Sunday 5 October 2014

What should I do?

Level [C1]

TLDR; : I was charged for a huge egress on one of my VMs and I have no way of knowing what caused it or whether it was an infrastructure glitch nothing to do with VM.

OK, here is the snippet of the last email I received back:

"I understand what you’re saying. Because this involves a non-windows VM, we wouldn’t be able to determine what caused this. we can only validate the usage, and as you already know, the data usage seems quite appropriate, comparing to our logs. Had this been a Windows machine, we could have engaged another team(s) to have this matter looked into. As of now, I am afraid, this is all we have. You might want to check with Ubuntu support to see what has caused this."

The story started two weeks ago. I have, you know, MSDN account courtesy of my work which provides around £95/mo free Windows Azure credit - for which I am really grateful. It has allowed me to run some kinda pre-startup stuff on a shoestring. I recently realised my free credit can take you so far so started using Azure services more liberally knowing that I am going to be charged. At the end of the day, nothing valuable comes out of nothing. But before doing that, I also registered for AWS and as you know, it provides some level of free services which I again took advantage of.

But I have not said anything about the problem yet. It was around the end of the month and I knew my remaining credit would be enough to carry me to the next month. Then I noticed my credit panel turning orange from green (this is quite handy, telling you with the rate of usage you will soon run out of credit) which I thought was bizarre and then next day I realised all my services had disappeared. Totally gone! Bang! I had run out of credit...

This was a Saturday and I spent Saturday and Sunday reinstating my services. So I learnt the lesson that I need remove spending cap, which is not the reason why you read this. The reason I ran out of credit was due to egress (=data out) from one of my Linux boxes... so this box used to have an egress of a few MB to max few hundred MB a day and suddenly shoot up to 175GB and 186GB! OK, either there is a mistake or my box has been hacked into - with the latter more likely.

Here is the egress from that "renegade" Linux box:
8/30/2014 "Data Transfer Out (GB)" "GB" 0.004967
8/31/2014 "Data Transfer Out (GB)" "GB" 0.006748
9/1/2014 "Data Transfer Out (GB)" "GB" 0.001735
9/2/2014 "Data Transfer Out (GB)" "GB" 0.17618
9/3/2014 "Data Transfer Out (GB)" "GB" 0.003499
9/4/2014 "Data Transfer Out (GB)" "GB" 0.013394
9/5/2014 "Data Transfer Out (GB)" "GB" 0.016147
9/6/2014 "Data Transfer Out (GB)" "GB" 0.005412
9/7/2014 "Data Transfer Out (GB)" "GB" 0.005803
9/8/2014 "Data Transfer Out (GB)" "GB" 0.001547
9/9/2014 "Data Transfer Out (GB)" "GB" 0.003044
9/10/2014 "Data Transfer Out (GB)" "GB" 0.002179
9/11/2014 "Data Transfer Out (GB)" "GB" 0.02876
9/12/2014 "Data Transfer Out (GB)" "GB" 0.008922
9/13/2014 "Data Transfer Out (GB)" "GB" 0.28983
9/14/2014 "Data Transfer Out (GB)" "GB" 0.099229
9/15/2014 "Data Transfer Out (GB)" "GB" 0.002653
9/16/2014 "Data Transfer Out (GB)" "GB" 0.00191
9/17/2014 "Data Transfer Out (GB)" "GB" 0.00182
9/18/2014 "Data Transfer Out (GB)" "GB" 175.69292
9/19/2014 "Data Transfer Out (GB)" "GB" 182.974478

This box was running an ElasticSearch instance which had barely 1GB of data. And yes, it was not protected so it could have been hacked into. So what I did, with a bunch of bash commands which I conveniently copied and pasted from google searches, was to create a list files that were changed on the box ordered by the date and send to the support. There was nothing suspicious there - and the support team did not find it any more useful [in fact the comment was that it was "poorly formatted", I assume due to the difference in new line character in linux :) ].

So it seemed less likely that it was hacked but maybe someone has been running queries against the ElasticSearch which had been secured only by its obscurity. But hang on! If that were the case, the ingress should somehow correspond:
8/30/2014 "Data Transfer In (GB)" "GB" 0.004335
8/31/2014 "Data Transfer In (GB)" "GB" 0.005579
9/1/2014 "Data Transfer In (GB)" "GB" 0.000744
9/2/2014 "Data Transfer In (GB)" "GB" 0.021571
9/3/2014 "Data Transfer In (GB)" "GB" 0.002983
9/4/2014 "Data Transfer In (GB)" "GB" 0.002571
9/5/2014 "Data Transfer In (GB)" "GB" 0.002961
9/6/2014 "Data Transfer In (GB)" "GB" 0.001994
9/7/2014 "Data Transfer In (GB)" "GB" 0.001642
9/8/2014 "Data Transfer In (GB)" "GB" 0.000483
9/9/2014 "Data Transfer In (GB)" "GB" 0.001879
9/10/2014 "Data Transfer In (GB)" "GB" 0.002022
9/11/2014 "Data Transfer In (GB)" "GB" 0.017067
9/12/2014 "Data Transfer In (GB)" "GB" 0.002644
9/13/2014 "Data Transfer In (GB)" "GB" 0.347959
9/14/2014 "Data Transfer In (GB)" "GB" 0.089146
9/15/2014 "Data Transfer In (GB)" "GB" 0.000404
9/16/2014 "Data Transfer In (GB)" "GB" 0.001912
9/17/2014 "Data Transfer In (GB)" "GB" 0.001733
9/18/2014 "Data Transfer In (GB)" "GB" 0.012967
9/19/2014 "Data Transfer In (GB)" "GB" 0.021446

which it does in all days other than 18th and 19th. Which made me think, it was perhaps all a mistake and maybe an Azure infrastructure agent or something has gone out of control and started doing this.

So I asked the support to start investigating the issue. And it took a week to get back to me and the investigation provided only the hourly breakdown (and I was hoping for more, perhaps some kind of explanation or identifying the IP address all this egress was going). The pattern is also bizarre. For example on 19th (at the end of which my credit ran out):
2014-09-18T00:00:00 2014-09-18T01:00:00 DataTrOut 166428 External
2014-09-18T01:00:00 2014-09-18T02:00:00 DataTrOut 374040 External
2014-09-18T02:00:00 2014-09-18T03:00:00 DataTrOut 2588121384 External
2014-09-18T03:00:00 2014-09-18T04:00:00 DataTrOut 539993671 External
2014-09-18T04:00:00 2014-09-18T05:00:00 DataTrOut 1128216 External
2014-09-18T05:00:00 2014-09-18T06:00:00 DataTrOut 25462 External
2014-09-18T06:00:00 2014-09-18T07:00:00 DataTrOut 18308 AM2
2014-09-18T06:00:00 2014-09-18T07:00:00 DataTrOut 63250 External
2014-09-18T07:00:00 2014-09-18T08:00:00 DataTrOut 24588 External
2014-09-18T08:00:00 2014-09-18T09:00:00 DataTrOut 82296 External
2014-09-18T09:00:00 2014-09-18T10:00:00 DataTrOut 59362 External
2014-09-18T10:00:00 2014-09-18T11:00:00 DataTrOut 10573316727 External
2014-09-18T11:00:00 2014-09-18T12:00:00 DataTrOut 11443247791 External
2014-09-18T12:00:00 2014-09-18T13:00:00 DataTrOut 13854724048 External
2014-09-18T13:00:00 2014-09-18T14:00:00 DataTrOut 8115190263 External
2014-09-18T14:00:00 2014-09-18T15:00:00 DataTrOut 13748807057 External
2014-09-18T15:00:00 2014-09-18T16:00:00 DataTrOut 10389478694 External
2014-09-18T16:00:00 2014-09-18T17:00:00 DataTrOut 19979259451 External
2014-09-18T17:00:00 2014-09-18T18:00:00 DataTrOut 21398993891 External
2014-09-18T18:00:00 2014-09-18T19:00:00 DataTrOut 22843598777 External
2014-09-18T19:00:00 2014-09-18T20:00:00 DataTrOut 23087199863 External
2014-09-18T20:00:00 2014-09-18T21:00:00 DataTrOut 16958070173 External
2014-09-18T21:00:00 2014-09-18T22:00:00 DataTrOut 13126214430 External
2014-09-18T22:00:00 2014-09-18T23:00:00 DataTrOut 352327 External
2014-09-18T23:00:00 2014-09-19T00:00:00 DataTrOut 358377 External

So what should I do?

So first of all, I have now put the ElasticSearch box behind a proxy and access to it requires authentication with the proxy. And better to do it now rather than later. And the ES box now is protected by IPSec.

But really the big question is, when you are on cloud and you don't own any of the infrastructure or its monitoring, how can you make sure you are being charged fairly. My £40 bill for the egress is not huge but makes me wonder, what if it happens again? What would I do?

There are also other questions: would that have been different on another provider? I am not really sure [although at least they could have opened a file with Linux line ending :) ] but the usage of a cloud platform requires building a trust relationship which is essential. I really appreciate the general attitude of Azure (and Microsoft) towards Open Source in embracing everything non-Windows and I think it is the right direction, but I think the support model should be also developed in line with that. AWS is a more mature platform but have you seen anything like this there? And if yes, how was your experience?


No comments:

Post a Comment

Note: only a member of this blog may post a comment.