Thursday 27 August 2015

No-nonsense Azure Monitoring in 20 Minutes (maybe 21) using ECK stack

Azure platform has been there for 6 years now and going from strength to strength. With the release of many different services and options (and sometimes too many services), it is now difficult to think of a technology tool or paradigm which is not “there” - albeit perhaps not exactly in the shape that you had wished for. Having said that, monitoring - even to the admission of some of the product teams - has not been the strongest of the features in Azure. Sadly, when building cloud systems, monitoring/telemetry is not a feature: it is a must.

I do not want to rant for hours why and how a product that is mainly built for external customers is different from the internal one which on its strength and success gets packaged up and released (as is the case with AWS) but a consistent and working telemetry option in Azure is pretty much missing - there are bits and pieces here and there but not a consolidated story. I am informed that even internal teams within Microsoft had to build their own monitoring solutions (something similar to what I am about to describe further down). And as the last piece of rant, let me tell you, whoever designed this chart with this puny level of data resolution must be punished with the most severe penalty ever known to man: actually using it - to investigate a production issue.

A 7-day chart, with 14 data points. Whoever designed this UI should be punished with the most severe penalty known to man ... actually using it - to investigate a production issue.

What are you on about?

Well if you have used Azure to deliver any serious solution and then tried to do any sort of support, investigation and root cause analysis, without using one of the paid telemetry solutions (and even with using them), painfully browsing through gigs of data in Table Storage, you would know the pain. Yeah, that's what I am talking about! I know you have been there, me too.

And here, I am presenting a solution to the telemetry problem that can give you these kinds of sexy charts, very quickly, on top of your existing Azure WAD tables (and other data sources) - tried, tested and working, requiring some setup and very little maintenance.

If you are already familiar with ELK (Elasticsearch, LogStash and Kibana) stack, you might be saying you already got that. True. But while LogStash is great and has many groks, it has been very much designed with the Linux mindset: just a daemon running locally on your box/VM, reading your syslog and delivering them over to Elasticsearch. The way Azure works is totally different: the local monitoring agent running on the VM keeps shovelling your data to durable and highly available storages (Table or Blob) - which I quite like. With VMs being essentially ephemeral, it makes a lot master your logging outside boxes and to read the data from those storages. Now, that is all well and good but when you have many instances of the same role (say you have scaled to 10 nodes) writing to the same storage, the data is usually much bigger than what a single process can handle and shoveling needs to be scaled requiring a centralised scheduling.

The gist of it, I am offering ECK (Elasticsearch, ConveyorBelt and Kibana), an alternative to LogStash that is Azure friendly (typically runs in Worker Role), out-of-the-box can tap into your existing WAD logs (as well as custom ones) and with a push of a button can be horizontally scaled to N, to handle the load for all your projects - and for your enterprise if you work for one. And it is open source, and can be extended to shovel data from any other sources.

At core, ConveyorBelt employs a clustering mechanism that can break down the work into chunks (scheduling), keep a pointer to the last scheduled point, pushing data to Elasticsearch in parallel and in batches and gracefully retry the work if fails. It is headless, so any node can fail, be shut down, restarted, added or removed - without affecting integrity of the cluster. All of this, without waking you up at night, and basically after a few days, making you forget it ever existed. In the enterprise I work for, we use just 3 medium instances to power analytics from 70 different production Storage Tables (and blobs).

Basic Concepts

Before you set up your own ConveyorBelt CB, it is better to know a few concepts and facts.

First of all, there is a one-to-one mapping between an Elasticsearch cluster and a ConveyorBelt cluster. ConveyorBelt has a list of DiagnosticSources, typically stored in an Azure Table Storage, which contains all data (and state) pertaining to a source. A source typically is a Table Storage, or a blob folder containing diagnostic data (or other) - but CB is extensible to accept other data stores such as SQL, file or even Elasticsearch itself (yes if you ever wanted to copy data from one ES to another). DiagnosticSource contains connection information for the CB to connect. CB continuously breaks down the work (schedules) for its DiagnosticSources and keeps updating the LastOffset.

Once the work is broken down to bite size chunks, they are picked up by actors (it internally uses BeeHive) and data within each chunk pushed up to your Elasticsearch cluster. There is usually a delay between data captured (something that you typically set in Azure configuration: how often copy data), so you set a Grace Period after which if the data isn't there, it is assumed there won’t be. Your Elasticsearch data will usually be behind realtime by the Grace Period. If you left everything as defaults, Azure copies data every minute which Grace Period of 3-5 minutes is safe. For IIS logs this is usually longer (I use 15-20 minutes).

The data that is pushed to the Elasticsearch requires:
  • An index name: by default the date in the yyyyMMdd format is used as the index name (but you can provide your own index)
  • The type name: default is PartitionKey + _ + RowKey (or the one you provide)
  • Elasticsearch mapping: Elasticsearch equivalent of a schema which defines how to store and index data for a source. These mappings are stored on a URL (a web folder or a public read-only Azure Blob folder) - schema for typical Azure data (WAD logs, WAD Perf data and IIS Logs) already available by default and you just need to copy them to your site or public Blob folder.

Set up your own monitoring suite

OK, now time to create our own ConveyorBelt cluster! Basically the CB cluster will shovel the data to a cluster of Elasticsearch. And you would need Kibana to visualise your data. Here I will explain how to set up Elasticsearch and Kibana in a Linux VM box. Further below I am explaining how to do this. But ...

if you are just testing the waters and want to try CB, you can create a Windows VM, download Elasticsearch and Kibana and run their batch files and then move to setting up CB. But after you have seen it working, come back to the instructions and set it up in a Linux box, its natural habitat.

So setting this up in Windows is just to download the files from the links below, unzip and then running the batch files elasticsearch.bat and kibana.bat. Make sure you expose the ports 5601 and 9200 from your VM, by creating endpoints.

Set up ConveyorBelt

As discussed above, ConveyorBelt is typically deployed as an Azure Cloud Service. In order to do that, you need to clone Github repo, build and then deploy it with your own credentials and settings - and all of this should be pretty easy. Once deployed, you would need to define various diagnostic source and point them to your ElasticSearch and then just relax and let CB do its work. So we will look at the steps now.

Clone and build ConveyorBelt repo

You can use command line:
git clone
Or use your tool of choice to clone the repo. Then open administrative PowerShell window, move to the build folder and execute .\build.ps1

Deploy mappings

Elasticsearch is able to guess the data types of your data and index them in a format that is usually suitable. However, this is not always true so we need to tell Elasticserach how to store each field and that is why CB needs to know this in advance.

To deploy mappings, create a Blob Storage container with the option "Public Container" - this allows the content to be publicly available in a read-only fashion. 

You would need the URL for the next step. It is in the format:
https://<storage account name><container name>/

Also use the tool of your choice and copy the mapping files in the mappings folder under ConveyorBelt directory.

Configure and deploy

Once you have built the solution, rename tokens.json.template file to tokens.json and edit tokens.json file (if you need some more info, find the instructions here). Then in the same PowerShell window, run the command below, replacing placeholders with your own values:
.\PublishCloudService.ps1 `
  -serviceName <name your ConveyorBelt Azure service> `
  -storageAccountName <name of the storage account needed for the deployment of the service> `
  -subscriptionDataFile <your .publishsettings file> `
  -selectedsubscription <name of subscription to use> `
  -affinityGroupName <affinity group or Azure region to deploy to>
After running the commands, you should see the PowerShell deploying CB to the cloud with a single Medium instance. In the storage account you had defined, you should now find a new table, whose name you defined in the tokens.json file.

Configure your diagnostic sources

Configuring the diagnostic sources can wildly differ depending on the type of the source. But for standard tables such as WADLogsTable, WADPerformanceCountersTable and WADWindowsEventLogsTable (whose mapping file you just copied) it will be straightforward.

Now choose an Azure diagnostic Storage Account with some data, and in the diagnostic source table, create a new row and add the entries below:

  • PartitionKey: whatever you like - commonly <top level business domain>_<mid level business domain>
  • RowKey: whatever you like - commonly <env: live/test/integration>_<service name>_<log type: logs/wlogs/perf/iis/custom>
  • ConnectionString (string): connection string to the Storage Account containing WADLogsTable (or others)
  • GracePeriodMinutes (int): Depends on how often your logs gets copied to Azure table. If it is 10 minutes then 15 should be ok, if it is 1 minute then 3 is fine.
  • IsActive (bool): True
  • MappingName (string): WADLogsTable . ConveyorBelt would look for mapping in URL "X/Y.json" where X is the value you defined in your tokens.json for mappings path   and Y is the TableName (see below).
  • LastOffsetPoint (string): set to ISO Date (second and millisecond MUST BE ZERO!!) from which you want the data to be copied e.g. 2015-02-15T19:34:00.0000000+00:00
  • LastScheduled (datetime): set it to a date in the past, same as the LastOffset point. Why do we have both? Well each does something different so we need both. 
  • MaxItemsInAScheduleRun (int): 100000 is fine
  • SchedulerType (string): ConveyorBelt.Tooling.Scheduling.MinuteTableShardScheduler
  • SchedulingFrequencyMinutes (int): 1
  • TableName (string): WADLogsTable, WADPerformanceCountersTable or WADWindowsEventLogsTable
And save. OK, now CB will start shovelling your data to your Elasticsearch and you should start seeing some data. If you do not, look at the entries you have created in the Table Storage and you will find an Error column which tells you what went wrong. Also to investigate further, just RDP to one of your ConveyorBelt VMs and run DebugView while having "Capture Global Win32" enabled - you should see some activity similar to below picture. Any exceptions will also show in there.

OK, that is it... you are done! ... well barely 20 minutes, wasn't it? :)

Now in case you are interested in setting up ES+Kibana in Linux, here is your little guide.

Set up your Elasticsearch in Linux

You can run Elasticsearch on Windows or Linux - I prefer the latter. To set up an Ubuntu box on Azure, you can follow instructions here. Ideally you need to add a Disk Volume as the VM disks are ephemeral - all you need to know is outlined here. Make sure you follow instructions to re-mount the drive after reboots. Another alternative, especially for your dev and test environments, is to go with D series machines (SSD hard disks) and use the ephemeral disks - they are fast and basically if you lose the data, you can always set ConveyorBelt to re-add the data, and it does it quickly. As I said before, never use Elasticsearch to master your logging data so you can recover losing it.

Almost all of the commands and settings below, needs to be run in an SSH session. If you are a geek with a lot of linux experience, you might find some of details below obvious and unnecessary - in which case just move on.

SSH is your best friend

Anyway, back to setting up ES - after you got your VM box provisioned, SSH to the box and install Oracle JDK:
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer
And then install Elasticsearch:
sudo dpkg -i elasticsearch-1.7.1.deb
Now you have installed ES v 1.7.1. To set Elasticsearch to start at reboots (equivalent of Windows services) run these commands in SSH:
sudo update-rc.d elasticsearch defaults 95 10
sudo /etc/init.d/elasticsearch start
Now ideally you would want to move the data and logs to the durable drive you have mounted, just edit the Elasticsearch config in vim and change:
sudo vim /etc/elasticsearch/elasticsearch.yml
and then (note uncommented lines): /mounted/elasticsearch/data
# Path to temporary files:
# /path/to/work

# Path to log files:
path.logs:  /mounted/elasticsearch/data
Now you are ready to restart Elasticsearch:
sudo service elasticsearch restart
Note: Elasticsearch is Memory, CPU and IO hungry. SSD drives really help but if you do not have them (class D VMs), make sure provide plenty of RAM and enough CPU. Searches are CPU heavy so it will depend on number of concurrent users using it.
If your machine has a lot of RAM, make sure you set ES memory settings as the default ones will be small. So update the file below and set the memory to 50-60% of the total memory size of the box:
sudo vim /etc/default/elasticsearch
And uncomment this line and set the memory size to half of your box’s memory (here 14GB, just an example!):
There are potentially other changes that you might wanna do. For example, based on number of your nodes, you wanna set the index.number_of_replicas in your elasticsearch.yml - if you have a single node set it to 0. Also turning off the multicast/Zen discovery since will not work in Azure. But these are things you can start learning about when you are completely hooked on the power of information provided by the solution. Believe me, more addicting than narcotics!

Set up the Kibana in Linux

Up until version 4, Kibana was simply a set of static HTML+CSS+JS files that would run locally on your browser by just opening root HTML in the browser. This model could not really be sustainable and with version 4, Kibana runs as a service on a box, most likely different to your ES nodes. But for PoC and small use cases it is absolutely fine to run it on the same box.
Installing Kibana is straightforward. You just need to download and unpack it:
tar xvf kibana-4.1.1-linux-x64.tar.gz
So now Kibana will be downloaded to your home directory and be unpacked to kibana-4.1.1-linux-x64 folder. If you want to see where that folder is you can run pwd to get the folder name.
Now to run it you just run the command below to start kibana:
cd bin
That will do for testing if it works but you need to configure it to start at the boot. We can use upstart for this. Just create a file in /etc/init folder:
sudo vim /etc/init/kibana.conf
and copy the below (path could be different) and save:
description "Kibana startup"
author "Ali"
start on runlevel [2345]
stop on runlevel [!2345]
exec /home/azureuser/kibana-4.1.1-linux-x64/bin/kibana
Now run this command to make sure there is no syntax error:
init-checkconf /etc/init/kibana.conf
If good then start the service:
sudo start kibana
If you have installed Kibana on the same box as the Elasticsearch and left all ports as the same, now you should be able to go to browser and browse to the server on port 5601 (make sure you expose this port on your VM by configuring endpoints) and you should see the Kibana screen (obviously no data).