Friday, 13 May 2016

XML or JSON, and that is not the question

So in last couple of days, our .NET community has showed some strong reactions to the announcements in the ASP.NET team stand-up. While Ruby and more recently node community are known to endless dramas on arguably petty issues, it felt that .NET community was also capable of throwing tantrums. For those who are outside .NET community or have not caught up with the news, .NET/ASP.NET team have decided to revert the project.json (JSON) in #DotNetCore back to *.csproj/*.vbproj (XML) and resurrect msbuild. So was it petty in the end?

Some believed it was: they argued all that was changed was the format of the project file and the drama associated with it was excessive. They also pointed out that all the goodness of project.json would be ported to the familiar yet different *.csproj. I call this group the loyalists:

On the other hand, some were upset by the return of the msbuild to the story of .NET development. This portion of the community were arguing that +15-year-old msbuild does not have a place in the modern development. They have been celebrating death of this technology not knowing it was never really dead - I call them msbuild-antagnoists. The first group (loyalists), on the other hand, were flagging that the msbuild would be improved and the experience would be modernised.

Now there were another group of people were frustrated that this decision had been made despite the community feedback and solely based on the feedback of “some customers” behind the closed doors. I call them OSS-apologetics and their main issue was the seemingly lack of weight of the community feedback when it comes to the internal decisions that Microsoft takes as a commercial enterprise - especially in the light of the fact that project.json was announced almost 2 years ago and it was very late to change it.

Now there were yet another group that had invested time and effort (==money?) in building projects and tooling (some of which commercial) and they felt that the rug has been pulled from underneath them and all those hours gone to waste - for the lack of a better phrase I call them loss-bearers. And they were even more upset to see that their loss was accounted as a learning process:
Obviously there is not a great answer for them but it is usually said that it is a very minor part of the whole community who have been living on the bleeding edge and knew it could be coming any minute, as mentioned on the stand-up:


Where do I stand?

I stand somewhere in between. Cannot quite agree with the loyalists since it is not just the question of format. On the other hand, I do not bear any losses since I had decided long time ago that I will skip the betas and pick it up when the train of changes slows down - something not yet in sight.

But I do not think any of the above captures the essence of what has been happening recently. I am on the belief that this decision along with the previous disrupting ones have been important and shrewd business decisions to save the day and contain losses for Mircosoft as a commercial platform - and no one can blame Microsoft for doing that.

I had warned times and times again that the huge amount of change in the API and tooling and no easy migration path will result in dividing the community into a niche progressive #DotNetCore minority and the mainstream commercial majority who would stay on .NET Fx and need years (not months) to move on to #DotNetCore - if at all. And this potentially will create a Python-vs-3-like divide in the community.

The cross from the old .NET to the new #DotNetCore (seemingly similar on the surface yet wildly different at heart) would not be dissimilar to the cross between VB6 to .NET. And what makes it worse is that unlike then, there are many viable alternatives OSS stacks (back then there was only Java and C/C++). This could have meant that the mainstream majority might in fact decide to try an altogether different platform.

So Microsoft as a business entity had to step in and albeit late, fix the few yet key mistakes made at the start and alongside the project during the last 2 years:
  • ASP.NET team to make platform/language decisions and implement features with clever tricks rather the .NET Fx baking such features in the framework itself. An example was Assembly Neutral Interfaces.
  • Ignoring the importance upgrade path for the existing projects and customers
  • Inconsistent, confusing and ever changing layering of the stack
  • Poor and conflicting scheduling messages
  • Using Silverlight’s CoreCLR for ASP.NET resulting in dichotomy of the runtime, something that as far as I know has no parallel in any other language/platform. In the most recent slides I do not see CoreCLR being mentioned anymore yet it might be there. If it is, it will stay a technical debt to be paid later.
All in all it has been a rough ride both for the drivers and the passengers of this journey but I feel now the clarity and cohesion is back and long-standing issues have been addressed now.

Where could I be wrong?

My argument naturally brings these counterarguments:
  • Perhaps had ASP.NET team not pushed the envelope this far by single-handedly crusading to bring modern ideas and courageous undertakings such as cross-platform, we would be having .NET 5 now instead of #DotNetCore.
  • By carrying baggage from the past (msbuild), Microsoft is extending the lifespan its stacks which in the short term will be beneficial to the corporate but since it is not a clean break, in the long term results in dispersion of the community and a need for another redeux.
Hard to answer these arguments since one is a hypothetical situation and the other looks well into the uncertainty of the future. I will leave it to the readers to weigh the arguments.

Last word

It is not possible to hide that none of this has been without casualties. Some confidence lost, community at times upset and overall has not been all rosy as far as the Microsoft’s image in its OSS ventures goes. I did mention old and new Microsoft coming head-to-head, which might not be correct but as Satya Nadella said, culture does not change overnight.

Monday, 8 February 2016

Future of Information: the Good, Bad and Ugly of it

We are certainly at the cusp of a big revolution in the human civilisation - caused by the Information Technology and Machine Intelligence. There are golden moments in the history that have fundamentally changed us: late dark ages for the Astronomy, early renaissance for Physics, 1700-1800s for the Chemistry, late 1800s for the Microbiology, 1950s for the transistors… and the periods get more and more compressed. It looks like a labyrinth where it gets narrower when you get closer to its centre.

Without speculating on what the centre could look like, and considering this could be still a flat line of constant progress, we need to start thinking what the future could look like  - not because it is fun, but because an action could be warranted now. There is no shortage of speculation or commentary, one man can dream, and fathom a far future which might or might not be close to the distant reality. And that is not the point. The point is, as I will outline below, it could be getting late to do what we need to DO. Yes, this is not a sci-fi story…

On one hand, there is nothing new under the sun, and the cycle of change has always been with the mankind since the beginning. We always had the reluctant establishment fighting with the wind of change promoted by the new generation.

Figure 1 - Accelerating change [Source: Wikipedia]

On the other hand, this is the first time in the history that the cycle of change has been reduced to less than a generation (a generation is normally considered 20-25 years). You see, the politicians of the past had time to grow up with the changes, feel the needs, brew new ideas and come up with the solutions. Likewise, the nations have had the time to assimilate and react to the changes in terms of aspirations, occupation and direction as the new changes would not be fully in effect during the one person’s lifetime. What about now? Only a decade ago (pre-iPhone) looks like a century backwards. The cycle of change already looks like to be around 5-10 years [see Figure 1]. And look at our politicians: it is not a coincidence someone like Trump can capture the imagination of a nation in the lack of visionary contenders. The politics as we know it has reached the end of life - IMHO due to lack of serious left-wing ideas - but that is not the topic of this post. The point I am trying to make is politicians are no longer able to propose but the most trivial changes since their view of the world is limited by their lack of understanding of a whole new virtual world being created alongside this physical world whose rules do not exist in any books.

And it is not just the politics that is dropping far behind. Economics in the face of fast cycle of change will be different too. First of all, today’s financial malaise experienced in many developed countries might still be around for years to come. In an age of Keynesian economy and central intervention characterised by low inflation, low growth and abundance of money printed by central banks, it seems the banks are no longer relevant. Current economy sometimes referred to as the Japanisation, which was spotted back in 2011 and 5 year on feels no different. And it is no coincidence that an IMF report finds decreasing efficiency of capital in Japanese Economy - that can be applied elsewhere. Looking at the value of bank stocks provides the glaring fact that they are remnants of institutions from the past. True, they are probably still financing mine and your mortgage but their importance as the cornerstone of development during the previous centuries is gone. Why? Because the importance of capital in a world where there is so much of it around without finding a suitable investment is overrated. With 10-yr US Bonds at around 1.8% and yield on 2-yr German bund at -0.5% (!), an investment with 2% annual return is a bargain. In fact today’s banking is characterised by piling up losses year on year (for example this and this). Looking at the Citigroup or Bank of America’s 10 year chart is another witness to the same decline. In an environment when money is cheap (Because of ZIRP), it cannot be the main driver in the business, as money (and hence banks) is not the scarce commodity anymore. See? We did not even have to mention bitcoin, blockchain or crowdfunding.

Figure 2 - Deutsche Bank Stock since year 2000 [Source: Yahoo Finance]
But beyond our myopic view of the economy focused on the current climate, there is a rhetoric looking at it from a different angle and far into the future, seeing the same pattern. In one interesting essay on Economics of the future, authors find an ever decreasing role for the capital. While mentioning the importance of the suitable labour (in terms of geek labour force, currently the scarcest resource resulting in companies not growing their full potential) could be helpful, it is evident that capital is no more an issue.

In essence what all this means is, if historically the banks as the institutions controlling capital had the upper hand, in the days to come it will be those controlling Information. The future of our civilisation will be surrounding the conflicts to control the Information, on one hand by the state, on the other by the institutions and finally by us and NGOs for the privacy.

The Good

Rate of data acquisition has been described as exponential. This has been mainly with regard to the virtual world and our surroundings but very soon it will be us. From our exact whereabouts to our blood pressure to various hormonal levels and perhaps our emotional status, all will be around very soon. A lot of this is already possible, also known as quantifying self. But it is only a matter of time for this to be for everyone.

It is not difficult to think what it can do to promote health and disease prevention. Even now those who suffer from heart arrhythmia carry devices that can defibrillate their heart if a deadly ventricular fibrillation occurs. The blood pressure, sugar level, various hormonal levels, and all sorts of measurable elements can be tracked. Cancerous cells can be detected in blood (and its source identified) well before it could grow and spread. The plaques in the blood vessels will be identified by the micro devices circulating and any serious stenosis can be identified. Clots in the heart or brain vessels (resulting in stroke) can be detected at the time of formation with a device releasing thrombolytic agents immediately alleviating the problem. Going for extra medical diagnosis could be very similar to how our cars are being serviced today: a device gets connected to the myriad of micro devices in your body and full picture of health status will be immediately visible to the medical staff. You could be walking on a road or in a car, witnessing a rare yet horrific accident (would there be accidents?) and the medical team would know whether you would suffer from PTSD and whether you would need certain therapies for it - they would know where you were and whether you witnessed the incident from your various measurements.

And of course, this is only the medical side. The way we work, entertain ourselves and interact with the outside world will be completely different. It is not very hard to imagine what it will be like: one cheesy way is to just take everything that you do at home and think of adding an automation/scheduling/verbal command to it. From making coffee, to checking information, to entertainment. But I will refrain from limiting your view by my imagination. What it is clear is that the presence and impact of the virtual world will be much more pronounced.

At the end of the day, it is all about the extra information coupled with machine intelligence.

The Bad

This section is not speculating on what it could look like. We can all go and read any of the dystopian books, many to choose from and could be like any or none.

But instead it is about simple reasoning: taking what we know, projecting the rate of change and looking at what we might get. It is very reasonable to think that machine intelligence will be at a point where it can reason very efficiently with a pretty good rate of success. And on the other hand, it is reasonable to think that there will be many many data points for every person. If we as humans can be represented as intelligent machines that turn data plus our characters into decisions, it is not silly to think that if our characters (historical data) known to the machines and the input perceived by us already available via the many many agents present in and around us, it is not unreasonable to think that the systems can estimate your decisions. So when you think of advertising, this gets really frightening since you would know pretty well what the reaction will be if you have enough information. And it is about, how much, how much money do you have to decide on…

You see, the fight for your disposable income (that part of your income that you can choose how to spend) could not be more fierce: it can make or break companies in the future. The future of advertising and the fight for this disposable income is what makes Eric Schmidt to come out and almost say there won’t be online privacy in the future:
"Some governments will consider it too risky to have thousands of anonymous, untraceable and unverified citizens - hidden people - they'll want to know who is associated with each online account... Within search results, information tied to verified online profiles will be ranked higher than content without... even the most fascinating content, if tied to an anonymous profile simply won't be seen because of its excessive low ranking." - The New Digital Age / page 33
And when you see how the top four companies have already moved into media industry, you get it. Your iPhone selects a handful of news items for you to see, Facebook controls your timeline, Amazon is a full-blown media company and Google controls youtube which has overtaken conventional media for the entertainment of the millennials. We must reiterate that none of these companies are by nature evil but when it is to choose between you and their income, it is natural that they will pick the latter. And guess what: they have what the state wants too.

Let’s revisit banking for a moment to clarify the point. Banks have what politicians need: capital to fund ever more expensive political campaigns. And the state has what the banks need: regulation or rather de-regulation which banks thought will help them prosper because they can enter the stock market’s casino with the high street bank deposits (which ironically has been the source of their losses). And above all, the state catches the banks if they fall, as it did in 2008. ECB uses its various funds (EFSF, ESM, etc) to keep the banks in Greece and Italy (and others, soon Germany?) afloat. And catches the stock market when they fall as it has constantly done by various QE measures, interest rate cuts, printing money, etc. In such a financial milieu, where there are cushions all along the path, there is no real risk anymore leading to irresponsible behaviour by the banks. And the party should never end, no wonder Obama could not move an inch towards bringing back some regulation. Heads of state’s financial institutions come from ex-CEOs of the likes of Goldman Sachs. This alliance of the state and banking has contributed to the growth of inequality (ultimately leading to modern slavery) and no wonder, state is not bothered, the state is made up of politicians in alliance with bankers and the bankers.

And what does it have to do with the future of information? Exactly the same thing can happen in the future, only with the state and heads of companies owning the information. If capital no longer holds the power and it is the information, then the alliance of state and info bosses will lead to the modern slavery. States control the legislations and information companies own private data and control the media: each one having what the other needs. 

The Ugly

Why ugly? Because we are already there, almost. First of all, the states have started gathering and controlling information. NSA is just an example. The states have started requesting companies owning the data to provide them. Legislations are under way to prevent effective encryption. This could all look harmless when we are busy checking out our twitter and facebook timelines, but I have already started to freak me out: companies are already started thinking and acting in this area. As we visited, Google's Eric Schmidt is portraying a future where anonymity has little value, either you agree or otherwise you need to speak out.

Going back to the politics, we do not have politicians or lawyers that have a correct understanding of the technology and its implications, and it is not their fault: they were not prepared for it. But soon, very soon, we will have heads of companies turning politicians. Very much like CEOs of the Goldman Sachs, and I do not mean it necessarily in a bad way, why? Because the power will be in the hands of the geeks and by the same token, we need strong oppositions, we need politicians among us to rise to the occasion and lead us safely into the future where we have meaningful legislations protecting our privacy while allowing safe data sharing. Problem is, we had 2500 years or so to think about democracy and government in the physical world (from the Greek and Roman philosophers to now) but we are confronted with a virtual world where the ethics and philosophy are not well-defined and do not quite map to the physical world we live in yet every lawmaker is trying to shoehorn it to the only thing they know about. Enough is enough.

But where do we start from? My point in this post/essay has been to ask the questions, I do not claim to have the answers. We have not yet explored the problems well enough to come up with the right answers... we need the help of think tanks, many of which I see rising amongst us.

We are surrounded by the questions whose answers (like all other aspects of our industry) tangled with so much of our personal opinions. When it comes to the court of law what doesn't matter is your or my opinion. Is Edwards Snowden a hero or a traitor? Was Julian Assange a visionary or an opportunist? What is ethical hacking, and how is it different from unethical, in fact could hacking ever be legitimate? Is Anonymous a bunch of criminals or a collection of selfless vigilantes working for the betterment of the virtual world in lack of a legal alternative? What is the right to privacy, and is there a right to be anonymous?

Needless to say, there could be some quick wins. I think defining privacy and data sharing is one of the key elements. One improvement could be turning small-print legal mumbo jumbo of the terms and conditions to bullet-wise fact sheets. Similar to “Key Fact Sheet” for mortgages where the APR and various fees are clearly defined, we can enforce a privacy fact sheet where the answers to questions such as “My anonymised/non-anonymised data might/might not be sold”, “I can ask for my records can be physically erased”, “My personal information can/cannot be given to third parties”, etc are clearly defined for non-technical consumers, as well as most of the rest of us who rarely read the terms and conditions.

Whatever the solutions, we need to start… now! And it could be already late.

Tuesday, 24 November 2015

Interactive DataViz: Rock albums by the genre since 1960


Interactive DataViz here: http://wiki-rock.azurewebsites.net/top10-album-genres.html
Last week I presented a talk in #BuildStuffLT titled “From Power Chords to the Power of Models” which was a study of the Rock Music by the way of Data Mining, Mathematical Modelling and also Machine Learning. It is such a fun subject to explore, especially for me that Rock Music has been one of my passions since I was a kid.

The slides from the talk is available and the videos will be available soon (although my performance during the talk was suboptimal due to lack of sleep, a problem which seems to be shared by many at the event). BuildStuffLT is a great event, highly recommended if you have never been to. It is a software conference with known speakers such as Michael Feathers, Randy Shoup, Venkat Subramaniam, Pieter Hintjens and this year was the host of Melvin Conway (yeah, the visionary who came up with Conway’s law in 1968) with really mind stimulating talks. You also get a variety of other speakers with very interesting talks.

I will be presenting my talk in CodeMash 2016 so I cannot share all of the material yet but I think this interactive DataViz alone is many many slides in a single representation. I can see myself spending hours just looking at the trends and artist names and their album covers - yeah this is how much I love Rock Music and its history - but even for others this could be fun and also help you discover some new to listen to.

DataViz

This is an interactive percentage-based stacked area chart of top 10 genres in a year, since 1960, where Rock Music as we know it started to appear. That is a mouthful but basically for every year, top 10 genres selected so the dataset contains only those Rock (or related) genres that at some point were among the top 10 genres. You can access it here or simply clone GitHub repo (see below) and host your own.


The data was collected from Wikipedia by capturing Rock Albums and then processing their genres, finding top 10 in every year and then presenting in a chart - I am using Highcharts which is really powerful and simple to use and has a non-commercial license too. The data itself I have shared so you can run your own DataViz if you want to. The license for the data is of course Wikipedia’s, which covers these purposes.



I highly recommend you start with the Visualisation with “All Unselected” (Figure 2) and then select a genre and visualise its rise and fall in the history.


Then you can click on a point (year/genre) to list all albums of that genre for that year (Figure 3). Please note that even when the chart shows 0%, there could be some albums for that genre - which are from a year which that genre was not among the top 10 genres.

Looking at the data in a different way

Here is the 50 years of Rock (starting from 1965) with the selected albums:



Things to bear in mind

  • The data has been captured by capturing all albums for all links found in documents that traversed from the list of rock genres then to the artist pages. As far as I know, the list includes all albums by the major (and minor) rock artists - according to Wikipedia. If you find a missing album (or artist), please let me know.
  • Every album will contribute all its genres to the list. This means if it has genres “Blues Rock” and “Rock”, then it will be counted once for each of the its genres and you can find it if you look at both Rock or Blues Rock genres.
  • Data has some oddities, sometimes an album occurs more than once, mainly due to nuances of data in Wikipedia, there are multiple entries (URLs) for the same document, etc. Data has already been cleansed through many processes and these oddities do not materially change the results. In the future however, there are things that can be done remove these remaining oddities.
  • Again, it is highly recommended that you click the “Unselect All” button and click on the genres that you are interested one by one and explore the name of the albums.
  • Clicking “Select All” or “Unselect All” takes a bit too much time. I am sure it has an easy solution (turn rendering off when changing the state) but have not been able to find it. Expect your PRs!
  • There are some genres in the list which are not really Rock genres. These genres would have been mentioned alongside a rock genre in the album cover or had been a not-so-much-rock album by an otherwise Rock artist.

Code and Data

All code and data published in GitHub. Code uses Highchartsjs, knockoutjs and foundations UI framework. Have fun!

Saturday, 19 September 2015

The Rule of "The Most Accessible" and how it can make you feel better

I remember when I was a kid, I watched a documentary on how to catch a monkey. Basically you dig a hole in a tree, big enough for a stretched monkey hand to go in but not too big that fisted hand can get out and sit and watch.

Source: http://www.tarekcoaching.com/blog/dont-fall-in-the-monkey-trap/


Apart from holes, buttons and levers (things that can be pushed) are concepts very easy for animals to learn. Without getting too Freudian, furrowing and protrusions (holes and buttons) are one of the first concepts we learn.

This is nice when dealing with animals. On the other hand, it can be dangerous - especially for kids. A meat mincer machine has exactly these two: a hole and a button. Without referring to the disturbing images of its victims on internet, it is imaginable what can happen - many children sadly lose their fingers or hands this way. Safety of these machines are much better now but I grew up with a kid who was left with pretty much a claw of his right hand after the accident.

Now, the point is: in confrontation with entities that we encounter for the first time or do not have enough appreciation of their complexity, we approach from the most accessible angle we can understand.  If this phenomenon did not have a name (and it is not BikeShedding, that is different), now it has: Rule of The Most Accessible TMA. The problem is, as the examples tried to illustrate, it is dangerous. Or it can be a sign of mediocrity.

*    *    *

Now what does it have to do with our geeky world?

Have you noticed that in some projects, critical bugs go unnoticed but there are half a dozen bugs raised for the layout being one pixel out? Have you written a document and the main feedback you got was the font being used? Have you attended a technical review meeting which you get a comment on the icons of your diagram? Have you seen a performance test project that focuses only on testing the API because it is the easiest to test? Have you witnessed a code review that results in a puny little comments on namings only?

When I say it can be a sign of mediocrity, I think you know now what I am talking about. I cannot describe my frustration when we replace the most critical with the most accessible. And I bet you feel the same.

Resist TMA in you

You know how bad it is when someone falls into the TMA trap? Then you shouldn't. Take your time, and approach from the angle which matters most. If you cannot comment anything worthwhile then don't. Don't be a hypocrite.

Ask for more time, break down the complexity and get a sense of the critical parts. And then comment.

Fight TMA in others

Someone does TMA to you? Show it to their face. Remind them that we need to focus on the critical aspects first. Ask them not to waste time on petty aspects of the problem.

If it cannot be fight, laugh inside

And I guess we all have cases where the person committing TMA is a manager high up that fighting TMA can have unpleasant consequences. Then you know what? Just remember face of the monkey cartoon above and laugh inside. It will certainly make you feel better :)



Thursday, 27 August 2015

No-nonsense Azure Monitoring in 20 Minutes (maybe 21) using ECK stack

Azure platform has been there for 6 years now and going from strength to strength. With the release of many different services and options (and sometimes too many services), it is now difficult to think of a technology tool or paradigm which is not “there” - albeit perhaps not exactly in the shape that you had wished for. Having said that, monitoring - even to the admission of some of the product teams - has not been the strongest of the features in Azure. Sadly, when building cloud systems, monitoring/telemetry is not a feature: it is a must.

I do not want to rant for hours why and how a product that is mainly built for external customers is different from the internal one which on its strength and success gets packaged up and released (as is the case with AWS) but a consistent and working telemetry option in Azure is pretty much missing - there are bits and pieces here and there but not a consolidated story. I am informed that even internal teams within Microsoft had to build their own monitoring solutions (something similar to what I am about to describe further down). And as the last piece of rant, let me tell you, whoever designed this chart with this puny level of data resolution must be punished with the most severe penalty ever known to man: actually using it - to investigate a production issue.

A 7-day chart, with 14 data points. Whoever designed this UI should be punished with the most severe penalty known to man ... actually using it - to investigate a production issue.

What are you on about?

Well if you have used Azure to deliver any serious solution and then tried to do any sort of support, investigation and root cause analysis, without using one of the paid telemetry solutions (and even with using them), painfully browsing through gigs of data in Table Storage, you would know the pain. Yeah, that's what I am talking about! I know you have been there, me too.

And here, I am presenting a solution to the telemetry problem that can give you these kinds of sexy charts, very quickly, on top of your existing Azure WAD tables (and other data sources) - tried, tested and working, requiring some setup and very little maintenance.


If you are already familiar with ELK (Elasticsearch, LogStash and Kibana) stack, you might be saying you already got that. True. But while LogStash is great and has many groks, it has been very much designed with the Linux mindset: just a daemon running locally on your box/VM, reading your syslog and delivering them over to Elasticsearch. The way Azure works is totally different: the local monitoring agent running on the VM keeps shovelling your data to durable and highly available storages (Table or Blob) - which I quite like. With VMs being essentially ephemeral, it makes a lot master your logging outside boxes and to read the data from those storages. Now, that is all well and good but when you have many instances of the same role (say you have scaled to 10 nodes) writing to the same storage, the data is usually much bigger than what a single process can handle and shoveling needs to be scaled requiring a centralised scheduling.

The gist of it, I am offering ECK (Elasticsearch, ConveyorBelt and Kibana), an alternative to LogStash that is Azure friendly (typically runs in Worker Role), out-of-the-box can tap into your existing WAD logs (as well as custom ones) and with a push of a button can be horizontally scaled to N, to handle the load for all your projects - and for your enterprise if you work for one. And it is open source, and can be extended to shovel data from any other sources.

At core, ConveyorBelt employs a clustering mechanism that can break down the work into chunks (scheduling), keep a pointer to the last scheduled point, pushing data to Elasticsearch in parallel and in batches and gracefully retry the work if fails. It is headless, so any node can fail, be shut down, restarted, added or removed - without affecting integrity of the cluster. All of this, without waking you up at night, and basically after a few days, making you forget it ever existed. In the enterprise I work for, we use just 3 medium instances to power analytics from 70 different production Storage Tables (and blobs).

Basic Concepts

Before you set up your own ConveyorBelt CB, it is better to know a few concepts and facts.

First of all, there is a one-to-one mapping between an Elasticsearch cluster and a ConveyorBelt cluster. ConveyorBelt has a list of DiagnosticSources, typically stored in an Azure Table Storage, which contains all data (and state) pertaining to a source. A source typically is a Table Storage, or a blob folder containing diagnostic data (or other) - but CB is extensible to accept other data stores such as SQL, file or even Elasticsearch itself (yes if you ever wanted to copy data from one ES to another). DiagnosticSource contains connection information for the CB to connect. CB continuously breaks down the work (schedules) for its DiagnosticSources and keeps updating the LastOffset.

Once the work is broken down to bite size chunks, they are picked up by actors (it internally uses BeeHive) and data within each chunk pushed up to your Elasticsearch cluster. There is usually a delay between data captured (something that you typically set in Azure configuration: how often copy data), so you set a Grace Period after which if the data isn't there, it is assumed there won’t be. Your Elasticsearch data will usually be behind realtime by the Grace Period. If you left everything as defaults, Azure copies data every minute which Grace Period of 3-5 minutes is safe. For IIS logs this is usually longer (I use 15-20 minutes).

The data that is pushed to the Elasticsearch requires:
  • An index name: by default the date in the yyyyMMdd format is used as the index name (but you can provide your own index)
  • The type name: default is PartitionKey + _ + RowKey (or the one you provide)
  • Elasticsearch mapping: Elasticsearch equivalent of a schema which defines how to store and index data for a source. These mappings are stored on a URL (a web folder or a public read-only Azure Blob folder) - schema for typical Azure data (WAD logs, WAD Perf data and IIS Logs) already available by default and you just need to copy them to your site or public Blob folder.

Set up your own monitoring suite

OK, now time to create our own ConveyorBelt cluster! Basically the CB cluster will shovel the data to a cluster of Elasticsearch. And you would need Kibana to visualise your data. Here I will explain how to set up Elasticsearch and Kibana in a Linux VM box. Further below I am explaining how to do this. But ...

if you are just testing the waters and want to try CB, you can create a Windows VM, download Elasticsearch and Kibana and run their batch files and then move to setting up CB. But after you have seen it working, come back to the instructions and set it up in a Linux box, its natural habitat.

So setting this up in Windows is just to download the files from the links below, unzip and then running the batch files elasticsearch.bat and kibana.bat. Make sure you expose the ports 5601 and 9200 from your VM, by creating endpoints.

https://download.elastic.co/kibana/kibana/kibana-4.1.1-windows.zip
https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.1.zip

Set up ConveyorBelt

As discussed above, ConveyorBelt is typically deployed as an Azure Cloud Service. In order to do that, you need to clone Github repo, build and then deploy it with your own credentials and settings - and all of this should be pretty easy. Once deployed, you would need to define various diagnostic source and point them to your ElasticSearch and then just relax and let CB do its work. So we will look at the steps now.

Clone and build ConveyorBelt repo

You can use command line:
git clone https://github.com/aliostad/ConveyorBelt.git
Or use your tool of choice to clone the repo. Then open administrative PowerShell window, move to the build folder and execute .\build.ps1

Deploy mappings

Elasticsearch is able to guess the data types of your data and index them in a format that is usually suitable. However, this is not always true so we need to tell Elasticserach how to store each field and that is why CB needs to know this in advance.

To deploy mappings, create a Blob Storage container with the option "Public Container" - this allows the content to be publicly available in a read-only fashion. 

You would need the URL for the next step. It is in the format:
https://<storage account name>.blob.core.windows.net/<container name>/

Also use the tool of your choice and copy the mapping files in the mappings folder under ConveyorBelt directory.

Configure and deploy

Once you have built the solution, rename tokens.json.template file to tokens.json and edit tokens.json file (if you need some more info, find the instructions here). Then in the same PowerShell window, run the command below, replacing placeholders with your own values:
.\PublishCloudService.ps1 `
  -serviceName <name your ConveyorBelt Azure service> `
  -storageAccountName <name of the storage account needed for the deployment of the service> `
  -subscriptionDataFile <your .publishsettings file> `
  -selectedsubscription <name of subscription to use> `
  -affinityGroupName <affinity group or Azure region to deploy to>
After running the commands, you should see the PowerShell deploying CB to the cloud with a single Medium instance. In the storage account you had defined, you should now find a new table, whose name you defined in the tokens.json file.

Configure your diagnostic sources

Configuring the diagnostic sources can wildly differ depending on the type of the source. But for standard tables such as WADLogsTable, WADPerformanceCountersTable and WADWindowsEventLogsTable (whose mapping file you just copied) it will be straightforward.

Now choose an Azure diagnostic Storage Account with some data, and in the diagnostic source table, create a new row and add the entries below:

  • PartitionKey: whatever you like - commonly <top level business domain>_<mid level business domain>
  • RowKey: whatever you like - commonly <env: live/test/integration>_<service name>_<log type: logs/wlogs/perf/iis/custom>
  • ConnectionString (string): connection string to the Storage Account containing WADLogsTable (or others)
  • GracePeriodMinutes (int): Depends on how often your logs gets copied to Azure table. If it is 10 minutes then 15 should be ok, if it is 1 minute then 3 is fine.
  • IsActive (bool): True
  • MappingName (string): WADLogsTable . ConveyorBelt would look for mapping in URL "X/Y.json" where X is the value you defined in your tokens.json for mappings path   and Y is the TableName (see below).
  • LastOffsetPoint (string): set to ISO Date (second and millisecond MUST BE ZERO!!) from which you want the data to be copied e.g. 2015-02-15T19:34:00.0000000+00:00
  • LastScheduled (datetime): set it to a date in the past, same as the LastOffset point. Why do we have both? Well each does something different so we need both. 
  • MaxItemsInAScheduleRun (int): 100000 is fine
  • SchedulerType (string): ConveyorBelt.Tooling.Scheduling.MinuteTableShardScheduler
  • SchedulingFrequencyMinutes (int): 1
  • TableName (string): WADLogsTable, WADPerformanceCountersTable or WADWindowsEventLogsTable
And save. OK, now CB will start shovelling your data to your Elasticsearch and you should start seeing some data. If you do not, look at the entries you have created in the Table Storage and you will find an Error column which tells you what went wrong. Also to investigate further, just RDP to one of your ConveyorBelt VMs and run DebugView while having "Capture Global Win32" enabled - you should see some activity similar to below picture. Any exceptions will also show in there.


OK, that is it... you are done! ... well barely 20 minutes, wasn't it? :)


Now in case you are interested in setting up ES+Kibana in Linux, here is your little guide.

Set up your Elasticsearch in Linux

You can run Elasticsearch on Windows or Linux - I prefer the latter. To set up an Ubuntu box on Azure, you can follow instructions here. Ideally you need to add a Disk Volume as the VM disks are ephemeral - all you need to know is outlined here. Make sure you follow instructions to re-mount the drive after reboots. Another alternative, especially for your dev and test environments, is to go with D series machines (SSD hard disks) and use the ephemeral disks - they are fast and basically if you lose the data, you can always set ConveyorBelt to re-add the data, and it does it quickly. As I said before, never use Elasticsearch to master your logging data so you can recover losing it.

Almost all of the commands and settings below, needs to be run in an SSH session. If you are a geek with a lot of linux experience, you might find some of details below obvious and unnecessary - in which case just move on.

SSH is your best friend

Anyway, back to setting up ES - after you got your VM box provisioned, SSH to the box and install Oracle JDK:
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer
And then install Elasticsearch:
wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.1.deb
sudo dpkg -i elasticsearch-1.7.1.deb
Now you have installed ES v 1.7.1. To set Elasticsearch to start at reboots (equivalent of Windows services) run these commands in SSH:
sudo update-rc.d elasticsearch defaults 95 10
sudo /etc/init.d/elasticsearch start
Now ideally you would want to move the data and logs to the durable drive you have mounted, just edit the Elasticsearch config in vim and change:
sudo vim /etc/elasticsearch/elasticsearch.yml
and then (note uncommented lines):
path.data: /mounted/elasticsearch/data
# Path to temporary files:
#
#path.work: /path/to/work

# Path to log files:
#
path.logs:  /mounted/elasticsearch/data
Now you are ready to restart Elasticsearch:
sudo service elasticsearch restart
Note: Elasticsearch is Memory, CPU and IO hungry. SSD drives really help but if you do not have them (class D VMs), make sure provide plenty of RAM and enough CPU. Searches are CPU heavy so it will depend on number of concurrent users using it.
If your machine has a lot of RAM, make sure you set ES memory settings as the default ones will be small. So update the file below and set the memory to 50-60% of the total memory size of the box:
sudo vim /etc/default/elasticsearch
And uncomment this line and set the memory size to half of your box’s memory (here 14GB, just an example!):
ES_HEAP_SIZE=14g
There are potentially other changes that you might wanna do. For example, based on number of your nodes, you wanna set the index.number_of_replicas in your elasticsearch.yml - if you have a single node set it to 0. Also turning off the multicast/Zen discovery since will not work in Azure. But these are things you can start learning about when you are completely hooked on the power of information provided by the solution. Believe me, more addicting than narcotics!

Set up the Kibana in Linux

Up until version 4, Kibana was simply a set of static HTML+CSS+JS files that would run locally on your browser by just opening root HTML in the browser. This model could not really be sustainable and with version 4, Kibana runs as a service on a box, most likely different to your ES nodes. But for PoC and small use cases it is absolutely fine to run it on the same box.
Installing Kibana is straightforward. You just need to download and unpack it:
wget https://download.elastic.co/kibana/kibana/kibana-4.1.1-linux-x64.tar.gz
tar xvf kibana-4.1.1-linux-x64.tar.gz
So now Kibana will be downloaded to your home directory and be unpacked to kibana-4.1.1-linux-x64 folder. If you want to see where that folder is you can run pwd to get the folder name.
Now to run it you just run the command below to start kibana:
cd bin
./kibana
That will do for testing if it works but you need to configure it to start at the boot. We can use upstart for this. Just create a file in /etc/init folder:
sudo vim /etc/init/kibana.conf
and copy the below (path could be different) and save:
description "Kibana startup"
author "Ali"
start on runlevel [2345]
stop on runlevel [!2345]
exec /home/azureuser/kibana-4.1.1-linux-x64/bin/kibana
Now run this command to make sure there is no syntax error:
init-checkconf /etc/init/kibana.conf
If good then start the service:
sudo start kibana
If you have installed Kibana on the same box as the Elasticsearch and left all ports as the same, now you should be able to go to browser and browse to the server on port 5601 (make sure you expose this port on your VM by configuring endpoints) and you should see the Kibana screen (obviously no data).



Thursday, 9 July 2015

Daft Punk+Tool=Muse: word2vec model trained on a small Rock music corpus

In my last blog post, I outlined a few interesting results from a word2wec model trained on half a million news documents. This was pleasantly met with some positive reactions, some of which not necessarily due to the scientific rigour of the report but due to awareness effect of such "populist treatment of the subject" on the community. On the other hand, there were more than some negative reactions. Some believing I was "cherry-picking" and reporting only a handful of interesting results out of an ocean of mediocre performances. Others rejecting my claim that training on a small dataset in any language can produce very encouraging results. And yet others literally threatening me so that I would release the code despite I reiterating the code is small and not the point.

Am I the only one here thinking word2vec is freaking awesome?!

So I am back. And this time I have trained the model on a very small corpus of Rock artists obtained from Wikipedia, as part of my Rock History project. And I have built an API on top of the model so that you could play with the model and try out different combinations to your heart's content - [but please be easy on the API it is a small instance only] :) strictly no bots. And that's not all: I am releasing the code and the dataset (which is only 36K Wiki entries).

But now, my turn to RANT for a few paragraphs.

First of all, quantification of the performance of an unsupervised learning algo in a highly subjective field is very hard, time-consuming and potentially non-repeatable. Google in their latest paper on seq2seq had to resort to reporting mainly man-machine conversations. I feel in these subjects crowdsourcing the quantification is probably the best approach. Hence you would help by giving a rough accuracy score according to your experience.


On the other hand, sorry, those who were expecting to see a formal paper - perhaps in laTex format - you completely missed the point. As others said, there are plenty of hardcode papers out there, feel free to knock yourselves down. My point was to evangelise to a much wider audience. And, if you liked what you saw, go and try it for yourself.

Finally, alluding to "cognition" turned a lot of eyebrows but as Nando de Freitas puts it when asked about intelligence, whenever we build an intelligent machine, we will look at it as bogus not containing the "real intelligence" and we will discard it as not AI. So the world of Artifical Intelligence is a world of moving targets essentially because intelligence has been very difficult to define.

For me, word2vec is a breath of fresh air in a world of arbitrary, highly engineered and complex NLP algorithms which can breach the gap forming a meaningful relationship between tokens of your corpus. And I feel it is more a tool enhancing other algorithms rather than the end product. But even on its own, it generates fascinating results. For example in this tiny corpus, it was not only able to find the match between the name of the artists, but it can successfully find matches between similar bands - able to be used it as a Recommender system. And then, even adding the vector of artists generates interesting fusion genres which tend to correspond to real bands influenced by them.

API

BEWARE: Tokens are case-sensitive. So u2 and U2 not the same.

The API is basically a simple RESTful flask on top of the model:
http://localhost:5000/api/v1/rock/similar?pos=<pos>&neg=<neg>
where pos and neg are comma separated list of zero to many 'phrases' (pos for similar, and neg for opposite) - that are English words, or multi-word tokens including name of the bands or phrases that have a Wiki entry (such as albums or songs) - list if which can be found here .
For example:
http://localhost:5000/api/v1/rock/similar?pos=Captain%20Beefheart


You can add vectors of words, for example to mix genres:
http://localhost:5000/api/v1/rock/similar?pos=Daft%20Punk,Tool&min_freq=50
or add an artist with an adjective for example a softer Bob Dylan:
http://localhost:5000/api/v1/rock/similar?pos=Bob%20Dylan,soft&min_freq=50
Or subtract:
http://localhost:5000/api/v1/rock/similar?pos=Bob%20Dylan&neg=U2
But the tokens do not have to be a band name or artist names:
http://localhost:5000/api/v1/rock/similar?pos=drug
If you pass a non-existent or misspelling (it is case-sensitive!) of a name or word, you will get an error:
http://localhost:5000/api/v1/rock/similar?pos=radiohead

{
  result: "Not in vocab: radiohead"
}

You may pass minimum frequency of the word in the corpus to filter the output to remove the noice:
http://localhost:5000/api/v1/rock/similar?pos=Daft%20Punk,Tool&min_freq=50

Code

The code on github as I said is tiny. Perhaps the most complex part of the code is the Dictionary Tokenisation which is one of the tools I have built to tokenise the text without breaking multi-word phrases and I have found it very useful allowing to produce much more meaningful results.

The code is shared under MIT license.

To build the model, uncomment the line in wiki_rock_train.py, specifying the location of corpus:

train_and_save('data/wiki_rock_multiword_dic.txt', 'data/stop-words-english1.txt', '<THE_LOCATION>/wiki_rock_corpus/*.txt')

Dataset

As mentioned earlier, dataset/corpus is the text from 36K Rock music artist entries on the Wikipedia. This list was obtained by scraping the links from the "List of rock genres". Dataset can be downloaded from here. For information on the Copyright of the Wikipedia text and its terms of use please see here.

Sunday, 14 June 2015

Five crazy abstractions my Deep Learning word2vec model just did

Seeing is believing. 

Of course, there is a whole host of Machine Learning techniques available, thanks to the researchers, and to Open Source developers for turning them into libraries. And I am not quite a complete stranger to this field, I have been, on and off, working on Machine Learning over the last 8 years. But, nothing, absolutely nothing for me has ever come close to what blew my mind recently with word2vec: so effortless yet you feel like the model knows so much that it has obtained cognitive coherence of the vocabulary. Until neuroscientists nail cognition, I am happy to foolishly take that as some early form of machine cognition.

Singularity Dance - Wiki

But, no, don't take my word for it! If you have a corpus of 100s of thousand documents (or even 10s of thousands), feed it and see it for yourselves. What language? Doesn't really matter! My money is on that you will get results that equally blow your tops off.

What is word2vec?

word2vec is a Deep Learning technique first described by Tomas Mikolov only 2 years ago but due to its simplicity of algorithm and yet surprising robustness of the results, it has been widely implemented and adopted. This technique basically trains a model based on a neighborhood window of words in a corpus and then projects the result onto [an arbitrary number of] n dimensions where each word is a vector in the n dimensional space. Then the words can be compared using the cosine similarity of their vectors. And what is much more interesting is the arithmetics: vectors can be added or subtracted for example vector of Queen is almost equal to King + Woman - Man. In other words, if you remove Man from the King and add Woman to it, logically you get Queen and but this model is able to represent it mathematically.

LeCun recently proposed a variant of this approach in which he uses characters and not words. Altogether this is a fast moving space and likely to bring about significant change in the state of the art in Natural Language Processing.

Enough of this, show us ze resultz!

OK, sure. For those interested, I have brought the methods after the results.

1) Human - Animal = Ethics

Yeah, as if it knows! So if you remove the animal traits from human, what remains is Ethics. And in word2vec terms, subtracting the vector of Human by the vector of Animal results in a vector which is closest to Ethics (0.51). The other similar words to the Human - Animal vector are the words below: spirituality,  knowledge and piety. Interesting, huh?

2) Stock Market ≈ Thermometer

In my model the word Thermometer has a similarity of 0.72 to the Stock Market vector and the 6th similar word to it - most of closer words were other names for the stock market. It is not 100% clear to me how it was able to make such abstraction but perhaps proximity of Thermometer to the words increase/decrease or up/down, etc could have resulted in the similarity. In any case, likening Stock Market to Thermometer is a higher level abstraction.

3) Library - Books = Hall

What remains of a library if you were to remove the books? word2vec to the rescue. The similarity is 0.49 and next words are: Building and Dorm.  Hall's vector is already similar to that of Library (so the subtraction's effect could be incidental) but Building and Dorm are not. Now Library - Book (and not Books) is closest to Dorm with 0.51 similarity.

4) Obama + Russia - USA = Putin

This is a classic case similar to King+Woman-Man but it was interesting to see that it works. In fact finding leaders of most countries was successful using this method. For example, Obama + Britain - USA finds David Cameron (0.71).

5) Iraq - Violence = Jordan

So a country that is most similar to Iraq after taking its violence is Jordan, its neighbour. Iraq's vector itself is most similar to that of Syria - for obvious reasons. After Jordan, next vectors are Lebanon, Oman and Turkey.

Not enough? Hmm there you go with another two...

Bonus) President - Power = Prime Minister

Kinda obvious, isn't it? But of course we know it depends which one is Putin which one is Medvedev :)

Bonus 2) Politics - Lies = Germans??

OK, I admit I don't know what this one really means but according to my model, German politicians do not lie!

Now the boring stuff...

Methods

I used a corpus of publicly available online news and articles. Articles extracted from a number of different Farsi online websites and on average they contained ~ 8KB of text. The topics ranged from local and global Politics, Sports, Arts and Culture, Science and Technologies, Humanities and Religion, Health, etc.

The processing pipeline is illustrated below:

Figure 1 - Processing Pipeline
For word segmentation, an approach was used to join named entities using a dictionary of ~ 40K multi-part words and named entities.

Gensim's word2vec implementation was used to train the model. The default n=100 and window=5 worked very well but to find the optimum values, another study needs to be conducted.

In order to generate the results presented in this post, most_similar method was used. No significant difference between using most_similar and most_similar_cosmul was found.

A significant problem was discovered where words with spelling mistake in the corpus or infrequent words generate sparse vectors which result in a very high score of similar with some words. I used frequency of the word in the corpus to filter out such occasions.

Conclusion

word2vec is relatively simple algorithm with surprisingly remarkable performance. Its implementation are available in a variety of Open Source libraries, including Python's Gensim. Based on the preliminary results, it appears that word2vec is able to make higher levels abstractions which nudges towards cognitive abilities.

Despite its remarkable it is not quite clear how this ability can be used in an application, although in its current form, it can be readily used in finding antonym/synonym, spelling correction and stemming.