Thursday, 9 July 2015

Daft Punk+Tool=Muse: word2vec model trained on a small Rock music corpus

In my last blog post, I outlined a few interesting results from a word2wec model trained on half a million news documents. This was pleasantly met with some positive reactions, some of which not necessarily due to the scientific rigour of the report but due to awareness effect of such "populist treatment of the subject" on the community. On the other hand, there were more than some negative reactions. Some believing I was "cherry-picking" and reporting only a handful of interesting results out of an ocean of mediocre performances. Others rejecting my claim that training on a small dataset in any language can produce very encouraging results. And yet others literally threatening me so that I would release the code despite I reiterating the code is small and not the point.

Am I the only one here thinking word2vec is freaking awesome?!

So I am back. And this time I have trained the model on a very small corpus of Rock artists obtained from Wikipedia, as part of my Rock History project. And I have built an API on top of the model so that you could play with the model and try out different combinations to your heart's content - [but please be easy on the API it is a small instance only] :) strictly no bots. And that's not all: I am releasing the code and the dataset (which is only 36K Wiki entries).

But now, my turn to RANT for a few paragraphs.

First of all, quantification of the performance of an unsupervised learning algo in a highly subjective field is very hard, time-consuming and potentially non-repeatable. Google in their latest paper on seq2seq had to resort to reporting mainly man-machine conversations. I feel in these subjects crowdsourcing the quantification is probably the best approach. Hence you would help by giving a rough accuracy score according to your experience.

On the other hand, sorry, those who were expecting to see a formal paper - perhaps in laTex format - you completely missed the point. As others said, there are plenty of hardcode papers out there, feel free to knock yourselves down. My point was to evangelise to a much wider audience. And, if you liked what you saw, go and try it for yourself.

Finally, alluding to "cognition" turned a lot of eyebrows but as Nando de Freitas puts it when asked about intelligence, whenever we build an intelligent machine, we will look at it as bogus not containing the "real intelligence" and we will discard it as not AI. So the world of Artifical Intelligence is a world of moving targets essentially because intelligence has been very difficult to define.

For me, word2vec is a breath of fresh air in a world of arbitrary, highly engineered and complex NLP algorithms which can breach the gap forming a meaningful relationship between tokens of your corpus. And I feel it is more a tool enhancing other algorithms rather than the end product. But even on its own, it generates fascinating results. For example in this tiny corpus, it was not only able to find the match between the name of the artists, but it can successfully find matches between similar bands - able to be used it as a Recommender system. And then, even adding the vector of artists generates interesting fusion genres which tend to correspond to real bands influenced by them.


BEWARE: Tokens are case-sensitive. So u2 and U2 not the same.

The API is basically a simple RESTful flask on top of the model:
where pos and neg are comma separated list of zero to many 'phrases' (pos for similar, and neg for opposite) - that are English words, or multi-word tokens including name of the bands or phrases that have a Wiki entry (such as albums or songs) - list if which can be found here .
For example:

You can add vectors of words, for example to mix genres:
or add an artist with an adjective for example a softer Bob Dylan:
Or subtract:
But the tokens do not have to be a band name or artist names:
If you pass a non-existent or misspelling (it is case-sensitive!) of a name or word, you will get an error:

  result: "Not in vocab: radiohead"

You may pass minimum frequency of the word in the corpus to filter the output to remove the noice:


The code on github as I said is tiny. Perhaps the most complex part of the code is the Dictionary Tokenisation which is one of the tools I have built to tokenise the text without breaking multi-word phrases and I have found it very useful allowing to produce much more meaningful results.

The code is shared under MIT license.

To build the model, uncomment the line in, specifying the location of corpus:

train_and_save('data/wiki_rock_multiword_dic.txt', 'data/stop-words-english1.txt', '<THE_LOCATION>/wiki_rock_corpus/*.txt')


As mentioned earlier, dataset/corpus is the text from 36K Rock music artist entries on the Wikipedia. This list was obtained by scraping the links from the "List of rock genres". Dataset can be downloaded from here. For information on the Copyright of the Wikipedia text and its terms of use please see here.