Byte Rot: Gensim

In my last blog post, I outlined a few interesting results from a word2wec model trained on half a million news documents. This was pleasantly met with some positive reactions, some of which not necessarily due to the scientific rigour of the report but due to awareness effect of such "populist treatment of the subject" on the community. On the other hand, there were more than some negative reactions. Some believing I was "cherry-picking" and reporting only a handful of interesting results out of an ocean of mediocre performances. Others rejecting my claim that training on a small dataset in any language can produce very encouraging results. And yet others literally threatening me so that I would release the code despite I reiterating the code is small and not the point.

Am I the only one here thinking word2vec is freaking awesome?!

So I am back. And this time I have trained the model on a very small corpus of Rock artists obtained from Wikipedia, as part of my Rock History project. And I have built an API on top of the model so that you could play with the model and try out different combinations to your heart's content - [but please be easy on the API it is a small instance only] :) strictly no bots. And that's not all: I am releasing the code and the dataset (which is only 36K Wiki entries).

But now, my turn to RANT for a few paragraphs.

First of all, quantification of the performance of an unsupervised learning algo in a highly subjective field is very hard, time-consuming and potentially non-repeatable. Google in their latest paper on seq2seq had to resort to reporting mainly man-machine conversations. I feel in these subjects crowdsourcing the quantification is probably the best approach. Hence you would help by giving a rough accuracy score according to your experience.

On the other hand, sorry, those who were expecting to see a formal paper - perhaps in laTex format - you completely missed the point. As others said, there are plenty of hardcode papers out there, feel free to knock yourselves down. My point was to evangelise to a much wider audience. And, if you liked what you saw, go and try it for yourself.

Finally, alluding to "cognition" turned a lot of eyebrows but as Nando de Freitas puts it when asked about intelligence, whenever we build an intelligent machine, we will look at it as bogus not containing the "real intelligence" and we will discard it as not AI. So the world of Artifical Intelligence is a world of moving targets essentially because intelligence has been very difficult to define.

For me, word2vec is a breath of fresh air in a world of arbitrary, highly engineered and complex NLP algorithms which can breach the gap forming a meaningful relationship between tokens of your corpus. And I feel it is more a tool enhancing other algorithms rather than the end product. But even on its own, it generates fascinating results. For example in this tiny corpus, it was not only able to find the match between the name of the artists, but it can successfully find matches between similar bands - able to be used it as a Recommender system. And then, even adding the vector of artists generates interesting fusion genres which tend to correspond to real bands influenced by them.

API

BEWARE: Tokens are case-sensitive. So u2 and U2 not the same.

The API is basically a simple RESTful flask on top of the model:

http://localhost:5000/api/v1/rock/similar?pos=<pos>&neg=<neg>

where pos and neg are comma separated list of zero to many 'phrases' (pos for similar, and neg for opposite) - that are English words, or multi-word tokens including name of the bands or phrases that have a Wiki entry (such as albums or songs) - list if which can be found here .
For example:

http://localhost:5000/api/v1/rock/similar?pos=Captain%20Beefheart

You can add vectors of words, for example to mix genres:

http://localhost:5000/api/v1/rock/similar?pos=Daft%20Punk,Tool&min_freq=50

or add an artist with an adjective for example a softer Bob Dylan:

http://localhost:5000/api/v1/rock/similar?pos=Bob%20Dylan,soft&min_freq=50

Or subtract:

http://localhost:5000/api/v1/rock/similar?pos=Bob%20Dylan&neg=U2

But the tokens do not have to be a band name or artist names:

http://localhost:5000/api/v1/rock/similar?pos=drug

If you pass a non-existent or misspelling (it is case-sensitive!) of a name or word, you will get an error:

http://localhost:5000/api/v1/rock/similar?pos=radiohead


{
  result: "Not in vocab: radiohead"
}

You may pass minimum frequency of the word in the corpus to filter the output to remove the noice:

http://localhost:5000/api/v1/rock/similar?pos=Daft%20Punk,Tool&min_freq=50

Code

The code on github as I said is tiny. Perhaps the most complex part of the code is the Dictionary Tokenisation which is one of the tools I have built to tokenise the text without breaking multi-word phrases and I have found it very useful allowing to produce much more meaningful results.

The code is shared under MIT license.

To build the model, uncomment the line in wiki_rock_train.py, specifying the location of corpus:

train_and_save('data/wiki_rock_multiword_dic.txt', 'data/stop-words-english1.txt', '<THE_LOCATION>/wiki_rock_corpus/*.txt')

Dataset

As mentioned earlier, dataset/corpus is the text from 36K Rock music artist entries on the Wikipedia. This list was obtained by scraping the links from the "List of rock genres". Dataset can be downloaded from here. For information on the Copyright of the Wikipedia text and its terms of use please see here.

Seeing is believing.

Of course, there is a whole host of Machine Learning techniques available, thanks to the researchers, and to Open Source developers for turning them into libraries. And I am not quite a complete stranger to this field, I have been, on and off, working on Machine Learning over the last 8 years. But, nothing, absolutely nothing for me has ever come close to what blew my mind recently with word2vec: so effortless yet you feel like the model knows so much that it has obtained cognitive coherence of the vocabulary. Until neuroscientists nail cognition, I am happy to foolishly take that as some early form of machine cognition.

Singularity Dance - Wiki

But, no, don't take my word for it! If you have a corpus of 100s of thousand documents (or even 10s of thousands), feed it and see it for yourselves. What language? Doesn't really matter! My money is on that you will get results that equally blow your tops off.

What is word2vec?

word2vec is a Deep Learning technique first described by Tomas Mikolov only 2 years ago but due to its simplicity of algorithm and yet surprising robustness of the results, it has been widely implemented and adopted. This technique basically trains a model based on a neighborhood window of words in a corpus and then projects the result onto [an arbitrary number of] n dimensions where each word is a vector in the n dimensional space. Then the words can be compared using the cosine similarity of their vectors. And what is much more interesting is the arithmetics: vectors can be added or subtracted for example vector of Queen is almost equal to King + Woman - Man. In other words, if you remove Man from the King and add Woman to it, logically you get Queen and but this model is able to represent it mathematically.

LeCun recently proposed a variant of this approach in which he uses characters and not words. Altogether this is a fast moving space and likely to bring about significant change in the state of the art in Natural Language Processing.

Enough of this, show us ze resultz!

OK, sure. For those interested, I have brought the methods after the results.

1) Human - Animal = Ethics

Yeah, as if it knows! So if you remove the animal traits from human, what remains is Ethics. And in word2vec terms, subtracting the vector of Human by the vector of Animal results in a vector which is closest to Ethics (0.51). The other similar words to the Human - Animal vector are the words below: spirituality, knowledge and piety. Interesting, huh?

2) Stock Market ≈ Thermometer

In my model the word Thermometer has a similarity of 0.72 to the Stock Market vector and the 6th similar word to it - most of closer words were other names for the stock market. It is not 100% clear to me how it was able to make such abstraction but perhaps proximity of Thermometer to the words increase/decrease or up/down, etc could have resulted in the similarity. In any case, likening Stock Market to Thermometer is a higher level abstraction.

3) Library - Books = Hall

What remains of a library if you were to remove the books? word2vec to the rescue. The similarity is 0.49 and next words are: Building and Dorm. Hall's vector is already similar to that of Library (so the subtraction's effect could be incidental) but Building and Dorm are not. Now Library - Book (and not Books) is closest to Dorm with 0.51 similarity.

4) Obama + Russia - USA = Putin

This is a classic case similar to King+Woman-Man but it was interesting to see that it works. In fact finding leaders of most countries was successful using this method. For example, Obama + Britain - USA finds David Cameron (0.71).

5) Iraq - Violence = Jordan

So a country that is most similar to Iraq after taking its violence is Jordan, its neighbour. Iraq's vector itself is most similar to that of Syria - for obvious reasons. After Jordan, next vectors are Lebanon, Oman and Turkey.

Not enough? Hmm there you go with another two...

Bonus) President - Power = Prime Minister

Kinda obvious, isn't it? But of course we know it depends which one is Putin which one is Medvedev :)

Bonus 2) Politics - Lies = Germans??

OK, I admit I don't know what this one really means but according to my model, German politicians do not lie!

Now the boring stuff...

Methods

I used a corpus of publicly available online news and articles. Articles extracted from a number of different Farsi online websites and on average they contained ~ 8KB of text. The topics ranged from local and global Politics, Sports, Arts and Culture, Science and Technologies, Humanities and Religion, Health, etc.

The processing pipeline is illustrated below:

Figure 1 - Processing Pipeline

For word segmentation, an approach was used to join named entities using a dictionary of ~ 40K multi-part words and named entities.

Gensim's word2vec implementation was used to train the model. The default n=100 and window=5 worked very well but to find the optimum values, another study needs to be conducted.

In order to generate the results presented in this post, most_similar method was used. No significant difference between using most_similar and most_similar_cosmul was found.

A significant problem was discovered where words with spelling mistake in the corpus or infrequent words generate sparse vectors which result in a very high score of similar with some words. I used frequency of the word in the corpus to filter out such occasions.

Conclusion

word2vec is relatively simple algorithm with surprisingly remarkable performance. Its implementation are available in a variety of Open Source libraries, including Python's Gensim. Based on the preliminary results, it appears that word2vec is able to make higher levels abstractions which nudges towards cognitive abilities.

Despite its remarkable it is not quite clear how this ability can be used in an application, although in its current form, it can be readily used in finding antonym/synonym, spelling correction and stemming.

Byte Rot

Thursday, 9 July 2015

Daft Punk+Tool=Muse: word2vec model trained on a small Rock music corpus

API