Sunday 14 June 2015

Five crazy abstractions my Deep Learning word2vec model just did

Seeing is believing. 

Of course, there is a whole host of Machine Learning techniques available, thanks to the researchers, and to Open Source developers for turning them into libraries. And I am not quite a complete stranger to this field, I have been, on and off, working on Machine Learning over the last 8 years. But, nothing, absolutely nothing for me has ever come close to what blew my mind recently with word2vec: so effortless yet you feel like the model knows so much that it has obtained cognitive coherence of the vocabulary. Until neuroscientists nail cognition, I am happy to foolishly take that as some early form of machine cognition.

Singularity Dance - Wiki

But, no, don't take my word for it! If you have a corpus of 100s of thousand documents (or even 10s of thousands), feed it and see it for yourselves. What language? Doesn't really matter! My money is on that you will get results that equally blow your tops off.

What is word2vec?

word2vec is a Deep Learning technique first described by Tomas Mikolov only 2 years ago but due to its simplicity of algorithm and yet surprising robustness of the results, it has been widely implemented and adopted. This technique basically trains a model based on a neighborhood window of words in a corpus and then projects the result onto [an arbitrary number of] n dimensions where each word is a vector in the n dimensional space. Then the words can be compared using the cosine similarity of their vectors. And what is much more interesting is the arithmetics: vectors can be added or subtracted for example vector of Queen is almost equal to King + Woman - Man. In other words, if you remove Man from the King and add Woman to it, logically you get Queen and but this model is able to represent it mathematically.

LeCun recently proposed a variant of this approach in which he uses characters and not words. Altogether this is a fast moving space and likely to bring about significant change in the state of the art in Natural Language Processing.

Enough of this, show us ze resultz!

OK, sure. For those interested, I have brought the methods after the results.

1) Human - Animal = Ethics

Yeah, as if it knows! So if you remove the animal traits from human, what remains is Ethics. And in word2vec terms, subtracting the vector of Human by the vector of Animal results in a vector which is closest to Ethics (0.51). The other similar words to the Human - Animal vector are the words below: spirituality,  knowledge and piety. Interesting, huh?

2) Stock Market ≈ Thermometer

In my model the word Thermometer has a similarity of 0.72 to the Stock Market vector and the 6th similar word to it - most of closer words were other names for the stock market. It is not 100% clear to me how it was able to make such abstraction but perhaps proximity of Thermometer to the words increase/decrease or up/down, etc could have resulted in the similarity. In any case, likening Stock Market to Thermometer is a higher level abstraction.

3) Library - Books = Hall

What remains of a library if you were to remove the books? word2vec to the rescue. The similarity is 0.49 and next words are: Building and Dorm.  Hall's vector is already similar to that of Library (so the subtraction's effect could be incidental) but Building and Dorm are not. Now Library - Book (and not Books) is closest to Dorm with 0.51 similarity.

4) Obama + Russia - USA = Putin

This is a classic case similar to King+Woman-Man but it was interesting to see that it works. In fact finding leaders of most countries was successful using this method. For example, Obama + Britain - USA finds David Cameron (0.71).

5) Iraq - Violence = Jordan

So a country that is most similar to Iraq after taking its violence is Jordan, its neighbour. Iraq's vector itself is most similar to that of Syria - for obvious reasons. After Jordan, next vectors are Lebanon, Oman and Turkey.

Not enough? Hmm there you go with another two...

Bonus) President - Power = Prime Minister

Kinda obvious, isn't it? But of course we know it depends which one is Putin which one is Medvedev :)

Bonus 2) Politics - Lies = Germans??

OK, I admit I don't know what this one really means but according to my model, German politicians do not lie!

Now the boring stuff...

Methods

I used a corpus of publicly available online news and articles. Articles extracted from a number of different Farsi online websites and on average they contained ~ 8KB of text. The topics ranged from local and global Politics, Sports, Arts and Culture, Science and Technologies, Humanities and Religion, Health, etc.

The processing pipeline is illustrated below:

Figure 1 - Processing Pipeline
For word segmentation, an approach was used to join named entities using a dictionary of ~ 40K multi-part words and named entities.

Gensim's word2vec implementation was used to train the model. The default n=100 and window=5 worked very well but to find the optimum values, another study needs to be conducted.

In order to generate the results presented in this post, most_similar method was used. No significant difference between using most_similar and most_similar_cosmul was found.

A significant problem was discovered where words with spelling mistake in the corpus or infrequent words generate sparse vectors which result in a very high score of similar with some words. I used frequency of the word in the corpus to filter out such occasions.

Conclusion

word2vec is relatively simple algorithm with surprisingly remarkable performance. Its implementation are available in a variety of Open Source libraries, including Python's Gensim. Based on the preliminary results, it appears that word2vec is able to make higher levels abstractions which nudges towards cognitive abilities.

Despite its remarkable it is not quite clear how this ability can be used in an application, although in its current form, it can be readily used in finding antonym/synonym, spelling correction and stemming.