Sunday, 14 June 2015

Five crazy abstractions my Deep Learning word2vec model just did

Seeing is believing. 

Of course, there is a whole host of Machine Learning techniques available, thanks to the researchers, and to Open Source developers for turning them into libraries. And I am not quite a complete stranger to this field, I have been, on and off, working on Machine Learning over the last 8 years. But, nothing, absolutely nothing for me has ever come close to what blew my mind recently with word2vec: so effortless yet you feel like the model knows so much that it has obtained cognitive coherence of the vocabulary. Until neuroscientists nail cognition, I am happy to foolishly take that as some early form of machine cognition.

Singularity Dance - Wiki

But, no, don't take my word for it! If you have a corpus of 100s of thousand documents (or even 10s of thousands), feed it and see it for yourselves. What language? Doesn't really matter! My money is on that you will get results that equally blow your tops off.

What is word2vec?

word2vec is a Deep Learning technique first described by Tomas Mikolov only 2 years ago but due to its simplicity of algorithm and yet surprising robustness of the results, it has been widely implemented and adopted. This technique basically trains a model based on a neighborhood window of words in a corpus and then projects the result onto [an arbitrary number of] n dimensions where each word is a vector in the n dimensional space. Then the words can be compared using the cosine similarity of their vectors. And what is much more interesting is the arithmetics: vectors can be added or subtracted for example vector of Queen is almost equal to King + Woman - Man. In other words, if you remove Man from the King and add Woman to it, logically you get Queen and but this model is able to represent it mathematically.

LeCun recently proposed a variant of this approach in which he uses characters and not words. Altogether this is a fast moving space and likely to bring about significant change in the state of the art in Natural Language Processing.

Enough of this, show us ze resultz!

OK, sure. For those interested, I have brought the methods after the results.

1) Human - Animal = Ethics

Yeah, as if it knows! So if you remove the animal traits from human, what remains is Ethics. And in word2vec terms, subtracting the vector of Human by the vector of Animal results in a vector which is closest to Ethics (0.51). The other similar words to the Human - Animal vector are the words below: spirituality,  knowledge and piety. Interesting, huh?

2) Stock Market ≈ Thermometer

In my model the word Thermometer has a similarity of 0.72 to the Stock Market vector and the 6th similar word to it - most of closer words were other names for the stock market. It is not 100% clear to me how it was able to make such abstraction but perhaps proximity of Thermometer to the words increase/decrease or up/down, etc could have resulted in the similarity. In any case, likening Stock Market to Thermometer is a higher level abstraction.

3) Library - Books = Hall

What remains of a library if you were to remove the books? word2vec to the rescue. The similarity is 0.49 and next words are: Building and Dorm.  Hall's vector is already similar to that of Library (so the subtraction's effect could be incidental) but Building and Dorm are not. Now Library - Book (and not Books) is closest to Dorm with 0.51 similarity.

4) Obama + Russia - USA = Putin

This is a classic case similar to King+Woman-Man but it was interesting to see that it works. In fact finding leaders of most countries was successful using this method. For example, Obama + Britain - USA finds David Cameron (0.71).

5) Iraq - Violence = Jordan

So a country that is most similar to Iraq after taking its violence is Jordan, its neighbour. Iraq's vector itself is most similar to that of Syria - for obvious reasons. After Jordan, next vectors are Lebanon, Oman and Turkey.

Not enough? Hmm there you go with another two...

Bonus) President - Power = Prime Minister

Kinda obvious, isn't it? But of course we know it depends which one is Putin which one is Medvedev :)

Bonus 2) Politics - Lies = Germans??

OK, I admit I don't know what this one really means but according to my model, German politicians do not lie!

Now the boring stuff...


I used a corpus of publicly available online news and articles. Articles extracted from a number of different Farsi online websites and on average they contained ~ 8KB of text. The topics ranged from local and global Politics, Sports, Arts and Culture, Science and Technologies, Humanities and Religion, Health, etc.

The processing pipeline is illustrated below:

Figure 1 - Processing Pipeline
For word segmentation, an approach was used to join named entities using a dictionary of ~ 40K multi-part words and named entities.

Gensim's word2vec implementation was used to train the model. The default n=100 and window=5 worked very well but to find the optimum values, another study needs to be conducted.

In order to generate the results presented in this post, most_similar method was used. No significant difference between using most_similar and most_similar_cosmul was found.

A significant problem was discovered where words with spelling mistake in the corpus or infrequent words generate sparse vectors which result in a very high score of similar with some words. I used frequency of the word in the corpus to filter out such occasions.


word2vec is relatively simple algorithm with surprisingly remarkable performance. Its implementation are available in a variety of Open Source libraries, including Python's Gensim. Based on the preliminary results, it appears that word2vec is able to make higher levels abstractions which nudges towards cognitive abilities.

Despite its remarkable it is not quite clear how this ability can be used in an application, although in its current form, it can be readily used in finding antonym/synonym, spelling correction and stemming.


  1. Is there a chance to find your code on github or similar websites?

    1. There is nothing magic about the code I have written. The word2vec part is around 10-20 lines of codes similar to what you find here

      The key is to have a corpus. You can try some freely available corpora that are part of NLTK.

    2. God damnit, "it's very easy", "nothing magic about it", "so simple"... cut the crap and release the source, preferably on GitHub. A .zip-file does the job also, for the case you don't know anything about git.

    3. Was it really necessary to be so hostile? You can get similar results from the word2vec website

  2. How do you calculate vector for a bigram ?
    Like you gave an example of a stock market . Did you average the vector for staock and market ?
    I ask because gensim take only unigrams and so in the end I have vectors only for unigrams

    1. Gensim does not care, you provide a vector of segmented words to it - rather a list or iterable to it.

      That happens at the segmentation phase. Basically at the segmentation, I first segment the document to sentences. Then I segment the sentence to words, which uses a dictionary of named entities and phrases which are to be treated as a single token. I believe this is very important but all examples I have seen use a simple tokenisation. So this way Barack Obama is a single token.

  3. Dear Rot,

    Excellent blog! I find your posts very interesting, especially the ones regarding Machine Learning.

    I am one of the executive editors at .NET Code Geeks (, a sister site to Java Code Geeks ( We have the NCG program, a program that aims to build partnerships between .NET Code Geeks and community bloggers (see, that I think you’d be perfect for.

    If you’re interested, send me an email to nikos[dot]souris[at]dotnetcodegeeks[dot]com and we can discuss further.

    Best regards,
    Nikos Souris

  4. word2vec is great, but none of your results correspond to gensim results on the Google New corpus, except almost #4. Here are my top 3 results for your examples, and the code that generated it. I would add a test for "stock market" ~ "thermometer", but the "stock_market" token does not appear in the corpus.

    + human - animal = mankind humankind humanity
    + library - books = Library Terraceview_Lodge rec_center
    + Obama + Russia - USA = Medvedev Putin Kremlin
    + Iraq - violence = Kuwait Iraqi Chalabi
    + President - power = president Vice_President Presdient
    + politics - lies = partisan_politics Politics political

    from gensim.models import Word2Vec
    model = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
    def similar(positive, negative):
    results = model.most_similar(positive=positive, negative=negative, topn=3)
    print ' '.join(['+ ' + x for x in positive] +
    ['- ' + x for x in negative] + ['='] +
    [result[0] for result in results])
    similar(['human'], ['animal'])
    similar(['library'], ['books'])
    similar(['Obama', 'Russia'], ['USA'])
    similar(['Iraq'], ['violence'])
    similar(['President'], ['power'])
    similar(['politics'], ['lies'])

    1. Well in fact the results pretty good. One problem is that you would need to normalise the data, the casing throws it off.

      + human - animal was in the Persian corpus I had which had many spiritual content. So I guess that was why.

      + library - books: result is strange, what is traceview lodge anyway?!

      + Obama + Russia - USA: Strange Medvedov is first but well, Putin is second.

      + Iraq - violence: This works

      + President - power: This also worked, if casing is normalised

      + politics - lies: again, the corpus was Persian. But also interesting that comes up with partisan politics.

  5. Thx for your article,
    I work with word2vec on the 200 years corpus. In this curpus they are a lot of scan mistakes and the accuracy of the model is quite low.
    You wrote about this problem just before the conclusion. Could you develop a little bit more. Did you fix this issue ?

    1. I never had this issue since I worked with the text extracted from web documents. I guess you can use word2vec itself to fix a lot of scan errors: use a dictionary and if it does not match, see which word it is mostly connected to. But the problem is these words do not have a lot of occurrence I suppose.

      I would consult the literature to look for methods, I am sure you are not the only one having this problem. But I am afraid I do not have any experience in this kind of problem.

  6. I am interested to implement Word2vec for Urdu language. Can I use gensim, and how can use my corpus in iy

  7. Can anyone send me little code Gensim for Word2vec with small corpus