Byte Rot: Five crazy abstractions my Deep Learning word2vec model just did

Sunday, 14 June 2015

Five crazy abstractions my Deep Learning word2vec model just did

Seeing is believing.

Of course, there is a whole host of Machine Learning techniques available, thanks to the researchers, and to Open Source developers for turning them into libraries. And I am not quite a complete stranger to this field, I have been, on and off, working on Machine Learning over the last 8 years. But, nothing, absolutely nothing for me has ever come close to what blew my mind recently with word2vec: so effortless yet you feel like the model knows so much that it has obtained cognitive coherence of the vocabulary. Until neuroscientists nail cognition, I am happy to foolishly take that as some early form of machine cognition.

Singularity Dance - Wiki

But, no, don't take my word for it! If you have a corpus of 100s of thousand documents (or even 10s of thousands), feed it and see it for yourselves. What language? Doesn't really matter! My money is on that you will get results that equally blow your tops off.

What is word2vec?

word2vec is a Deep Learning technique first described by Tomas Mikolov only 2 years ago but due to its simplicity of algorithm and yet surprising robustness of the results, it has been widely implemented and adopted. This technique basically trains a model based on a neighborhood window of words in a corpus and then projects the result onto [an arbitrary number of] n dimensions where each word is a vector in the n dimensional space. Then the words can be compared using the cosine similarity of their vectors. And what is much more interesting is the arithmetics: vectors can be added or subtracted for example vector of Queen is almost equal to King + Woman - Man. In other words, if you remove Man from the King and add Woman to it, logically you get Queen and but this model is able to represent it mathematically.

LeCun recently proposed a variant of this approach in which he uses characters and not words. Altogether this is a fast moving space and likely to bring about significant change in the state of the art in Natural Language Processing.

Enough of this, show us ze resultz!

OK, sure. For those interested, I have brought the methods after the results.

1) Human - Animal = Ethics

Yeah, as if it knows! So if you remove the animal traits from human, what remains is Ethics. And in word2vec terms, subtracting the vector of Human by the vector of Animal results in a vector which is closest to Ethics (0.51). The other similar words to the Human - Animal vector are the words below: spirituality, knowledge and piety. Interesting, huh?

2) Stock Market ≈ Thermometer

In my model the word Thermometer has a similarity of 0.72 to the Stock Market vector and the 6th similar word to it - most of closer words were other names for the stock market. It is not 100% clear to me how it was able to make such abstraction but perhaps proximity of Thermometer to the words increase/decrease or up/down, etc could have resulted in the similarity. In any case, likening Stock Market to Thermometer is a higher level abstraction.

3) Library - Books = Hall

What remains of a library if you were to remove the books? word2vec to the rescue. The similarity is 0.49 and next words are: Building and Dorm. Hall's vector is already similar to that of Library (so the subtraction's effect could be incidental) but Building and Dorm are not. Now Library - Book (and not Books) is closest to Dorm with 0.51 similarity.

4) Obama + Russia - USA = Putin

This is a classic case similar to King+Woman-Man but it was interesting to see that it works. In fact finding leaders of most countries was successful using this method. For example, Obama + Britain - USA finds David Cameron (0.71).

5) Iraq - Violence = Jordan

So a country that is most similar to Iraq after taking its violence is Jordan, its neighbour. Iraq's vector itself is most similar to that of Syria - for obvious reasons. After Jordan, next vectors are Lebanon, Oman and Turkey.

Not enough? Hmm there you go with another two...

Bonus) President - Power = Prime Minister

Kinda obvious, isn't it? But of course we know it depends which one is Putin which one is Medvedev :)

Bonus 2) Politics - Lies = Germans??

OK, I admit I don't know what this one really means but according to my model, German politicians do not lie!

Now the boring stuff...

Methods

I used a corpus of publicly available online news and articles. Articles extracted from a number of different Farsi online websites and on average they contained ~ 8KB of text. The topics ranged from local and global Politics, Sports, Arts and Culture, Science and Technologies, Humanities and Religion, Health, etc.

The processing pipeline is illustrated below:

Figure 1 - Processing Pipeline

For word segmentation, an approach was used to join named entities using a dictionary of ~ 40K multi-part words and named entities.

Gensim's word2vec implementation was used to train the model. The default n=100 and window=5 worked very well but to find the optimum values, another study needs to be conducted.

In order to generate the results presented in this post, most_similar method was used. No significant difference between using most_similar and most_similar_cosmul was found.

A significant problem was discovered where words with spelling mistake in the corpus or infrequent words generate sparse vectors which result in a very high score of similar with some words. I used frequency of the word in the corpus to filter out such occasions.

Conclusion

word2vec is relatively simple algorithm with surprisingly remarkable performance. Its implementation are available in a variety of Open Source libraries, including Python's Gensim. Based on the preliminary results, it appears that word2vec is able to make higher levels abstractions which nudges towards cognitive abilities.

Despite its remarkable it is not quite clear how this ability can be used in an application, although in its current form, it can be readily used in finding antonym/synonym, spelling correction and stemming.

16 comments:

Unknown14 June 2015 at 20:30
Is there a chance to find your code on github or similar websites?
ReplyDelete
Replies
Unknown17 June 2015 at 14:49
How do you calculate vector for a bigram ?
Like you gave an example of a stock market . Did you average the vector for staock and market ?
I ask because gensim take only unigrams and so in the end I have vectors only for unigrams
ReplyDelete
Replies
Unknown17 June 2015 at 15:18
Thanks.
ReplyDelete
Replies
Unknown17 July 2015 at 12:50
Dear Rot,

Excellent blog! I find your posts very interesting, especially the ones regarding Machine Learning.

I am one of the executive editors at .NET Code Geeks (www.dotnetcodegeeks.com), a sister site to Java Code Geeks (www.javacodegeeks.com). We have the NCG program, a program that aims to build partnerships between .NET Code Geeks and community bloggers (see http://www.dotnetcodegeeks.com/join-us/ncg/), that I think you’d be perfect for.

If you’re interested, send me an email to nikos[dot]souris[at]dotnetcodegeeks[dot]com and we can discuss further.

Best regards,
Nikos Souris
ReplyDelete
Replies
Unknown11 January 2016 at 10:36
word2vec is great, but none of your results correspond to gensim results on the Google New corpus, except almost #4. Here are my top 3 results for your examples, and the code that generated it. I would add a test for "stock market" ~ "thermometer", but the "stock_market" token does not appear in the corpus.

+ human - animal = mankind humankind humanity
+ library - books = Library Terraceview_Lodge rec_center
+ Obama + Russia - USA = Medvedev Putin Kremlin
+ Iraq - violence = Kuwait Iraqi Chalabi
+ President - power = president Vice_President Presdient
+ politics - lies = partisan_politics Politics political

from gensim.models import Word2Vec
model = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
def similar(positive, negative):
results = model.most_similar(positive=positive, negative=negative, topn=3)
print ' '.join(['+ ' + x for x in positive] +
['- ' + x for x in negative] + ['='] +
[result[0] for result in results])
similar(['human'], ['animal'])
similar(['library'], ['books'])
similar(['Obama', 'Russia'], ['USA'])
similar(['Iraq'], ['violence'])
similar(['President'], ['power'])
similar(['politics'], ['lies'])
ReplyDelete
Replies
Unknown31 May 2016 at 17:18
Thx for your article,
I work with word2vec on the 200 years corpus. In this curpus they are a lot of scan mistakes and the accuracy of the model is quite low.
You wrote about this problem just before the conclusion. Could you develop a little bit more. Did you fix this issue ?
ReplyDelete
Replies
Dr. Ali Saeed1 December 2016 at 09:30
I am interested to implement Word2vec for Urdu language. Can I use gensim, and how can use my corpus in iy
ReplyDelete
Replies
Dr. Ali Saeed1 December 2016 at 09:31
Can anyone send me little code Gensim for Word2vec with small corpus
ReplyDelete
Replies
Unknown19 September 2017 at 11:10
Hi Dear,

i Like Your Blog Very Much..I see Daily Your Blog ,is A Very Useful For me.

learn german language in jordan

Are you looking for best language school in Jordan? Berlitz Jordan offers a wide range of products and services for many different language including English, Arabic, German, Mandarin and many more languages.

Visit Now - http://berlitz-jo.com/
ReplyDelete
Replies
Unknown19 January 2018 at 05:34
really awesome blog
hr interview questions
hibernate interview questions
selenium interview questions
c interview questions
c++ interview questions linux interview questions
ReplyDelete
Replies

Add comment

Note: only a member of this blog may post a comment.