Am I the only one here thinking word2vec is freaking awesome?! |
So I am back. And this time I have trained the model on a very small corpus of Rock artists obtained from Wikipedia, as part of my Rock History project. And I have built an API on top of the model so that you could play with the model and try out different combinations to your heart's content - [but please be easy on the API it is a small instance only] :) strictly no bots. And that's not all: I am releasing the code and the dataset (which is only 36K Wiki entries).
But now, my turn to RANT for a few paragraphs.
First of all, quantification of the performance of an unsupervised learning algo in a highly subjective field is very hard, time-consuming and potentially non-repeatable. Google in their latest paper on seq2seq had to resort to reporting mainly man-machine conversations. I feel in these subjects crowdsourcing the quantification is probably the best approach. Hence you would help by giving a rough accuracy score according to your experience.
On the other hand, sorry, those who were expecting to see a formal paper - perhaps in laTex format - you completely missed the point. As others said, there are plenty of hardcode papers out there, feel free to knock yourselves down. My point was to evangelise to a much wider audience. And, if you liked what you saw, go and try it for yourself.
Finally, alluding to "cognition" turned a lot of eyebrows but as Nando de Freitas puts it when asked about intelligence, whenever we build an intelligent machine, we will look at it as bogus not containing the "real intelligence" and we will discard it as not AI. So the world of Artifical Intelligence is a world of moving targets essentially because intelligence has been very difficult to define.
For me, word2vec is a breath of fresh air in a world of arbitrary, highly engineered and complex NLP algorithms which can breach the gap forming a meaningful relationship between tokens of your corpus. And I feel it is more a tool enhancing other algorithms rather than the end product. But even on its own, it generates fascinating results. For example in this tiny corpus, it was not only able to find the match between the name of the artists, but it can successfully find matches between similar bands - able to be used it as a Recommender system. And then, even adding the vector of artists generates interesting fusion genres which tend to correspond to real bands influenced by them.
API
BEWARE: Tokens are case-sensitive. So u2 and U2 not the same.The API is basically a simple RESTful flask on top of the model:
http://localhost:5000/api/v1/rock/similar?pos=<pos>&neg=<neg>
where pos
and neg
are comma separated list of zero to many 'phrases' (pos
for similar, and neg
for opposite) - that are English words, or multi-word tokens including name of the bands or phrases that have a Wiki entry (such as albums or songs) - list if which can be found here .For example:
http://localhost:5000
/api/v1/rock/similar?pos=Captain%20Beefheart
You can add vectors of words, for example to mix genres:
http://localhost:5000
/api/v1/rock/similar?pos=Daft%20Punk,Tool&min_freq=50
or add an artist with an adjective for example a softer Bob Dylan:http://localhost:5000/api/v1/rock/similar?pos=Bob%20Dylan,soft&min_freq=50
Or subtract:http://localhost:5000
/api/v1/rock/similar?pos=Bob%20Dylan&neg=U2
But the tokens do not have to be a band name or artist names:http://localhost:5000
/api/v1/rock/similar?pos=drug
If you pass a non-existent or misspelling (it is case-sensitive!) of a name or word, you will get an error:http://
localhost:5000/api/v1/rock/similar?pos=radiohead
{
result: "Not in vocab: radiohead"
}
You may pass minimum frequency of the word in the corpus to filter the output to remove the noice:http://localhost:5000/api/v1/rock/similar?pos=Daft%20Punk,Tool&min_freq=50
Code
The code on github as I said is tiny. Perhaps the most complex part of the code is the Dictionary Tokenisation which is one of the tools I have built to tokenise the text without breaking multi-word phrases and I have found it very useful allowing to produce much more meaningful results.The code is shared under MIT license.
To build the model, uncomment the line in wiki_rock_train.py, specifying the location of corpus:
train_and_save('data/wiki_rock_multiword_dic.txt', 'data/stop-words-english1.txt', '<THE_LOCATION>/wiki_rock_corpus/*.txt')