![]() |
Am I the only one here thinking word2vec is freaking awesome?! |
So I am back. And this time I have trained the model on a very small corpus of Rock artists obtained from Wikipedia, as part of my Rock History project. And I have built an API on top of the model so that you could play with the model and try out different combinations to your heart's content - [but please be easy on the API it is a small instance only] :) strictly no bots. And that's not all: I am releasing the code and the dataset (which is only 36K Wiki entries).
But now, my turn to RANT for a few paragraphs.
First of all, quantification of the performance of an unsupervised learning algo in a highly subjective field is very hard, time-consuming and potentially non-repeatable. Google in their latest paper on seq2seq had to resort to reporting mainly man-machine conversations. I feel in these subjects crowdsourcing the quantification is probably the best approach. Hence you would help by giving a rough accuracy score according to your experience.
On the other hand, sorry, those who were expecting to see a formal paper - perhaps in laTex format - you completely missed the point. As others said, there are plenty of hardcode papers out there, feel free to knock yourselves down. My point was to evangelise to a much wider audience. And, if you liked what you saw, go and try it for yourself.
Finally, alluding to "cognition" turned a lot of eyebrows but as Nando de Freitas puts it when asked about intelligence, whenever we build an intelligent machine, we will look at it as bogus not containing the "real intelligence" and we will discard it as not AI. So the world of Artifical Intelligence is a world of moving targets essentially because intelligence has been very difficult to define.
For me, word2vec is a breath of fresh air in a world of arbitrary, highly engineered and complex NLP algorithms which can breach the gap forming a meaningful relationship between tokens of your corpus. And I feel it is more a tool enhancing other algorithms rather than the end product. But even on its own, it generates fascinating results. For example in this tiny corpus, it was not only able to find the match between the name of the artists, but it can successfully find matches between similar bands - able to be used it as a Recommender system. And then, even adding the vector of artists generates interesting fusion genres which tend to correspond to real bands influenced by them.
API
BEWARE: Tokens are case-sensitive. So u2 and U2 not the same.The API is basically a simple RESTful flask on top of the model:
http://localhost:5000/api/v1/rock/similar?pos=<pos>&neg=<neg>
where pos
and neg
are comma separated list of zero to many 'phrases' (pos
for similar, and neg
for opposite) - that are English words, or multi-word tokens including name of the bands or phrases that have a Wiki entry (such as albums or songs) - list if which can be found here .For example:
http://localhost:5000
/api/v1/rock/similar?pos=Captain%20Beefheart
You can add vectors of words, for example to mix genres:
http://localhost:5000
/api/v1/rock/similar?pos=Daft%20Punk,Tool&min_freq=50
or add an artist with an adjective for example a softer Bob Dylan:http://localhost:5000/api/v1/rock/similar?pos=Bob%20Dylan,soft&min_freq=50
Or subtract:http://localhost:5000
/api/v1/rock/similar?pos=Bob%20Dylan&neg=U2
But the tokens do not have to be a band name or artist names:http://localhost:5000
/api/v1/rock/similar?pos=drug
If you pass a non-existent or misspelling (it is case-sensitive!) of a name or word, you will get an error:http://
localhost:5000/api/v1/rock/similar?pos=radiohead
{
result: "Not in vocab: radiohead"
}
You may pass minimum frequency of the word in the corpus to filter the output to remove the noice:http://localhost:5000/api/v1/rock/similar?pos=Daft%20Punk,Tool&min_freq=50
Code
The code on github as I said is tiny. Perhaps the most complex part of the code is the Dictionary Tokenisation which is one of the tools I have built to tokenise the text without breaking multi-word phrases and I have found it very useful allowing to produce much more meaningful results.The code is shared under MIT license.
To build the model, uncomment the line in wiki_rock_train.py, specifying the location of corpus:
train_and_save('data/wiki_rock_multiword_dic.txt', 'data/stop-words-english1.txt', '<THE_LOCATION>/wiki_rock_corpus/*.txt')
I got nice blog
ReplyDeletesap partner companies in bangalore
sap implementation companies in bangalore
sap partners in india
aws staffing
jquery interview questions
sql interview questions
Nice blog
ReplyDeleteuipath training in bangalore
angular4 interview questions
python interview questions
artificial intelligence interview questions
python online training
artificial intelligence online training
Excellent blog
ReplyDeletepython interview questions
git interview questions
django interview questions
sap grc interview questions and answers
advanced excel training in bangalore
zend framework interview questions
apache kafka interview questions
uipath training in bangalore
FYI the dataset is gone.
ReplyDeletevery nice interview questions
ReplyDeletevlsi interview questions
extjs interview questions
laravel interview questions
sap bi/bw interview questions
pcb interview questions
unix shell scripting interview questions
useful blog
ReplyDeletehr interview questions
hibernate interview questions
selenium interview questions
c interview questions
c++ interview questions linux interview questions
really awesome blog
ReplyDeletespring mvc interview questions
machine learning online training
servlet interview questions
mytectra.in