Comments on Byte Rot: Five crazy abstractions my Deep Learning word2vec model just did

really awesome blog hr interview questions hibern...

2018-01-19T05:34:14.721+00:00

really awesome blog
hr interview questions
hibernate interview questions
selenium interview questions
c interview questions
c++ interview questions linux interview questions

Hi Dear, i Like Your Blog Very Much..I see Daily ...

2017-09-19T11:10:59.519+01:00

Hi Dear,

i Like Your Blog Very Much..I see Daily Your Blog ,is A Very Useful For me.

learn german language in jordan

Are you looking for best language school in Jordan? Berlitz Jordan offers a wide range of products and services for many different language including English, Arabic, German, Mandarin and many more languages.

Visit Now - http://berlitz-jo.com/

Can anyone send me little code Gensim for Word2vec...

2016-12-01T09:31:46.585+00:00

Can anyone send me little code Gensim for Word2vec with small corpus

I am interested to implement Word2vec for Urdu lan...

2016-12-01T09:30:25.048+00:00

I am interested to implement Word2vec for Urdu language. Can I use gensim, and how can use my corpus in iy

I never had this issue since I worked with the tex...

2016-05-31T21:51:15.199+01:00

I never had this issue since I worked with the text extracted from web documents. I guess you can use word2vec itself to fix a lot of scan errors: use a dictionary and if it does not match, see which word it is mostly connected to. But the problem is these words do not have a lot of occurrence I suppose.

I would consult the literature to look for methods, I am sure you are not the only one having this problem. But I am afraid I do not have any experience in this kind of problem.

Thx for your article, I work with word2vec on the ...

2016-05-31T17:18:38.384+01:00

Thx for your article,
I work with word2vec on the 200 years corpus. In this curpus they are a lot of scan mistakes and the accuracy of the model is quite low.
You wrote about this problem just before the conclusion. Could you develop a little bit more. Did you fix this issue ?

Well in fact the results pretty good. One problem ...

2016-01-11T12:27:52.255+00:00

Well in fact the results pretty good. One problem is that you would need to normalise the data, the casing throws it off.

+ human - animal was in the Persian corpus I had which had many spiritual content. So I guess that was why.

+ library - books: result is strange, what is traceview lodge anyway?!

+ Obama + Russia - USA: Strange Medvedov is first but well, Putin is second.

+ Iraq - violence: This works

+ President - power: This also worked, if casing is normalised

+ politics - lies: again, the corpus was Persian. But also interesting that comes up with partisan politics.

word2vec is great, but none of your results corres...

2016-01-11T10:36:02.392+00:00

word2vec is great, but none of your results correspond to gensim results on the Google New corpus, except almost #4. Here are my top 3 results for your examples, and the code that generated it. I would add a test for "stock market" ~ "thermometer", but the "stock_market" token does not appear in the corpus.

+ human - animal = mankind humankind humanity
+ library - books = Library Terraceview_Lodge rec_center
+ Obama + Russia - USA = Medvedev Putin Kremlin
+ Iraq - violence = Kuwait Iraqi Chalabi
+ President - power = president Vice_President Presdient
+ politics - lies = partisan_politics Politics political

from gensim.models import Word2Vec
model = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
def similar(positive, negative):
results = model.most_similar(positive=positive, negative=negative, topn=3)
print ' '.join(['+ ' + x for x in positive] +
['- ' + x for x in negative] + ['='] +
[result[0] for result in results])
similar(['human'], ['animal'])
similar(['library'], ['books'])
similar(['Obama', 'Russia'], ['USA'])
similar(['Iraq'], ['violence'])
similar(['President'], ['power'])
similar(['politics'], ['lies'])

Dear Rot, Excellent blog! I find your posts very ...

2015-07-17T12:50:46.456+01:00

Dear Rot,

Excellent blog! I find your posts very interesting, especially the ones regarding Machine Learning.

I am one of the executive editors at .NET Code Geeks (www.dotnetcodegeeks.com), a sister site to Java Code Geeks (www.javacodegeeks.com). We have the NCG program, a program that aims to build partnerships between .NET Code Geeks and community bloggers (see http://www.dotnetcodegeeks.com/join-us/ncg/), that I think you’d be perfect for.

If you’re interested, send me an email to nikos[dot]souris[at]dotnetcodegeeks[dot]com and we can discuss further.

Best regards,
Nikos Souris

Thanks.

2015-06-17T15:18:32.022+01:00

Thanks.

Gensim does not care, you provide a vector of segm...

2015-06-17T14:59:25.898+01:00

Gensim does not care, you provide a vector of segmented words to it - rather a list or iterable to it.

That happens at the segmentation phase. Basically at the segmentation, I first segment the document to sentences. Then I segment the sentence to words, which uses a dictionary of named entities and phrases which are to be treated as a single token. I believe this is very important but all examples I have seen use a simple tokenisation. So this way Barack Obama is a single token.

How do you calculate vector for a bigram ? Like y...

2015-06-17T14:49:51.036+01:00

How do you calculate vector for a bigram ?
Like you gave an example of a stock market . Did you average the vector for staock and market ?
I ask because gensim take only unigrams and so in the end I have vectors only for unigrams

Was it really necessary to be so hostile? You can ...

2015-06-15T04:26:33.597+01:00

Was it really necessary to be so hostile? You can get similar results from the word2vec website

God damnit, "it's very easy", "...

2015-06-14T22:11:53.183+01:00

God damnit, "it's very easy", "nothing magic about it", "so simple"... cut the crap and release the source, preferably on GitHub. A .zip-file does the job also, for the case you don't know anything about git.

There is nothing magic about the code I have writt...

2015-06-14T20:38:02.912+01:00

There is nothing magic about the code I have written. The word2vec part is around 10-20 lines of codes similar to what you find here https://radimrehurek.com/gensim/models/word2vec.html

The key is to have a corpus. You can try some freely available corpora that are part of NLTK.

Is there a chance to find your code on github or s...

2015-06-14T20:30:35.308+01:00

Is there a chance to find your code on github or similar websites?