81
edits
No edit summary  | 
				No edit summary  | 
				||
| (6 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
== General Information on word embeddings ==  | == General Information on word embeddings ==  | ||
Word embeddings associate words with vectors in a high-dimensional space. Words that are close together in that space are more likely to occur in close proximity in a test than words which are far apart. See this article for details: [https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/]  | |||
[https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/]  | |||
The whole process goes through a number of stages:  | |||
=== 1.  Text corpus ===  | |||
This is the raw data used for learning. Determines language, the topics that are covered and the semantics.  | |||
Typical sources are Wikipedia and news articles.  | |||
=== 2.  Tokens ===  | |||
The corpus is split into words. These might be processed further, i.e. to clean up junk, or taking flections (i.e. verb forms) into account.   | |||
=== 3. Contexts ===  | |||
The words are grouped into contexts. These might be all words in a sentence, a certain number of words in a neighborhood (aka. "bag of words") or words that somehow relate to each other gramatically (as in "dependency based word embeddings" [https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/]).  | |||
=== 4. The algorithm ===  | |||
Different algorithms can be used to map the relationship between words and their context into a vector space. The main contesters are...  | |||
* '''Word2vec''' by Google, uses Neural Networks  | |||
* '''Fastword''' by Facebook based on word2vec. Splits words into smaller particles in order to capture capturing syntactic relations (like apparent --> apparently).  Explained here: [https://rare-technologies.com/fasttext-and-gensim-word-embeddings/]. Needs a lot of memory.  | |||
* '''GloVe''' by the Natural language processing group in standford [https://nlp.stanford.edu/projects/glove/]. Uses more conventional math instead of Neural Network "Black Magic" [https://www.quora.com/How-is-GloVe-different-from-word2vec].  | |||
The different algorithms seem to perform quite similar, and results depend on the benchmark and training data. Word2Vec seems to be a little less memory hungry, though.  | |||
=== 5. Keyed Vecors ===  | |||
Here comes the '''Good news''': All of the algorithms provide a table with words and and their positions in vector space... So '''all you need is that table'''!  | |||
Fastvec is special in beeing able to match also on words that it hasn't seen before... but we probably don't even need that...  | |||
==== pre trained models ====  | |||
Here is a collection of Words->Vector tables ("models") that other people have created from big corpuses. This is what you probably want:  | |||
* [https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models https://github.com/3Top/word2vec-api Mostly GloVe, some word2vec, English, Trained on News, Wikipedia, Twitter, '''a good mix''']  | |||
* [https://github.com/Kyubyong/wordvectors https://github.com/Kyubyong/wordvectors: Word2Vec and FastText, '''Multiple languages''', no english, trained on Wikipedia]  | |||
* [https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md: Fasttext, all imaginable languages, trained on Wikipedia, HUGE files]  | |||
* [https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ an interesting approach that gives similarities between syntaktically equivalent words]  | |||
In order to convert from GloVe to Word2Vec tables, the following script can be used:  | |||
[[https://radimrehurek.com/gensim/scripts/glove2word2vec.html]]  | |||
== Installation + getting started: ==  | |||
==Word2vec==  | ==Word2vec==  | ||
Included in the ''gensim'' package.  | |||
To install, just type   | |||
<code>pip install gensim</code><br>  | <code>pip install gensim</code><br>  | ||
into a command window.  | |||
Here are some of the things you can do with the model: [http://textminingonline.com/getting-started-with-word2vec-and-glove-in-python]<br>  | Here are some of the things you can do with the model: [http://textminingonline.com/getting-started-with-word2vec-and-glove-in-python]<br>  | ||
Here is a bit of background information an an explanation how to train your own models: [https://rare-technologies.com/word2vec-tutorial/].  | Here is a bit of background information an an explanation how to train your own models: [https://rare-technologies.com/word2vec-tutorial/].  | ||
==Fastword==  | ==Fastword==  | ||
Made by   | Made by Facebook based on word2vec. Better at capturing syntactic relations (like apparent ---> apparently) see here:  | ||
[https://rare-technologies.com/fasttext-and-gensim-word-embeddings/]  | [https://rare-technologies.com/fasttext-and-gensim-word-embeddings/]<br>  | ||
Pretrained model files are HUGE  | |||
Pretrained model files are HUGE - this will be a problem on computers with less than 16GB Memory  | |||
=== Installation + getting started: ===  | |||
Included in the ''gensim'' package.  | |||
To install, just type   | |||
<code>pip install gensim</code><br>  | |||
into a command window.  | |||
Documentation is here: [https://radimrehurek.com/gensim/models/wrappers/fasttext.html]  | |||
==GloVe==  | ==GloVe==  | ||
Invented by the Natural language processing group in standford [https://nlp.stanford.edu/projects/glove/]. Uses more conventional math instead of Neural Network "Black Magic" [https://www.quora.com/How-is-GloVe-different-from-word2vec]. Seems to perform just slightly less well than Word2vec and FastWord.  | |||
edits