Line 5: Line 5:
 
The whole process goes through a number of stages:
 
The whole process goes through a number of stages:
  
=== 1.  The text corpus ===
+
=== 1.  Text corpus ===
 
This is the raw data used for learning. Determines language, the topics that are covered and the semantics.
 
This is the raw data used for learning. Determines language, the topics that are covered and the semantics.
 
Typical sources are Wikipedia and news articles.
 
Typical sources are Wikipedia and news articles.
=== 2.  The tokens ===
+
 
 +
=== 2.  Tokens ===
 
The corpus is split into words. These might be processed further, i.e. to clean up junk, or taking flections (i.e. verb forms) into account.  
 
The corpus is split into words. These might be processed further, i.e. to clean up junk, or taking flections (i.e. verb forms) into account.  
 +
 
=== 3. Contexts ===
 
=== 3. Contexts ===
The words are grouped into contexts. These might be all words in a sentence, a certain number of words in a neighborhood or words that somehow relate to each other gramatically (as in "dependency based word embeddings" [https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/]).
+
The words are grouped into contexts. These might be all words in a sentence, a certain number of words in a neighborhood (aka. "bag of words") or words that somehow relate to each other gramatically (as in "dependency based word embeddings" [https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/]).
 +
 
 
=== 4. The algorithm ===
 
=== 4. The algorithm ===
 
Different algorithms can be used to map the relationship between words and their context into a vector space. The main contesters are...
 
Different algorithms can be used to map the relationship between words and their context into a vector space. The main contesters are...
 
* '''Word2vec''' by Google, uses Neural Networks
 
* '''Word2vec''' by Google, uses Neural Networks
* '''FastwordMade''' by Facebook based on word2vec. Splits words into smaller particles in order to capture capturing syntactic relations (like apparent ---> apparently).  Explained here: [https://rare-technologies.com/fasttext-and-gensim-word-embeddings/]
+
* '''Fastword''' by Facebook based on word2vec. Splits words into smaller particles in order to capture capturing syntactic relations (like apparent --> apparently).  Explained here: [https://rare-technologies.com/fasttext-and-gensim-word-embeddings/]. Needs a lot of memory.
* '''GloVe''' by the Natural language processing group in standford [https://nlp.stanford.edu/projects/glove/]. Uses more conventional math instead of Neural Network "Black Magic" [https://www.quora.com/How-is-GloVe-different-from-word2vec]. Seems to perform just slightly less well than Word2vec and FastWord.
+
* '''GloVe''' by the Natural language processing group in standford [https://nlp.stanford.edu/projects/glove/]. Uses more conventional math instead of Neural Network "Black Magic" [https://www.quora.com/How-is-GloVe-different-from-word2vec].
 +
 
 +
The different algorithms seem to perform quite similar, and results depend on the benchmark and training data. Word2Vec seems to be a little less memory hungry, though.
 +
=== 5. Keyed Vecors ===
 +
 
 +
Here comes the '''Good news''': All of the algorithms provide a table with words and and their positions in vector space... So '''all you need is that table'''!
 +
 
 +
Fastvec is special in beeing able to match also on words that it hasn't seen before... but we probably don't even need that...
 +
 
 +
==== pre trained models ====
 +
Here is a collection of Words->Vector tables ("models") that other people have created from big corpuses. This is what you probably want:
 +
 
 +
* [https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models https://github.com/3Top/word2vec-api Mostly GloVe, some word2vec, English, Trained on News, Wikipedia, Twitter, '''a good mix''']
 +
* [https://github.com/Kyubyong/wordvectors https://github.com/Kyubyong/wordvectors: Word2Vec and FastText, '''Multiple languages''', no english, trained on Wikipedia]
 +
* [https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md: Fasttext, all imaginable languages, trained on Wikipedia, HUGE files]
 +
* [https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ an interesting approach that gives similarities between syntaktically equivalent words]
  
 +
In order to convert from GloVe to Word2Vec tables, the following script can be used:
 +
[[https://radimrehurek.com/gensim/scripts/glove2word2vec.html]]
  
 
== Installation + getting started: ==
 
== Installation + getting started: ==
Line 51: Line 71:
 
==GloVe==
 
==GloVe==
 
Invented by the Natural language processing group in standford [https://nlp.stanford.edu/projects/glove/]. Uses more conventional math instead of Neural Network "Black Magic" [https://www.quora.com/How-is-GloVe-different-from-word2vec]. Seems to perform just slightly less well than Word2vec and FastWord.
 
Invented by the Natural language processing group in standford [https://nlp.stanford.edu/projects/glove/]. Uses more conventional math instead of Neural Network "Black Magic" [https://www.quora.com/How-is-GloVe-different-from-word2vec]. Seems to perform just slightly less well than Word2vec and FastWord.
 
== pre trained models ==
 
 
* [https://github.com/Kyubyong/wordvectors https://github.com/Kyubyong/wordvectors: Word2Vec and FastText, Multiple languages, no english, trained on Wikipedia]
 
* [https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models https://github.com/3Top/word2vec-api Mostly GloVe, some word2vec, English, Trained on News, Wikipedia, Twitter]
 
* [https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md: Fasttext, all imaginable languages, trained on Wikipedia]
 
* [https://radimrehurek.com/gensim/scripts/glove2word2vec.html https://radimrehurek.com/gensim/scripts/glove2word2vec.html convert between GloVe and Word2Vec Format]
 
* [https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ an interesting approach that gives similarities between syntaktically equivalent words]
 

Latest revision as of 13:10, 9 May 2017

General Information on word embeddings

Word embeddings associate words with vectors in a high-dimensional space. Words that are close together in that space are more likely to occur in close proximity in a test than words which are far apart. See this article for details: [1]

The whole process goes through a number of stages:

1. Text corpus

This is the raw data used for learning. Determines language, the topics that are covered and the semantics. Typical sources are Wikipedia and news articles.

2. Tokens

The corpus is split into words. These might be processed further, i.e. to clean up junk, or taking flections (i.e. verb forms) into account.

3. Contexts

The words are grouped into contexts. These might be all words in a sentence, a certain number of words in a neighborhood (aka. "bag of words") or words that somehow relate to each other gramatically (as in "dependency based word embeddings" [2]).

4. The algorithm

Different algorithms can be used to map the relationship between words and their context into a vector space. The main contesters are...

  • Word2vec by Google, uses Neural Networks
  • Fastword by Facebook based on word2vec. Splits words into smaller particles in order to capture capturing syntactic relations (like apparent --> apparently). Explained here: [3]. Needs a lot of memory.
  • GloVe by the Natural language processing group in standford [4]. Uses more conventional math instead of Neural Network "Black Magic" [5].

The different algorithms seem to perform quite similar, and results depend on the benchmark and training data. Word2Vec seems to be a little less memory hungry, though.

5. Keyed Vecors

Here comes the Good news: All of the algorithms provide a table with words and and their positions in vector space... So all you need is that table!

Fastvec is special in beeing able to match also on words that it hasn't seen before... but we probably don't even need that...

pre trained models

Here is a collection of Words->Vector tables ("models") that other people have created from big corpuses. This is what you probably want:

In order to convert from GloVe to Word2Vec tables, the following script can be used: [[6]]

Installation + getting started:

Word2vec

Included in the gensim package.

To install, just type

pip install gensim

into a command window.

Here are some of the things you can do with the model: [7]
Here is a bit of background information an an explanation how to train your own models: [8].

Fastword

Made by Facebook based on word2vec. Better at capturing syntactic relations (like apparent ---> apparently) see here: [9]

Pretrained model files are HUGE - this will be a problem on computers with less than 16GB Memory

Installation + getting started:

Included in the gensim package.

To install, just type

pip install gensim

into a command window.

Documentation is here: [10]

GloVe

Invented by the Natural language processing group in standford [11]. Uses more conventional math instead of Neural Network "Black Magic" [12]. Seems to perform just slightly less well than Word2vec and FastWord.