Word embeddings associate words with vectors in a high-dimensional space. Words that are close together in that space are more likely to occur in close proximity in a test than words which are far apart. See this article for details: [1]
The whole process goes through a number of stages:
This is the raw data used for learning. Determines language, the topics that are covered and the semantics. Typical sources are Wikipedia and news articles.
The corpus is split into words. These might be processed further, i.e. to clean up junk, or taking flections (i.e. verb forms) into account.
The words are grouped into contexts. These might be all words in a sentence, a certain number of words in a neighborhood (aka. "bag of words") or words that somehow relate to each other gramatically (as in "dependency based word embeddings" [2]).
Different algorithms can be used to map the relationship between words and their context into a vector space. The main contesters are...
The different algorithms seem to perform quite similar, and results depend on the benchmark and training data. Word2Vec seems to be a little less memory hungry, though.
Here comes the Good news: All of the algorithms provide a table with words and and their positions in vector space... So all you need is that table!
Fastvec is special in beeing able to match also on words that it hasn't seen before... but we probably don't even need that...
Here is a collection of Words->Vector tables ("models") that other people have created from big corpuses. This is what you probably want:
In order to convert from GloVe to Word2Vec tables, the following script can be used: [[6]]
Included in the gensim package.
To install, just type
pip install gensim
into a command window.
Here are some of the things you can do with the model: [7]
Here is a bit of background information an an explanation how to train your own models: [8].
Made by Facebook based on word2vec. Better at capturing syntactic relations (like apparent ---> apparently) see here:
[9]
Pretrained model files are HUGE - this will be a problem on computers with less than 16GB Memory
Included in the gensim package.
To install, just type
pip install gensim
into a command window.
Documentation is here: [10]
Invented by the Natural language processing group in standford [11]. Uses more conventional math instead of Neural Network "Black Magic" [12]. Seems to perform just slightly less well than Word2vec and FastWord.