GMU:The Hidden Layer:Topics: Difference between revisions

From Medien Wiki
No edit summary
No edit summary
Line 1: Line 1:


== General Information on word embeddings ==
== General Information on word embeddings ==
For a general explanation look here:
Word embeddings associate words with vectors in a high-dimensional space. Words that are close together in that space are more likely to occur in close proximity in a test than words which are far apart. See this article for details: [https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/]
[https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/]
 
The whole process goes through a number of stages:
 
=== 1.  The text corpus ===
This is the raw data used for learning. Determines language, the topics that are covered and the semantics.
Typical sources are Wikipedia and news articles.
=== 2.  The tokens ===
The corpus is split into words. These might be processed further, i.e. to clean up junk, or taking flections (i.e. verb forms) into account.
=== 3. Contexts ===
The words are grouped into contexts. These might be all words in a sentence, a certain number of words in a neighborhood or words that somehow relate to each other gramatically (as in "dependency based word embeddings" [https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/]).
=== 4. The algorithm ===
Different algorithms can be used to map the relationship between words and their context into a vector space. The main contesters are...
* '''Word2vec''' by Google, uses Neural Networks
* '''FastwordMade''' by Facebook based on word2vec. Splits words into smaller particles in order to capture capturing syntactic relations (like apparent ---> apparently).  Explained here: [https://rare-technologies.com/fasttext-and-gensim-word-embeddings/]
* '''GloVe''' by the Natural language processing group in standford [https://nlp.stanford.edu/projects/glove/]. Uses more conventional math instead of Neural Network "Black Magic" [https://www.quora.com/How-is-GloVe-different-from-word2vec]. Seems to perform just slightly less well than Word2vec and FastWord.
 
 
== Installation + getting started: ==


As wordvector algorithms
==Word2vec==
==Word2vec==
Made by Google, uses Neural Net, performs good on semantics.
=== Installation + getting started: ===
Included in the ''gensim'' package.
Included in the ''gensim'' package.