forked from tmikolov/word2vec
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
removed some specific information from README, as I just changed the …
…scritps
- Loading branch information
tmikolov
committed
Aug 1, 2013
1 parent
402ae80
commit ada9ca3
Showing
1 changed file
with
13 additions
and
21 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,29 +1,21 @@ | ||
Tools for computing distributed representtion of words | ||
------------------------------------------------------ | ||
|
||
We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG). | ||
We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts. | ||
|
||
Given a text corpus, the word2vec program learns a vector for every word using the Continuous | ||
Bag-of-Words or the Skip-Gram model. The user needs to specify the following: | ||
Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous | ||
Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following: | ||
- desired vector dimensionality | ||
- the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model | ||
- Whether hierarchical sampling is used | ||
- Whether negative sampling is used, and if so, how many negative samples should be used | ||
- A threshold for downsampling frequent words | ||
- Number of threads to use | ||
- Whether to save the vectors in a text format or a binary format | ||
- training algorithm: hierarchical softmax and / or negative sampling | ||
- threshold for downsampling the frequent words | ||
- number of threads to use | ||
- the format of the output word vector file (text or binary) | ||
|
||
Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets. | ||
|
||
Thus the programs require a very modest number of parameter. In particular, learning rates | ||
need not be selected. | ||
The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training | ||
is finished, the user can interactively explore the similarity of the words. | ||
|
||
The file demo-word.sh downloads a small (100MB) text corpus, and trains a 200-dimensional CBOW model | ||
with a window of size 5, negative sampling with 5 negative samples, a downsampling of 1e-3, 12 threads, and binary files. | ||
|
||
./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 5 -negative 5 -hs 0 -sample 1e-3 -threads 12 -binary 1 | ||
|
||
|
||
Then, to evaluate the fidelity of our vectors, we can run the command, which will run | ||
a battery of tests on the vectors to determine their fidelity. The tests evaluate | ||
the vectors' ability to perform linear analogies. | ||
|
||
./distance vectors.bin | ||
More information about the scripts is provided at https://code.google.com/p/word2vec/ | ||
|