Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Killed process #1

Open
nick-magnini opened this issue Jan 6, 2016 · 8 comments
Open

Killed process #1

nick-magnini opened this issue Jan 6, 2016 · 8 comments

Comments

@nick-magnini
Copy link

Hi,

Thanks for making the code available. I have an embedding model in this format:

word1 4 -2 3 1 1 1 0 -2 2 3 1 0 0 0 -3 -4 0 0 3 -4 1 -5 2 -2 0 -1 -2 0 0 1 0 0 2 2 0 3 -4 -2 0 -5 -1 1 1 2 -2 0 -2 0 -2 -3 -1 -3 0 0 -5 0 5 -2 -1 -2 0 2 0 0 0 2 5 -3 1 2 1 -3 0 1 3 0 -3 0 1 -2 2 -1 -1 0 -4 2 0 -1 0 0 -1 1 0 -5 2 0 0 0 -2 -2
word2 ...

It contains 10008676 lines and about 2.5 GB in size. I use python2.7. my running command is this:
$> ./qvec-python2.7.py --in_vectors $embedding --in_oracle oracles/semcor_noun_verb.supersenses.en

After running "Loading VSM file: ....", it takes around 10-20 mins till it stops. The output after being stoped is "Killed". It can't be memory since I tired bigger embeddings and they went through. What could it be the possible reason?

@ytsvetko
Copy link
Owner

ytsvetko commented Jan 7, 2016

qvec was designed to load the whole embedding file into memory, because it makes it easier to calculate column-wise correlations. If you want to use this implementation as-is you would need a machine with enough RAM to hold the whole dataset.

I am working now on an improved version of qvec, that uses CCA-algorithm instead of sum of correlations. See qvec_cca.py, this implementation still loads everything into memory, but does not have to. It can be modified to process data on the fly. However, it requires Matlab installed to perform the actual CCA calculation. Please see if it works better for you.

@nick-magnini
Copy link
Author

The memory is actually enough.
$> free -g
total used free shared buffers cached
Mem: 93 42 51 0 0 6

The machine has 51 G free memory. It shouldn't be the memory issue. I suspected that and that's why I ran it on a big machine.

@ytsvetko
Copy link
Owner

ytsvetko commented Jan 7, 2016

sorry, I didn't notice your wrote in the first message that you have 51G free. However, I still think this is a memory issue because "Killed" error message is not from qvec but from your OS. Even though you tired bigger embeddings, this does not necessarily imply that the bigger file requires more memory: the data is stored in a python dictionary data structure, so if there are repeated lines or more spaces in the bigger file it might still need less memory in the python dictionary. I suggest you try in a tmux session to run qvec in one pane and in another monitor memory usage with htop command.  

@nick-magnini
Copy link
Author

Well, it's surprising still. The memory depends on the number of rows and number of columns otherwise everything else should be the same.
Having an embedding file with 100 dims and 10008676 unique words should take much more memory than a file with the same 10008676 unique words and 15 dims for each. Isn't it true?

@nick-magnini
Copy link
Author

Running it using gensim resolves the problem though!

@ytsvetko
Copy link
Owner

Great, thanks for an update :)

On Mon, Jan 11, 2016 at 4:11 PM, nick-magnini [email protected]
wrote:

Running it using gensim resolves the problem though!


Reply to this email directly or view it on GitHub
#1 (comment).

@nick-magnini
Copy link
Author

As a suggestion, it is great to make your code compatible with gensim since gensim has been widely used.

@tmylk
Copy link

tmylk commented Aug 12, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants