Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to open .../WNdb/dict/index.adv #115

Open
tommedema opened this issue Dec 23, 2013 · 16 comments
Open

unable to open .../WNdb/dict/index.adv #115

tommedema opened this issue Dec 23, 2013 · 16 comments
Labels

Comments

@tommedema
Copy link

When I perform the following a couple of million times I get the error unable to open .../WNdb/dict/index.adv:

function isWord(text, cb) {
    wordnet.lookup(text, function(results) {
        cb(Array.isArray(results) && results.length > 0);
    });
}

Is there anything I can do to resolve this?

@kkoch986 kkoch986 added Bugs and removed Bugs labels Mar 7, 2014
@kkoch986
Copy link
Member

@tommedema I'm having a hard time pinpointing this (mainly because these wordnet lookups are really slow). It did crash for me but i didn't find the same error.

In the meantime, is it essential to use wordnet for this? I hate to be pushing this all the time to people but the Trie is basically purpose-built for the 'isWord' test. I've built my trie using this code and you can do is something like this:

trie.contains(word);

To get a synchronous and lightning fast answer.

Here are some times (not rock solid benchmarks, but you get the idea) for comparison:

Using WordNet: crashed after about 4k, after a minute or two
Using Trie: 23,588,700 lookups ~39 seconds

Threw the code into a gist if you want to check it out https://gist.github.com/kkoch986/9899177

In the meantime, i'll tag this as a bug.

Thanks,
-Ken

@kkoch986 kkoch986 added the Bugs label Mar 31, 2014
@tommedema
Copy link
Author

Thanks, I am no longer working on this issue.

@kkoch986
Copy link
Member

Ok no problem, just curious did you manage to resolve the error?

@tommedema
Copy link
Author

I didn't :)

@moos
Copy link
Contributor

moos commented May 3, 2014

As already mentioned, WordNet module is bare bones and notoriously underperforming except for simple lookups. You may want to look at https://github.com/moos/wordpos, built on top of natural's WordNet, with optimized perfermance using additional fast-index files and cached disk reads.

Although for simple isWord operations, I agree @kkoch986's suggestion might be better.

@kkoch986
Copy link
Member

kkoch986 commented May 5, 2014

Actually going to close this unless someone else runs into a similar problem. I think its safe to say the WNdb code directly is not best used this way, @moos I haven't had a chance to try the new wordpos module but it looks pretty cool thanks for the tip!

@kkoch986 kkoch986 closed this as completed May 5, 2014
@ahamid
Copy link

ahamid commented Oct 14, 2014

I encounter the same error using the wordnet.lookup(word, cb) API.
If I wait a few seconds I get the same error for data.adv. Both index.adv and data.adv exist on disk at the reported location and are readable under the current user.

Edit: some more debugging: this appears to be a problem with too many open file handles:

{ [Error: EMFILE, open '/home/aaron/blah/blah/node_modules/WNdb/dict/index.adv']
  errno: 20,
  code: 'EMFILE',
  path: '/home/aaron/blah/blah/node_modules/WNdb/dict/index.adv' }

http://stackoverflow.com/questions/8965606/node-and-error-emfile-too-many-open-files

it looks like index_file.js and data_file.js may not be appropriately calling back the file close callback in their, um, callback...

Ouch, this is thorny: even if the file handles are in theory being (eventually) closed, because the API is async, we can have potentially unlimited pending calls, and therefore opened file handles. Given I'm looking up hundreds, possibly thousands, of synonyms, this is likely the case :(

@moos
Copy link
Contributor

moos commented Oct 14, 2014

natural opens the index file on each lookup, and if you've got thousands of simultaneous lookups, that's how many open files you'll have. wordpos is optimized for multiple async reads and is much faster. You could combine wordpos' getPOS() or isX() method with its lookupX() for better performance than natural's lookup().

@kkoch986
Copy link
Member

Going to reopen this, it seems like the issue is around the files not being closed correctly, I'll try to dig in further and see if i can find anything.

@kkoch986 kkoch986 reopened this Oct 15, 2014
@ahamid
Copy link

ahamid commented Oct 15, 2014

It turns out Bluebird promises library map function supports a concurrency flag that can limit pending promises, so I used that to work around this problem.

Promise.map words, ((word) => @lookupWordNetInfo word), concurrency: 10

@kkoch986
Copy link
Member

@ahamid so do you think the problem is too many concurrent calls? that could potentially explain why too many files are open at once.

@ahamid
Copy link

ahamid commented Oct 23, 2014

@kkoch986 Yeah, I'm pretty sure that's the case (well, I haven't proved the opposite - that files aren't eventually getting closed, but code looked fine on casual inspection). It's just the tradeoff of using an async-only api. I did not get around to using wordpos since the map trick did the job, which has become my go-to hammer for this sort of thing.

@kkoch986
Copy link
Member

Yea i think its worth a closer look, maybe an option to just load the thing into memory. no reason to keep reading it from files every time anyway, especially if your doing a large amount of lookups.

@kkoch986
Copy link
Member

So just to give everyone the latest news on this, I am kicking off a rewrite of the natural wordnet layer which should result in cleaner code and better performance. Hopefully in the next few weeks i'll have something to show for this and we can finally close this issue

@moos
Copy link
Contributor

moos commented Feb 8, 2015

I think wordpos already solves this problem -- not only that its 'fastIndex' provides 30x performance boost over natural's WordNet methods. I'm happy to contribute any or all parts of wordpos's code to this effort, either as a rewrite, a sub-module, or drop-in plugin.
If you go the wholesale rewrite route, I'm afraid it'll break wordpos since it was built on top of the WordNet module's API.

@kkoch986
Copy link
Member

kkoch986 commented Feb 9, 2015

@moos see #211 and #170 the plan is to reimplement for performance/stability while maintaining the base API.

Theres a good chance we will build more functionality on top of the basic API but the main plan is to at least stabilize the code using the same API and move the wordnet downloading to an in-library corpus manager. Would love to have your input on this whole effort as well, I'm just getting into the actual wordnet files and coming up with a plan for indexing them more efficiently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants