-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better query parsing #7
Comments
Here is a Google Scholar blog post about detecting reference strings: https://scholar.googleblog.com/2016/01/quickly-lookup-references.html The jargon-y term for this use case is "known item lookup" |
An initial version of this has been implemented and is live. Testing and iteration probably needed. |
Some user queries are getting re-written poorly with the current system:
The original query was probably:
Some of this may be due to copy/paste from other sources? Eg, an email or multi-line record on a website. For one thing, we probably shouldn't return the re-written (quoted) query, we should return the original query string (in the search box). Any time we rewrite/modify the query, should indicate that it happened though, and link to query documentation. Other possible improvements or work arounds are to have an "advanced search" page, or to have separate search boxes/options for different types of query. I'd like to try a little more to stick with the "one simple box" experience though. |
A particular user request is to be able to paste a citation string into the search box and have "the right thing" happen in most cases. The current query parser (Elasticsearch's built-in) doesn't work well for this; it is expecting a structured query string (with booleans etc).
A great solution would be a custom query parser with perfect detection of user intent that "does the expected thing". In the meanwhile, more practically, we could try to differentiate between regular queries and citation string queries, and have two code paths. The query string path would be the current behavior. The query string path would use, eg, GROBID and/or biblio-glutton to parse the raw citation in to a structured citation, then try to do a fuzzy match against the live fatcat metadata index (generally faster than the scholar fulltext index), and if there is a hit do an exact identifier lookup against scholar elasticsearch. The later half of this code path would be similar to the current behavior for identifier lookups (eg, remove all filters and sort order).
The text was updated successfully, but these errors were encountered: