Better query parsing #7

bnewbold · 2020-10-02T03:48:01Z

A particular user request is to be able to paste a citation string into the search box and have "the right thing" happen in most cases. The current query parser (Elasticsearch's built-in) doesn't work well for this; it is expecting a structured query string (with booleans etc).

A great solution would be a custom query parser with perfect detection of user intent that "does the expected thing". In the meanwhile, more practically, we could try to differentiate between regular queries and citation string queries, and have two code paths. The query string path would be the current behavior. The query string path would use, eg, GROBID and/or biblio-glutton to parse the raw citation in to a structured citation, then try to do a fuzzy match against the live fatcat metadata index (generally faster than the scholar fulltext index), and if there is a hit do an exact identifier lookup against scholar elasticsearch. The later half of this code path would be similar to the current behavior for identifier lookups (eg, remove all filters and sort order).

bnewbold · 2020-10-12T19:57:06Z

Here is a Google Scholar blog post about detecting reference strings: https://scholar.googleblog.com/2016/01/quickly-lookup-references.html

The jargon-y term for this use case is "known item lookup"

bnewbold · 2021-01-26T01:29:51Z

An initial version of this has been implemented and is live. Testing and iteration probably needed.

bnewbold · 2021-02-12T20:25:24Z

Some user queries are getting re-written poorly with the current system:

"journal:" Post Communist Economies "year:" 2021
"Title:" A multi-speed fiscal "Europe?" Fiscal rules and fiscal performance in the EU former communist countries. It appears to be online content from Post Communist Economies 31Jan 2021. "Link:" "https://www-tandfonline-com.libproxy-imf.imf.org/doi/full/10.1080/14631377.2020.1867432"

The original query was probably:

journal: Post Communist Economies year: 2021

Some of this may be due to copy/paste from other sources? Eg, an email or multi-line record on a website.

For one thing, we probably shouldn't return the re-written (quoted) query, we should return the original query string (in the search box). Any time we rewrite/modify the query, should indicate that it happened though, and link to query documentation.

Other possible improvements or work arounds are to have an "advanced search" page, or to have separate search boxes/options for different types of query. I'd like to try a little more to stick with the "one simple box" experience though.

bnewbold added enhancement New feature or request help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Oct 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better query parsing #7

Better query parsing #7

bnewbold commented Oct 2, 2020

bnewbold commented Oct 12, 2020

bnewbold commented Jan 26, 2021

bnewbold commented Feb 12, 2021

Better query parsing #7

Better query parsing #7

Comments

bnewbold commented Oct 2, 2020

bnewbold commented Oct 12, 2020

bnewbold commented Jan 26, 2021

bnewbold commented Feb 12, 2021