You've seen how to add features to feature sets. We want to show you how to address common feature engineering tasks that come up when developing a learning to rank solution.
Many learning to rank solutions use raw term statistics in training. For example, the total term frequency for a term, the document frequency, and other statistics. Luckily, Elasticsearch LTR comes with a query primitive, match_explorer
, that extracts these statistics for you for a set of terms. In its simplest form, match_explorer
lets you specify a statistic you're interested in and a match you'd like to explore. For example:
POST tmdb/_search { "query": { "match_explorer": { "type": "max_raw_df", "query": { "match": { "title": "rambo rocky" } } } } }
This query returns the highest document frequency between the two terms.
A large number of statistics are available. The type
parameter can be prepended with the operation to be performed across terms for the statistic max
, min
, sum
, and stddev
.
The statistics available include:
raw_df
-- the direct document frequency for a term. So if rambo occurs in 3 movie titles, this is 3.classic_idf
-- the IDF calculation of the classic similaritylog((NUM_DOCS+1)/(raw_df+1)) + 1
.raw_ttf
-- the total term frequency for the term across the index. So if rambo is mentioned a total of 100 times in the overview field, this would be 100.raw_tf
-- the term frequency for a document. So if rambo occurs in 3 in movie synopsis in same document, this is 3.
Putting the operation and the statistic together, you can see some examples. To get stddev of classic_idf, you would write stddev_classic_idf
. To get the minimum total term frequency, you'd write min_raw_ttf
.
Term position statistics
The type
parameter can be prepended with the operation to be performed across term position for the statistic min
, max
and avg
.
For any of the cases, 0 will be returned if there isn't any occurrence of the terms in the document.
The statistics available include, e.g. using the query "dance monkey" we have:
min_raw_tp
-- return the minimum occurrence, i.e. the first one, of any term on the query. So if dance occurs at positions [2, 5 ,9], and monkey occurs at positions [1, 4] in a text in the same document, the minimum is 1.max_raw_tp
-- return the maximum occurrence, i.e. the last one, of any term on the query. So if dance occurs at positions [2, 5 ,9] and monkey occurs at positions [1, 4] in a text in the same document, the maximum is 9.avg_raw_tp
-- return the average of all occurrence of the terms on the query. So if dance occurs at positions [2, 5 ,9] its average is5.33
, and monkey has average2.5
for positions [1, 4]. So the returned average is3.91
, computed by(5.33 + 2.5)/2
.
Finally a special stat exists for just counting the number of search terms. That stat is unique_terms_count
.
Another common case in learning to rank is features such as popularity or recency, tied only to the document. Elasticsearch's function_score
query has the functionality you need to pull this data out. You already saw an example when adding features in the last section:
{ "query": { "function_score": { "functions": [{ "field_value_factor": { "field": "vote_average", "missing": 0 } }], "query": { "match_all": {} } } } }
The score for this query corresponds to the value of the vote_average
field.
If you have an index that updates regularly, trends that held true today, may not hold true tomorrow! On an e-commerce store, sandals might be very popular in the summer, but impossible to find in the winter. Features that drive purchases for one time period, may not hold true for another. It's always a good idea to monitor your model's performance regularly, retrain as needed.
Next up, we discuss the all-important task of logging features in :doc:`logging-features`.