Skip to content

tmanabe/RankLibBM25F-IA

 
 

Repository files navigation

RankLibBM25F-IA

A Learning to Rank Library.

Forked from: jattenberg/RankLib

which is copied from: http://people.cs.umass.edu/~vdang/ranklib.html

Additional Features

BM25F Ranking Function

The BM25F ranking function is for documents containing multiple fields. In this context, fields means categorized partial documents. For example, names, descriptions, and prices are all possible fields of detail pages of merchandises.

The BM25F supports multiple categories of fields and their different weights. The function is represented by the formula:

The BM25F formula

where q is a set of query keywords, beta is a document (i.e., a set of term occurrences), ka is a query keyword, k1 is a saturation factor for term frequency, N is the number of documents in the corpus, bf(k) is the number of documents in the corpus in which ka occurs.

The score of beta for ka is represented by the other formula:

The BM25F weighting formula

where occurs(ka, f, beta) is the number of occurences of ka in the field f of beta, boost_f is the weight of f relative to the other fields, b_f is a strength of length-based normalization of occurs in f, length is the length of f in beta, avgLength is the average length of f.

Strict sorting of input features is required to use this ranking function.

/**
 * @author tmanabe
 *
 * This class implements the BM25F model and its optimization with the method known as Coordinate Ascent.
 *
 * You must sort kf + 1 features in the following order where
 * k means |keywords|, f |fields|, tf(i, j) TF of i-th keyword in j-th field,
 * H(i) entropy of i-th keyword, and lp(j) length penalty of j-th field.
 * [k,     H(1),     H(2),     H(3),     ..., H(k),
 *  lp(1), tf(1, 1), tf(2, 1), tf(3, 1), ..., tf(k, 1),
 *  lp(2), tf(1, 2), tf(2, 2), tf(3, 2), ..., tf(k, 2),
 *  lp(3), tf(1, 3), tf(2, 3), tf(3, 3), ..., tf(k, 3),
 *  ...
 *  lp(f), tf(1, f), tf(2, f), tf(3, f), ..., tf(k, f)]
 * Note that H(i) is the (i+1)-th feature,
 * lp(j) the ((k+1)j+1)-th, and tf(i, j) the ((k+1)j+i+1)-th.
 */

The H(k), lp(f), and tf(k, f) feature values in the above JavaDoc correspond to the partial formulae illustrated below:

The correspondence between features and BM25F weighting formula

Note: Different number of feature values for different query IDs is allowed because we may at once optimize the rankings for multiple queries with different numbers of keywords.

Use of this ranking function triggers optimization of the k1, b_f, and boost_f parameters with coordinate ascent. They are internally used and outputted in this order:

// [k1, b(1), b(2), b(3), ..., b(f), boost(1), boost(2), boost(3), ..., boost(f)]
// Note that b(j) is the j-th element and also that boost(j) is the (f+j)-th.

SpanF Ranking Function

The SpanF ranking function is a variation of BM25F which considers proximity between keyword occurrences. The details of the function is in a short note:

https://arxiv.org/pdf/1709.03260

Strict sorting of input features is required to use this ranking function.

/**
 * @author tmanabe
 *
 * This class implements the Span model proposed by Song et al. [1] then
 * naturally extended for multiple fields by tmanabe [2], and
 * its optimization with the method known as Coordinate Ascent.
 *
 * [1] R. Song, M. J. Taylor, J.-R. Wen, H.-W. Hon, and Y. Yu. Viewing term
 * proximity from a different perspective. In ECIR, pages 346–357, 2008.
 * [2] T. Manabe and S. Fujita. A Short Note on Proximity-based Scoring of
 * Documents with Multiple Fields. CoRR, arXiv:1709.03260, 2017.
 *
 * You must sort features in the following order where
 * k means |keywords|, f |fields|,
 * H(j) the entoropy of j-th keyword,
 * lp(i) the length penalty of i-th field,
 * n(i,j) |spans of i-th field in which j-th keyword occurs|,
 * s(h,i) |keywords occur in the h-th span of i-th field|, and
 * w(h,i,j) the width of h-th span of i-th field in which j-th keyword occurs.
 *
 * [k, f, H(1), H(2), H(3), ..., H(k), lp(1), lp(2), lp(3), ..., lp(f),
 *
 *  n(1,1), s(1,1), w(1,1,1), s(2,1), w(2,1,1), ..., s(n(1,1),1), w(n(1,1),1,1),
 *  n(2,1), s(1,2), w(1,2,1), s(2,2), w(2,2,1), ..., s(n(2,1),2), w(n(2,1),2,1),
 *  n(3,1), s(1,3), w(1,3,1), s(2,3), w(2,3,1), ..., s(n(3,1),3), w(n(3,1),3,1),
 *  ...
 *  n(f,1), s(1,f), w(1,f,1), s(2,f), w(2,f,1), ..., s(n(f,1),f), w(n(f,1),f,1),
 *
 *  n(1,2), s(1,1), w(1,1,2), s(2,1), w(2,1,2), ..., s(n(1,2),1), w(n(1,2),1,2),
 *  n(2,2), s(1,2), w(1,2,2), s(2,2), w(2,2,2), ..., s(n(2,2),2), w(n(2,2),2,2),
 *  n(3,2), s(1,3), w(1,3,2), s(2,3), w(2,3,2), ..., s(n(3,2),3), w(n(3,2),3,2),
 *  ...
 *  n(f,2), s(1,f), w(1,f,2), s(2,f), w(2,f,2), ..., s(n(f,2),f), w(n(f,2),f,2),
 *
 *  n(1,3), s(1,1), w(1,1,3), s(2,1), w(2,1,3), ..., s(n(1,3),1), w(n(1,3),1,3),
 *  n(2,3), s(1,2), w(1,2,3), s(2,2), w(2,2,3), ..., s(n(2,3),2), w(n(2,3),2,3),
 *  n(3,3), s(1,3), w(1,3,3), s(2,3), w(2,3,3), ..., s(n(3,3),3), w(n(3,3),3,3),
 *  ...
 *  n(f,3), s(1,f), w(1,f,3), s(2,f), w(2,f,3), ..., s(n(f,3),f), w(n(f,3),f,3),
 *
 *  ...
 *
 *  n(1,k), s(1,1), w(1,1,k), s(2,1), w(2,1,k), ..., s(n(1,k),1), w(n(1,k),1,k),
 *  n(2,k), s(1,2), w(1,2,k), s(2,2), w(2,2,k), ..., s(n(2,k),2), w(n(2,k),2,k),
 *  n(3,k), s(1,3), w(1,3,k), s(2,3), w(2,3,k), ..., s(n(3,k),3), w(n(3,k),3,k),
 *  ...
 *  n(f,k), s(1,f), w(1,f,k), s(2,f), w(2,f,k), ..., s(n(f,k),f), w(n(f,k),f,k)]
 */

Use of this ranking function triggers optimization of the k1, z_f, x_f, b_f, and boost_f parameters with coordinate ascent. They are internally used and outputted in this order:

/**
 * [k1,
 *  z(1), z(2), z(3), ..., z(f),
 *  x(1), x(2), x(3), ..., x(f),
 *  b(1), b(2), b(3), ..., b(f),
 *  boost(1), boost(2), boost(3), ..., boost(f)]
 *  Note that z(i) is the i-th element,
 *  that x(i) is the (f+i)-th,
 *  that b(i) is the (2f+i)-th, and
 *  that boost(i) is the (3f+i)-th.
 */

Intent-Aware Labels

Multiple users may have different intents behind a query. To represent such queries, you can input comma-separated multiple answer labels for each document. You should align labels for an intent in the same column.

Note: Different number of labels for different query IDs is also allowed.

MAP-IA Evaluation Measure

The MAP-IA (intent-aware mean average precision) is an evaluation measure for rankings with multiple possible query intents. This is the weighted summation of MAP scores for multiple intents. The weights are the probabilities that each intent occurs.

Unfortunately there is no way to input the intent probabilities so that the current implementation calculates simple arithmetic mean of MAP scores for multiple intents. In other words, we assume an even intent probabilities of all intents behind each query.

Multiple Documents Sharing ID

In case there are multiple documents of the same document ID, only the top-scored one will be outputted.

About

A Learning to Rank Library

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 100.0%