Learning to Rank is a class of techniques that apply supervised machine learning to solve ranking problems. It is widely used in information retrieval (IR), natural language processing (NLP), and recommender systems.
Machine Learning
Ranking
Recommender Systems
Author
Daniel Fat
Published
April 23, 2021
Imports
Code
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport xgboost as xgbfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn import metricsimport shapplt.style.use('seaborn')
What is Ranking ?
Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines.
Given a query \(q\) and a collection \(D\) of documents that match the query, the problem is to rank, that is, sort, the documents in \(D\) according to some criterion so that the “best” results appear early in the result list displayed to the user.
Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems.
A majority of search engines use ranking algorithms to provide users with accurate and relevant results.
Applications of Ranking: - Information Retrieval - Collaborative Filtering - Automated Text Scoring (Essay Scoring) - Machine Translation - Sentence Parsing
Learning to Rank
Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. “relevant” or “not relevant”) for each item. The ranking model purposes to rank, i.e. producing a permutation of items in new, unseen lists in a similar way to rankings in the training data.
Supervised Ranking
Pointwise model
Pointwise approaches look at a single document at a time in the loss function. They essentially take a single document and train a classifier / regressor on it to predict how relevant it is for the current query. The final ranking is achieved by simply sorting the result list by these document scores. For pointwise approaches, the score for each document is independent of the other documents that are in the result list for the query. All the standard regression and classification algorithms can be directly used for pointwise learning to rank.
Regression, Classification, Ordinal regression (items to be ranked are treated in isolation)
Scenario: We have customers searching for different movies and those have been ranked by a customer relevance score
let’s take a copy of our test dataframe and apply the model
Code
val = test.copy().reset_index(drop=True)val['y_pred'] = clf.predict(test.drop([target],axis=1))val['y_pred_proba'] = clf.predict_proba(test.drop([target],axis=1))[:,1]val[:3]
price
rating
relevance
genere_Action
genere_Adventure
genere_Animation
genere_Children’s
genere_Comedy
genere_Crime
genere_Documentary
...
genere_Horror
genere_Musical
genere_Mystery
genere_Romance
genere_Sci-Fi
genere_Thriller
genere_War
genere_Western
y_pred
y_pred_proba
0
44.963785
45.100109
0.0
0
0
0
0
0
0
0
...
0
1
0
0
0
0
0
0
0.0
0.002454
1
21.442252
35.388441
1.0
0
0
0
0
0
0
0
...
0
0
0
0
0
0
1
0
0.0
0.128310
2
30.280563
26.777554
0.0
0
0
0
0
0
1
0
...
0
0
0
0
0
0
0
0
0.0
0.010166
3 rows × 23 columns
We use the Compute Normalized Discounted Cumulative Gain(NDCG) to assess our model
Sum the true scores ranked in the order induced by the predicted scores, after applying a logarithmic discount.
Then divide by the best possible score (Ideal DCG, obtained for a perfect ranking) to obtain a score between 0 and 1.
Code
metrics.ndcg_score([val[target]],[val['y_pred']])
0.9131585064991307
Pairwise model
Pairwise approaches look at a pair of documents at a time in the loss function. Given a pair of documents, they try and come up with the optimal ordering for that pair and compare it to the ground truth. The goal for the ranker is to minimize the number of inversions in ranking i.e. cases where the pair of results are in the wrong order relative to the ground truth. Pairwise approaches work better in practice than pointwise approaches because predicting relative order is closer to the nature of ranking than predicting class label or relevance score. Some of the most popular Learning to Rank algorithms like RankNet, LambdaRank and LambdaMART are pairwise approaches.
Rank-preference models (items to be ranked are treated in pairs)
Scenario: We have a lot of papers and we some very defined queries - we want to find the most relevant documents given the query
in the next models we need to specify the number of samples related to group/query/domain
Code
groups = docs_train.groupby('qid').size().to_frame('size')['size'].to_numpy()
Actually, in Learning to Rank field, we are trying to predict the relative score for each document to a specific query. That is, this is not a regression problem or classification problem. Hence, if a document, attached to a query, gets a negative predict score, it means and only means that it’s relatively less relative to the query, when comparing to other document(s), with positive scores.
Then we can train our XGBoost model using the Pairwise objective
Listwise approaches directly look at the entire list of documents and try to come up with the optimal ordering for it.
There are 2 main sub-techniques for doing listwise Learning to Rank: - Direct optimization of IR measures such as NDCG. E.g. SoftRank , AdaRank - Minimize a loss function that is defined based on understanding the unique properties of the kind of ranking you are trying to achieve. E.g. ListNet , ListMLE
Treat each list as an instance. Usually tries to directly optimise the evaluation measure (e.g. mean average precision)
Using the same data as above we can train another model but this time using the NDCG objective
Expedia has provided you logs of customer behavior. These include what customers searched for, how they interacted with search results (click/book), whether or not the search result was a travel package. The data in this competition is a random selection from Expedia and is not representative of the overall statistics.
Expedia is interested in predicting which hotel group a user is going to book. Expedia has in-house algorithms to form hotel clusters, where similar hotels for a search (based on historical price, customer star ratings, geographical locations relative to city center, etc) are grouped together. These hotel clusters serve as good identifiers to which types of hotels people are going to book, while avoiding outliers such as new hotels that don’t have historical data.
Column name
Description
Data type
date_time
Timestamp
string
site_name
ID of the Expedia point of sale (i.e. Expedia.com, Expedia.co.uk, Expedia.co.jp, …)
int
posa_continent
ID of continent associated with site_name
int
user_location_country
The ID of the country the customer is located
int
user_location_region
The ID of the region the customer is located
int
user_location_city
The ID of the city the customer is located
int
orig_destination_distance
Physical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated
double
user_id
ID of user
int
is_mobile
1 when a user connected from a mobile device, 0 otherwise
tinyint
is_package
1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise
int
channel
ID of a marketing channel
int
srch_ci
Checkin date
string
srch_co
Checkout date
string
srch_adults_cnt
The number of adults specified in the hotel room
int
srch_children_cnt
The number of (extra occupancy) children specified in the hotel room
int
srch_rm_cnt
The number of hotel rooms specified in the search
int
srch_destination_id
ID of the destination where the hotel search was performed
int
srch_destination_type_id
Type of destination
int
hotel_continent
Hotel continent
int
hotel_country
Hotel country
int
hotel_market
Hotel market
int
is_booking
1 if a booking, 0 if a click
tinyint
cnt
Numer of similar events in the context of the same user session
we use this fuction for feature engineering - we want to take the number of the mount from the check in search date - we want to extract the number of days the user is looking to book
Code
def my_feature_engineering(df): data = df.copy()for c in ['srch_ci','srch_co']: data[c] = pd.to_datetime(data[c]) data[f'{c}_month'] = data[c].dt.month data['srch_days'] = (data['srch_co'] - data['srch_ci']).dt.days return data
ntree_limit is deprecated, use `iteration_range` or model slicing instead.
At first glance we can see: - a low value/category for hotel_continent is increasing the relevance score where a high value is decreasing it - on the other hand a high value/category for user location region is increasing the score and the reverse of it when the feature value is low - many features have a contradictoy impact whre a high value is having a both negative and positive imapct toward the model output
Code
shap.summary_plot(shap_values,test[features])
Conclusions
Learning to rank is another subranch of machine learning focused on ranking and usaully used in Informatio Retrieval(IR)
Very large datasets are used in training the models in this area, but in the same time filtering capabilities are serving an important role
Framing the problem right give data and goal is an essential part for solving ranking problem