Learning to Rank
Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
import shap
plt.style.use('seaborn')
Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines.
Given a query $q$ and a collection $D$ of documents that match the query, the problem is to rank, that is, sort, the documents in $D$ according to some criterion so that the "best" results appear early in the result list displayed to the user.
Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems.
A majority of search engines use ranking algorithms to provide users with accurate and relevant results.
Applications of Ranking:
- Information Retrieval
- Collaborative Filtering
- Automated Text Scoring (Essay Scoring)
- Machine Translation
- Sentence Parsing
Learning to Rank
Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item. The ranking model purposes to rank, i.e. producing a permutation of items in new, unseen lists in a similar way to rankings in the training data.
Pointwise approaches look at a single document at a time in the loss function. They essentially take a single document and train a classifier / regressor on it to predict how relevant it is for the current query. The final ranking is achieved by simply sorting the result list by these document scores. For pointwise approaches, the score for each document is independent of the other documents that are in the result list for the query. All the standard regression and classification algorithms can be directly used for pointwise learning to rank.
Regression, Classification, Ordinal regression (items to be ranked are treated in isolation)
Scenario: We have customers searching for different movies and those have been ranked by a customer relevance score
size = 1000
movie_genere = ['Action','Adventure','Animation','Children’s','Comedy','Crime','Documentary','Drama','Fantasy','Film-Noir','Horror','Musical','Mystery','Romance','Sci-Fi','Thriller','War','Western']
movies = pd.DataFrame({
'customer_id': np.random.choice(['c1','c2','c3','c4','c5'],size=size),
'movie': [ f'movie_{x+1}' for x in range(size)],
'genere': np.random.choice(movie_genere,size=size),
'price': np.random.gamma(50,size=size) * np.random.uniform(0.0,1.0,size=size),
'rating': np.random.randint(1,100,size=size) * np.random.uniform(0.0,1.0,size=size),
})
for gnr in movie_genere:
mean_price = movies.loc[movies.genere == gnr,'price'].mean()
mean_rating = movies.loc[movies.genere == gnr,'rating'].mean()
movies.loc[(movies.genere == gnr) & (movies.price < mean_price ) & (movies.rating > mean_rating),'relevance'] = 1
movies.loc[movies.relevance != 1, 'relevance'] = 0
movies
Now that we have the data we can define the features we will use in our model and the target variable
features = ['genere','price','rating']
target = 'relevance'
Since we have a categorical variable we will dumify the data
movies_processed = movies.copy()
movies_processed = pd.get_dummies(movies_processed[features + [target]])
then we can train/test split our dataset
train,test = train_test_split(movies_processed, test_size=0.33, random_state=42, stratify=movies[['genere','relevance']])
so now we can train our classifier to predict how relevant is a movie for a customer
clf = LogisticRegression()
clf.fit(train.drop([target],axis=1),train[target])
let's take a copy of our test dataframe and apply the model
val = test.copy().reset_index(drop=True)
val['y_pred'] = clf.predict(test.drop([target],axis=1))
val['y_pred_proba'] = clf.predict_proba(test.drop([target],axis=1))[:,1]
val[:3]
We use the Compute Normalized Discounted Cumulative Gain(NDCG) to assess our model
Sum the true scores ranked in the order induced by the predicted scores, after applying a logarithmic discount.
Then divide by the best possible score (Ideal DCG, obtained for a perfect ranking) to obtain a score between 0 and 1.
metrics.ndcg_score([val[target]],[val['y_pred']])
Pairwise approaches look at a pair of documents at a time in the loss function. Given a pair of documents, they try and come up with the optimal ordering for that pair and compare it to the ground truth. The goal for the ranker is to minimize the number of inversions in ranking i.e. cases where the pair of results are in the wrong order relative to the ground truth. Pairwise approaches work better in practice than pointwise approaches because predicting relative order is closer to the nature of ranking than predicting class label or relevance score. Some of the most popular Learning to Rank algorithms like RankNet, LambdaRank and LambdaMART are pairwise approaches.
Rank-preference models (items to be ranked are treated in pairs)
Scenario: We have a lot of papers and we some very defined queries - we want to find the most relevant documents given the query
docs_size = 1000
query_types = ['query_1','query_2','query_3','query_4','query_5']
docs = pd.DataFrame({
'qid': np.random.choice(query_types,size=docs_size),
'did': [ f'doc_{x+1}' for x in range(docs_size)],
'doc_size': np.random.uniform(1,500,size=docs_size),
'doc_authors': np.random.randint(2,6,size=docs_size),
'doc_paragraphs': np.random.randint(50,200,size=docs_size),
'relevance': np.random.uniform(0.0,1.0,size=docs_size)
})
docs['rank'] = pd.cut(docs['relevance'],5,labels=list(range(1,6)))
docs = docs.sort_values(by='rank')
docs
again we define the features we are going to use in our models
docs_feats = ['doc_size','doc_authors','doc_paragraphs']
docs_target = 'rank'
We train test split the data, but also sort it by the rank we have assigned
docs_train, docs_test = train_test_split(docs, test_size=0.33, random_state=42)
docs_train = docs_train.sort_values(by='rank')
docs_test = docs_test.sort_values(by='rank')
in the next models we need to specify the number of samples related to group/query/domain
groups = docs_train.groupby('qid').size().to_frame('size')['size'].to_numpy()
Actually, in Learning to Rank field, we are trying to predict the relative score for each document to a specific query. That is, this is not a regression problem or classification problem. Hence, if a document, attached to a query, gets a negative predict score, it means and only means that it's relatively less relative to the query, when comparing to other document(s), with positive scores.
Then we can train our XGBoost model using the Pairwise
objective
ranker = xgb.XGBRanker(objective='rank:pairwise',booster='gbtree')
ranker.fit(docs_train[docs_feats],docs_train['rank'],group=groups)
and predict the rank score using our model
results = docs_test.copy()
results['rank_score_pairwise'] = ranker.predict(results[docs_feats])
results = results.sort_values(by='rank_score_pairwise',ascending=False)
results
We are using again the NDCG score to see how our model performs
metrics.ndcg_score([results['rank']],[results['rank_score_pairwise']])
results.sort_values(by=['qid','rank_score_pairwise'],ascending=False)
Listwise approaches directly look at the entire list of documents and try to come up with the optimal ordering for it.
There are 2 main sub-techniques for doing listwise Learning to Rank:
- Direct optimization of IR measures such as NDCG. E.g. SoftRank , AdaRank
- Minimize a loss function that is defined based on understanding the unique properties of the kind of ranking you are trying to achieve. E.g. ListNet , ListMLE
Treat each list as an instance. Usually tries to directly optimise the evaluation measure (e.g. mean average precision)
Using the same data as above we can train another model but this time using the NDCG
objective
ranker = xgb.XGBRanker(objective='rank:ndcg',booster='gbtree')
ranker.fit(docs_train[docs_feats],docs_train['rank'],group=groups)
and predict the ranking score using this model as well
results['rank_score_ndcg'] = ranker.predict(results[docs_feats])
results = results.sort_values(by='rank_score_ndcg',ascending=False)
results
We are using again the NDCG score to see how our model performs
metrics.ndcg_score([results['rank']],[results['rank_score_ndcg']])
results.sort_values(by=['qid','rank_score_ndcg'],ascending=False)
Expedia has provided you logs of customer behavior. These include what customers searched for, how they interacted with search results (click/book), whether or not the search result was a travel package. The data in this competition is a random selection from Expedia and is not representative of the overall statistics.
Expedia is interested in predicting which hotel group a user is going to book. Expedia has in-house algorithms to form hotel clusters, where similar hotels for a search (based on historical price, customer star ratings, geographical locations relative to city center, etc) are grouped together. These hotel clusters serve as good identifiers to which types of hotels people are going to book, while avoiding outliers such as new hotels that don't have historical data.
Column name | Description | Data type |
---|---|---|
date_time | Timestamp | string |
site_name | ID of the Expedia point of sale (i.e. Expedia.com, Expedia.co.uk, Expedia.co.jp, ...) | int |
posa_continent | ID of continent associated with site_name | int |
user_location_country | The ID of the country the customer is located | int |
user_location_region | The ID of the region the customer is located | int |
user_location_city | The ID of the city the customer is located | int |
orig_destination_distance | Physical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated | double |
user_id | ID of user | int |
is_mobile | 1 when a user connected from a mobile device, 0 otherwise | tinyint |
is_package | 1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise | int |
channel | ID of a marketing channel | int |
srch_ci | Checkin date | string |
srch_co | Checkout date | string |
srch_adults_cnt | The number of adults specified in the hotel room | int |
srch_children_cnt | The number of (extra occupancy) children specified in the hotel room | int |
srch_rm_cnt | The number of hotel rooms specified in the search | int |
srch_destination_id | ID of the destination where the hotel search was performed | int |
srch_destination_type_id | Type of destination | int |
hotel_continent | Hotel continent | int |
hotel_country | Hotel country | int |
hotel_market | Hotel market | int |
is_booking | 1 if a booking, 0 if a click | tinyint |
cnt | Numer of similar events in the context of the same user session | bigint |
hotel_cluster | ID of a hotel cluster | int |
expedia = pd.read_csv('data/expedia-hotel-recommendations/train.csv')[:200000].copy()
expedia.head(5)
Our objective here is to predict a list of hotel clusters for each user.
But first let's make a list of features and a variable for target and group_id
we use group_id to define the column that will count the number of samples related to a group/query/domain
features = [
'site_name','posa_continent','user_location_country','user_location_region', 'user_location_city',
'orig_destination_distance', 'is_mobile', 'is_package','channel', 'srch_adults_cnt', 'srch_children_cnt',
'srch_rm_cnt', 'srch_destination_type_id','is_booking', 'cnt', 'hotel_continent', 'hotel_country',
'hotel_market', 'srch_days', 'srch_ci_month'
]
group_id = 'user_id'
target = 'hotel_cluster'
we use this fuction for feature engineering
- we want to take the number of the mount from the check in search date
- we want to extract the number of days the user is looking to book
def my_feature_engineering(df):
data = df.copy()
for c in ['srch_ci','srch_co']:
data[c] = pd.to_datetime(data[c])
data[f'{c}_month'] = data[c].dt.month
data['srch_days'] = (data['srch_co'] - data['srch_ci']).dt.days
return data
we apply the feature engineering function
expedia = my_feature_engineering(expedia)
after that we can train/test split our data
train,test = train_test_split(expedia, test_size=0.33, random_state=42)
then we want to get the number of samples by our group_id column in this case is the used_id
column
Note: the sum of the number of groups must match the size of the training set
train_groups = train.groupby(group_id).size()
len(train_groups)
finally we can train our ranking model
ranker = xgb.XGBRanker(objective='rank:ndcg',booster='gbtree')
ranker.fit(train[features], train[target], group=train_groups)
Now that we've trained our model we can take a copy of our test data and apply the model.
Then we predict the relevance score for each group in our dataframe.
%%capture
results = test.copy()
results['rank_score_ndcg'] = ranker.predict(results[features])
results = results.sort_values(by='rank_score_ndcg',ascending=False)
results[:5]
Again we use the NDCG score to see how our model performs
metrics.ndcg_score([results[target]],[results['rank_score_ndcg']])
Now let's look at the feature importance and impact
expl = shap.TreeExplainer(ranker)
shap_values = expl.shap_values(test[features])
At first glance we can see:
- a low value/category for hotel_continent is increasing the relevance score where a high value is decreasing it
- on the other hand a high value/category for user location region is increasing the score and the reverse of it when the feature value is low
- many features have a contradictoy impact whre a high value is having a both negative and positive imapct toward the model output
shap.summary_plot(shap_values,test[features])
- Learning to rank is another subranch of machine learning focused on ranking and usaully used in Informatio Retrieval(IR)
- Very large datasets are used in training the models in this area, but in the same time filtering capabilities are serving an important role
- Framing the problem right give
data
andgoal
is an essential part for solving ranking problem
- Tiered machine learned ranking improves relevance for the retail search
- Learning to Rank
- learning-to-rank-1
- Learning to Rank with Partially-Labeled Data
- Learning to Rank with XGBoost and GPU
- Ranking Ads with Machine Learning
- How we’ve reworked our listings at OLX
- Notes on the NDCG metric used in the Visual Dialog Challenge 2019
- Learning to Rank with Partially-Labeled Data