This is the Python library that serves as wrapper for the DELTR model for fair ranking in ElasticSearch with the Learning to Rank plugin.
This library requires:
Python 3.4+
Python
dependencies are stored in therequirements.txt
file
ElasticSearch
and Learning to rank plugin (LTR) forElasticSearch
- Start a supported version of
ElasticSearch
and follow the installation steps
- Start a supported version of
There are several steps you need to take. In the following, we describe how to use the adapter to search on a collection of e-mails from W3C, included in /data/, which is one of the examples we used in the paper (see bibliography below).
Index the training corpus. We have a sample data set in zip
files /data/candidates/candidates*.zip
.
Make sure to unzip them first. Then, you can index them with:
python deltr.py --index --document-dir ./data/candidates --index-name resumes
This will (re)index the JSON
files under the folder /data/candidates
in an index named resumes
.
Later, at any point, you can add the real documents over which you want to search using the trained ranking model. Those documents do not need to be in the same index, most commonly they will be in a different index.
Create the features you want to use in LTR. We have created sample features in /data/features.json
Next, we need to upload these features to ElasticSearch.
python deltr.py --prepare --feature-set-file ./data/features.json --feature-set-name w3c
This will upload the features defined in /data/features.json
in ElasticSearch under the name w3c
.
After, we have defined and uploaded the features and indexed the data, we can now create a model to use for retrireval.
In order to build a DELTR model, we need to provide it with some training data. We have created a sample train set contained in two files:
/data/queries.csv
and /data/judgements.csv
. You can train a model with:
python deltr.py --train --queries ./data/queries.csv --judgements ./data/judgements.csv --model deltr_vanilla --feature-set-name w3c
This is going to train a DELTR model (with default parameters) name deltr_vanilla
using the questions in /data/queries.csv
and
judgements for those queries in /data/judgements.csv
, with the features defined in the feature set name w3c
The library will use the features we defined in LTR to train the model. So, for debugging purposes, the library
creates a features.csv
file in the same folder where this is executed. There you can see what features were generated for each document.
It also creates a model.txt
where you can see the final model, that was uploaded in LTR.
Note: You can also specify tuning parameters from the command line as well. E.g.
python deltr.py --train --queries ./data/queries.csv --judgements ./data/judgements.csv --model deltr_not_vanilla --feature-set-name w3c --gamma 0.8
This will create a new model with the same files, only it will set the gamma
parameter to 0.8. Here you can see how to check all options.
Once we have the model, we can start using to do some searches.
python3 deltr.py --search --query html --model deltr_vanilla --index-name resumes
This will run a query with the keyword html
using the model deltr_vanilla
on the index resumes
.
Note: You can also see a verbose output, which will contain the features calculated for each document returned.
python3 deltr.py --search --query html --model deltr_vanilla --index-name resumes --verbose
Run the following command to get the full options list
python deltr.py --help
- Clone this repository
git clone https://github.com/fair-search/fairsearch-deltr-for-elasticsearch
- Change directory to the directory where you cloned the repository
cd WHERE_ITS_DOWNLOADED/fairsearch-deltr-for-elasticsearch
- Use any IDE to work with the code
The DELTR algorithm is described in this paper:
- Meike Zehlike, Gina-Theresa Diehn, Carlos Castillo. "Reducing Disparate Exposure in Ranking: A Learning to Rank Approach." preprint arXiv:1805.08716 (2018).
This library was developed by Ivan Kitanovski based on the paper. See the license file for more information.
For any questions contact Meike Zehlike.
You can also see: