Skip to content

Commit

Permalink
End-to-end training and evaluation
Browse files Browse the repository at this point in the history
  • Loading branch information
dhdhagar committed Jan 11, 2023
1 parent 7e55f40 commit 109ffa6
Show file tree
Hide file tree
Showing 27 changed files with 3,747 additions and 218 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -136,4 +136,4 @@ foo-*
*.cpp

.idea
data
data/
167 changes: 25 additions & 142 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,9 @@
# S2AND
This repository provides access to the S2AND dataset and S2AND reference model described in the paper [S2AND: A Benchmark and Evaluation System for Author Name Disambiguation](https://api.semanticscholar.org/CorpusID:232233421) by Shivashankar Subramanian, Daniel King, Doug Downey, Sergey Feldman.
# Probabilistic Entity Resolution

The reference model is live on semanticscholar.org, and the trained model is available now as part of the data download (see below).

## Installation
## Setup the Conda Environment
To install this package, run the following:

```bash
git clone https://github.com/allenai/S2AND.git
cd S2AND
conda create -y --name s2and python==3.7
conda activate s2and
pip install -r requirements.in
Expand All @@ -20,15 +15,15 @@ If you run into cryptic errors about GCC on macOS while installing the requirmen
CFLAGS='-stdlib=libc++' pip install -r requirements.in
```

## Data
## Download S2AND Data
To obtain the S2AND dataset, run the following command after the package is installed (from inside the `S2AND` directory):
```[Expected download size is: 50.4 GiB]```

`aws s3 sync --no-sign-request s3://ai2-s2-research-public/s2and-release data/`

Note that this software package comes with tools specifically designed to access and model the dataset.

## Configuration
## Setup Configuration
Modify the config file at `data/path_config.json`. This file should look like this
```
{
Expand All @@ -39,146 +34,34 @@ Modify the config file at `data/path_config.json`. This file should look like th
As the dummy file says, `main_data_dir` should be set to the location of wherever you downloaded the data to, and
`internal_data_dir` can be ignored, as it is used for some scripts that rely on unreleased data, internal to Semantic Scholar.

## How to use S2AND for loading data and training a model
Once you have downloaded the datasets, you can go ahead and load up one of them:

```python
from os.path import join
from s2and.data import ANDData

dataset_name = "pubmed"
parent_dir = f"data/{dataset_name}"
dataset = ANDData(
signatures=join(parent_dir, f"{dataset_name}_signatures.json"),
papers=join(parent_dir, f"{dataset_name}_papers.json"),
mode="train",
specter_embeddings=join(parent_dir, f"{dataset_name}_specter.pickle"),
clusters=join(parent_dir, f"{dataset_name}_clusters.json"),
block_type="s2",
train_pairs_size=100000,
val_pairs_size=10000,
test_pairs_size=10000,
name=dataset_name,
n_jobs=8,
)
## Preprocess Dataset
Run the Preprocessing step for each dataset, this step creates the following directory structure:
```

This may take a few minutes - there is a lot of text pre-processing to do.

The first step in the S2AND pipeline is to specify a featurizer and then train a binary classifier
that tries to guess whether two signatures are referring to the same person.

We'll do hyperparameter selection with the validation set and then get the test area under ROC curve.

Here's how to do all that:

```python
from s2and.model import PairwiseModeler
from s2and.featurizer import FeaturizationInfo
from s2and.eval import pairwise_eval

featurization_info = FeaturizationInfo()
# the cache will make it faster to train multiple times - it stores the features on disk for you
train, val, test = featurize(dataset, featurization_info, n_jobs=8, use_cache=True)
X_train, y_train = train
X_val, y_val = val
X_test, y_test = test

# calibration fits isotonic regression after the binary classifier is fit
# monotone constraints help the LightGBM classifier behave sensibly
pairwise_model = PairwiseModeler(
n_iter=25, calibrate=True, monotone_constraints=featurization_info.lightgbm_monotone_constraints
)
# this does hyperparameter selection, which is why we need to pass in the validation set.
pairwise_model.fit(X_train, y_train, X_val, y_val)

# this will also dump a lot of useful plots (ROC, PR, SHAP) to the figs_path
pairwise_metrics = pairwise_eval(X_test, y_test, pairwise_model.classifier, figs_path='figs/', title='example')
print(pairwise_metrics)
/data
-> /{dataset}
-> /seed{seed #}
-> pickle files stored here
```

The second stage in the S2AND pipeline is to tune hyperparameters for the clusterer on the validation data
and then evaluate the full clustering pipeline on the test blocks.

We use agglomerative clustering as implemented in `fastcluster` with average linkage.
There is only one hyperparameter to tune.
2 kinds of pickle files are created and stored for each split of the data (train/test/val), following
this naming convention: train_features.pkl, train_signatures.pkl.

```python
from s2and.model import Clusterer, FastCluster
from hyperopt import hp
The features pickle contains a dictionary of type:
```Dict[block_id: str, Tuple[features: np.ndarray, labels: np.ndarray, cluste_ids: np.ndarray]]```.
NOTE: The pairwise features are compressed in order to be stored as a n(n-1)/2 matrix rather than an nxn symmetric matrix.
The signatures pickle contains all the metadata for each signature in a block.

clusterer = Clusterer(
featurization_info,
pairwise_model,
cluster_model=FastCluster(linkage="average"),
search_space={"eps": hp.uniform("eps", 0, 1)},
n_iter=25,
n_jobs=8,
)
clusterer.fit(dataset)

# the metrics_per_signature are there so we can break out the facets if needed
metrics, metrics_per_signature = cluster_eval(dataset, clusterer)
print(metrics)
Sample command:
```commandline
python pipeline/preprocess_s2and_data.py --data_home_dir="./data" --dataset_name="pubmed"
```

For a fuller example, please see the transfer script: `scripts/transfer_experiment.py`.

## How to use S2AND for predicting with a saved model
Assuming you have a clusterer already fit, you can dump the model to disk like so
```python
import pickle
## End-to-end model training
The end-to-end model is defined in file pipeline/model.py. For training of this model, run python script
sample_scripts/train_e2e_model.py

with open("saved_model.pkl", "wb") as _pkl_file:
pickle.dump(clusterer, _pkl_file)
```

You can then reload it, load a new dataset, and run prediction
```python
import pickle

with open("saved_model.pkl", "rb") as _pkl_file:
clusterer = pickle.load(_pkl_file)

anddata = ANDData(
signatures=signatures,
papers=papers,
specter_embeddings=paper_embeddings,
name="your_name_here",
mode="inference",
block_type="s2",
)
pred_clusters, pred_distance_matrices = clusterer.predict(anddata.get_blocks(), anddata)
```

Our released models are in the `s3` folder referenced above, and are called `production_model.pickle` and `full_union_seed_*.pickle`. They can be loaded the same way, except that the pickled object is a dictionary, with a `clusterer` key.

### Incremental prediction
There is a also a `predict_incremental` function on the `Clusterer`, that allows prediction for just a small set of *new* signatures. When instantiating `ANDData`, you can pass in `cluster_seeds`, which will be used instead of model predictions for those signatures. If you call `predict_incremental`, the full distance matrix will not be created, and the new signatures will simply be assigned to the cluster they have the lowest average distance to, as long as it is below the model's `eps`, or separately reclustered with the other unassigned signatures, if not within `eps` of any existing cluster.

## Reproducibility
The experiments in the paper were run with the python (3.7.9) package versions in `paper_experiments_env.txt`. You can install these packages exactly by running `pip install pip==21.0.0` and then `pip install -r paper_experiments_env.txt --use-feature=fast-deps --use-deprecated=legacy-resolver`. Rerunning `scripts/paper_experiments.sh` on the branch `s2and_paper` should produce the same numbers as in the paper (we will udpate here if this becomes not true).

Note that by default we are using the `--use_cache` flag, which will cache all the features so future reruns are faster. There are two things to be aware of: (a) the cache is stored in RAM and can be huge (100gb+) and (b) if you intend to change the features and rerun, you'll have to turn off the cache or the new features won't be used.

## Licensing
The code in this repo is released under the Apache 2.0 license (license included in the repo. The dataset is released under ODC-BY (included in S3 bucket with the data). We would also like to acknowledge that some of the affiliations data comes directly from the Microsoft Academic Graph (https://aka.ms/msracad).

## Citation

If you use S2AND in your research, please cite [S2AND: A Benchmark and Evaluation System for Author Name Disambiguation](https://api.semanticscholar.org/CorpusID:232233421).

```
@inproceedings{subramanian2021s2and,
title={{S}2{AND}: {A} {B}enchmark and {E}valuation {S}ystem for {A}uthor {N}ame {D}isambiguation},
author={Subramanian, Shivashankar and King, Daniel and Downey, Doug and Feldman, Sergey},
year={2021},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {{JCDL} '21: Proceedings of the {ACM/IEEE} Joint Conference on Digital Libraries in 2021},
series = {JCDL '21}
}
Sample Command:
```commandline
python sample_scripts/train_e2e_model.py --wandb_run_params=configs/wandb_overfit_1_batch.json
```

S2AND is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org).
AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
3 changes: 3 additions & 0 deletions configs/wandb_overfit_1_batch.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"overfit_batch_idx": 35
}
Empty file added ecc/__init__.py
Empty file.
Loading

0 comments on commit 109ffa6

Please sign in to comment.