Skip to content

Commit

Permalink
Removed global config and replaced with argparse + Other fixes...
Browse files Browse the repository at this point in the history
  • Loading branch information
TimDettmers committed Oct 1, 2019
1 parent c8aa06c commit 5feb358
Show file tree
Hide file tree
Showing 5 changed files with 177 additions and 114 deletions.
112 changes: 76 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ ConvE with 8 times less parameters is still more powerful than DistMult. Relatio

This repo supports Linux and Python installation via Anaconda.

1. Install [PyTorch](https://github.com/pytorch/pytorch) using [Anaconda](https://www.continuum.io/downloads). If you compiled PyTorch from source, please checkout the [v0.5 branch](https://github.com/TimDettmers/ConvE/tree/pytorch_v0.5): `git checkout pytorch_0.5`
1. Install [PyTorch](https://github.com/pytorch/pytorch) using [Anaconda](https://www.continuum.io/downloads).
2. Install the requirements `pip install -r requirements.txt`
3. Download the default English model used by [spaCy](https://github.com/explosion/spaCy), which is installed in the previous step `python -m spacy download en`
4. Run the preprocessing script for WN18RR, FB15k-237, YAGO3-10, UMLS, Kinship, and Nations: `sh preprocess.sh`
Expand All @@ -53,22 +53,28 @@ This repo supports Linux and Python installation via Anaconda.

Parameters need to be specified by white-space tuples for example:
```
CUDA_VISIBLE_DEVICES=0 python main.py model ConvE dataset FB15k-237 \
input_drop 0.2 hidden_drop 0.3 feat_drop 0.2 \
lr 0.003 process True
CUDA_VISIBLE_DEVICES=0 python main.py --model conve --data FB15k-237 \
--input-drop 0.2 --hidden-drop 0.3 --feat-drop 0.2 \
--lr 0.003 --preprocess
```
will run a ConvE model on FB15k-237.

To run a model, you first need to preprocess the data. This can be done by specifying the `process` parameter:
To run a model, you first need to preprocess the data once. This can be done by specifying the `--preprocess` parameter:
```
CUDA_VISIBLE_DEVICES=0 python main.py model ConvE dataset FB15k-237 process True
CUDA_VISIBLE_DEVICES=0 python main.py --data DATASET_NAME --preprocess
```
After the dataset is preprocessed it will be saved to disk and this parameter can be omitted.
```
CUDA_VISIBLE_DEVICES=0 python main.py model ConvE dataset FB15k-237
CUDA_VISIBLE_DEVICES=0 python main.py --data DATASET_NAME
```
The following parameters can be used for the `--model` parameter:
```
conve
distmult
complex
```

Here a list of parameters for the available datasets:
The following datasets can be used for the `--data` parameter:
```
FB15k-237
WN18RR
Expand All @@ -78,40 +84,75 @@ kinship
nations
```

The following models are available:
```
ConvE
DistMult
ComplEx
And here a complete list of parameters.
```
Link prediction for knowledge graphs
optional arguments:
-h, --help show this help message and exit
--batch-size BATCH_SIZE
input batch size for training (default: 128)
--test-batch-size TEST_BATCH_SIZE
input batch size for testing/validation (default: 128)
--epochs EPOCHS number of epochs to train (default: 1000)
--lr LR learning rate (default: 0.003)
--seed S random seed (default: 17)
--log-interval LOG_INTERVAL
how many batches to wait before logging training
status
--data DATA Dataset to use: {FB15k-237, YAGO3-10, WN18RR, umls,
nations, kinship}, default: FB15k-237
--l2 L2 Weight decay value to use in the optimizer. Default:
0.0
--model MODEL Choose from: {conve, distmult, complex}
--embedding-dim EMBEDDING_DIM
The embedding dimension (1D). Default: 200
--embedding-shape1 EMBEDDING_SHAPE1
The first dimension of the reshaped 2D embedding. The
second dimension is infered. Default: 20
--hidden-drop HIDDEN_DROP
Dropout for the hidden layer. Default: 0.3.
--input-drop INPUT_DROP
Dropout for the input embeddings. Default: 0.2.
--feat-drop FEAT_DROP
Dropout for the convolutional features. Default: 0.2.
--lr-decay LR_DECAY Decay the learning rate by this factor every epoch.
Default: 0.995
--loader-threads LOADER_THREADS
How many loader threads to use for the batch loaders.
Default: 4
--preprocess Preprocess the dataset. Needs to be executed only
once. Default: 4
--resume Resume a model.
--use-bias Use a bias in the convolutional layer. Default: True
--label-smoothing LABEL_SMOOTHING
Label smoothing value to use. Default: 0.1
--hidden-size HIDDEN_SIZE
The side of the hidden layer. The required size
changes with the size of the embeddings. Default: 9728
(embedding size 200).
```
To reproduce most of the results in the ConvE paper, you can use the default parameters and execute the command below:
```
CUDA_VISIBLE_DEVICES=0 python main.py --data DATASET_NAME
```
For the reverse model, you can run the provided file with the name of the dataset name and a threshold probability:

The following parameters can be used for the models:
```
batch_size
input_drop = input_dropout
feat_drop = feature_map_dropout
hidden_drop = hidden_dropout
embedding_dim
L2
epochs
lr_decay = learning_rate_decay
lr = learning_rate
label_smoothing = label_smoothing_epsilon
python inverse_model.py WN18RR 0.9
```
The parameters with the equal sign are equivalent and short-forms of each other.

To reproduce most of the results in the ConvE paper, you can use command below:
### Changing the embedding size for ConvE

```
CUDA_VISIBLE_DEVICES=0 python main.py model ConvE input_drop 0.2 hidden_drop 0.3 \
feat_drop 0.2 lr 0.003 lr_decay 0.995 \
dataset DATASET_NAME
```
For the reverse model, you can run the provided file with the name of the dataset name and a threshold probability:
If you want to change the embedding size you can do that via the ``--embedding-dim` parameter. However, for ConvE, since the embedding is reshaped as a 2D embedding one also needs to pass the first dimension of the reshaped embedding (`--embedding-shape1`) while the second dimension is infered. When once changes the embedding size, the hidden layer size `--hidden-size` also needs to be different but it is difficult to determine before run time. The easiest way to determine the hidden size is to run the model, let it run on an error due to wrong shape, and then reshape according to the dimension in the error message.

Example: Change embedding size to be 100. We want 10x10 2D embeddings. We run `python main.py --embedding-dim 100 --embedding-shape1 10` and we run on an error due to wrong hidden dimension:
```python
ret = torch.addmm(bias, input, weight.t())
RuntimeError: size mismatch, m1: [128 x 4608], m2: [9728 x 100] at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/generic/THCTensorMathBlas.cu:273
```
python inverse_model.py WN18RR 0.9
```

Now we change the hidden dimension to 4608 accordingly: `python main.py --embedding-dim 100 --embedding-shape1 10 --hidden-size 4608`. Now the model runs with an embedding size of 100 and 10x10 2D embeddings.

### Adding new datasets

Expand All @@ -124,8 +165,7 @@ You can easily write your own knowledge graph model by extending the barebone mo
### Quirks

There are some quirks of this framework.
1. If you use a different embedding size, the ConvE concatenation size cannot be determined automatically and you have to set it yourself in line [103/104](https://github.com/TimDettmers/ConvE/blob/master/model.py#L103). Also the first dimension of the projection layer will change. You will need to comment out the print function ([line 115](https://github.com/TimDettmers/ConvE/blob/master/model.py#L115)) to get the needed dimension, and adjust the size of the fully connected layer in [line 95](https://github.com/TimDettmers/ConvE/blob/master/model.py#L95).
2. The model currently ignores data that does not fit into the specified batch size, for example if your batch size is 100 and your test data is 220, then 20 samples will be ignored. This is designed in that way to improve performance on small datasets. To test on the full test-data you can save the model checkpoint, load the model (with the `load=True` variable) and then evaluate with a batch size that fits the test data (for 220 you could use a batch size of 110). Another solution is to just use a fitting batch size from the start, that is, you could train with a batch size of 110.
1. The model currently ignores data that does not fit into the specified batch size, for example if your batch size is 100 and your test data is 220, then 20 samples will be ignored. This is designed in that way to improve performance on small datasets. To test on the full test-data you can save the model checkpoint, load the model (with the `--resume True` variable) and then evaluate with a batch size that fits the test data (for 220 you could use a batch size of 110). Another solution is to just use a fitting batch size from the start, that is, you could train with a batch size of 110.

### Issues

Expand Down
5 changes: 2 additions & 3 deletions evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
import numpy as np
import datetime

from spodernet.utils.global_config import Config
from spodernet.utils.logger import Logger
from torch.autograd import Variable
from sklearn import metrics
Expand Down Expand Up @@ -38,7 +37,7 @@ def ranking_and_hits(model, dev_rank_batcher, vocab, name):
pred1, pred2 = pred1.data, pred2.data
e1, e2 = e1.data, e2.data
e2_multi1, e2_multi2 = e2_multi1.data, e2_multi2.data
for i in range(Config.batch_size):
for i in range(e1.shape[0]):
# these filters contain ALL labels
filter1 = e2_multi1[i].long()
filter2 = e2_multi2[i].long()
Expand All @@ -62,7 +61,7 @@ def ranking_and_hits(model, dev_rank_batcher, vocab, name):

argsort1 = argsort1.cpu().numpy()
argsort2 = argsort2.cpu().numpy()
for i in range(Config.batch_size):
for i in range(e1.shape[0]):
# find the rank of the target entities
rank1 = np.where(argsort1[i]==e2[i, 0].item())[0][0]
rank2 = np.where(argsort2[i]==e1[i, 0].item())[0][0]
Expand Down
Binary file modified logs.tar.gz
Binary file not shown.
112 changes: 68 additions & 44 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,26 +24,12 @@
from spodernet.hooks import LossHook, ETAHook
from spodernet.utils.util import Timer
from spodernet.preprocessing.processors import TargetIdx2MultiTarget
np.set_printoptions(precision=3)

cudnn.benchmark = True

# parse console parameters and set global variables
Config.backend = Backends.TORCH
Config.parse_argv(sys.argv)
import argparse

Config.cuda = True
Config.embedding_dim = 200
#Logger.GLOBAL_LOG_LEVEL = LogLevel.DEBUG

np.set_printoptions(precision=3)

#model_name = 'DistMult_{0}_{1}'.format(Config.input_dropout, Config.dropout)
model_name = '{2}_{0}_{1}'.format(Config.input_dropout, Config.dropout, Config.model_name)
epochs = 1000
load = False
if Config.dataset is None:
Config.dataset = 'FB15k-237'
model_path = 'saved_models/{0}_{1}.model'.format(Config.dataset, model_name)
cudnn.benchmark = True


''' Preprocess knowledge graph using spodernet. '''
Expand All @@ -67,7 +53,7 @@ def preprocess(dataset_name, delete_data=False):

# process full vocabulary and save it to disk
d.set_path(full_path)
p = Pipeline(Config.dataset, delete_data, keys=input_keys, skip_transformation=True)
p = Pipeline(args.data, delete_data, keys=input_keys, skip_transformation=True)
p.add_sent_processor(ToLower())
p.add_sent_processor(CustomTokenizer(lambda x: x.split(' ')),keys=['e2_multi1', 'e2_multi2'])
p.add_token_processor(AddToVocab())
Expand All @@ -87,43 +73,42 @@ def preprocess(dataset_name, delete_data=False):
p.execute(d)


def main():
if Config.process: preprocess(Config.dataset, delete_data=True)
def main(args, model_path):
if args.preprocess: preprocess(args.data, delete_data=True)
input_keys = ['e1', 'rel', 'rel_eval', 'e2', 'e2_multi1', 'e2_multi2']
p = Pipeline(Config.dataset, keys=input_keys)
p = Pipeline(args.data, keys=input_keys)
p.load_vocabs()
vocab = p.state['vocab']

num_entities = vocab['e1'].num_token

train_batcher = StreamBatcher(Config.dataset, 'train', Config.batch_size, randomize=True, keys=input_keys)
dev_rank_batcher = StreamBatcher(Config.dataset, 'dev_ranking', Config.batch_size, randomize=False, loader_threads=4, keys=input_keys)
test_rank_batcher = StreamBatcher(Config.dataset, 'test_ranking', Config.batch_size, randomize=False, loader_threads=4, keys=input_keys)
train_batcher = StreamBatcher(args.data, 'train', args.batch_size, randomize=True, keys=input_keys, loader_threads=args.loader_threads)
dev_rank_batcher = StreamBatcher(args.data, 'dev_ranking', args.test_batch_size, randomize=False, loader_threads=args.loader_threads, keys=input_keys)
test_rank_batcher = StreamBatcher(args.data, 'test_ranking', args.test_batch_size, randomize=False, loader_threads=args.loader_threads, keys=input_keys)


if Config.model_name is None:
model = ConvE(vocab['e1'].num_token, vocab['rel'].num_token)
elif Config.model_name == 'ConvE':
model = ConvE(vocab['e1'].num_token, vocab['rel'].num_token)
elif Config.model_name == 'DistMult':
model = DistMult(vocab['e1'].num_token, vocab['rel'].num_token)
elif Config.model_name == 'ComplEx':
model = Complex(vocab['e1'].num_token, vocab['rel'].num_token)
if args.model is None:
model = ConvE(args, vocab['e1'].num_token, vocab['rel'].num_token)
elif args.model == 'conve':
model = ConvE(args, vocab['e1'].num_token, vocab['rel'].num_token)
elif args.model == 'distmult':
model = DistMult(args, vocab['e1'].num_token, vocab['rel'].num_token)
elif args.model == 'complex':
model = Complex(args, vocab['e1'].num_token, vocab['rel'].num_token)
else:
log.info('Unknown model: {0}', Config.model_name)
log.info('Unknown model: {0}', args.model)
raise Exception("Unknown model!")

train_batcher.at_batch_prepared_observers.insert(1,TargetIdx2MultiTarget(num_entities, 'e2_multi1', 'e2_multi1_binary'))


eta = ETAHook('train', print_every_x_batches=100)
eta = ETAHook('train', print_every_x_batches=args.log_interval)
train_batcher.subscribe_to_events(eta)
train_batcher.subscribe_to_start_of_epoch_event(eta)
train_batcher.subscribe_to_events(LossHook('train', print_every_x_batches=100))
train_batcher.subscribe_to_events(LossHook('train', print_every_x_batches=args.log_interval))

if Config.cuda:
model.cuda()
if load:
model.cuda()
if args.resume:
model_params = torch.load(model_path)
print(model)
total_param_size = []
Expand All @@ -144,16 +129,16 @@ def main():
print(params)
print(np.sum(params))

opt = torch.optim.Adam(model.parameters(), lr=Config.learning_rate, weight_decay=Config.L2)
for epoch in range(epochs):
opt = torch.optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.l2)
for epoch in range(args.epochs):
model.train()
for i, str2var in enumerate(train_batcher):
opt.zero_grad()
e1 = str2var['e1']
rel = str2var['rel']
e2_multi = str2var['e2_multi1_binary'].float()
# label smoothing
e2_multi = ((1.0-Config.label_smoothing_epsilon)*e2_multi) + (1.0/e2_multi.size(1))
e2_multi = ((1.0-args.label_smoothing)*e2_multi) + (1.0/e2_multi.size(1))

pred = model.forward(e1, rel)
loss = model.loss(pred, e2_multi)
Expand All @@ -168,11 +153,50 @@ def main():

model.eval()
with torch.no_grad():
ranking_and_hits(model, dev_rank_batcher, vocab, 'dev_evaluation')
if epoch % 3 == 0:
if epoch % 5 == 0 and epoch > 0:
ranking_and_hits(model, dev_rank_batcher, vocab, 'dev_evaluation')
if epoch % 5 == 0:
if epoch > 0:
ranking_and_hits(model, test_rank_batcher, vocab, 'test_evaluation')


if __name__ == '__main__':
main()
parser = argparse.ArgumentParser(description='Link prediction for knowledge graphs')
parser.add_argument('--batch-size', type=int, default=128, help='input batch size for training (default: 128)')
parser.add_argument('--test-batch-size', type=int, default=128, help='input batch size for testing/validation (default: 128)')
parser.add_argument('--epochs', type=int, default=1000, help='number of epochs to train (default: 1000)')
parser.add_argument('--lr', type=float, default=0.003, help='learning rate (default: 0.003)')
parser.add_argument('--seed', type=int, default=17, metavar='S', help='random seed (default: 17)')
parser.add_argument('--log-interval', type=int, default=100, help='how many batches to wait before logging training status')
parser.add_argument('--data', type=str, default='FB15k-237', help='Dataset to use: {FB15k-237, YAGO3-10, WN18RR, umls, nations, kinship}, default: FB15k-237')
parser.add_argument('--l2', type=float, default=0.0, help='Weight decay value to use in the optimizer. Default: 0.0')
parser.add_argument('--model', type=str, default='conve', help='Choose from: {conve, distmult, complex}')
parser.add_argument('--embedding-dim', type=int, default=200, help='The embedding dimension (1D). Default: 200')
parser.add_argument('--embedding-shape1', type=int, default=20, help='The first dimension of the reshaped 2D embedding. The second dimension is infered. Default: 20')
parser.add_argument('--hidden-drop', type=float, default=0.3, help='Dropout for the hidden layer. Default: 0.3.')
parser.add_argument('--input-drop', type=float, default=0.2, help='Dropout for the input embeddings. Default: 0.2.')
parser.add_argument('--feat-drop', type=float, default=0.2, help='Dropout for the convolutional features. Default: 0.2.')
parser.add_argument('--lr-decay', type=float, default=0.995, help='Decay the learning rate by this factor every epoch. Default: 0.995')
parser.add_argument('--loader-threads', type=int, default=4, help='How many loader threads to use for the batch loaders. Default: 4')
parser.add_argument('--preprocess', action='store_true', help='Preprocess the dataset. Needs to be executed only once. Default: 4')
parser.add_argument('--resume', action='store_true', help='Resume a model.')
parser.add_argument('--use-bias', action='store_true', help='Use a bias in the convolutional layer. Default: True')
parser.add_argument('--label-smoothing', type=float, default=0.1, help='Label smoothing value to use. Default: 0.1')
parser.add_argument('--hidden-size', type=int, default=9728, help='The side of the hidden layer. The required size changes with the size of the embeddings. Default: 9728 (embedding size 200).')

args = parser.parse_args()



# parse console parameters and set global variables
Config.backend = 'pytorch'
Config.cuda = True
Config.embedding_dim = args.embedding_dim
#Logger.GLOBAL_LOG_LEVEL = LogLevel.DEBUG


model_name = '{2}_{0}_{1}'.format(args.input_drop, args.hidden_drop, args.model)
model_path = 'saved_models/{0}_{1}.model'.format(args.data, model_name)

torch.manual_seed(args.seed)
main(args, model_path)
Loading

0 comments on commit 5feb358

Please sign in to comment.