-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Hello~
It seem like that the following code in preprocess.py (lines 116-121) may inadvertently leak test anchor pairs in the training set. The variable ILL contains all anchor pairs (labels) loaded from icews_wiki/ref_pairs, which has already been devided into training and testing sets, saved in file icews_wiki/sup_pair and icews_wiki/ref_pairs, respectively.
However, the redivided train may include anchor pairs that belong to the testing set icews_wiki/ref_pairs, potentially leading to data leakage.
train = ILL[:1500]
test = ILL[1500:]
same_name = {}
for id_1,id_2 in train:
name = id_1+"-"+id_2
same_name[name] = [id_1,id_2]
Here, train is used to create same_name, which subsequently generates node2same to assign identical structure embeddings to a pair of anchor nodes in train (in get_deep_emb,py). In short, the anchor pairs in train should be given and should not include any testing data.
To prevent this, the correct code should be modified as follows:
train = load_file(self.path + 'sup_pairs')
test = load_file(self.path + 'ref_pairs')