Skip to content

Whether there is a risk of leaking test sets in preprocess.py #1

@Longmeix

Description

@Longmeix

Hello~
It seem like that the following code in preprocess.py (lines 116-121) may inadvertently leak test anchor pairs in the training set. The variable ILL contains all anchor pairs (labels) loaded from icews_wiki/ref_pairs, which has already been devided into training and testing sets, saved in file icews_wiki/sup_pair and icews_wiki/ref_pairs, respectively.

However, the redivided train may include anchor pairs that belong to the testing set icews_wiki/ref_pairs, potentially leading to data leakage.

train = ILL[:1500]
test = ILL[1500:]
same_name = {}
for id_1,id_2 in train:
    name = id_1+"-"+id_2
    same_name[name] = [id_1,id_2]

Here, train is used to create same_name, which subsequently generates node2same to assign identical structure embeddings to a pair of anchor nodes in train (in get_deep_emb,py). In short, the anchor pairs in train should be given and should not include any testing data.

To prevent this, the correct code should be modified as follows:

train = load_file(self.path + 'sup_pairs')
test = load_file(self.path + 'ref_pairs')

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions