Whether there is a risk of leaking test sets in preprocess.py

Hello~
It seem like that the following code in preprocess.py (lines 116-121) may inadvertently leak test anchor pairs in the training set. The variable `ILL` contains all anchor pairs (labels) loaded from `icews_wiki/ref_pairs`, which has already been devided into training and testing sets, saved in file `icews_wiki/sup_pair` and `icews_wiki/ref_pairs`, respectively. 

However, the redivided `train` may include anchor pairs that belong to the testing set `icews_wiki/ref_pairs`, potentially leading to data leakage.

```
train = ILL[:1500]
test = ILL[1500:]
same_name = {}
for id_1,id_2 in train:
    name = id_1+"-"+id_2
    same_name[name] = [id_1,id_2]
```

Here, `train` is used to create `same_name`, which subsequently generates `node2same` to assign identical structure embeddings to a pair of anchor nodes in `train`  (in `get_deep_emb,py`).  In short, the anchor pairs in `train` should be given and should not include any testing data.

To prevent this, the correct code should be modified as follows:
```
train = load_file(self.path + 'sup_pairs')
test = load_file(self.path + 'ref_pairs')
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whether there is a risk of leaking test sets in preprocess.py #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Whether there is a risk of leaking test sets in preprocess.py #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions