Skip to content

Commit

Permalink
̉add metadata
Browse files Browse the repository at this point in the history
  • Loading branch information
phusroyal committed Feb 10, 2023
1 parent 5e01b19 commit a5a51d9
Show file tree
Hide file tree
Showing 12 changed files with 52 additions and 4 deletions.
4 changes: 2 additions & 2 deletions codes/Models/PhoBERT_HOS.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,8 @@
"source": [
"# Efficiently load training, dev, and test data sets for train and optimal model evaluation\n",
"\n",
"train = pd.read_csv('data/Sequence labeling-based version/Word/train_BIO_Word.csv')\n",
"dev = pd.read_csv('data/Sequence labeling-based version/Word/dev_BIO_Word.csv')\n",
"train = pd.read_csv('data/Sequence_labeling_based_version/Word/train_BIO_Word.csv')\n",
"dev = pd.read_csv('data/Sequence_labeling_based_version/Word/dev_BIO_Word.csv')\n",
"test = pd.read_csv('data/Span Extraction-based version/test.csv')\n",
"\n",
"# Delete redundant column\n",
Expand Down
4 changes: 2 additions & 2 deletions codes/Models/XLMR_HOS.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,8 @@
"source": [
"# Efficiently load training, dev, and test data sets for train and optimal model evaluation\n",
"\n",
"train = pd.read_csv('data/Sequence labeling-based version/Syllable/train_BIO_syllabel.csv')\n",
"dev = pd.read_csv('data/Sequence labeling-based version/Syllable/dev_BIO_syllabel.csv')\n",
"train = pd.read_csv('data/Span_Extraction_based_version/Syllable/train_BIO_syllable.csv')\n",
"dev = pd.read_csv('data/Span_Extraction_based_version/Syllable/dev_BIO_syllable.csv')\n",
"test = pd.read_csv('data/Span Extraction-based version/test.csv')\n",
"\n",
"# Delete redundant column\n",
Expand Down
48 changes: 48 additions & 0 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
Here is our data folder structure!
```
.
└── data/
├── Sequence labeling-based version/
│ ├── Syllable/
│ │ ├── dev_BIO_syllable.csv
│ │ ├── test_BIO_syllable.csv
│ │ └── train_BIO_syllable.csv
│ └── Word/
│ ├── dev_BIO_Word.csv
│ ├── test_BIO_Word.csv
│ └── train_BIO_Word.csv
├── Span Extraction-based version/
│ ├── dev.csv
│ └── train.csv
└── Test_data/
└── test.csv
```
# Sequence labeling-based version
## Syllable
Description:
- This folder contains the data for the sequence labeling-based version of the task. The data is divided into two files: train, and dev. Each file contains the following columns:
- **index**: The id of the word.
- **word**: Words in the sentence after the processing of tokenization using [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) tokenizer followed by underscore tokenization.
The reason for this is that some words are in bad format:
e.g. "điện.thoại của tôi" is split into ["điện.thoại", "của", "tôi"] instead of ["điện", "thoại", "của", "tôi"] if we use space tokenization, which is not in the right format of Syllable.
As that, we used VnCoreNLP to tokenize first and then split words into tokens.
e.g. "điện.thoại của tôi" ---(VnCoreNLP)---> ["điện_thoại", "của", "tôi"] ---(split by "_")---> ["điện", "thoại", "của", "tôi"].
- **tag**: The tag of the word. The tag is either B-T (beginning of a word), I-T (inside of a word), or O (outside of a word).
- The train_BIO_syllable and dev_BIO_syllable file are used for training and validation for XLMR model, respectively.
- The test_BIO_syllable file is used for reference only. It is not used for testing the model. **Please use the test.csv file in the Testdata folder for testing the model.**
## Word
Description:
- This folder contains the data for the sequence labeling-based version of the task. The data is divided into two files: train, and dev. Each file contains the following columns:
- **index**: The id of the word.
- **word**: Words in the sentence after the processing of tokenization using [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) tokenizer
- **tag**: The tag of the word. The tag is either B-T (beginning of a word), I-T (inside of a word), or O (outside of a word).
- The train_BIO_Word and dev_BIO_Word file are used for training and validation for PhoBERT model, respectively.
- The test_BIO_Word file is used for reference only. It is not used for testing the model. **Please use the test.csv file in the Testdata folder for testing the model.**

# Span Extraction-based version
Description:
- This folder contains the data for the span extraction-based version of the task. The data is divided into two files: train and dev. Each file contains the following columns:
- **content**: The content of the sentence.
- **index_spans**: The index of the hate and offensive spans in the sentence. The index is in the format of [start, end] where start is the index of the first character of the hate and offensive span and end is the index of the last character of the hate and offensive span.
- The train and dev file are used for training and validation for BiLSTM-CRF model, respectively.

File renamed without changes.
File renamed without changes.

0 comments on commit a5a51d9

Please sign in to comment.