diff --git a/codes/Models/PhoBERT_HOS.ipynb b/codes/Models/PhoBERT_HOS.ipynb index 8c3d962..6a94a52 100644 --- a/codes/Models/PhoBERT_HOS.ipynb +++ b/codes/Models/PhoBERT_HOS.ipynb @@ -76,8 +76,8 @@ "source": [ "# Efficiently load training, dev, and test data sets for train and optimal model evaluation\n", "\n", - "train = pd.read_csv('data/Sequence labeling-based version/Word/train_BIO_Word.csv')\n", - "dev = pd.read_csv('data/Sequence labeling-based version/Word/dev_BIO_Word.csv')\n", + "train = pd.read_csv('data/Sequence_labeling_based_version/Word/train_BIO_Word.csv')\n", + "dev = pd.read_csv('data/Sequence_labeling_based_version/Word/dev_BIO_Word.csv')\n", "test = pd.read_csv('data/Span Extraction-based version/test.csv')\n", "\n", "# Delete redundant column\n", diff --git a/codes/Models/XLMR_HOS.ipynb b/codes/Models/XLMR_HOS.ipynb index 81e3634..9e8dd35 100644 --- a/codes/Models/XLMR_HOS.ipynb +++ b/codes/Models/XLMR_HOS.ipynb @@ -74,8 +74,8 @@ "source": [ "# Efficiently load training, dev, and test data sets for train and optimal model evaluation\n", "\n", - "train = pd.read_csv('data/Sequence labeling-based version/Syllable/train_BIO_syllabel.csv')\n", - "dev = pd.read_csv('data/Sequence labeling-based version/Syllable/dev_BIO_syllabel.csv')\n", + "train = pd.read_csv('data/Span_Extraction_based_version/Syllable/train_BIO_syllable.csv')\n", + "dev = pd.read_csv('data/Span_Extraction_based_version/Syllable/dev_BIO_syllable.csv')\n", "test = pd.read_csv('data/Span Extraction-based version/test.csv')\n", "\n", "# Delete redundant column\n", diff --git a/data/README.md b/data/README.md new file mode 100644 index 0000000..02e6a0b --- /dev/null +++ b/data/README.md @@ -0,0 +1,48 @@ +Here is our data folder structure! +``` +. +└── data/ + ├── Sequence labeling-based version/ + │ ├── Syllable/ + │ │ ├── dev_BIO_syllable.csv + │ │ ├── test_BIO_syllable.csv + │ │ └── train_BIO_syllable.csv + │ └── Word/ + │ ├── dev_BIO_Word.csv + │ ├── test_BIO_Word.csv + │ └── train_BIO_Word.csv + ├── Span Extraction-based version/ + │ ├── dev.csv + │ └── train.csv + └── Test_data/ + └── test.csv +``` +# Sequence labeling-based version +## Syllable +Description: +- This folder contains the data for the sequence labeling-based version of the task. The data is divided into two files: train, and dev. Each file contains the following columns: + - **index**: The id of the word. + - **word**: Words in the sentence after the processing of tokenization using [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) tokenizer followed by underscore tokenization. + The reason for this is that some words are in bad format: + e.g. "điện.thoại của tôi" is split into ["điện.thoại", "của", "tôi"] instead of ["điện", "thoại", "của", "tôi"] if we use space tokenization, which is not in the right format of Syllable. + As that, we used VnCoreNLP to tokenize first and then split words into tokens. + e.g. "điện.thoại của tôi" ---(VnCoreNLP)---> ["điện_thoại", "của", "tôi"] ---(split by "_")---> ["điện", "thoại", "của", "tôi"]. + - **tag**: The tag of the word. The tag is either B-T (beginning of a word), I-T (inside of a word), or O (outside of a word). +- The train_BIO_syllable and dev_BIO_syllable file are used for training and validation for XLMR model, respectively. +- The test_BIO_syllable file is used for reference only. It is not used for testing the model. **Please use the test.csv file in the Testdata folder for testing the model.** +## Word +Description: +- This folder contains the data for the sequence labeling-based version of the task. The data is divided into two files: train, and dev. Each file contains the following columns: + - **index**: The id of the word. + - **word**: Words in the sentence after the processing of tokenization using [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) tokenizer + - **tag**: The tag of the word. The tag is either B-T (beginning of a word), I-T (inside of a word), or O (outside of a word). +- The train_BIO_Word and dev_BIO_Word file are used for training and validation for PhoBERT model, respectively. +- The test_BIO_Word file is used for reference only. It is not used for testing the model. **Please use the test.csv file in the Testdata folder for testing the model.** + +# Span Extraction-based version +Description: +- This folder contains the data for the span extraction-based version of the task. The data is divided into two files: train and dev. Each file contains the following columns: + - **content**: The content of the sentence. + - **index_spans**: The index of the hate and offensive spans in the sentence. The index is in the format of [start, end] where start is the index of the first character of the hate and offensive span and end is the index of the last character of the hate and offensive span. +- The train and dev file are used for training and validation for BiLSTM-CRF model, respectively. + \ No newline at end of file diff --git a/data/Sequence labeling-based version/Syllable/dev_BIO_syllabel.csv b/data/Sequence_labeling_based_version/Syllable/dev_BIO_syllable.csv similarity index 100% rename from data/Sequence labeling-based version/Syllable/dev_BIO_syllabel.csv rename to data/Sequence_labeling_based_version/Syllable/dev_BIO_syllable.csv diff --git a/data/Sequence labeling-based version/Syllable/test_BIO_syllabel.csv b/data/Sequence_labeling_based_version/Syllable/test_BIO_syllable.csv similarity index 100% rename from data/Sequence labeling-based version/Syllable/test_BIO_syllabel.csv rename to data/Sequence_labeling_based_version/Syllable/test_BIO_syllable.csv diff --git a/data/Sequence labeling-based version/Syllable/train_BIO_syllabel.csv b/data/Sequence_labeling_based_version/Syllable/train_BIO_syllable.csv similarity index 100% rename from data/Sequence labeling-based version/Syllable/train_BIO_syllabel.csv rename to data/Sequence_labeling_based_version/Syllable/train_BIO_syllable.csv diff --git a/data/Sequence labeling-based version/Word/dev_BIO_Word.csv b/data/Sequence_labeling_based_version/Word/dev_BIO_Word.csv similarity index 100% rename from data/Sequence labeling-based version/Word/dev_BIO_Word.csv rename to data/Sequence_labeling_based_version/Word/dev_BIO_Word.csv diff --git a/data/Sequence labeling-based version/Word/test_BIO_Word.csv b/data/Sequence_labeling_based_version/Word/test_BIO_Word.csv similarity index 100% rename from data/Sequence labeling-based version/Word/test_BIO_Word.csv rename to data/Sequence_labeling_based_version/Word/test_BIO_Word.csv diff --git a/data/Sequence labeling-based version/Word/train_BIO_Word.csv b/data/Sequence_labeling_based_version/Word/train_BIO_Word.csv similarity index 100% rename from data/Sequence labeling-based version/Word/train_BIO_Word.csv rename to data/Sequence_labeling_based_version/Word/train_BIO_Word.csv diff --git a/data/Span Extraction-based version/dev.csv b/data/Span_Extraction_based_version/dev.csv similarity index 100% rename from data/Span Extraction-based version/dev.csv rename to data/Span_Extraction_based_version/dev.csv diff --git a/data/Span Extraction-based version/train.csv b/data/Span_Extraction_based_version/train.csv similarity index 100% rename from data/Span Extraction-based version/train.csv rename to data/Span_Extraction_based_version/train.csv diff --git a/data/Span Extraction-based version/test.csv b/data/Test_data/test.csv similarity index 100% rename from data/Span Extraction-based version/test.csv rename to data/Test_data/test.csv