-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
12 changed files
with
52 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
Here is our data folder structure! | ||
``` | ||
. | ||
└── data/ | ||
├── Sequence labeling-based version/ | ||
│ ├── Syllable/ | ||
│ │ ├── dev_BIO_syllable.csv | ||
│ │ ├── test_BIO_syllable.csv | ||
│ │ └── train_BIO_syllable.csv | ||
│ └── Word/ | ||
│ ├── dev_BIO_Word.csv | ||
│ ├── test_BIO_Word.csv | ||
│ └── train_BIO_Word.csv | ||
├── Span Extraction-based version/ | ||
│ ├── dev.csv | ||
│ └── train.csv | ||
└── Test_data/ | ||
└── test.csv | ||
``` | ||
# Sequence labeling-based version | ||
## Syllable | ||
Description: | ||
- This folder contains the data for the sequence labeling-based version of the task. The data is divided into two files: train, and dev. Each file contains the following columns: | ||
- **index**: The id of the word. | ||
- **word**: Words in the sentence after the processing of tokenization using [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) tokenizer followed by underscore tokenization. | ||
The reason for this is that some words are in bad format: | ||
e.g. "điện.thoại của tôi" is split into ["điện.thoại", "của", "tôi"] instead of ["điện", "thoại", "của", "tôi"] if we use space tokenization, which is not in the right format of Syllable. | ||
As that, we used VnCoreNLP to tokenize first and then split words into tokens. | ||
e.g. "điện.thoại của tôi" ---(VnCoreNLP)---> ["điện_thoại", "của", "tôi"] ---(split by "_")---> ["điện", "thoại", "của", "tôi"]. | ||
- **tag**: The tag of the word. The tag is either B-T (beginning of a word), I-T (inside of a word), or O (outside of a word). | ||
- The train_BIO_syllable and dev_BIO_syllable file are used for training and validation for XLMR model, respectively. | ||
- The test_BIO_syllable file is used for reference only. It is not used for testing the model. **Please use the test.csv file in the Testdata folder for testing the model.** | ||
## Word | ||
Description: | ||
- This folder contains the data for the sequence labeling-based version of the task. The data is divided into two files: train, and dev. Each file contains the following columns: | ||
- **index**: The id of the word. | ||
- **word**: Words in the sentence after the processing of tokenization using [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) tokenizer | ||
- **tag**: The tag of the word. The tag is either B-T (beginning of a word), I-T (inside of a word), or O (outside of a word). | ||
- The train_BIO_Word and dev_BIO_Word file are used for training and validation for PhoBERT model, respectively. | ||
- The test_BIO_Word file is used for reference only. It is not used for testing the model. **Please use the test.csv file in the Testdata folder for testing the model.** | ||
|
||
# Span Extraction-based version | ||
Description: | ||
- This folder contains the data for the span extraction-based version of the task. The data is divided into two files: train and dev. Each file contains the following columns: | ||
- **content**: The content of the sentence. | ||
- **index_spans**: The index of the hate and offensive spans in the sentence. The index is in the format of [start, end] where start is the index of the first character of the hate and offensive span and end is the index of the last character of the hate and offensive span. | ||
- The train and dev file are used for training and validation for BiLSTM-CRF model, respectively. | ||
|
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.