Skip to content

Commit 0285b09

Browse files
Merge pull request #28 from ml6team/feature/nlp_gpt3mix
Feature/nlp gpt3mix
2 parents 3b76749 + f7616fc commit 0285b09

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+2497
-0
lines changed
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Text Augmentation using large-scale LMs and prompt engineering
2+
3+
Typically, the more data we have, the better performance we can achieve 🤙. However, it is sometimes difficult and/or expensive to annotate a large amount of training data 😞. Therefore, proper data augmentation is useful to boost the model performance.
4+
5+
Large-scale language models (LMs) are excellent few-shot learners, allowing them to be controlled via natural text prompts. In this tip, we leverage three large-scale LMs (GPT-3, GPT-J and GPT-Neo) and prompt engineering to generate very realistic samples from a very small dataset. The model takes as input two real samples from our dataset, embeds them in a carefully designed prompt and generates an augmented mixed sample influenced by the sample sentences. We use the [Emotion](https://huggingface.co/datasets/emotion) dataset and distilled BERT pre-trained model and show that this augmentation method boosts the model performance and generates very realistic samples. For more information on text augmentation using large-scale LMs check [GPT3Mix](https://arxiv.org/pdf/2104.08826.pdf).
6+
7+
We recommend to open the notebook using Colab for an interactive explainable experience and optimal rendering of the visuals 👇:
8+
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ml6team/quick-tips/blob/main/nlp/2021_11_25_augmentation_lm/nlp_augmentation_lm.ipynb)
1.44 KB
Binary file not shown.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
{
2+
"builder_name": null,
3+
"citation": "",
4+
"config_name": null,
5+
"dataset_size": null,
6+
"description": "",
7+
"download_checksums": null,
8+
"download_size": null,
9+
"features": {
10+
"text": {
11+
"dtype": "string",
12+
"id": null,
13+
"_type": "Value"
14+
},
15+
"label": {
16+
"dtype": "int64",
17+
"id": null,
18+
"_type": "Value"
19+
}
20+
},
21+
"homepage": "",
22+
"license": "",
23+
"post_processed": null,
24+
"post_processing_size": null,
25+
"size_in_bytes": null,
26+
"splits": null,
27+
"supervised_keys": null,
28+
"task_templates": null,
29+
"version": null
30+
}
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"_data_files": [
3+
{
4+
"filename": "dataset.arrow"
5+
}
6+
],
7+
"_fingerprint": "83739e53a9e544c6",
8+
"_format_columns": null,
9+
"_format_kwargs": {},
10+
"_format_type": null,
11+
"_indexes": {},
12+
"_output_all_columns": false,
13+
"_split": null
14+
}
8.42 KB
Binary file not shown.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
{
2+
"builder_name": null,
3+
"citation": "",
4+
"config_name": null,
5+
"dataset_size": null,
6+
"description": "",
7+
"download_checksums": null,
8+
"download_size": null,
9+
"features": {
10+
"text": {
11+
"dtype": "string",
12+
"id": null,
13+
"_type": "Value"
14+
},
15+
"label": {
16+
"dtype": "int64",
17+
"id": null,
18+
"_type": "Value"
19+
}
20+
},
21+
"homepage": "",
22+
"license": "",
23+
"post_processed": null,
24+
"post_processing_size": null,
25+
"size_in_bytes": null,
26+
"splits": null,
27+
"supervised_keys": null,
28+
"task_templates": null,
29+
"version": null
30+
}
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"_data_files": [
3+
{
4+
"filename": "dataset.arrow"
5+
}
6+
],
7+
"_fingerprint": "304595d22fc3c18b",
8+
"_format_columns": null,
9+
"_format_kwargs": {},
10+
"_format_type": null,
11+
"_indexes": {},
12+
"_output_all_columns": false,
13+
"_split": null
14+
}
15.7 KB
Binary file not shown.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
{
2+
"builder_name": null,
3+
"citation": "",
4+
"config_name": null,
5+
"dataset_size": null,
6+
"description": "",
7+
"download_checksums": null,
8+
"download_size": null,
9+
"features": {
10+
"text": {
11+
"dtype": "string",
12+
"id": null,
13+
"_type": "Value"
14+
},
15+
"label": {
16+
"dtype": "int64",
17+
"id": null,
18+
"_type": "Value"
19+
}
20+
},
21+
"homepage": "",
22+
"license": "",
23+
"post_processed": null,
24+
"post_processing_size": null,
25+
"size_in_bytes": null,
26+
"splits": null,
27+
"supervised_keys": null,
28+
"task_templates": null,
29+
"version": null
30+
}
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"_data_files": [
3+
{
4+
"filename": "dataset.arrow"
5+
}
6+
],
7+
"_fingerprint": "d84c706c458fde16",
8+
"_format_columns": null,
9+
"_format_kwargs": {},
10+
"_format_type": null,
11+
"_indexes": {},
12+
"_output_all_columns": false,
13+
"_split": null
14+
}

0 commit comments

Comments
 (0)