ml6team
diff --git a/‎nlp/2021_11_25_augmentation_lm/README.md‎
Lines changed: 8 additions & 0 deletions b/‎nlp/2021_11_25_augmentation_lm/README.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/10/dataset.arrow‎
1.44 KB b/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/10/dataset.arrow‎
1.44 KB
diff --git a/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/10/dataset_info.json‎
Lines changed: 30 additions & 0 deletions b/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/10/dataset_info.json‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/10/state.json‎
Lines changed: 14 additions & 0 deletions b/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/10/state.json‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/100/dataset.arrow‎
8.42 KB b/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/100/dataset.arrow‎
8.42 KB
diff --git a/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/100/dataset_info.json‎
Lines changed: 30 additions & 0 deletions b/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/100/dataset_info.json‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/100/state.json‎
Lines changed: 14 additions & 0 deletions b/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/100/state.json‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/200/dataset.arrow‎
15.7 KB b/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/200/dataset.arrow‎
15.7 KB
diff --git a/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/200/dataset_info.json‎
Lines changed: 30 additions & 0 deletions b/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/200/dataset_info.json‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/200/state.json‎
Lines changed: 14 additions & 0 deletions b/‎nlp/2021_11_25_augmentation_lm/data/gpt-3/200/state.json‎
Lines changed: 14 additions & 0 deletions
@@ -0,0 +1,8 @@
+# Text Augmentation using large-scale LMs and prompt engineering
+
+Typically, the more data we have, the better performance we can achieve 🤙. However, it is sometimes difficult and/or expensive to annotate a large amount of training data 😞. Therefore, proper data augmentation is useful to boost the model performance.
+
+Large-scale language models (LMs) are excellent few-shot learners, allowing them to be controlled via natural text prompts. In this tip, we leverage three large-scale LMs (GPT-3, GPT-J and GPT-Neo) and prompt engineering to generate very realistic samples from a very small dataset. The model takes as input two real samples from our dataset, embeds them in a carefully designed prompt and generates an augmented mixed sample influenced by the sample sentences. We use the [Emotion](https://huggingface.co/datasets/emotion) dataset and distilled BERT pre-trained model and show that this augmentation method boosts the model performance and generates very realistic samples. For more information on text augmentation using large-scale LMs check [GPT3Mix](https://arxiv.org/pdf/2104.08826.pdf).
+
+We recommend to open the notebook using Colab for an interactive explainable experience and optimal rendering of the visuals 👇:
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ml6team/quick-tips/blob/main/nlp/2021_11_25_augmentation_lm/nlp_augmentation_lm.ipynb)
@@ -0,0 +1,30 @@
+{
+  "builder_name": null,
+  "citation": "",
+  "config_name": null,
+  "dataset_size": null,
+  "description": "",
+  "download_checksums": null,
+  "download_size": null,
+  "features": {
+    "text": {
+      "dtype": "string",
+      "id": null,
+      "_type": "Value"
+    },
+    "label": {
+      "dtype": "int64",
+      "id": null,
+      "_type": "Value"
+    }
+  },
+  "homepage": "",
+  "license": "",
+  "post_processed": null,
+  "post_processing_size": null,
+  "size_in_bytes": null,
+  "splits": null,
+  "supervised_keys": null,
+  "task_templates": null,
+  "version": null
+}
@@ -0,0 +1,14 @@
+{
+  "_data_files": [
+    {
+      "filename": "dataset.arrow"
+    }
+  ],
+  "_fingerprint": "83739e53a9e544c6",
+  "_format_columns": null,
+  "_format_kwargs": {},
+  "_format_type": null,
+  "_indexes": {},
+  "_output_all_columns": false,
+  "_split": null
+}
@@ -0,0 +1,30 @@
+{
+  "builder_name": null,
+  "citation": "",
+  "config_name": null,
+  "dataset_size": null,
+  "description": "",
+  "download_checksums": null,
+  "download_size": null,
+  "features": {
+    "text": {
+      "dtype": "string",
+      "id": null,
+      "_type": "Value"
+    },
+    "label": {
+      "dtype": "int64",
+      "id": null,
+      "_type": "Value"
+    }
+  },
+  "homepage": "",
+  "license": "",
+  "post_processed": null,
+  "post_processing_size": null,
+  "size_in_bytes": null,
+  "splits": null,
+  "supervised_keys": null,
+  "task_templates": null,
+  "version": null
+}
@@ -0,0 +1,14 @@
+{
+  "_data_files": [
+    {
+      "filename": "dataset.arrow"
+    }
+  ],
+  "_fingerprint": "304595d22fc3c18b",
+  "_format_columns": null,
+  "_format_kwargs": {},
+  "_format_type": null,
+  "_indexes": {},
+  "_output_all_columns": false,
+  "_split": null
+}
@@ -0,0 +1,30 @@
+{
+  "builder_name": null,
+  "citation": "",
+  "config_name": null,
+  "dataset_size": null,
+  "description": "",
+  "download_checksums": null,
+  "download_size": null,
+  "features": {
+    "text": {
+      "dtype": "string",
+      "id": null,
+      "_type": "Value"
+    },
+    "label": {
+      "dtype": "int64",
+      "id": null,
+      "_type": "Value"
+    }
+  },
+  "homepage": "",
+  "license": "",
+  "post_processed": null,
+  "post_processing_size": null,
+  "size_in_bytes": null,
+  "splits": null,
+  "supervised_keys": null,
+  "task_templates": null,
+  "version": null
+}
@@ -0,0 +1,14 @@
+{
+  "_data_files": [
+    {
+      "filename": "dataset.arrow"
+    }
+  ],
+  "_fingerprint": "d84c706c458fde16",
+  "_format_columns": null,
+  "_format_kwargs": {},
+  "_format_type": null,
+  "_indexes": {},
+  "_output_all_columns": false,
+  "_split": null
+}