added ALBERT benchmarks

plkmo · Mar 13, 2020 · 11e6aae · 11e6aae
1 parent 0322ec9
commit 11e6aae
Show file tree

Hide file tree

Showing 6 changed files with 15 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -100,13 +100,13 @@ Predicted:  Cause-Effect(e2,e1)
 
 ## Benchmark Results
 ### MTB pre-training
-Base architecture: ALBERT base uncased (12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters)
+2) Base architecture: ALBERT base uncased (12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters)
 MTB training results:
 ![](https://github.com/plkmo/BERT-Relation-Extraction/blob/master/results/CNN/loss_vs_epoch_1.png) 
 ![](https://github.com/plkmo/BERT-Relation-Extraction/blob/master/results/CNN/accuracy_vs_epoch_1.png) 
 
 ### SemEval2010 Task 8
-Base architecture: BERT base uncased (12-layer, 768-hidden, 12-heads, 110M parameters)
+1) Base architecture: BERT base uncased (12-layer, 768-hidden, 12-heads, 110M parameters)
 With MTB pre-training: F1 results when trained on 100 % training data:
 ![](https://github.com/plkmo/BERT-Relation-Extraction/blob/master/results/CNN/blanks_task_test_f1_vs_epoch_0.png) 
 
@@ -115,8 +115,18 @@ Without MTB pre-training: F1 results when trained on 100 % training data:
 
 With 100 % training data, both models perform similarly, as reproduced in the paper. Yet to test cases where data is limited.
 
+2) Base architecture: ALBERT base uncased (12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters)
+With MTB pre-training: F1 results when trained on 100 % training data:
+![](https://github.com/plkmo/BERT-Relation-Extraction/blob/master/results/CNN/blanks_task_test_f1_vs_epoch_1.png) 
+
+Without MTB pre-training: F1 results when trained on 100 % training data:
+![](https://github.com/plkmo/BERT-Relation-Extraction/blob/master/results/CNN/task_test_f1_vs_epoch_1.png) 
+
+For ALBERT, it looks like pretraining with MTB causes the model to overfit. Using ALBERT directly on the SemEval2010 Task 8 gives much better f1.  
+It seems ALBERT's modifications: parameter-sharing across the layers & factorization of the embedding parametrization is not suitable with MTB pretraining.  
+
 ## To add
-- ~~inference~~ & results on benchmarks (SemEval2010 Task 8) with & without MTB pre-training 
+- ~~inference & results on benchmarks (SemEval2010 Task 8) with & without MTB pre-training~~
 - ~~fine-tuning MTB on supervised relation extraction tasks~~
 - felrel task
 
diff --git a/main_task.py b/main_task.py
@@ -32,9 +32,9 @@
     parser.add_argument("--gradient_acc_steps", type=int, default=1, help="No. of steps of gradient accumulation")
     parser.add_argument("--max_norm", type=float, default=1.0, help="Clipped gradient norm")
     parser.add_argument("--fp16", type=int, default=0, help="1: use mixed precision ; 0: use floating point 32") # mixed precision doesn't seem to train well
-    parser.add_argument("--num_epochs", type=int, default=23, help="No of epochs")
+    parser.add_argument("--num_epochs", type=int, default=10, help="No of epochs")
     parser.add_argument("--lr", type=float, default=0.00005, help="learning rate")
-    parser.add_argument("--model_no", type=int, default=0, help='''Model ID: 0 - BERT\n
+    parser.add_argument("--model_no", type=int, default=1, help='''Model ID: 0 - BERT\n
                                                                             1 - ALBERT''')
 
     parser.add_argument("--train", type=int, default=1, help="0: Don't train, 1: train")

diff --git a/results/CNN/blanks_task_test_f1_vs_epoch_1.png b/results/CNN/blanks_task_test_f1_vs_epoch_1.png
diff --git a/results/CNN/blanks_task_train_accuracy_vs_epoch_1.png b/results/CNN/blanks_task_train_accuracy_vs_epoch_1.png
diff --git a/results/CNN/task_test_f1_vs_epoch_1.png b/results/CNN/task_test_f1_vs_epoch_1.png
diff --git a/results/CNN/task_train_accuracy_vs_epoch_1.png b/results/CNN/task_train_accuracy_vs_epoch_1.png