Skip to content

olliekm/rosetta

Repository files navigation

Rosetta 🌹

An analysis of distilbert-base-uncased vs distilbert-base-cased for NER on SkillSpan dataset.

Model Comparison Analysis

Comparing distilbert-base-uncased vs distilbert-base-cased for NER on SkillSpan dataset

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
eval_loss eval_precision eval_recall eval_f1 eval_knowledge_f1 eval_skill_f1 eval_skill_precision eval_skill_recall eval_knowledge_precision eval_knowledge_recall ... eval_samples_per_second eval_steps_per_second epoch model train_time num_params avg_inference_latency dev_f1 overfit_gap top_confusion
0 0.276139 0.474191 0.547598 0.508258 0.598270 0.408696 0.388751 0.430797 0.548666 0.657736 ... 292.413 36.623 5.0 distilbert-base-uncased 756.488262 66366725 0.003466 0.509213 0.000955 [(('O', 'I-Skill'), 820), (('I-Skill', 'O'), 7...
1 0.308579 0.481634 0.519128 0.499679 0.588942 0.397241 0.398524 0.395967 0.548917 0.635264 ... 251.477 31.496 5.0 distilbert-base-cased 878.169684 65194757 0.004008 0.489948 -0.009731 [(('I-Skill', 'O'), 891), (('O', 'I-Skill'), 6...

2 rows × 21 columns

Performance Metrics Comparison

<style type="text/css"> </style>
  Metric Uncased Cased Δ Winner
0 F1 0.508 0.500 +0.009 uncased
1 Precision 0.474 0.482 -0.007 cased
2 Recall 0.548 0.519 +0.028 uncased
3 Skill F1 0.409 0.397 +0.011 uncased
4 Knowledge F1 0.598 0.589 +0.009 uncased

Performance Visualization

png

Efficiency Metrics

<style type="text/css"> </style>
  Model Parameters (M) Train Time (min) Inference Latency (ms)
0 uncased 66.4 12.6 3.47
1 cased 65.2 14.6 4.01

Overfitting Analysis

png

✓ Both models show minimal overfitting (gap < 0.02)

Error Analysis

Top Confusion Pairs (True Label → Predicted Label)
============================================================

distilbert-base-uncased:
  O               → I-Skill         :  820 errors
  I-Skill         → O               :  793 errors
  B-Skill         → O               :  243 errors
  O               → B-Skill         :  235 errors
  O               → B-Knowledge     :  206 errors

distilbert-base-cased:
  I-Skill         → O               :  891 errors
  O               → I-Skill         :  692 errors
  B-Skill         → O               :  306 errors
  O               → I-Knowledge     :  227 errors
  I-Knowledge     → O               :  195 errors

png

Sample Predictions

Sample Predictions (distilbert-base-uncased)
======================================================================

Tokens: Full Stack Software Engineer - Java / JavaScript
Token                True            Pred            Match
-------------------------------------------------------
Full                 O               O               ✓
Stack                O               O               ✓
Software             O               O               ✓
Engineer             O               O               ✓
-                    O               O               ✓
Java                 O               O               ✓
/                    O               O               ✓
JavaScript           O               B-Knowledge     ✗

Tokens: javascript reactjs java
Token                True            Pred            Match
-------------------------------------------------------
javascript           B-Knowledge     B-Knowledge     ✓
reactjs              B-Knowledge     B-Knowledge     ✓
java                 B-Knowledge     B-Knowledge     ✓

Tokens: javascript reactjs java
Token                True            Pred            Match
-------------------------------------------------------
javascript           B-Knowledge     B-Knowledge     ✓
reactjs              B-Knowledge     B-Knowledge     ✓
java                 B-Knowledge     B-Knowledge     ✓

Summary

KEY FINDINGS
============

1. PERFORMANCE:
   • distilbert-base-uncased wins with F1 0.508 vs 0.500
   • Uncased has better recall (+2.8%), cased has slightly better precision

2. SKILL vs KNOWLEDGE:
   • Knowledge entities are easier to detect (F1 ~0.59) than Skills (F1 ~0.40)
   • Knowledge: often single distinctive tokens ("python", "javascript", "aws")
   • Skills: often multi-word phrases ("problem solving", "attention to detail")

3. ERROR PATTERNS:
   • Main error: confusing I-Skill ↔ O (missing continuations)
   • Cased model misses more skill tokens (891 vs 793 I-Skill→O errors)

4. EFFICIENCY:
   • Both models are similar size (~66M params)
   • Uncased is slightly faster to train and at inference

5. OVERFITTING:
   • Neither model shows overfitting (dev-test gap < 0.02)

RECOMMENDATION: Use distilbert-base-uncased for skill/knowledge extraction.

Citation

If you use this code or the SkillSpan dataset, please cite:

@inproceedings{zhang-etal-2022-skillspan,
    title = "{S}kill{S}pan: Hard and Soft Skill Extraction from {E}nglish Job Postings",
    author = "Zhang, Mike  and
      Jensen, Kristian N{\o}rgaard  and
      Sonniks, Sif  and
      Plank, Barbara",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    url = "https://aclanthology.org/2022.naacl-main.366",
    pages = "4962--4984",
}

About

Production NLP pipeline that take job postings and resumes, extracts skills as structured entities, and normalizes them into a canonical taxonomy using embeddings and clustering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors