An analysis of distilbert-base-uncased vs distilbert-base-cased for NER on SkillSpan dataset.
Comparing distilbert-base-uncased vs distilbert-base-cased for NER on SkillSpan dataset
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
<style type="text/css">
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| eval_loss | eval_precision | eval_recall | eval_f1 | eval_knowledge_f1 | eval_skill_f1 | eval_skill_precision | eval_skill_recall | eval_knowledge_precision | eval_knowledge_recall | ... | eval_samples_per_second | eval_steps_per_second | epoch | model | train_time | num_params | avg_inference_latency | dev_f1 | overfit_gap | top_confusion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.276139 | 0.474191 | 0.547598 | 0.508258 | 0.598270 | 0.408696 | 0.388751 | 0.430797 | 0.548666 | 0.657736 | ... | 292.413 | 36.623 | 5.0 | distilbert-base-uncased | 756.488262 | 66366725 | 0.003466 | 0.509213 | 0.000955 | [(('O', 'I-Skill'), 820), (('I-Skill', 'O'), 7... |
| 1 | 0.308579 | 0.481634 | 0.519128 | 0.499679 | 0.588942 | 0.397241 | 0.398524 | 0.395967 | 0.548917 | 0.635264 | ... | 251.477 | 31.496 | 5.0 | distilbert-base-cased | 878.169684 | 65194757 | 0.004008 | 0.489948 | -0.009731 | [(('I-Skill', 'O'), 891), (('O', 'I-Skill'), 6... |
2 rows × 21 columns
| Metric | Uncased | Cased | Δ | Winner | |
|---|---|---|---|---|---|
| 0 | F1 | 0.508 | 0.500 | +0.009 | uncased |
| 1 | Precision | 0.474 | 0.482 | -0.007 | cased |
| 2 | Recall | 0.548 | 0.519 | +0.028 | uncased |
| 3 | Skill F1 | 0.409 | 0.397 | +0.011 | uncased |
| 4 | Knowledge F1 | 0.598 | 0.589 | +0.009 | uncased |
| Model | Parameters (M) | Train Time (min) | Inference Latency (ms) | |
|---|---|---|---|---|
| 0 | uncased | 66.4 | 12.6 | 3.47 |
| 1 | cased | 65.2 | 14.6 | 4.01 |
✓ Both models show minimal overfitting (gap < 0.02)
Top Confusion Pairs (True Label → Predicted Label)
============================================================
distilbert-base-uncased:
O → I-Skill : 820 errors
I-Skill → O : 793 errors
B-Skill → O : 243 errors
O → B-Skill : 235 errors
O → B-Knowledge : 206 errors
distilbert-base-cased:
I-Skill → O : 891 errors
O → I-Skill : 692 errors
B-Skill → O : 306 errors
O → I-Knowledge : 227 errors
I-Knowledge → O : 195 errors
Sample Predictions (distilbert-base-uncased)
======================================================================
Tokens: Full Stack Software Engineer - Java / JavaScript
Token True Pred Match
-------------------------------------------------------
Full O O ✓
Stack O O ✓
Software O O ✓
Engineer O O ✓
- O O ✓
Java O O ✓
/ O O ✓
JavaScript O B-Knowledge ✗
Tokens: javascript reactjs java
Token True Pred Match
-------------------------------------------------------
javascript B-Knowledge B-Knowledge ✓
reactjs B-Knowledge B-Knowledge ✓
java B-Knowledge B-Knowledge ✓
Tokens: javascript reactjs java
Token True Pred Match
-------------------------------------------------------
javascript B-Knowledge B-Knowledge ✓
reactjs B-Knowledge B-Knowledge ✓
java B-Knowledge B-Knowledge ✓
KEY FINDINGS
============
1. PERFORMANCE:
• distilbert-base-uncased wins with F1 0.508 vs 0.500
• Uncased has better recall (+2.8%), cased has slightly better precision
2. SKILL vs KNOWLEDGE:
• Knowledge entities are easier to detect (F1 ~0.59) than Skills (F1 ~0.40)
• Knowledge: often single distinctive tokens ("python", "javascript", "aws")
• Skills: often multi-word phrases ("problem solving", "attention to detail")
3. ERROR PATTERNS:
• Main error: confusing I-Skill ↔ O (missing continuations)
• Cased model misses more skill tokens (891 vs 793 I-Skill→O errors)
4. EFFICIENCY:
• Both models are similar size (~66M params)
• Uncased is slightly faster to train and at inference
5. OVERFITTING:
• Neither model shows overfitting (dev-test gap < 0.02)
RECOMMENDATION: Use distilbert-base-uncased for skill/knowledge extraction.
If you use this code or the SkillSpan dataset, please cite:
@inproceedings{zhang-etal-2022-skillspan,
title = "{S}kill{S}pan: Hard and Soft Skill Extraction from {E}nglish Job Postings",
author = "Zhang, Mike and
Jensen, Kristian N{\o}rgaard and
Sonniks, Sif and
Plank, Barbara",
booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jul,
year = "2022",
url = "https://aclanthology.org/2022.naacl-main.366",
pages = "4962--4984",
}

