Rosetta 🌹

An analysis of distilbert-base-uncased vs distilbert-base-cased for NER on SkillSpan dataset.

Model Comparison Analysis

Comparing distilbert-base-uncased vs distilbert-base-cased for NER on SkillSpan dataset

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	eval_loss	eval_precision	eval_recall	eval_f1	eval_knowledge_f1	eval_skill_f1	eval_skill_precision	eval_skill_recall	eval_knowledge_precision	eval_knowledge_recall	...	eval_samples_per_second	eval_steps_per_second	epoch	model	train_time	num_params	avg_inference_latency	dev_f1	overfit_gap	top_confusion
0	0.276139	0.474191	0.547598	0.508258	0.598270	0.408696	0.388751	0.430797	0.548666	0.657736	...	292.413	36.623	5.0	distilbert-base-uncased	756.488262	66366725	0.003466	0.509213	0.000955	[(('O', 'I-Skill'), 820), (('I-Skill', 'O'), 7...
1	0.308579	0.481634	0.519128	0.499679	0.588942	0.397241	0.398524	0.395967	0.548917	0.635264	...	251.477	31.496	5.0	distilbert-base-cased	878.169684	65194757	0.004008	0.489948	-0.009731	[(('I-Skill', 'O'), 891), (('O', 'I-Skill'), 6...

2 rows × 21 columns

Performance Metrics Comparison

	Metric	Uncased	Cased	Δ	Winner
0	F1	0.508	0.500	+0.009	uncased
1	Precision	0.474	0.482	-0.007	cased
2	Recall	0.548	0.519	+0.028	uncased
3	Skill F1	0.409	0.397	+0.011	uncased
4	Knowledge F1	0.598	0.589	+0.009	uncased

Performance Visualization

Efficiency Metrics

	Model	Parameters (M)	Train Time (min)	Inference Latency (ms)
0	uncased	66.4	12.6	3.47
1	cased	65.2	14.6	4.01

Overfitting Analysis

✓ Both models show minimal overfitting (gap < 0.02)

Error Analysis

Top Confusion Pairs (True Label → Predicted Label)
============================================================

distilbert-base-uncased:
  O               → I-Skill         :  820 errors
  I-Skill         → O               :  793 errors
  B-Skill         → O               :  243 errors
  O               → B-Skill         :  235 errors
  O               → B-Knowledge     :  206 errors

distilbert-base-cased:
  I-Skill         → O               :  891 errors
  O               → I-Skill         :  692 errors
  B-Skill         → O               :  306 errors
  O               → I-Knowledge     :  227 errors
  I-Knowledge     → O               :  195 errors

Sample Predictions

Sample Predictions (distilbert-base-uncased)
======================================================================

Tokens: Full Stack Software Engineer - Java / JavaScript
Token                True            Pred            Match
-------------------------------------------------------
Full                 O               O               ✓
Stack                O               O               ✓
Software             O               O               ✓
Engineer             O               O               ✓
-                    O               O               ✓
Java                 O               O               ✓
/                    O               O               ✓
JavaScript           O               B-Knowledge     ✗

Tokens: javascript reactjs java
Token                True            Pred            Match
-------------------------------------------------------
javascript           B-Knowledge     B-Knowledge     ✓
reactjs              B-Knowledge     B-Knowledge     ✓
java                 B-Knowledge     B-Knowledge     ✓

Tokens: javascript reactjs java
Token                True            Pred            Match
-------------------------------------------------------
javascript           B-Knowledge     B-Knowledge     ✓
reactjs              B-Knowledge     B-Knowledge     ✓
java                 B-Knowledge     B-Knowledge     ✓

Summary

KEY FINDINGS
============

1. PERFORMANCE:
   • distilbert-base-uncased wins with F1 0.508 vs 0.500
   • Uncased has better recall (+2.8%), cased has slightly better precision

2. SKILL vs KNOWLEDGE:
   • Knowledge entities are easier to detect (F1 ~0.59) than Skills (F1 ~0.40)
   • Knowledge: often single distinctive tokens ("python", "javascript", "aws")
   • Skills: often multi-word phrases ("problem solving", "attention to detail")

3. ERROR PATTERNS:
   • Main error: confusing I-Skill ↔ O (missing continuations)
   • Cased model misses more skill tokens (891 vs 793 I-Skill→O errors)

4. EFFICIENCY:
   • Both models are similar size (~66M params)
   • Uncased is slightly faster to train and at inference

5. OVERFITTING:
   • Neither model shows overfitting (dev-test gap < 0.02)

RECOMMENDATION: Use distilbert-base-uncased for skill/knowledge extraction.

Citation

If you use this code or the SkillSpan dataset, please cite:

@inproceedings{zhang-etal-2022-skillspan,
    title = "{S}kill{S}pan: Hard and Soft Skill Extraction from {E}nglish Job Postings",
    author = "Zhang, Mike  and
      Jensen, Kristian N{\o}rgaard  and
      Sonniks, Sif  and
      Plank, Barbara",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    url = "https://aclanthology.org/2022.naacl-main.366",
    pages = "4962--4984",
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
model_analysis_files		model_analysis_files
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
error_samples_distilbert-base-cased.json		error_samples_distilbert-base-cased.json
error_samples_distilbert-base-uncased.json		error_samples_distilbert-base-uncased.json
error_samples_distilroberta-base.json		error_samples_distilroberta-base.json
model_analysis.ipynb		model_analysis.ipynb
model_analysis.md		model_analysis.md
model_comparison.csv		model_comparison.csv
plan.md		plan.md
raw_extractions.json		raw_extractions.json
taxonomy.py		taxonomy.py
test_across_models.py		test_across_models.py
testing.ipynb		testing.ipynb
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rosetta 🌹

Model Comparison Analysis

Performance Metrics Comparison

Performance Visualization

Efficiency Metrics

Overfitting Analysis

Error Analysis

Sample Predictions

Summary

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Rosetta 🌹

Model Comparison Analysis

Performance Metrics Comparison

Performance Visualization

Efficiency Metrics

Overfitting Analysis

Error Analysis

Sample Predictions

Summary

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages