What it is
A controlled head-to-head fine-tuning comparison of ModernBERT-base and RoBERTa-base on binary sentiment classification (IMDb dataset), using identical hyperparameters, training duration, and hardware. The goal: determine whether ModernBERT's architectural improvements — rotary positional embeddings, alternating local/global attention, longer context window (8K tokens), and better training recipe — translate into measurable gains on a standard short-text classification task.
What I found
Both models fine-tune to 95.31% accuracy and 95.31% F1 after 3 epochs. The architectural delta disappears entirely once both models are domain-adapted. Without fine-tuning, neither model performs above chance on sentiment (49–51% accuracy), confirming that the pretrained representations for this task are not directly usable without task-specific adaptation.

| Model | Pretrained accuracy | Fine-tuned accuracy | Fine-tuned F1 |
|---|---|---|---|
| ModernBERT-base | 49.03% | 95.31% | 95.31% |
| RoBERTa-base | 51.17% | 95.31% | 95.30% |
Why it's interesting
ModernBERT is marketed on efficiency and long-context tasks. IMDb reviews average ~230 tokens — well within both models' context windows — so the 8K token capacity confers no structural advantage here. The result suggests ModernBERT's gains are real but domain-specific: they show up in retrieval, long-context reasoning, and inference throughput, not in short-text classification where RoBERTa has had years of fine-tuning playbooks optimized for it.
This is a useful negative result. It sets expectations for practitioners deciding whether to swap RoBERTa for ModernBERT: for short-text tasks, the switch is justified by inference efficiency, not accuracy.
Setup
Both models were fine-tuned in Google Colab on identical settings:
- 3 epochs, batch size 16, AdamW optimizer, learning rate 2e-5
- Linear warmup scheduler
- Hugging Face
TrainerAPI with default evaluation strategy - IMDb dataset from
datasetslibrary (25K train / 25K test)