ModernBERT vs RoBERTa Fine-Tuning

What it is

A controlled head-to-head fine-tuning comparison of ModernBERT-base and RoBERTa-base on binary sentiment classification (IMDb dataset), using identical hyperparameters, training duration, and hardware. The goal: determine whether ModernBERT's architectural improvements — rotary positional embeddings, alternating local/global attention, longer context window (8K tokens), and better training recipe — translate into measurable gains on a standard short-text classification task.

What I found

Both models fine-tune to 95.31% accuracy and 95.31% F1 after 3 epochs. The architectural delta disappears entirely once both models are domain-adapted. Without fine-tuning, neither model performs above chance on sentiment (49–51% accuracy), confirming that the pretrained representations for this task are not directly usable without task-specific adaptation.

Accuracy and F1 comparison: ModernBERT vs RoBERTa pretrained and fine-tuned — Pre- vs post-fine-tuning performance. Both models converge to identical accuracy after 3 epochs with identical training settings.

Model	Pretrained accuracy	Fine-tuned accuracy	Fine-tuned F1
ModernBERT-base	49.03%	95.31%	95.31%
RoBERTa-base	51.17%	95.31%	95.30%

Why it's interesting

ModernBERT is marketed on efficiency and long-context tasks. IMDb reviews average ~230 tokens — well within both models' context windows — so the 8K token capacity confers no structural advantage here. The result suggests ModernBERT's gains are real but domain-specific: they show up in retrieval, long-context reasoning, and inference throughput, not in short-text classification where RoBERTa has had years of fine-tuning playbooks optimized for it.

This is a useful negative result. It sets expectations for practitioners deciding whether to swap RoBERTa for ModernBERT: for short-text tasks, the switch is justified by inference efficiency, not accuracy.

Setup

Both models were fine-tuned in Google Colab on identical settings:

3 epochs, batch size 16, AdamW optimizer, learning rate 2e-5
Linear warmup scheduler
Hugging Face Trainer API with default evaluation strategy
IMDb dataset from datasets library (25K train / 25K test)

What it is

What I found

Why it's interesting

Setup

Links