Project

ModernBERT vs RoBERTa Fine-Tuning

Head-to-head fine-tuning comparison of ModernBERT-base and RoBERTa-base on IMDb sentiment classification. Both converge to 95.31% accuracy — revealing where architectural improvements do and don't transfer.

What it is

A controlled head-to-head fine-tuning comparison of ModernBERT-base and RoBERTa-base on binary sentiment classification (IMDb dataset), using identical hyperparameters, training duration, and hardware. The goal: determine whether ModernBERT's architectural improvements — rotary positional embeddings, alternating local/global attention, longer context window (8K tokens), and better training recipe — translate into measurable gains on a standard short-text classification task.

What I found

Both models fine-tune to 95.31% accuracy and 95.31% F1 after 3 epochs. The architectural delta disappears entirely once both models are domain-adapted. Without fine-tuning, neither model performs above chance on sentiment (49–51% accuracy), confirming that the pretrained representations for this task are not directly usable without task-specific adaptation.

Accuracy and F1 comparison: ModernBERT vs RoBERTa pretrained and fine-tuned
Pre- vs post-fine-tuning performance. Both models converge to identical accuracy after 3 epochs with identical training settings.
ModelPretrained accuracyFine-tuned accuracyFine-tuned F1
ModernBERT-base49.03%95.31%95.31%
RoBERTa-base51.17%95.31%95.30%

Why it's interesting

ModernBERT is marketed on efficiency and long-context tasks. IMDb reviews average ~230 tokens — well within both models' context windows — so the 8K token capacity confers no structural advantage here. The result suggests ModernBERT's gains are real but domain-specific: they show up in retrieval, long-context reasoning, and inference throughput, not in short-text classification where RoBERTa has had years of fine-tuning playbooks optimized for it.

This is a useful negative result. It sets expectations for practitioners deciding whether to swap RoBERTa for ModernBERT: for short-text tasks, the switch is justified by inference efficiency, not accuracy.

Setup

Both models were fine-tuned in Google Colab on identical settings:

Links