Tamil Nadu AI Public Benchmark தமிழ்நாடு AI
A public, reproducible, government-services-focused benchmark scoring AI models on real Tamil Nadu government tasks.
What this benchmarks
Pre-W4 harness validation
Single-sample harness validation (Tamil → English)
From the AVGC-XR Policy 2026 introduction sentence. Five Cloudflare Workers AI models compared. Single-sample only — not statistically meaningful, but illustrative of the methodology.
| Model | BLEU | chrF | CER | WER | Latency |
|---|---|---|---|---|---|
| cf-gemma-3-12b 🥇 | 73.83 | 80.23 | 0.147 | 0.174 | 1015 ms |
| cf-mistral-small-24b 🥈 | 52.74 | 78.52 | 0.203 | 0.304 | 1733 ms |
| cf-llama-3.3-70b 🥉 | 36.61 | 63.74 | 0.357 | 0.609 | 1362 ms |
| cf-llama-3.2-3b | 14.63 | 52.57 | 0.86 | 1.174 | 615 ms |
| cf-mistral-7b | 8.97 | 35.45 | 1.035 | 1.304 | 3578 ms |
Notable finding: cf-mistral-7b hallucinated — translated the Tamil sentence about TN's economy as "major political figures of India... possess net worth USD 1 trillion." Real Tamil-comprehension failure. Documented in methodology §10 limitation #8.
Multi-task harness validation (3 samples per task)
Mean primary metric (chrF) ± stdev across 3 samples per task. v0.5 preview — not v1.0 published
| Model | 01-go-summarisation | 02-scheme-qa | 04-legal-translate |
|---|---|---|---|
| cf-llama-3.3-70b | 38.6 ±2.2 | 36.4 ±0.9 | 71.2 ±11.4 |
| cf-llama-3.2-3b | 63.3 ±8.6 | 37.5 ±4.8 | 70.1 ±13.8 |
| cf-gemma-3-12b | 53.8 ±7.1 | 41.0 ±4.2 | 93.4 ±9.3 |
| cf-mistral-7b | 57.7 ±6.8 | 39.7 ±4.8 | 29.8 ±4.1 |
| cf-mistral-small-24b | 64.3 ±5.3 | 44.4 ±7.2 | 83.1 ±12.3 |
Why this benchmark exists
Tamil Nadu has ~80M people, ~85% Android, deep Tamil-language daily life, and a state government rolling out digital citizen services at scale. General-purpose Indic benchmarks (AI4Bharat IndicGLUE, IndicNLG) measure Tamil NLP broadly. This benchmark answers a different question: which AI model should TN govt actually use for which workflow, at what real cost, with what failure modes, and can it run inside India?
What we publish
- The full methodology (reproducibility contract, scoring axes, reviewer protocol)
- Open-source code (Apache-2.0): eval harness, ingest Worker, scoring scripts
- Open dataset (CC-BY-4.0): all curated gold sets per task
- The leaderboard CSVs (versioned, reproducible from the harness)
- Model-version pins, decode parameters, dataset hashes — every result re-runnable
What we don't claim
- That any one model is "best" — the leaderboard speaks for itself
- An endorsement of any vendor or product
- A procurement recommendation — TN govt may use this as one input; it is not the only input