Tamil Nadu AI Public Benchmark தமிழ்நாடு AI

A public, reproducible, government-services-focused benchmark scoring AI models on real Tamil Nadu government tasks.

Build status: v1.0 in progress. Target launch: 2026-07-16. Pre-W4 harness validation runs are previewed below. Final v1.0 leaderboard published at launch.

What this benchmarks

Task suites — GO summary, scheme Q&A, grievance classification, legal translate, OCR, voice intent

Models in v1.0 — 3 Claude variants + 5 Cloudflare Workers AI models. v1.1 expands to 18+.

Scoring axes — accuracy, Tamil fluency, cost, latency, on-device, residency, script edge cases

12.3M

Characters of TN govt corpus — Madras Legislative Council 1953-1968 + 4 TN Acts + AVGC-XR Policy 2026

Pre-W4 harness validation

Single-sample harness validation (Tamil → English)

From the AVGC-XR Policy 2026 introduction sentence. Five Cloudflare Workers AI models compared. Single-sample only — not statistically meaningful, but illustrative of the methodology.

Model	BLEU	chrF	CER	WER	Latency
cf-gemma-3-12b 🥇	73.83	80.23	0.147	0.174	1015 ms
cf-mistral-small-24b 🥈	52.74	78.52	0.203	0.304	1733 ms
cf-llama-3.3-70b 🥉	36.61	63.74	0.357	0.609	1362 ms
cf-llama-3.2-3b	14.63	52.57	0.86	1.174	615 ms
cf-mistral-7b	8.97	35.45	1.035	1.304	3578 ms

Notable finding: cf-mistral-7b hallucinated — translated the Tamil sentence about TN's economy as "major political figures of India... possess net worth USD 1 trillion." Real Tamil-comprehension failure. Documented in methodology §10 limitation #8.

Multi-task harness validation (3 samples per task)

Mean primary metric (chrF) ± stdev across 3 samples per task. v0.5 preview — not v1.0 published

Model	01-go-summarisation	02-scheme-qa	04-legal-translate
cf-llama-3.3-70b	38.6 ±2.2	36.4 ±0.9	71.2 ±11.4
cf-llama-3.2-3b	63.3 ±8.6	37.5 ±4.8	70.1 ±13.8
cf-gemma-3-12b	53.8 ±7.1	41.0 ±4.2	93.4 ±9.3
cf-mistral-7b	57.7 ±6.8	39.7 ±4.8	29.8 ±4.1
cf-mistral-small-24b	64.3 ±5.3	44.4 ±7.2	83.1 ±12.3

Why this benchmark exists

Tamil Nadu has ~80M people, ~85% Android, deep Tamil-language daily life, and a state government rolling out digital citizen services at scale. General-purpose Indic benchmarks (AI4Bharat IndicGLUE, IndicNLG) measure Tamil NLP broadly. This benchmark answers a different question: which AI model should TN govt actually use for which workflow, at what real cost, with what failure modes, and can it run inside India?

What we publish

The full methodology (reproducibility contract, scoring axes, reviewer protocol)
Open-source code (Apache-2.0): eval harness, ingest Worker, scoring scripts
Open dataset (CC-BY-4.0): all curated gold sets per task
The leaderboard CSVs (versioned, reproducible from the harness)
Model-version pins, decode parameters, dataset hashes — every result re-runnable

What we don't claim

That any one model is "best" — the leaderboard speaks for itself
An endorsement of any vendor or product
A procurement recommendation — TN govt may use this as one input; it is not the only input

Read the methodology → About this project →