Tamil Nadu AI Public Benchmark தமிழ்நாடு AI

A public, reproducible, government-services-focused benchmark scoring AI models on real Tamil Nadu government tasks.

Build status: v1.0 in progress. Target launch: 2026-07-16. Pre-W4 harness validation runs are previewed below. Final v1.0 leaderboard published at launch.

What this benchmarks

6
Task suites — GO summary, scheme Q&A, grievance classification, legal translate, OCR, voice intent
8
Models in v1.0 — 3 Claude variants + 5 Cloudflare Workers AI models. v1.1 expands to 18+.
7
Scoring axes — accuracy, Tamil fluency, cost, latency, on-device, residency, script edge cases
12.3M
Characters of TN govt corpus — Madras Legislative Council 1953-1968 + 4 TN Acts + AVGC-XR Policy 2026

Pre-W4 harness validation

Single-sample harness validation (Tamil → English)

From the AVGC-XR Policy 2026 introduction sentence. Five Cloudflare Workers AI models compared. Single-sample only — not statistically meaningful, but illustrative of the methodology.

ModelBLEUchrFCERWERLatency
cf-gemma-3-12b 🥇73.8380.230.1470.1741015 ms
cf-mistral-small-24b 🥈52.7478.520.2030.3041733 ms
cf-llama-3.3-70b 🥉36.6163.740.3570.6091362 ms
cf-llama-3.2-3b14.6352.570.861.174615 ms
cf-mistral-7b8.9735.451.0351.3043578 ms

Notable finding: cf-mistral-7b hallucinated — translated the Tamil sentence about TN's economy as "major political figures of India... possess net worth USD 1 trillion." Real Tamil-comprehension failure. Documented in methodology §10 limitation #8.

Multi-task harness validation (3 samples per task)

Mean primary metric (chrF) ± stdev across 3 samples per task. v0.5 preview — not v1.0 published

Model01-go-summarisation02-scheme-qa04-legal-translate
cf-llama-3.3-70b38.6
±2.2
36.4
±0.9
71.2
±11.4
cf-llama-3.2-3b63.3
±8.6
37.5
±4.8
70.1
±13.8
cf-gemma-3-12b53.8
±7.1
41.0
±4.2
93.4
±9.3
cf-mistral-7b57.7
±6.8
39.7
±4.8
29.8
±4.1
cf-mistral-small-24b64.3
±5.3
44.4
±7.2
83.1
±12.3

Why this benchmark exists

Tamil Nadu has ~80M people, ~85% Android, deep Tamil-language daily life, and a state government rolling out digital citizen services at scale. General-purpose Indic benchmarks (AI4Bharat IndicGLUE, IndicNLG) measure Tamil NLP broadly. This benchmark answers a different question: which AI model should TN govt actually use for which workflow, at what real cost, with what failure modes, and can it run inside India?

What we publish

What we don't claim

Read the methodology →   About this project →