PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects
AIPR assessment
Problem difficulty is high, this is a crowded and fast-moving benchmark and speech-LLM evaluation problem where many groups are optimizing similar metrics. The paper’s strengths reinforce each other: large language coverage, open artifact release, and extensive cross-model tables make the benchmark easy to adopt and compare against. The weaknesses also compound: the synthetic dialect construction is a central enabler, but its validation is still relatively narrow, so the strongest results depend
Abstract
While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level' speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech-100.
Score Breakdown
More from this week
- Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends
- On Thin Perfect Matchings up to Polylogarithmic Factors
- ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving
- LeAP: Learnable Adaptive Permutation for Feature Selection in Heterogeneous and Sparse Recommender Systems
- HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces
More in AI
- GitInject: Real-World Prompt Injection Attacks in AI-Powered CI/CD Pipelines
- The CIFAR Synthetic Evidence Corpus for Detecting AI-Generated Evidence
- SCOPE: Cost-Efficient Model Selection for Compound AI Systems under Quality Constraints
- Context Features Are Cheap: Rank-Aware Decomposition for Efficient Feature Interaction in Recommender Systems
- Five Queries Are Enough: Query-Efficient and Surrogate-Free Membership Inference Attacks on RAG via Entailment