Fun-ASR: End-to-End Speech Recognition Large Language Model Fun-ASR: 端到端语音识别大模型

Abstract 摘要

Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Laboratory, trained on tens of millions of hours of real speech data. It possesses strong contextual understanding capabilities and industry adaptability, supports low-latency real-time dictation, and covers 31 languages. It excels in vertical fields such as education and finance, accurately recognizing professional terms and industry expressions, effectively addressing challenges like "hallucination" generation and language confusion, achieving "clear hearing, understanding meaning, and accurate writing." Fun-ASR is widely applicable to various high-demand speech processing scenarios, with core advantages including high accuracy, low latency, and customization, suitable for applications like AI meeting minutes and real-time subtitles.

Fun-ASR 是通义实验室推出的端到端语音识别大模型，是基于数千万小时真实语音数据训练而成，具备强大的上下文理解能力与行业适应性，支持低延迟实时听写，并且覆盖31个语种。在教育、金融等垂直领域表现出色，能准确识别专业术语与行业表达，有效应对"幻觉"生成和语种混淆等挑战，实现"听得清、懂其意、写得准"。Fun-ASR 广泛适用于多种高要求语音处理场景，凭借高精度、低延迟、可定制化等核心优势，可以应用于会议AI纪要、实时字幕等场景。

Fun-ASR Architecture Overview

Fun-ASR Training Pipeline

Pre-training Pipeline for Audio Encoder

The FunRL Framework

Time Consumption Analysis

Performance Comparison in Different Scenarios(Radar Chart)

Performance Comparison in Different Scenarios

ASR Capability Demonstrations ASR 能力演示

Click to play and listen to Fun-ASR's recognition performance in different scenarios.

点击播放试听 Fun-ASR 在不同场景下的识别效果。

Chinese Dialect Recognition 中文方言识别

Supports Wu, Cantonese, Min, Hakka, Gan, Xiang, and Jin dialects, covering over 20 regions including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, Ningxia, etc.

支持的方言包括吴语、粤语、闽语、客家话、赣语、湘语、晋语，支持的口音官话覆盖中原、西南、冀鲁、江淮、兰银、胶辽、东北、北京、港台等，包括河南、陕西、湖北、四川、重庆、云南、贵州、广东、广西、河北、天津、山东、安徽、南京、江苏、杭州、甘肃、宁夏等 20 多个地区。

Shanghai Dialect 上海话

ASR Output 识别结果

人跟狗，包括人跟动物接触长了，全有感情。葛末随了阿拉社会个富裕。

Cantonese 粤语

ASR Output 识别结果

啲身体好劲啊，跟住咧佢哋有一个人咧就突然可能就有高原反应啦，突然间就啊窒息咗，即系晕晕咗。

Hunan Dialect 湖南话

ASR Output 识别结果

但总来讲孙膑对兵法的理解运用比庞涓略胜一筹。

Southern Min 闽南语

ASR Output 识别结果

嗯，下摆若有机会吧，因为即久吼开了吼卷啊遮厉害，会倒贴钱啊。

Multilingual Recognition 多语言识别 (自由说)

Supports 31 languages including Japanese, Vietnamese, and code-switching scenarios.

支持的语种包括31种语言：日语、越南语等，支持语言自由切换场景。

Japanese 日语

ASR Output 识别结果

人民たちは、金欲しさに王をのけ者にしてしまって、何でもすべて商人のところへ持って行ってしまいました。

Vietnamese 越南语

ASR Output 识别结果

Đi cùng với tiếp tục kêu gọi người dân đã qua lại các ổ dịch này, khai báo y tế và yêu cầu liên hệ để được xét nghiệm.

Code-switching 自由说

ASR Output 识别结果

このカフェのwi-fiがアンステーブル過ぎて、google meetでディスコネクトされてクライエントに悪い印象を与えてしまった。

Robust Far-Field Recognition 远场复杂背景识别

Optimized for far-field pickup and high-noise scenarios (such as conference rooms, in-vehicle environments, industrial sites), significantly improving speech recognition stability and accuracy in complex acoustic conditions.

针对远距离拾音及高噪声场景（如会议室、车载环境、工业现场等）进行深度优化，显著提升复杂声学条件下的语音识别稳定性与准确率。

Far-Field 远场

ASR Output 识别结果

然后被冠以了渣男线的称号，好了，不管这个，那么前方即将到达沈杜公路站，左边是8号线。

Noise 噪声

ASR Output 识别结果

周末要不要去露营，最近天气超舒服，露营？我怕虫子咬，而且晚上睡帐篷会不会很冷啊？放心，我借了专业装备还有暖宝宝，再带点火锅食材，边吃边看星星超惬意。

Noise 噪声

ASR Output 识别结果

<music>唯一的遗憾就是他那个八宝鸭还有烤鸭都没吃上估计得提前预定吧 <impact_sounds></impact_sounds>只能怪我自己没有做好功课</music>

Noise 噪声

ASR Output 识别结果

别紧张<breathing></breathing>我只是我是在这边逛街然后看到你们在这边拍照想跟你交个朋友<impact_sounds> 认识</impact_sounds>一下

Noise 噪声

ASR Output 识别结果

So what's interesting here is I feel that you know brands knowing this when people sort of speak to the voice assistance at home and if you want to be the brand.

Industry Customization Capability 行业定制化能力

In industrial ASR applications, personalized customization is an essential technology. Customization refers to applying additional probability preferences to specific words/phrases (such as names, places, brands, professional terms) during the recognition process, thereby significantly improving their recall rate while minimizing the impact on general recognition accuracy.

在ASR的工业落地中，个性化定制是必不可少的技术。所谓定制化，是指在识别过程中对特定词/短语（如人名、地名、品牌、专业术语等）施加额外概率偏好，从而显著提高它们的识别召回率，同时尽量不损伤通用识别准确率。

Our solution introduces RAG (Retrieval-Augmented Generation) mechanism: building a dedicated RAG library with user-configured customized words, dynamically extracting relevant vocabulary based on CTC first-pass decoding results, and precisely injecting only relevant vocabulary into the LLM's prompt to avoid interference from unrelated information. This approach expands the number of customized context words to over a thousand without increasing inference complexity, while maintaining high customized recognition performance.

我们的解决方案：引入RAG（检索增强生成）机制，将用户配置的定制词构建成专属RAG库，依据CTC第一遍解码结果动态抽取相关词汇，仅将相关词汇精准注入LLM的Prompt中避免无关信息干扰。该方案在不增加推理复杂度的前提下，将定制化上文数量扩充到上千个以上，并且保持较高的定制化识别效果。

Medical Field 医学领域

ASR Output 识别结果

肾脏中肾小球囊上的细胞膜孔隙很小

Mathematics 数学领域

ASR Output 识别结果

对微分形式的积分是微分几何中的基本概念

History 历史领域

ASR Output 识别结果

由罗马皇帝钦点的犹地亚王大希律王统治期间

Biochemistry 生物化学领域

ASR Output 识别结果

利用三磷酸腺苷的水解所产生的能量来驱动其他化学反应

Physics 物理学领域

ASR Output 识别结果

根据碰撞理论月面样本缺少挥发性物质

Chemistry 化学领域

ASR Output 识别结果

比如说酯在当时被认为是一种含氧酸盐

Music & Lyrics Transcription 歌词识别

Enhanced speech recognition performance in music background interference and first-time support for accurate recognition of lyrics in songs, expanding the application potential of speech technology in entertainment and consumer scenarios such as music, short videos, and smart speakers.

强化在音乐背景干扰下的语音识别性能，并首次支持对歌曲中歌词内容的精准识别，拓展语音技术在音乐、短视频、智能音箱等娱乐与消费场景的应用潜力。

Music + Vocal 音乐 + 人声

Lyrics Output 歌词输出

我看到我的身后盯着我的人群，喜欢或恨不一样的神情，我知道这可能就是所谓的成名，我知道必须往前一步也不能停。

Music + Vocal 音乐 + 人声

Lyrics Output 歌词输出

明明那么远，为何却感觉离他那么近？闭上眼，你甚至能背出他所有押韵。虽然不听说唱了，但你已学会自信。我代表所有中文说唱歌手向你致敬。如今面对困难的你，早已不再抱怨。

Music + Vocal 音乐 + 人声

Lyrics Output 歌词输出

你听啊秋末的落叶，你听它叹息着离别，只剩我独自领略海与山风和月，你听啊。

Music + Vocal 音乐 + 人声

Lyrics Output 歌词输出

Hey diddle diddle, the cat and the fiddle, the cow jumped over the moon. The little dog laughed to see such sport, and the dish ran away with the spoon. Hey diddle diddle, the cat and the fiddle, the cow jumped over the moon. The little dog laughed to see such sport, and the dish ran away with the spoon.

Music + Vocal 音乐 + 人声

Lyrics Output 歌词输出

I see your monsters. I see your pain. Tell me your problems; I'll chase them away. I'll be your lighthouse. I'll make it okay. When I see your monsters, I'll stand there so brave and chase them all away.

Music + Vocal 音乐 + 人声

Lyrics Output 歌词输出

It may be a good idea for Joe, but it wouldn't be good for me to sit in a mortgaged bungalow with my little ones on my knee. I'd much rather go and blow my dough on a casual chickadee. I don't want a mark that I'll have to toe—my toe can go where it wants to go. It wants to go where the wild girls grow in extravagant quantity to bask in the warm and peaceful glow of connubial constancy. May be awfully good for good old Joe, but it wouldn't be good for me!

Streaming ASR Demo 流式转录示例

Performance Benchmarks 性能基准测试

Fun-ASR achieves industry-leading performance on multiple public datasets and industrial test sets. The following are detailed performance comparison data.

Fun-ASR 在多个公开数据集和工业测试集上均达到业界领先水平，以下为详细性能对比数据。

Open-Source Dataset Performance (WER %)

Test Set	GLM-ASR-nano	GLM-ASR-nano*	Whisper-large-v3	Seed-ASR	Seed-ASR*	Kimi-Audio	Step-Audio2	FireRed-ASR	Fun-ASR-nano	Fun-ASR
Model Size	1.5B	1.5B	1.6B	-	-	-	-	1.1B	0.8B	7.7B
OpenSource	✓	✓	✓	✗	✗	✓	✓	✓	✓	✗
AIShell1	1.81	2.17	4.72	0.68	1.63	0.71	0.63	0.54	1.80	1.22
AIShell2	-	3.47	4.68	2.27	2.76	2.86	2.10	2.58	2.75	2.39
Fleurs-zh	-	3.65	5.18	3.43	3.23	3.11	2.68	4.81	2.56	2.53
Fleurs-en	5.78	6.95	6.23	9.39	9.39	6.99	3.03	10.79	5.96	4.74
Librispeech-clean	2.00	2.17	1.86	1.58	2.8	1.32	1.17	1.84	1.76	1.51
Librispeech-other	4.19	4.43	3.43	2.84	5.69	2.63	2.42	4.52	4.33	3.03
WenetSpeech Meeting	6.73	8.21	18.39	5.69	7.07	6.24	4.75	4.95	6.60	6.17
WenetSpeech Net	-	6.33	11.89	4.66	4.84	6.45	4.67	4.94	6.01	5.46

Note: Seed-ASR* results are evaluated using the official API on volcengine; GLM-ASR-nano* results are evaluated using the opensource checkpoint. 注：Seed-ASR* 结果使用 volcengine 上的官方 API 评估；GLM-ASR-nano* 结果使用开源 checkpoint 评估。

Industry Dataset Performance (WER %)

Test Set	GLM-ASR-Nano	Whisper-large-v3	Seed-ASR	FireRed-ASR	Kimi-Audio	Paraformer v2	Fun-ASR-nano	Fun-ASR
Model Size	1.5B	1.6B	-	1.1B	8B	0.2B	0.8B	7.7B
OpenSource	✓	✓	✗	✓	✓	✓	✓	✗
Nearfield	16.95	16.58	7.20	10.10	9.02	8.11	7.79	6.31
Farfield	9.44	22.21	4.59	7.49	10.95	9.55	5.79	4.34
Complex Background	23.79	32.57	12.90	15.56	15.56	15.19	14.59	11.45
English General	16.47	18.56	15.65	21.62	18.12	19.48	15.28	13.73
Opensource	4.67	7.05	3.83	5.31	3.79	6.23	4.22	3.38
Dialect	54.21	66.14	29.45	52.82	71.94	41.16	28.18	15.21
Accent	19.78	36.03	10.23	14.05	27.20	17.80	12.90	10.31
Lyrics	46.56	54.82	30.26	42.87	65.18	50.14	30.85	21.00
Hiphop	43.32	46.56	29.46	33.88	57.25	43.79	30.87	28.58
Average	26.13	33.39	15.95	22.63	31.00	23.49	16.72	12.70

Streaming ASR Performance (WER %)

Test Set	Seed-ASR	Fun-ASR-nano	Fun-ASR
Nearfield	8.64	8.10	6.75
Farfield	5.51	6.38	4.72
Complex Background	15.48	15.52	12.49
English General	18.78	16.46	14.68
OpenSource	3.80	5.06	4.08
Dialect	-	30.72	18.25
Accent	-	15.42	11.49
Lyrics	-	31.54	22.05
Hiphop	-	36.55	28.90

Code Switching Performance (WER %)

Test Set	Offline			Streaming
	w/o CS	w/o RL	w/ RL	w/o CS	w/o RL	w/ RL
A	4.53	1.70	1.55	6.19	5.85	2.28
B	4.76	4.50	4.49	6.32	5.68	5.07

Noise Robustness Performance (WER %)

Environment	Fun-ASR (w/o NRT)	Fun-ASR (w/ NRT)	Fun-ASR (NRT + RL)
Canteen	20.67	19.66	19.28
Dinner	14.02	9.02	8.77
Meeting	6.45	6.25	6.05
Office	15.02	10.52	10.42
Outdoor	10.12	9.63	9.37
Park	13.67	11.04	10.89
Shop	12.22	10.57	10.46
Street	12.05	10.10	9.85
Subway	14.11	11.89	11.84
Supermarket	14.27	8.03	7.74
Walk Street	13.89	13.34	12.94
Average	13.32	10.91	10.69

Hotword Customization Performance

Topic	Offline (w/o RL)			Offline (w/ RL)			Streaming (w/o RL)			Streaming (w/ RL)
	WER	Accuracy	Recall	WER	Accuracy	Recall	WER	Accuracy	Recall	WER	Accuracy	Recall
Biology	1.67	0.98	0.99	1.70	0.97	1.00	2.04	0.98	0.98	1.97	0.99	0.98
Math	0.86	0.99	0.99	0.86	0.99	0.99	1.29	0.99	0.99	1.01	0.99	1.00
Religion	3.20	0.98	0.98	2.87	0.99	0.99	3.71	0.99	0.98	3.35	0.99	0.97
Food	1.90	0.98	0.99	1.55	0.99	1.00	2.01	0.99	0.99	1.47	0.99	0.99
Name	0.53	1.00	0.95	0.35	1.00	1.00	1.29	1.00	0.95	0.88	1.00	0.98
Brand	0.41	1.00	0.99	0.33	1.00	0.99	1.08	0.99	0.95	0.38	1.00	1.00
Astronomy	2.11	1.00	0.97	1.97	0.99	0.97	2.28	0.98	0.95	2.39	1.00	0.98
Chemistry	1.76	0.99	0.97	1.91	0.99	0.98	2.81	0.98	0.97	1.83	0.99	0.97
Philosophy	3.03	0.99	0.96	2.84	0.99	0.97	3.31	0.99	0.96	3.03	0.99	0.95
Physics	1.72	0.99	1.00	1.82	0.98	1.00	2.31	0.99	0.98	1.80	0.99	0.99

Multilingual ASR Performance (WER/CER %)

Language	Test Set	Kimi-Audio	Whisper Large v3	seamless-m4t-large-v2	Fun-ASR-ML	Fun-ASR-ML-Nano
Chinese	Fleurs	2.69	4.71	5.15	3.00	3.51
	CommonVoice	7.21	12.61	10.76	5.76	6.2
	WeNetSpeech Test Net	5.37	9.83	9.87	6.48	6.35
	AIShell2 iOS Test	2.56	4.83	4.79	2.60	2.74
	Nearfield Test Set	36.42	16.54	14.85	7.91	7.89
English	Fleurs	4.40	4.11	6.59	3.18	5.49
	CommonVoice	10.31	9.66	7.63	7.67	9.90
	Librispeech Test Clean	1.28	2.56	2.56	1.62	1.68
	Librispeech Test Other	2.42	4.34	4.84	3.39	4.03
	Nearfield Test Set	12.40	11.78	43.74	11.19	11.61
Indonesian	Fleurs	-	6.07	9.36	8.10	6.56
	CommonVoice	-	7.27	6.10	5.49	8.09
	GigaSpeech2 Test	-	19.11	22.30	16.71	18.52
Thai	Fleurs	-	8.48	9.25	6.00	8.05
	CommonVoice	-	5.92	2.81	0.90	1.97
	GigaSpeech2 Test	-	19.35	21.70	19.09	18.96
Vietnamese	Fleurs	-	6.51	8.07	5.50	7.75
	CommonVoice	-	13.51	13.85	7.41	9.26
	GigaSpeech2 Test	-	13.82	43.31	8.98	9.29
	Nearfield Test Set	-	11.46	32.10	7.06	7.72

BibTeX 文献引用

@misc{an2025funasrtechnicalreport,
      title={Fun-ASR Technical Report}, 
      author={Keyu An and Yanni Chen and Zhigao Chen and Chong Deng and Zhihao Du and Changfeng Gao and Zhifu Gao and Bo Gong and Xiangang Li and Yabin Li and Ying Liu and Xiang Lv and Yunjie Ji and Yiheng Jiang and Bin Ma and Haoneng Luo and Chongjia Ni and Zexu Pan and Yiping Peng and Zhendong Peng and Peiyao Wang and Hao Wang and Haoxu Wang and Wen Wang and Wupeng Wang and Yuzhong Wu and Biao Tian and Zhentao Tan and Nan Yang and Bin Yuan and Jieping Ye and Jixing Yu and Qinglin Zhang and Kun Zou and Han Zhao and Shengkui Zhao and Jingren Zhou and Yanqiao Zhu},
      year={2025},
      eprint={2509.12508},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.12508}, 
}