Abstract 摘要
Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Laboratory, trained on tens of millions of hours of real speech data. It possesses strong contextual understanding capabilities and industry adaptability, supports low-latency real-time dictation, and covers 31 languages. It excels in vertical fields such as education and finance, accurately recognizing professional terms and industry expressions, effectively addressing challenges like "hallucination" generation and language confusion, achieving "clear hearing, understanding meaning, and accurate writing." Fun-ASR is widely applicable to various high-demand speech processing scenarios, with core advantages including high accuracy, low latency, and customization, suitable for applications like AI meeting minutes and real-time subtitles.
Fun-ASR 是通义实验室推出的端到端语音识别大模型,是基于数千万小时真实语音数据训练而成,具备强大的上下文理解能力与行业适应性,支持低延迟实时听写,并且覆盖31个语种。在教育、金融等垂直领域表现出色,能准确识别专业术语与行业表达,有效应对"幻觉"生成和语种混淆等挑战,实现"听得清、懂其意、写得准"。Fun-ASR 广泛适用于多种高要求语音处理场景,凭借高精度、低延迟、可定制化等核心优势,可以应用于会议AI纪要、实时字幕等场景。
TODO: 性能表格是否需要丰富,图中文字变英文? tech report 的链接。 示例音频的删减? 流式asr的demo
Fun-ASR Architecture Overview
Pre-training Pipeline for Audio Encoder
The FunRL Framework
Time Consumption Analysis
Performance Comparison in Different Scenarios
Hot words Performance Comparison in Different Domain
ASR Capability Demonstrations ASR 能力演示
Click to play and listen to Fun-ASR's recognition performance in different scenarios.
点击播放试听 Fun-ASR 在不同场景下的识别效果。
Chinese Dialect Recognition 中文方言识别
Supports Wu, Cantonese, Min, Hakka, Gan, Xiang, and Jin dialects, covering over 20 regions including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, Ningxia, etc.
支持的方言包括吴语、粤语、闽语、客家话、赣语、湘语、晋语,支持的口音官话覆盖中原、西南、冀鲁、江淮、兰银、胶辽、东北、北京、港台等,包括河南、陕西、湖北、四川、重庆、云南、贵州、广东、广西、河北、天津、山东、安徽、南京、江苏、杭州、甘肃、宁夏 等 20 多个地区。
人跟狗,包括人跟动物接触长了,全有感情。葛末随了阿拉社会个富裕。
啲身体好劲啊,跟住咧佢哋有一个人咧就突然可能就有高原反应啦,突然间就啊窒息咗,即系晕晕咗。
但总来讲孙膑对兵法的理解运用比庞涓略胜一筹。
嗯,下摆若有机会吧,因为即久吼开了吼卷啊遮厉害,会倒贴钱啊。
Multilingual Recognition 多语言识别 (自由说)
Supports 31 languages including Japanese, Vietnamese, and code-switching scenarios.
支持的语种包括31种语言:日语、越南语等,支持语言自由切换场景。
人民たちは、金欲しさに王をのけ者にしてしまって、何でもすべて商人のところへ持って行ってしまいました。
Đi cùng với tiếp tục kêu gọi người dân đã qua lại các ổ dịch này, khai báo y tế và yêu cầu liên hệ để được xét nghiệm.
このカフェのwi-fiがアン ステーブル 過ぎて、google meetでディスコネクトされて クライエントに悪い印象を与えてしまった。
Robust Far-Field Recognition 远场复杂背景识别
Optimized for far-field pickup and high-noise scenarios (such as conference rooms, in-vehicle environments, industrial sites), significantly improving speech recognition stability and accuracy in complex acoustic conditions.
针对远距离拾音及高噪声场景(如会议室、车载环境、工业现场等)进行深度优化,显著提升复杂声学条件下的语音识别稳定性与准确率。
然后被冠以了渣男线的称号,好了,不管这个,那么前方即将到达沈杜公路站,左边是8号线。
周末要不要去露营,最近天气超舒服,露营?我怕虫子咬,而且晚上睡帐篷会不会很冷啊?放心,我借了专业装备还有暖宝宝,再带点火锅食材,边吃边看星星超惬意。
<music>唯一的遗憾就是他那个八宝鸭还有烤鸭都没吃上 估计得提前预定吧 <impact_sounds></impact_sounds>只能怪我自己没有做好功课</music>
别紧张<breathing></breathing>我只是我是在这边逛街 然后看到你们在这边拍照 想跟你交个朋友<impact_sounds> 认识</impact_sounds>一下
So what's interesting here is I feel that you know brands knowing this when people sort of speak to the voice assistance at home and if you want to be the brand.
Industry Customization Capability 行业定制化能力
In industrial ASR applications, personalized customization is an essential technology. Customization refers to applying additional probability preferences to specific words/phrases (such as names, places, brands, professional terms) during the recognition process, thereby significantly improving their recall rate while minimizing the impact on general recognition accuracy.
在ASR的工业落地中,个性化定制是必不可少的技术。所谓定制化,是指在识别过程中对特定词/短语(如人名、地名、品牌、专业术语等)施加额外概率偏好,从而显著提高它们的识别召回率,同时尽量不损伤通用识别准确率。
Our solution introduces RAG (Retrieval-Augmented Generation) mechanism: building a dedicated RAG library with user-configured customized words, dynamically extracting relevant vocabulary based on CTC first-pass decoding results, and precisely injecting only relevant vocabulary into the LLM's prompt to avoid interference from unrelated information. This approach expands the number of customized context words to over a thousand without increasing inference complexity, while maintaining high customized recognition performance.
我们的解决方案:引入RAG(检索增强生成)机制,将用户配置的定制词构建成专属RAG库,依据CTC第一遍解码结果动态抽取相关词汇,仅将相关词汇精准注入LLM的Prompt中避免无关信息干扰。该方案在不增加推理复杂度的前提下,将定制化上文数量扩充到上千个以上,并且保持较高的定制化识别效果。
肾脏中肾小球囊上的细胞膜孔隙很小
对微分形式的积分是微分几何中的基本概念
由罗马皇帝钦点的犹地亚王大希律王统治期间
利用三磷酸腺苷的水解所产生的能量来驱动其他化学反应
根据碰撞理论月面样本缺少挥发性物质
比如说酯在当时被认为是一种含氧酸盐
Music & Lyrics Transcription 歌词识别
Enhanced speech recognition performance in music background interference and first-time support for accurate recognition of lyrics in songs, expanding the application potential of speech technology in entertainment and consumer scenarios such as music, short videos, and smart speakers.
强化在音乐背景干扰下的语音识别性能,并首次支持对歌曲中歌词内容的精准识别,拓展语音技术在音乐、短视频、智能音箱等娱乐与消费场景的应用潜力。
我看到我的身后盯着我的人群, 喜欢或恨不一样的神情, 我知道这可能就是所谓的成名, 我知道必须往前一步也不能停。
明明那么远,为何却感觉离他那么近? 闭上眼,你甚至能背出他所有押韵。 虽然不听说唱了,但你已学会自信。 我代表所有中文说唱歌手向你致敬。 如今面对困难的你,早已不再抱怨。
你听啊秋末的落叶, 你听它叹息着离别, 只剩我独自领略海与山风和月, 你听啊。
Hey diddle diddle, the cat and the fiddle, the cow jumped over the moon. The little dog laughed to see such sport, and the dish ran away with the spoon. Hey diddle diddle, the cat and the fiddle, the cow jumped over the moon. The little dog laughed to see such sport, and the dish ran away with the spoon.
I see your monsters. I see your pain. Tell me your problems; I'll chase them away. I'll be your lighthouse. I'll make it okay. When I see your monsters, I'll stand there so brave and chase them all away.
It may be a good idea for Joe, but it wouldn't be good for me to sit in a mortgaged bungalow with my little ones on my knee. I'd much rather go and blow my dough on a casual chickadee. I don't want a mark that I'll have to toe—my toe can go where it wants to go. It wants to go where the wild girls grow in extravagant quantity to bask in the warm and peaceful glow of connubial constancy. May be awfully good for good old Joe, but it wouldn't be good for me!
Streaming ASR 流式识别
Streaming ASR with low first-character latency, optimized theoretical first-character output delay to 160ms. While maintaining the ability to process audio streams segment by segment and return intermediate results with final transcribed text in real-time, it truly achieves millisecond-level response experience of "instant text output while speaking".
流式识别低首字延迟,首字输出理论延迟优化至160ms,在保持音频流逐段输入、实时返回中间结果与最终转写文本能力的同时,真正实现"说话即出字"的毫秒级响应体验。
Performance Benchmarks 性能基准测试
Fun-ASR achieves industry-leading performance on multiple public datasets and industrial test sets. The following are detailed performance comparison data.
Fun-ASR 在多个公开数据集和工业测试集上均达到业界领先水平,以下为详细性能对比数据。
Open-Source Dataset Performance (WER %)
| Test Set | GLM-ASR-Nano | Whisper-large-v3 | Seed-ASR | Seed-ASR* | Kimi-Audio | Step-Audio2 | FireRed-ASR | Fun-ASR-nano | Fun-ASR | Fun-ASR (1126) |
|---|---|---|---|---|---|---|---|---|---|---|
| AIShell1 | 3.47 | 4.72 | 0.68 | 1.63 | 0.71 | 0.63 | 0.54 | 1.76 | 1.22 | 1.28 |
| AIShell2 | 3.47 | 4.68 | 2.27 | 2.76 | 2.86 | 2.10 | 2.58 | 2.80 | 2.30 | 2.35 |
| Fleurs-zh | 3.65 | 5.18 | 3.43 | 3.23 | 3.11 | 2.68 | 4.81 | 3.47 | 2.64 | 2.89 |
| Fleurs-en | 6.95 | 6.23 | 9.39 | 9.39 | 6.99 | 3.03 | 10.79 | 7.95 | 5.84 | 6.24 |
| Librispeech-clean | 2.17 | 1.86 | 1.58 | 2.8 | 1.32 | 1.17 | 1.84 | 1.75 | 1.57 | 1.51 |
| Librispeech-other | 4.43 | 3.43 | 2.84 | 5.69 | 2.63 | 2.42 | 4.52 | 4.37 | 3.24 | 3.13 |
| WenetSpeech Meeting | 8.21 | 18.39 | 5.69 | 7.07 | 6.24 | 4.75 | 4.95 | 8.78 | 6.49 | 6.53 |
| WenetSpeech Net | 6.33 | 11.89 | 4.66 | 4.84 | 6.45 | 4.67 | 4.94 | 6.28 | 5.46 | 5.50 |
note: Seed-ASR* results are evaluated using the official API on volcengine 注:Seed-ASR* 结果使用 volcengine 上的官方 API 评估
Industry Dataset Performance (WER %)
| Test Set | GLM-ASR-Nano | Seed-ASR | Whisper-large-v3 | FireRed-ASR | Kimi-Audio | Paraformer v2 | Fun-ASR-nano | Fun-ASR |
|---|---|---|---|---|---|---|---|---|
| Nearfield | 16.95 | 7.20 | 16.58 | 10.10 | 9.02 | 8.11 | 7.79 | 6.31 |
| Fairfield | 9.44 | 4.59 | 22.21 | 7.49 | 10.95 | 9.55 | 5.79 | 4.34 |
| Complex Background | 23.79 | 12.90 | 32.57 | 15.56 | 15.56 | 15.19 | 14.59 | 11.45 |
| English General | 16.47 | 15.65 | 18.56 | 21.62 | 18.12 | 19.48 | 15.28 | 13.73 |
| Opensource | 4.67 | 3.83 | 7.05 | 5.31 | 3.79 | 6.23 | 4.22 | 3.68 |
| Dialect | 54.21 | 29.45 | 66.14 | 52.82 | 71.94 | 41.16 | 28.18 | 19.55 |
| Accent | 19.78 | 10.23 | 36.03 | 14.05 | 27.20 | 17.80 | 12.90 | 10.01 |
| Lyrics | 46.56 | 30.26 | 54.82 | 42.87 | 65.18 | 50.14 | 30.85 | 21.23 |
| Hiphop | 43.32 | 29.46 | 46.56 | 33.88 | 57.25 | 43.79 | 30.87 | 24.86 |
| Average | 26.13 | 15.95 | 33.39 | 22.63 | 31.00 | 23.49 | 16.72 | 12.80 |
Streaming ASR Performance (WER %)
| Test Set | Seed-ASR | Fun-ASR | Fun-ASR-nano |
|---|---|---|---|
| Nearfield | 8.64 | 7.00 | 7.97 |
| Fairfield | 5.51 | 5.33 | 6.92 |
| Home Scenario | 9.70 | 5.33 | 6.51 |
| Complex Background | 15.48 | 12.50 | 14.83 |
| English General | 18.78 | 14.74 | 16.70 |
| OpenSource Test Sets | 3.80 | 3.60 | 5.13 |
| Average | 10.32 | 8.08 | 9.68 |
Code Switching Performance (WER %)
| Test Set | Offline | Streaming | ||||
|---|---|---|---|---|---|---|
| w/o CS | w/o RL | w/ RL | w/o CS | w/o RL | w/ RL | |
| A | 4.53 | 1.70 | 1.55 | 6.19 | 5.85 | 2.28 |
| B | 4.76 | 4.50 | 4.49 | 6.32 | 5.68 | 5.07 |
Noise Robustness Performance (WER %)
| Environment | Fun-ASR (w/o NRT) | Fun-ASR (w/ NRT) | Fun-ASR (NRT + RL) |
|---|---|---|---|
| Canteen | 20.67 | 19.66 | 19.28 |
| Dinner | 14.02 | 9.02 | 8.77 |
| Meeting | 6.45 | 6.25 | 6.05 |
| Office | 15.02 | 10.52 | 10.42 |
| Outdoor | 10.12 | 9.63 | 9.37 |
| Park | 13.67 | 11.04 | 10.89 |
| Shop | 12.22 | 10.57 | 10.46 |
| Street | 12.05 | 10.10 | 9.85 |
| Subway | 14.11 | 11.89 | 11.84 |
| Supermarket | 14.27 | 8.03 | 7.74 |
| Walk Street | 13.89 | 13.34 | 12.94 |
| Average | 13.32 | 10.91 | 10.69 |
Hotword Customization Performance
| Topic | Offline (w/o RL) | Offline (w/ RL) | Streaming (w/o RL) | Streaming (w/ RL) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WER | Accuracy | Recall | WER | Accuracy | Recall | WER | Accuracy | Recall | WER | Accuracy | Recall | |
| Biology | 1.67 | 0.98 | 0.99 | 1.70 | 0.97 | 1.00 | 2.04 | 0.98 | 0.98 | 1.97 | 0.99 | 0.98 |
| Math | 0.86 | 0.99 | 0.99 | 0.86 | 0.99 | 0.99 | 1.29 | 0.99 | 0.99 | 1.01 | 0.99 | 1.00 |
| Religion | 3.20 | 0.98 | 0.98 | 2.87 | 0.99 | 0.99 | 3.71 | 0.99 | 0.98 | 3.35 | 0.99 | 0.97 |
| Food | 1.90 | 0.98 | 0.99 | 1.55 | 0.99 | 1.00 | 2.01 | 0.99 | 0.99 | 1.47 | 0.99 | 0.99 |
| Name | 0.53 | 1.00 | 0.95 | 0.35 | 1.00 | 1.00 | 1.29 | 1.00 | 0.95 | 0.88 | 1.00 | 0.98 |
| Brand | 0.41 | 1.00 | 0.99 | 0.33 | 1.00 | 0.99 | 1.08 | 0.99 | 0.95 | 0.38 | 1.00 | 1.00 |
| Astronomy | 2.11 | 1.00 | 0.97 | 1.97 | 0.99 | 0.97 | 2.28 | 0.98 | 0.95 | 2.39 | 1.00 | 0.98 |
| Chemistry | 1.76 | 0.99 | 0.97 | 1.91 | 0.99 | 0.98 | 2.81 | 0.98 | 0.97 | 1.83 | 0.99 | 0.97 |
| Philosophy | 3.03 | 0.99 | 0.96 | 2.84 | 0.99 | 0.97 | 3.31 | 0.99 | 0.96 | 3.03 | 0.99 | 0.95 |
| Physics | 1.72 | 0.99 | 1.00 | 1.82 | 0.98 | 1.00 | 2.31 | 0.99 | 0.98 | 1.80 | 0.99 | 0.99 |
Multilingual ASR Performance (WER/CER %)
| Language | Test Set | Kimi-Audio | Whisper Large v3 | seamless-m4t-large-v2 | Fun-ASR-ML | Fun-ASR-ML-Nano |
|---|---|---|---|---|---|---|
| Chinese | Fleurs | 2.69 | 4.71 | 5.15 | 3.00 | 3.51 |
| CommonVoice | 7.21 | 12.61 | 10.76 | 5.76 | 6.2 | |
| WeNetSpeech Test Net | 5.37 | 9.83 | 9.87 | 6.48 | 6.35 | |
| AIShell2 iOS Test | 2.56 | 4.83 | 4.79 | 2.60 | 2.74 | |
| Nearfield Test Set | 36.42 | 16.54 | 14.85 | 7.91 | 7.89 | |
| English | Fleurs | 4.40 | 4.11 | 6.59 | 3.18 | 5.49 |
| CommonVoice | 10.31 | 9.66 | 7.63 | 7.67 | 9.90 | |
| Librispeech Test Clean | 1.28 | 2.56 | 2.56 | 1.62 | 1.68 | |
| Librispeech Test Other | 2.42 | 4.34 | 4.84 | 3.39 | 4.03 | |
| Nearfield Test Set | 12.40 | 11.78 | 43.74 | 11.19 | 11.61 | |
| Indonesian | Fleurs | - | 6.07 | 9.36 | 8.10 | 6.56 |
| CommonVoice | - | 7.27 | 6.10 | 5.49 | 8.09 | |
| GigaSpeech2 Test | - | 19.11 | 22.30 | 16.71 | 18.52 | |
| Thai | Fleurs | - | 8.48 | 9.25 | 6.00 | 8.05 |
| CommonVoice | - | 5.92 | 2.81 | 0.90 | 1.97 | |
| GigaSpeech2 Test | - | 19.35 | 21.70 | 19.09 | 18.96 | |
| Vietnamese | Fleurs | - | 6.51 | 8.07 | 5.50 | 7.75 |
| CommonVoice | - | 13.51 | 13.85 | 7.41 | 9.26 | |
| GigaSpeech2 Test | - | 13.82 | 43.31 | 8.98 | 9.29 | |
| Nearfield Test Set | - | 11.46 | 32.10 | 7.06 | 7.72 |
BibTeX 文献引用
@article{YourPaperKey2024,
title={Your Paper Title Here},
author={First Author and Second Author and Third Author},
journal={Conference/Journal Name},
year={2024},
url={https://your-domain.com/your-project-page}
}