Multilingual Spoken Words

Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours).

The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. All alignments are included in the dataset. Please see our paper for a detailed analysis of the contents of the data and methods for detecting potential outliers, along with baseline accuracy metrics on keyword spotting models trained from our dataset compared to models trained on a manually-recorded keyword dataset.

Read our full paper here

Join the MSWC mailing list here

Connect with other MSWC users on the Datasets working group discord server

Get started by trying out our introductory tutorial notebook here on Google Colab

Watch our NeurIPS talk here

Kind

Language

License CC-BY 4.0
AUDIO FORMAT Opus
Size 124 GB
Description All 50 languages

Download

Google mirror. By using this mirror, Google requires that you agree not to attempt to determine the identity of the speakers in the dataset.

Download

Alibaba mirror. By using this mirror, MLCommons requires that you agree not to attempt to determine the identity of the speakers in the dataset.

Audio

Google mirror. By using this mirror, Google requires that you agree not to attempt to determine the identity of the speakers in the dataset.

Splits

Google mirror. By using this mirror, Google requires that you agree not to attempt to determine the identity of the speakers in the dataset.

Alignments

Google mirror. By using this mirror, Google requires that you agree not to attempt to determine the identity of the speakers in the dataset.

Audio

Alibaba mirror. By using this mirror, MLCommons requires that you agree not to attempt to determine the identity of the speakers in the dataset.

Splits

Alibaba mirror. By using this mirror, MLCommons requires that you agree not to attempt to determine the identity of the speakers in the dataset.

Alignments

Alibaba mirror. By using this mirror, MLCommons requires that you agree not to attempt to determine the identity of the speakers in the dataset.

Primary languages in our dataset by country

This map depicts 28 primary languages which are included in our 50-language dataset, highlighted by country. Our dataset contains keywords in the following 50 languages: Arabic, Assamese, Basque, Breton, Catalan, Chinese, Chuvash, Czech, Dhivehi, Dutch, English, Esparanto, Estonian, French, Frisian, Georgian, German, Greek, Guarani, Hakha Chin, Hausa, Indonesian, Interlingua, Irish, Italian, Kinyarwada, Kyrgyz, Latvian, Lithuanian, Maltese, Mongolian, Oriya, Persian, Polish, Portuguese, Romanian, Russian, Sakha, Slovak, Slovenian, Spanish, Sursilvan, Swedish, Tamil, Tatar, Turkish, Ukranian, Vallader, Vietnamese, and Welsh.