M3SD: Multi-modal, Multi-scenario and Multi-language Speaker Diarization Dataset

A large-scale speaker diarization dataset based on pseudo-labeling

Dataset Overview

Speaker diarization aims to solve the problem of "who speaks when". Existing data resources are often concentrated in specific scenarios such as meetings, resulting in insufficient generalization of the speaker diarization model. The M3SD dataset is a carefully organized speaker diarization dataset with detailed metadata, which aims to promote multi-modal, multi-scenario, and multi-language speaker diarization task research. The dataset contains 770+ hours of conversations, covering multiple scenarios such as online and offline meetings, home communications, outdoor conversations, interviews, movie clips, news broadcasts, and multiple languages ​​including English and Chinese. The data comes from YouTube and is pseudo-labeled through a variety of speaker diarization systems. We will provide audio files, annotation files, and video metadata including uid. You can also download videos from YouTube based on video meta information for multimodal research. The code for data collection has been open sourced: https://github.com/slwu0209/M3SD .

This dataset is freely available for academic and non-commercial research purposes.

Key Features

Diverse Data

1,372 records, 770+ hours of data, a large number of different speakers, covering multiple real-world scenarios and multiple languages, and supporting audio-visual multimodal research.

Rich Metadata

Each recording contains rttm annotations, duration, title, description, video uid and other information to help researchers download and use it.

Novel Process

Strict data collection and cleaning process, carefully designed data processing process and speaker diarization pseudo-label generation combined with audio and video ensure the reliability of annotation. Relevant code has been open sourced.

Download the Dataset

Get immediate access to the M3SD dataset by clicking the button below. The download includes:

  • High-quality audio files in WAV format (16kHz)
  • Speaker diarization pseudo-labels in RTTM format
  • Excel file containing metadata including video uid

Potential Applications

Speaker diarization model pre-training

Add this dataset to train speaker diarization model with better generalization performance.

Audio-visual speaker diarization research

Research on multimodal speaker diarization using audio and video.

Further research on speech recognition

Transcribe the speech to research automatic speech recognition (ASR).

Explore better semi-supervised methods

Research on better semi-supervised methods to generate more accurate pseudo labels.

Citation

If you find this dataset or code useful in your research, please consider citing the following paper:

@article{wu2025m3sd,
  title={M3SD: Multi-modal, Multi-scenario and Multi-language Speaker Diarization Dataset},
  author={Shilong Wu and Hang Chen and Jun Du},
  journal={arXiv preprint arXiv:2506.14427},
  year={2025}
}

Made with DeepSite LogoDeepSite - 🧬 Remix