VoxCeleb

VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube

7,000 +
speakers

VoxCeleb contains speech from speakers spanning a wide range of different ethnicities, accents, professions and ages.

Utterance Lengths

1 million +
utterances

All speaking face-tracks are captured "in the wild", with background chatter, laughter, overlapping speech, pose variation and different lighting conditions.

Gender Distribution

2,000 +
hours

VoxCeleb consists of both audio and video. Each segment is at least 3 seconds long.

Nationality Distribution

Publications

Please cite the following if you make use of the dataset.

A. Nagrani*, J. S. Chung*, W. Xie, A. Zisserman

Voxceleb: Large-scale speaker verification in the wild

Computer Science and Language, 2019

Bibtex | Abstract | PDF

@Article{Nagrani19,
              author       = "Arsha Nagrani and Joon~Son Chung and Weidi Xie and Andrew Zisserman",
              title        = "Voxceleb: Large-scale speaker verification in the wild",
              journal      = "Computer Science and Language",
              year         = "2019",
              publisher    = "Elsevier",
            }

J. S. Chung*, A. Nagrani*, A. Zisserman

VoxCeleb2: Deep Speaker Recognition

INTERSPEECH, 2018.

Bibtex | Abstract | PDF

@InProceedings{Chung18b,
  author       = "Chung, J.~S. and Nagrani, A. and Zisserman, A.",
  title        = "VoxCeleb2: Deep Speaker Recognition",
  booktitle    = "INTERSPEECH",
  year         = "2018",
}

A. Nagrani*, J. S. Chung*, A. Zisserman

VoxCeleb: a large-scale speaker identification dataset

INTERSPEECH, 2017.

Bibtex | Abstract | PDF | Presentation

@InProceedings{Nagrani17,
	author       = "Nagrani, A. and Chung, J.~S. and Zisserman, A.",
	title        = "VoxCeleb: a large-scale speaker identification dataset",
	booktitle    = "INTERSPEECH",
	year         = "2017",
}

* Equal Contribution

7,000 +
speakers

1 million +
utterances

2,000 +
hours

Dataset

Publications

Please cite the following if you make use of the dataset.

Applications

Speaker Identification

Speech Separation

Talking face synthesis

Cross-modal transfer between face and voice

Emotion Recognition

Face Generation

Related Links

Challenge

Previous Challenges

Acknowledgements

This work is supported by the EPSRC programme grant Seebibyte EP/M013774/1: Visual Search for the Era of Big Data.

7,000 + speakers

1 million + utterances

2,000 + hours

Dataset

Publications

Please cite the following if you make use of the dataset.

Applications

Speaker Identification

Speech Separation

Talking face synthesis

Cross-modal transfer between face and voice

Emotion Recognition

Face Generation

Related Links

Challenge

Previous Challenges

Acknowledgements

This work is supported by the EPSRC programme grant Seebibyte EP/M013774/1: Visual Search for the Era of Big Data.

7,000 +
speakers

1 million +
utterances

2,000 +
hours