VoxCeleb Speaker Recognition Challenge

Welcome to the 2021 VoxCeleb Speaker Recognition Challenge! The goal of this challenge is to probe how well current methods can recognize speakers from speech obtained 'in the wild'. The data is obtained from YouTube videos of celebrity interviews, as well as news shows, talk shows, and debates - consisting of audio from both professionally edited videos as well as more casual conversational audio in which background noise, laughter, and other artefacts are observed in a range of recording environments.

Evaluation servers are open now! Also, please check the FAQs section for more & recent information about the challenge.

There has been a rule change for tracks 1 & 2! Now, there is no restriction on the use of language tags on the training set, and participants may compute their own language tags on the test set, for use during inference (participants must use the VoxLingua107 language classifier if choosing to do this).

Timeline

This is the rough timeline of the challenge. We will post the exact dates as soon as possible.

Mid July	Release of validation and test data.
Late July	Opening of the Codalab evaluation server.
1st Sep (11:59PM AOE)	Deadline for submission of results.
4th Sep (11:59PM AOE)	Deadline for Technical Description.
7th Sep (9:00AM EDT, 15:00PM CET)	VoxSRC2021 workshop.

Tracks

VoxSRC-2021 features four tracks. Track 1, 2 and 3 are speaker verification tracks, where the task is to to determine whether two samples of speech are from the same person. Track 4 is a speaker diarisation track, where the task is to break up multi-speaker audio into homogenous single speaker segments, effectively solving ‘who spoke when’. Details about the evaluation metrics are provided below.

#	Description
Track 1	Fully supervised speaker verification (closed) Participants can only use VoxCeleb2 dev dataset for which we have already released speaker labels.
Track 2	Fully supervised speaker verification (open) Participants can use VoxCeleb2 dev dataset and any other data except the challenge test data and TED/TEDx.
Track 3	Self-supervised speaker verification (closed) Participants can train only on VoxCeleb2 dev dataset without using speaker labels. You CANNOT use models pre-trained on any other data, or with any identity labels from other datasets.
Track 4	Speaker diarisation (open) Participants are allowed to use any data except the challenge test data

New focus for speaker verification tracks

This year the verification tracks (1, 2 and 3) have a special focus on multi-lingual verification. It is important for fairness and accessibility that different people from diverse language groups can use deep learning models. In order to promote this wider accessiblity for speaker verification models, it is essential that benchmark datasets contain multi-lingual data and in turn reward the models that perform well on this diverse range of data. To reflect this, the validation and test sets this year contain more multi-lingual data than in previous years.

In order to aid participants with the multi-lingual focus of this year's verification tracks, we provide the language predictions and confidence scores for the VoxCeleb1 and VoxCeleb2 wavforms. The predictions were made using the VoxLingua107 language classifier. For each wavform, the model predicts the language label. We softmax the scores, and provide for each wavform the resulting score and top predicted language. When predicting language, we sampled one wavform from each YouTube video in the VoxCeleb datasets at random, and assigned the same resulting language prediction to all wavforms from that video. This assumes that the speaker's language is consistent within each video.

Please note that the provided language labels can not be used in track 3 (self-supervised).

Data

Speaker verification

For the speaker verification tracks, we use the VoxCeleb dataset.

Training data: There are two closed tracks (1 and 3) and one open track (track 2) for speaker verification. For the closed track, participants can only use the VoxCeleb2 dev dataset. Please refer to this website to download the dataset. For the open track, participants can use any public data except the challenge test data.

Validation data: We provide the list of trial speech pairs from identities in the VoxCeleb1 dataset. Each trial pair consists of two single-speaker speech segments, of variable length. Unlike the 2020 challenge, this year we do not use out-of-domain data from VoxCeleb1 identities.

Please note that we utilised the predicted language labels of each speech segment when constructing this year's challenging validation set. This strategy is similarly applied when constructing the VoxSRC2021 workshop test set. We provide to participants the top predicted language for each of the wavforms in VoxCeleb1 and VoxCeleb2 (see above).

Test data: The test set consists of a list of trial pairs and anonymized speech wavfiles. Below are the links to download both the trial list and speech segments.

File		MD5 Checksum
Validation pairs (Tracks 1, 2 and 3)	Download	`b640cd8f5d7bcd17c5b9cf59476098a5`
Test WAV files (Tracks 1, 2 and 3)	Download	`c8567eb27f509a8f8db410c891054240`
Test pairs	Download	`566dfa218ce8fcf9ba50b43e1497d77c`

Speaker diarisation

Training data: Participants can use any existing public datasets for training models except the challenge test data.

Validation data: In VoxSRC2021, we decide to provide both the dev set and test set of VoxConverse for use in validation. VoxConverse is a large-scale diarisation dataset, collected from Youtube videos including talk-shows, panel discussions, political debates and celebrity interviews. Please refer to this website to download the dataset.

Test data: Participants can download our test wavfiles from the link below. Please note that you have to submit one rttmfile which contains all predicted segments from our test data.

File		MD5 Checksum
Test WAV files (Track 4)	Download	`ffa60a0e8243ea0e38d3c8026d86356b`

Evaluation Metrics

Speaker Verification

For the Speaker Verification tracks, we will display both the Equal Error Rate (EER) and the Minimum Detection Cost (CDet). For tracks 1 and 2, the primary metric for the challenge will be the Detection Cost, and the final ranking of the leaderboard will be determined using this score alone. For track 3, the primary metric is EER, as this is a more forgiving metric.

Equal Error Rate

This is the rate used to determine the threshold value for a system when its false acceptance rate (FAR) and false rejection rate (FRR) are equal.

Minimum Detection Cost

Compared to equal error-rate, which assigns equal weight to false negatives and false positives, this error-rate is usually used to assess performance in settings where achieving a low false positive rate is more important than achieving a low false negative rate. We follow the procedure outlined in Sec 3.1 of the NIST 2018 Speaker Recognition Evaluation Plan, for the AfV trials. To avoid ambiguity, we mention here that we will use the following parameters: C_Miss = 1, C_FalseAlarm = 1, and P_Target = 0.05

Speaker Diarisation

For the Speaker Diarisation track, we will display both the Diarisation Error Rate (DER) and the Jaccard Error Rate (JER), but the leaderboard will be ranked using the Diarisation Error Rate (DER) only.

Diarisation Error Rate

The Diarisation Error Rate (DER) is the sum of
1. speaker error - percentage of scored time for which the wrong speaker id is assigned within a speech region.
2. false alarm speech - percentage of scored time for which a nonspeech region is incorrectly marked as containing speech
3. missed speech - percentage of scored time for which a speech region is incorrectly marked as not containing speech.

We use a collar of 0.25 seconds and include overlapping speech in the scoring. For more details, consult section 6.1 of the NIST RT-09 evaluation plan.

Jaccard Error Rate

We also report the Jaccard error rate (JER), a metric introduced for the DIHARD II challenge that is based on the Jaccard index. The Jaccard index is a similarity measure typically used to evaluate the output of image segmentation systems and is defined as the ratio between the intersection and union of two segmentations. To compute Jaccard error rate, an optimal mapping between reference and system speakers is determined and for each pair the Jaccard index of their segmentations is computed. The Jaccard error rate is then 1 minus the average of these scores. For more details please consult Sec 3 of the DIHARD Challenge Report.

Code for computing all metrics on the validation data has been provided in the development toolkit.

Challenge registration

Four tracks will be held via Codalab platform. You need a Codalab account for registration, so please make it if you don't have one. Any researchers, whether in academia or industry, can participate in our challenge, but we only accept institutional emails to register. Please follow the instructions on each challenge website for submission.

CodaLab evaluation server are active now. Please visit the links below for participation.

Baselines

For reference, we have added baseline results to the leaderboard, submitted by vgg_oxford user. Please visit the links below for more details.

Fully supervised speaker verification (Track 1 & 2)
Self-supervised speaker verification (Track 3)
For track 4, we implemented our own system, mostly following this paper, except changing speaker embedding extractor and voice activity detector.

Previous Challenges

Details of the previous challenges can be found below. You can also find the slides and presentation videos of the winners on the workshop websites.

Challenge	Links
VoxSRC-19	challenge / workshop
VoxSRC-20	challenge / workshop

Technical Description

All teams are required to submit a brief technical report describing their method. Please submit this using the latest Interspeech paper template. All reports must be a minimum of 1 page and a maximum of 4 pages excluding references. You can combine descriptions for multiple tracks into one report. Reports must be written in English.

The reports should be sent to us as a link to arXiv document, or as a PDF file. In both cases, we will place links to the reports from the challenge website. The report may be used to form all or part of a submission to another conference or workshop. We recommend that you send the report as a link to arXiv document if you intend to do this. The links and PDF files should be sent to voxsrc (at) gmail.com. The deadline for the report is 4th September 2021, 23:59 AOE.

See here, here and here for examples of reports.

FAQs

Note the CodaLab evaluation server is NOT active yet

Q. Who is allowed to participate?
A. Any researcher, whether in academia or industry, is invited to participate in VoxSRC . We only request a valid official email address, associated with an institution for registration, once the registration system opens. This ensures we limit the number of submissions per team.

Q: Can I use the provided language labels for track 3 (self-supervised speaker verification)?
A: No. These may only be used for the supervised tracks (1 and 2).

Q: In tracks 1 and 2, can I use compute my own language tags for the test set and use those during inference?
A: Yes. You are allowed to compute language tags for the hidden test set using the VoxLingua107 language classifier. However, these tags from the test set may not be used for training.

Q: Do I need to use the name of my institution or my real name as the team name for a submission?
A: No, you do not have to. The name of the CodaLab user (or the Team name, if you have set up one in CodaLab) that uploads the submission will be used in the public leaderboard. Hence if you do not want your details to be public, you should anonymise if appropriate. You must select a team name before the server's closing time.

Q. Can I participate in only some tracks?
A. Yes, you can participate in as many tracks as you like and be considered for each one independantly.

Q: How many submissions can I make?
A: You can only make 1 submission per day. In total, you can make only 5 submissions to the test set for each track.

Q: Can I train on other external datasets (public, or not)?
A: Only for the OPEN tracks. Not for the CLOSED tracks.

Q: Can I use data augmentation?
A: Yes, you can use any kind of noise or music, as long as you are not training on additional speech data, for the CLOSED tracks. You may also use the MUSAN noise dataset as augmentation for the CLOSED tracks. For the OPEN track, you can train on any data you see fit.

Q. Can I participate in the challenge but not submit a report describing my method?
A. We do not allow that option. Entries to the challenge will only be considered if a technical report is submitted on time. This should not affect later publications of your method if you restrict your report to 2 pages including references. You can still submit to the leaderboard, however, even if you do not submit a technical report.

Q. Will the technical report submitted to this workshop be archived by Interspeech 2020?
A. No. We shall use the papers to select some authors to present their work at the workshop.

Q. Will there be prizes for the winners?
A. Yes, there will be cash prizes for the top 3 on the leaderboard for each track.

Q. For the CLOSED condition, can I use the validation set for training anything, eg. the PLDA parameters?
A. No, for the CLOSED condition you can use the validation set only to tune user-defined hyperparameters, eg. for example selecting which convolutional model to use.

Q. For the CLOSED conditions, what can I use as the validation set?
A. For the closed conditions, participants may only use the provided pairs for this year's challenge, or the VoxCeleb1 pairs. These must strictly NOT be used for training. It is beneficial for participants to use this year's provided validation pairs, as their distribution matches that of the hidden test pairs.

Q. What kind of supervision can I use for the self-supervised track?
A. Self-supervision is an increasingly popular field of machine learning which does not use manually labelled training data for a particular task. The supervision for training instead comes from the data itself, for example from the future frames of a video or from another modality, such as faces.

Q. For the self-supervised track can I use the single-speaker audio segments that have been released as part of the VoxCeleb2 dev set?
A. Yes, you can use the already segmented single speaker segments. These were obtained using SyncNet, a self-supervised method. You cannot, however, use the speaker identity labels provided.

Q. For the self-supervised track can I use the total number of speakers in the VoxCeleb2 dev set as a hyperparameter?
A. No, you cannot use any speaker identity information at all. You cannot use the number of speakers in any way, e.g. to determine the number of clusters for a clustering algorithm.

Q. What if I have an additional question about the competition?
A. If you are registered in the CodaLab competition, please post your question in the competition forum (rather than contact the organizers directly by e-mail) and we will answer it as soon as possible. The reason for this approach is that others may have similar questions: use of the forum ensures that the question can be useful for everyone. If you rather make your question before registering, please follow the procedure in the Organisers section below.

Organisers

Andrew Brown, VGG, University of Oxford,
Jaesung Huh, VGG, University of Oxford,
Arsha Nagrani, Google Research,
Joon Son Chung, Naver, South Korea,
Andrew Zisserman, VGG, University of Oxford,
Daniel Garcia-Romero, AWS AI

Advisors

Mitchell McLaren, Speech Technology and Research Laboratory, SRI International, CA,
Douglas A Reynolds, Lincoln Laboratory, MIT.

Please contact abrown[at]robots[dot]ox[dot]ac[dot]uk or jaesung[at]robots[dot]ox[dot]ac[dot]uk if you have any queries, or if you would be interested in sponsoring this challenge.

The VoxCeleb Speaker Recognition Challenge 2021
(VoxSRC-21)