Data

Databases used in the Interspeech Computational Paralinguistics Challenge (ComParE) series are usually owned by individual donators. End User License Agreements (EULAs) are usually given for participation in the challenge. Usage of the databases outside of the Challenges always has to be negotiated with the data owners – not the organisers of the Challenge. We aim to provide contact information per database – however, this requires consent of the data owners, which we are currently collecting.

Below, a description of the current 2023 data is given. All of these corpora provide realistic data in challenging acoustic conditions. They feature further rich annotation such as subject meta-data, transcripts, and segmentation, and are partitioned into training, development, and test data, observing subject independence. Reproducible benchmark results of the most popular approaches by open-source toolkits will be provided as in the years before: We will provide scripts for computing the results including auDeep, DeepSpectrum, openSMILE, and openXBOW.

Hume-Prosody is an audio-only dataset and subset of a large-scale dataset of emotionally rated spoken utterances of varied prosody, provided by Alan S. Cowen and colleagues from Hume.AI. The dataset consists of around 50k samples from 1k speakers aged from 20 to 66 years old. It was gathered in 3 countries with broadly differing cultures: the United States, South Africa, and Venezuela. Each sample has been rated in a ‘select-all-that-apply’ strategy, and consists of a 9-class array of floats representing the proportion to which a given emotion is expressed in the sample. The classes for the sub-set include ‘Anger’, ‘Boredom’, ‘Calmness’, ‘Concentration’, ‘Determination,’ ‘Excitement’, ‘Interest,’ ‘Sadness,’ and ‘Tiredness’. The task will be 9-class multi-label regression.

The HealthCall30 corpus is an audio dataset and subset of the HealthCall corpus, provided by Claude Montacié and colleagues. It is based on real audio interactions between call centre agents and customers who call to solve a problem or to request information. This corpus is designed to study natural spoken conversations and to predict Customer Relationship Management (CRM) annotations made by human agents from various vocal interaction, audio, and linguistic features. This corpus consists of 13,409 chunks of spoken conversations, each lasting 30 seconds. Each conversation was recorded on two separate and distinct audio channels: the first channel corresponds to the customer’s audio, and the second corresponds to the agent’s audio. More information can be found in: Nikola Lackovic, Claude Montacié, Gauthier Lalande, Marie-José Caraty: Prediction of User Request and Complaint in Spoken Customer-Agent Conversations,” arXiv, 2022.