Abstract: Highlights•Extraction of reliable acoustic data from originally misaligned videos.•Extraction of face images without prior annotation of the speaker’s location.•Reliable audiovisual data provision for emotion recognition in multiparty dialogues.•Dataset refinement improves audiovisual data used for emotion recognition.•Refinement of the MELD dataset using CTC segmentation and active speaker detection.
Loading