文章目录
Face-Voice Matching using Cross-modal EmbeddingsSummary摘要 (中文)Research ObjectiveBackground and Problemsmain workRelated workMethod(s)ConclusionReference(optional)Arouse for me
Face-Voice Matching using Cross-modal Embeddings
Summary
The author using one audio do two pictrues classification by collect data from MS-celeb-1M and voxceleb. These results are very similar to human performance.
摘要 (中文)
Research Objective
Background and Problems
Background
Speaker diarization, which identifies “who spoke when,” is an essential process in dialogue systems or robotics services.There are two major groups of speaker diarization techniques: one uses the direction of arrival (DOA) of audio signals [13, 35, 44] and another uses lip motion [36].
previous methods brief introduction
Nevertheless, there are cases when neither DOA-based nor lip motion-based speaker diarization can be used. Let us assume a situation when someone is speaking behind you.only one study was proposed very recently for forcedchoice tasks [29]. The authors trained neural network-based classifiers that take an input of two voices (faces) as candidates and one face (voice) as a query, and output which candidate has the same identity as the query.
Problem Statement
not state in introduction
main work
The method consists of two branches: a face branch and a voice branch. The face branch extracts 100-dimensional vectors from an input image using ResNet50 convolutional neural network architecture [12] and four fully connected layers. The voice branch first extracts an ivector [5] using a pre-trained acoustic model. The i-vector is then converted into 100-dimensional vectors by four fully connected layers.There have two datasets. One is a big dataset, One is a age range to 18-30.We proposed a cross-modal embedding-based face-voice matching model, which extracts features from face images and voices and optimizes the distance metric of the features extracted from two modalities.
loss:
Related work
Human capability for face-voice association:
Method(s)
Methods one : Study on Human Performance
Conclusion
Reference(optional)
Arouse for me