paper fast review : Face-Voice Matching using Cross-modal Embeddings

it2023-01-25  65

文章目录

Face-Voice Matching using Cross-modal EmbeddingsSummary摘要 (中文)Research ObjectiveBackground and Problemsmain workRelated workMethod(s)ConclusionReference(optional)Arouse for me

Face-Voice Matching using Cross-modal Embeddings

Summary

The author using one audio do two pictrues classification by collect data from MS-celeb-1M and voxceleb. These results are very similar to human performance.

摘要 (中文)

Research Objective

Background and Problems

Background

Speaker diarization, which identifies “who spoke when,” is an essential process in dialogue systems or robotics services.There are two major groups of speaker diarization techniques: one uses the direction of arrival (DOA) of audio signals [13, 35, 44] and another uses lip motion [36].

previous methods brief introduction

Nevertheless, there are cases when neither DOA-based nor lip motion-based speaker diarization can be used. Let us assume a situation when someone is speaking behind you.only one study was proposed very recently for forcedchoice tasks [29]. The authors trained neural network-based classifiers that take an input of two voices (faces) as candidates and one face (voice) as a query, and output which candidate has the same identity as the query.

Problem Statement

not state in introduction

main work

The method consists of two branches: a face branch and a voice branch. The face branch extracts 100-dimensional vectors from an input image using ResNet50 convolutional neural network architecture [12] and four fully connected layers. The voice branch first extracts an ivector [5] using a pre-trained acoustic model. The i-vector is then converted into 100-dimensional vectors by four fully connected layers.There have two datasets. One is a big dataset, One is a age range to 18-30.We proposed a cross-modal embedding-based face-voice matching model, which extracts features from face images and voices and optimizes the distance metric of the features extracted from two modalities.

loss:

Related work

Human capability for face-voice association:

Method(s)

Methods one : Study on Human Performance

Conclusion

Reference(optional)

Arouse for me

最新回复(0)