ABCDEFGHIJKLMNOPQ
1
TASKSSOURCESPRE-PROCESSINGFUSION APPROACHESDESCRIPTION
2
Note levelLyricsAudioImagesGeneral textVideoOtherCommentSynchronizationFeature ExtractionOtherEarly fusionLate fusion
3
ML decisionOtherRule based
4
Synchronization
5
Audio-to-score alignment (A2S)xxChroma featuresDTW-
6
Lyrics-to-audio synchronizationxxA2S/SSMFCCs, ad hoc features (in tonal languages, F0 can be compared to word tones)HMMsDTW-
7
Audio synchronization (A2A)xxChroma featuresDTW-
8
Audio-to-image alignment
2017, Dorfer et al. (b)xScoreSpectrogram for audio -> CNNDTWAudio spectrogram and score images are converted to a common space by using CNNs with a cost function that makes the vectors of corresponding images and audio near in the common space. Then, a system for audio-to-image alignment and another for score retrieval through audio example are presented.
9
Similarity
10
Recommendation/playlist generationxlocationTo be checked, reviews don't talk about thisNo fusion in similarity tasks-
11
Multimodal queriesxxxxMeta tagsxMid-level descriptionAll modalities togetherQueries can be built by merging different modalities: 2013, Zhonghua
12
Query-by-hummingxx-
13
Audio queries for score image retrieval
2017, Dorfer et al. (a)xScoreSpectrogram for audio -> CNNAudio spectrogram and score images are converted to a common space by using CNNs with a cost function that makes the vectors of corresponding images and audio near in the common space. Then, a system for audio-to-image alignment and another for score retrieval through audio example are presented.
14
Audio queries for video retrieval2007, GilletxMusic videoAUDIO: spectral centroid, width, asymmetry and kurtosis, MFCC, ZCR, variance, skewness and kurtosis of waveform, other perceptual features.
VIDEO: distance of color and luminosity features.
LDA feature selection on audio, then segmentation on each modalityA segmentation is performed on music video and audio separately, then some correlatione measures are evaluated. Finally, they use music audio to retrieve video by using these correlation measures as distance function.
15
Cover based retrieval2018, Correya et al. (only arxiv)xxMetatagstf-idf for text, HPCPs for audioTop results from the one modality are re-ranked according to the new modality.
16
Symbolic queries for audio retrieval2016, Balke et al. & 2008, Suyoto et al.xxA2SCENS (Chroma Features) for audio, pianoroll mod 12 for notes (2016, Balke)Transcription of collection to note level (2008, Suyoto)Audio recording retrieval through symbolic queries: symbolic themes and audio recordings are converted to chroma features and the minimum matching is searched by using SDTW. The second one instead is just about transcribing audio-to-scores and transposing them.
17
Classification
18
Emotion/moodxxCover artsMetatagsEEGusually just one in {images, lyrics, text, EEG}AUDIO: BPM, MPEG-7 descriptors, MFCC, spectrogram
TEXT: bag-of-words, LSA, LMD, BSTI
Hough forestSVM -> ?-
19
GenrexxCover artsMeta tags, album reviews, synonims, wikipedia tags, etc.Music videousually just one in {images, lyrics, text, video}, exception: Oramas, 2017AUDIO: MFCC, spectral centroid, rolloff, flux, zero crossing, low energy, DWCH, rythm hystogram, rythm patterns, statistical spectrum descriptors, spectrogram -> CNN
LYRICS: bag-of-words -> CNN, LDA, POS, ortography (capitalization, etc), rhyme, word statistics
VISUAL: Global Color Statistics, Global Emotion Values, Colorfulness, Wang Emotional Factors, Itten’s Contrasts, Color Names, Lightness Fluctuation Patterns, CNN
SVM, k-NN, random forests, naïve bayes, CNNSVM, k-NN, random forests, naïve bayes -> [cartesian ensemble] -> weighted or unweighted rules-
20
Artist2014, Aryafar et Shokoufandehxxtf-idf for lyrics, tf-idf of MFCCs for audioESA representationFusion is in the SLIM building, then k-NNText and MFCC are represented in a tf-idf fashion using Extended Semantic Analysis matrices. Then features are merged by using Sparse LInear Models (SLIM) and classified with k-NN
21
Derivative works2017, Smith et al.xTitle, authorsYouTubeTEXT: LDA and Knowledge Boosting; VIDEO: lighting-key, estimate of variance in colour, magnitude of frame-to-frame change in colour, magnitude and direction of the optical flow, flag indicating contrary motion or not; AUDIO: landmarksSVMClassification of video retrieved through YouTube. Classification is performed with SVM, without feature selection nor conversion to common space.
22
Instrument2017, Slizovskaia
xPerformerSpectrogram representation -> CNNCNNInstrument recognition is performed by analysing video and aufio frames. They train different NN models for audio and video modalities and then 2 fully connected layers for the prediction. They compare the various model trained for signle and multiple modalities.
23
Instrument2011, Lim et al.xPerformerMel scale Spectrum, delta values from linear regression, logarithmic power for audio. HOG for video.GMM -> linear combination -> maxA human robot for orchestra director needs to recognize different instrument in the orchestra. Audio and video are processed and the results are fused with a linear weighted sum.
24
Tonic2013, Sentürk et al.xxKernel-density Pitch Class Distribution from audio and score
ComplicatedAudio and score representations are used to identify the tonic (karar) in turkish music.
25
Expressive musical annotation2015, Li et al.xxA2SBaseline (MIRToolbox), dynamic, duration and vibrato featuresFeatures selection through ReliefFFusion is in the feature extraction stage, then SVMAlgorithm for automatic classification based on expressive musical terms, using SVMs
26
Time-dependent representation
27
Score-informed source separation (SS)xxA2SSpectrogram(GMM)NMFError correction-
28
Beat trackingxDancer or PerformerVIDEO: hough lines, hough transform, optical flow, mean shift, skeleton features
AUDIO: onset detection, STPM, onset likelihood, instantaneous tempo
normalization (only in Ohkita et al., 2015)Particle filterComplex rules-
29
Piano tutoringxxA2SsynthesisHMMsNMFComplicated-
30
Onset, not really multimodal2010, Degara et al.*Not really multimodalSub-bands peak scoresLinear combination -> maxNote onset algorithm in which the rythmic structure is detected from the audio to help the processing. The rythmic structure could be used as provided by other sources (e.g. Symbolic)
31
Segmentation2016, Gregorio et al.
&
2009, Cheng et al.
&
2005, Zhu et al.
xxKaraoke VideoA2SAUDIO: spectral flux, centroid features, spectral entropy, power spectral density, spectral contrast
NOTES: IOI
TEXT: Longest Common Subsequence between paragraphs
Audio beat detection, timbre recognition, chord labelingHMMsClustering for audio IOIs and rules for lyrics paragraphs -> complex rules for matching lyrics paragraphs to audio segmentsThe task consists in subdividing music in unitary and coherent segments.
32
Spacial guitar fingering2008, Paleari & 2010 HrybykxPerformance videospecmurt for audio, homography for videovideo trackingnot clearVideo performance of a guitarist and audio recordings are used to do a better transcription and to annotate fingering.
33
Multipitch estimation2017, Dinesh et al.xPerformance videoMotion features from videoMultipitch estimation on audio SVM -> Ad-hoc algorithmMotion features are used in a SVM to detect play/non play activity. MPE on audio is performed and then SVM results are used as constraints.
34
Source association2017, Li et al. (2 papers)xxPerformance videoA2SMotion features from videoAudio SS, feature selection (in part)Best onset or vibrato match sequenceVideo performance of players in a string quartet, recording and scores are analized to detect onsets or vibrato. Onset are then matched against video bow detection, while vibrato against hand movements. All permutations are considered and the best one is taken. Audio is used to align scores to the video
35
Percussion onset2015, Marenco et al.xPerformerSet of spectral-shape features, MFCC, spectral flux features on audio. Distance of hand and sticks from drumhead, max and min vertical speed, zero-crossing vertical speed, DCT of vertical position.Feature selectionSVMOnset detection for percussion afro music. Audio onset are detected and used to analyze the vide frames, then features coming from both the video and audio are used as input for a SVM.
36
Chords2012, Konz & MüllerxxA2AChord labelingVoting or constraintsMultiple audio tracks of the same song are synchronized and then exploited to stabilize the result of a chord labeling procedure.