A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | TASKS | SOURCES | PRE-PROCESSING | FUSION APPROACHES | DESCRIPTION | ||||||||||||
2 | Note level | Lyrics | Audio | Images | General text | Video | Other | Comment | Synchronization | Feature Extraction | Other | Early fusion | Late fusion | ||||
3 | ML decision | Other | Rule based | ||||||||||||||
4 | Synchronization | ||||||||||||||||
5 | Audio-to-score alignment (A2S) | x | x | Chroma features | DTW | - | |||||||||||
6 | Lyrics-to-audio synchronization | x | x | A2S/SS | MFCCs, ad hoc features (in tonal languages, F0 can be compared to word tones) | HMMs | DTW | - | |||||||||
7 | Audio synchronization (A2A) | xx | Chroma features | DTW | - | ||||||||||||
8 | Audio-to-image alignment | 2017, Dorfer et al. (b) | x | Score | Spectrogram for audio -> CNN | DTW | Audio spectrogram and score images are converted to a common space by using CNNs with a cost function that makes the vectors of corresponding images and audio near in the common space. Then, a system for audio-to-image alignment and another for score retrieval through audio example are presented. | ||||||||||
9 | Similarity | ||||||||||||||||
10 | Recommendation/playlist generation | x | location | To be checked, reviews don't talk about this | No fusion in similarity tasks | - | |||||||||||
11 | Multimodal queries | x | x | x | x | Meta tags | x | Mid-level description | All modalities together | Queries can be built by merging different modalities: 2013, Zhonghua | |||||||
12 | Query-by-humming | x | x | - | |||||||||||||
13 | Audio queries for score image retrieval | 2017, Dorfer et al. (a) | x | Score | Spectrogram for audio -> CNN | Audio spectrogram and score images are converted to a common space by using CNNs with a cost function that makes the vectors of corresponding images and audio near in the common space. Then, a system for audio-to-image alignment and another for score retrieval through audio example are presented. | |||||||||||
14 | Audio queries for video retrieval | 2007, Gillet | x | Music video | AUDIO: spectral centroid, width, asymmetry and kurtosis, MFCC, ZCR, variance, skewness and kurtosis of waveform, other perceptual features. VIDEO: distance of color and luminosity features. | LDA feature selection on audio, then segmentation on each modality | A segmentation is performed on music video and audio separately, then some correlatione measures are evaluated. Finally, they use music audio to retrieve video by using these correlation measures as distance function. | ||||||||||
15 | Cover based retrieval | 2018, Correya et al. (only arxiv) | x | x | Metatags | tf-idf for text, HPCPs for audio | Top results from the one modality are re-ranked according to the new modality. | ||||||||||
16 | Symbolic queries for audio retrieval | 2016, Balke et al. & 2008, Suyoto et al. | x | x | A2S | CENS (Chroma Features) for audio, pianoroll mod 12 for notes (2016, Balke) | Transcription of collection to note level (2008, Suyoto) | Audio recording retrieval through symbolic queries: symbolic themes and audio recordings are converted to chroma features and the minimum matching is searched by using SDTW. The second one instead is just about transcribing audio-to-scores and transposing them. | |||||||||
17 | Classification | ||||||||||||||||
18 | Emotion/mood | x | x | Cover arts | Metatags | EEG | usually just one in {images, lyrics, text, EEG} | AUDIO: BPM, MPEG-7 descriptors, MFCC, spectrogram TEXT: bag-of-words, LSA, LMD, BSTI | Hough forest | SVM -> ? | - | ||||||
19 | Genre | x | x | Cover arts | Meta tags, album reviews, synonims, wikipedia tags, etc. | Music video | usually just one in {images, lyrics, text, video}, exception: Oramas, 2017 | AUDIO: MFCC, spectral centroid, rolloff, flux, zero crossing, low energy, DWCH, rythm hystogram, rythm patterns, statistical spectrum descriptors, spectrogram -> CNN LYRICS: bag-of-words -> CNN, LDA, POS, ortography (capitalization, etc), rhyme, word statistics VISUAL: Global Color Statistics, Global Emotion Values, Colorfulness, Wang Emotional Factors, Itten’s Contrasts, Color Names, Lightness Fluctuation Patterns, CNN | SVM, k-NN, random forests, naïve bayes, CNN | SVM, k-NN, random forests, naïve bayes -> [cartesian ensemble] -> weighted or unweighted rules | - | ||||||
20 | Artist | 2014, Aryafar et Shokoufandeh | x | x | tf-idf for lyrics, tf-idf of MFCCs for audio | ESA representation | Fusion is in the SLIM building, then k-NN | Text and MFCC are represented in a tf-idf fashion using Extended Semantic Analysis matrices. Then features are merged by using Sparse LInear Models (SLIM) and classified with k-NN | |||||||||
21 | Derivative works | 2017, Smith et al. | x | Title, authors | YouTube | TEXT: LDA and Knowledge Boosting; VIDEO: lighting-key, estimate of variance in colour, magnitude of frame-to-frame change in colour, magnitude and direction of the optical flow, flag indicating contrary motion or not; AUDIO: landmarks | SVM | Classification of video retrieved through YouTube. Classification is performed with SVM, without feature selection nor conversion to common space. | |||||||||
22 | Instrument | 2017, Slizovskaia | x | Performer | Spectrogram representation -> CNN | CNN | Instrument recognition is performed by analysing video and aufio frames. They train different NN models for audio and video modalities and then 2 fully connected layers for the prediction. They compare the various model trained for signle and multiple modalities. | ||||||||||
23 | Instrument | 2011, Lim et al. | x | Performer | Mel scale Spectrum, delta values from linear regression, logarithmic power for audio. HOG for video. | GMM -> linear combination -> max | A human robot for orchestra director needs to recognize different instrument in the orchestra. Audio and video are processed and the results are fused with a linear weighted sum. | ||||||||||
24 | Tonic | 2013, Sentürk et al. | x | x | Kernel-density Pitch Class Distribution from audio and score | Complicated | Audio and score representations are used to identify the tonic (karar) in turkish music. | ||||||||||
25 | Expressive musical annotation | 2015, Li et al. | x | x | A2S | Baseline (MIRToolbox), dynamic, duration and vibrato features | Features selection through ReliefF | Fusion is in the feature extraction stage, then SVM | Algorithm for automatic classification based on expressive musical terms, using SVMs | ||||||||
26 | Time-dependent representation | ||||||||||||||||
27 | Score-informed source separation (SS) | x | x | A2S | Spectrogram | (GMM) | NMF | Error correction | - | ||||||||
28 | Beat tracking | x | Dancer or Performer | VIDEO: hough lines, hough transform, optical flow, mean shift, skeleton features AUDIO: onset detection, STPM, onset likelihood, instantaneous tempo | normalization (only in Ohkita et al., 2015) | Particle filter | Complex rules | - | |||||||||
29 | Piano tutoring | x | x | A2S | synthesis | HMMs | NMF | Complicated | - | ||||||||
30 | Onset, not really multimodal | 2010, Degara et al. | * | Not really multimodal | Sub-bands peak scores | Linear combination -> max | Note onset algorithm in which the rythmic structure is detected from the audio to help the processing. The rythmic structure could be used as provided by other sources (e.g. Symbolic) | ||||||||||
31 | Segmentation | 2016, Gregorio et al. & 2009, Cheng et al. & 2005, Zhu et al. | x | x | Karaoke Video | A2S | AUDIO: spectral flux, centroid features, spectral entropy, power spectral density, spectral contrast NOTES: IOI TEXT: Longest Common Subsequence between paragraphs | Audio beat detection, timbre recognition, chord labeling | HMMs | Clustering for audio IOIs and rules for lyrics paragraphs -> complex rules for matching lyrics paragraphs to audio segments | The task consists in subdividing music in unitary and coherent segments. | ||||||
32 | Spacial guitar fingering | 2008, Paleari & 2010 Hrybyk | x | Performance video | specmurt for audio, homography for video | video tracking | not clear | Video performance of a guitarist and audio recordings are used to do a better transcription and to annotate fingering. | |||||||||
33 | Multipitch estimation | 2017, Dinesh et al. | x | Performance video | Motion features from video | Multipitch estimation on audio | SVM -> Ad-hoc algorithm | Motion features are used in a SVM to detect play/non play activity. MPE on audio is performed and then SVM results are used as constraints. | |||||||||
34 | Source association | 2017, Li et al. (2 papers) | x | x | Performance video | A2S | Motion features from video | Audio SS, feature selection (in part) | Best onset or vibrato match sequence | Video performance of players in a string quartet, recording and scores are analized to detect onsets or vibrato. Onset are then matched against video bow detection, while vibrato against hand movements. All permutations are considered and the best one is taken. Audio is used to align scores to the video | |||||||
35 | Percussion onset | 2015, Marenco et al. | x | Performer | Set of spectral-shape features, MFCC, spectral flux features on audio. Distance of hand and sticks from drumhead, max and min vertical speed, zero-crossing vertical speed, DCT of vertical position. | Feature selection | SVM | Onset detection for percussion afro music. Audio onset are detected and used to analyze the vide frames, then features coming from both the video and audio are used as input for a SVM. | |||||||||
36 | Chords | 2012, Konz & Müller | xx | A2A | Chord labeling | Voting or constraints | Multiple audio tracks of the same song are synchronized and then exploited to stabilize the result of a chord labeling procedure. |