CSC475 Music Information Retrieval

CSC475 Music Information Retrieval Tags and Music George Tzanetakis University of Victoria 2014 G. Tzanetakis 1 / 53

Table of Contents I 1 Indexing music with tags 2 Tag acquisition 3 Autotagging 4 Evaluation 5 Ideas for future work G. Tzanetakis 2 / 53

Tags Definition A tag is a short phrase or word that can be used to characterize a piece of music. Examples: bouncy, heavy metal, or hand drums. Tags can be related to instruments, genres, amotions, moods, usages, geographic origins, musicological terms, or anything the users decide. Similarly to a text index, a music index associated music documents to tags. A document can be a song, an album, an artist, a record label, etc. We consider songs/tracks to be our musical documents. G. Tzanetakis 3 / 53

Music Index Vocabulary s 1 s 2 s 3 happy.8.2.6 pop.7 0.1 a capella.1.1.5 saxophone 0.7.9 A query can either be a list of tags or a song. Using the music index the system can return a playlist of songs that somehow match the specified tags. G. Tzanetakis 4 / 53

Tag research terminology Note: Cold-start problem: songs that are not annotated can not be retrieved. Popularity bias: songs (in the short head tend to be annotated more thoroughly than unpopular songs (in the long tail). Strong labeling versus weak labeling. Extensible or fixed vocabulary. Structured or unstructured vocabulary. Evaluation is a big challenge due to subjectivity. Tags generalize classification labels G. Tzanetakis 5 / 53

Many thanks to Material for these slides was generously provided by: Mohamed Sordo Emanule Coviello Doug Turnbull G. Tzanetakis 6 / 53

Tagging a song G. Tzanetakis 7 / 53

Tagging multiple songs G. Tzanetakis 8 / 53

Text query G. Tzanetakis 9 / 53

Table of Contents I 1 Indexing music with tags 2 Tag acquisition 3 Autotagging 4 Evaluation 5 Ideas for future work G. Tzanetakis 10 / 53

Sources of Tags Human participation: Surveys Social Tags Games Automatic: Text mining Autotagging G. Tzanetakis 11 / 53

Survey Pandora: a team of approximately 50 expert music reviewers (each with a degree in music and 200 hours of training) annotate songs using a structured vocabulary of between 150 and 200 tags. Tags are objective i.e there is a high degree of inter-reviewer agreement. Between 2000 and 2010, Pandora annotated about 750, 000 songs. Annotation takes approximately 20-30 minutes. CAL500: one song from 500 unique artists, each annod by a minimum of 3 nonexpert reviewers using a structured vocabulary of 174 tags. Standard dataset of training and evaluating tag-based retrieval systems. G. Tzanetakis 12 / 53

Harvesting social tags Last.fm is a music discovery Web site that allows users to contribute social tags through a text box in their audio player interface. It is an example of crowd sourcing. In 2007, 40 million active users built up a vocabulary of 960, 000 free-text tags and used it to annotate millions of songs. All data available through public web API. Tags typically annotate artists rather than sons. Problems with multiple spelling, polysemous tags (such as progressive). G. Tzanetakis 13 / 53

Last.fm tags for Adele G. Tzanetakis 14 / 53

Playing Annotation Games In ISMIR 2007, music annotation games were presented for the first time: ListenGame, Tag-a-Tune, and MajorMiner. ListenGame uses a structured vocabulary and is real time. Tag-a-Tune and MajorMiner are inspired by the ESP Game for image tagging. In this approach the players listen to a track and are asked to enter free text tags until they both enter the same tag. This results in an extensible vocabulary. G. Tzanetakis 15 / 53

Tag-a-tune G. Tzanetakis 16 / 53

Mining web documents There are many text sources of information associated with a music track. These include artist biographies, album reviews, song reviews, social media posts, and personal blogs. The set of documents associated with a song is typically processed by text mining techniques resulting in a vector space representation which can then be used as input to data mining/machine learning techniques (text mining will be covered in more detail in a future lecture). G. Tzanetakis 17 / 53

Table of Contents I 1 Indexing music with tags 2 Tag acquisition 3 Autotagging 4 Evaluation 5 Ideas for future work G. Tzanetakis 18 / 53

cal500.sness.net G. Tzanetakis 19 / 53

Audio feature extraction Audio features for tagging are typically very similar to the ones used for audio classification i.e statistics of the short-time magnitude spectrum over different time scales. G. Tzanetakis 20 / 53

Bag of words for text G. Tzanetakis 21 / 53

Bag of words for audio G. Tzanetakis 22 / 53

Multi-label classification (with twists) Classic classification is single label and multi-class. In multi-label classification each instance can be assigned more than one label. Tag annotation can be viewed as multi-label classification with some additional twists: Synonyms (female voice, woman singing) Subpart relations (string quartet, classical) Sparse (only a small subset of tags applies to each song) Noisy Useful because: Cold start problem Query-by-keywords G. Tzanetakis 23 / 53

Machine Learning for Tag Annotation A straightforward approach is to treat each tag independently as a classification problem. G. Tzanetakis 24 / 53

Tag models Identify songs associated with tag t Merge all features either directly or by model merging Estimate p(x t) G. Tzanetakis 25 / 53

Direct multi-label classifiers Alternatives to individual tag classifiers: K-NN multi-label classifier - straightforward extension that requires strategy for label merging (union or intersection are possibilities) Multi-layer perceptron - simple train directly with multi-label ground truth G. Tzanetakis 26 / 53

Tag co-occurence G. Tzanetakis 27 / 53

Stacking G. Tzanetakis 28 / 53

Stacking II G. Tzanetakis 29 / 53

How stacking can help? G. Tzanetakis 30 / 53

Other terms/variants The main idea behind stacking i.e using the output of a classification stage as the input to a subsequent classification stage has been proposed under several different names: Correction approach (using binary outputs) Anchor classification (for example classification into artists used as a feature for genre classification) Semantic space retrieval Cascaded classification (in computer vision) Stacked generalization (in the classification) Context modeling (in autotagging) Cost-sensitive stacking (variant) G. Tzanetakis 31 / 53

Combining taggers/bag of systems G. Tzanetakis 32 / 53

Table of Contents I 1 Indexing music with tags 2 Tag acquisition 3 Autotagging 4 Evaluation 5 Ideas for future work G. Tzanetakis 33 / 53

Datasets There are several datasets that have been used to train and evaluate auto-tagging. They differ in the amount of data they contain, and the source of the ground truth tag information. Major Miner Magnatagatune CAL500 (the most widely used one) CAL10K MediaEval Reproducibility: common dataset is not enough, ideally exact details about the cross-validation folding process and evaluation scripts should also be included. G. Tzanetakis 34 / 53

Magnatagatune 26K sound clips from magnatune.com Human annotation from the Tag-a-tune game Audio features from the Echo Nest 230 artists 183 tags G. Tzanetakis 35 / 53

CAL-10K Dataset Number of tracks: 10866 Tags: 1053 (genre and acoustic tags) Tags/Track: min = 2, max = 25, µ = 10.9, σ = 4.57, median = 11 Most used tags: major key tonality (4547), acoustic rhythm guitars (2296), a vocal-centric aesthetic (2163), extensive vamping (2130) Less used tags: cocky lyrics (1), psychedelic rock influences (1), breathy vocal sound (1), well-articulated trombone solo (1), lead flute (1) Tags collected using survey Available at: http://cosmal.ucsd.edu/cal/projects/annret/ G. Tzanetakis 36 / 53

Tagging evaluation metrics The inputs to a autotagging evaluation metric are the predicted tags (#tags by #tracks binary matrix) or tag affinities (#tags by #tracks) matrix of reals) and the associated ground truth (binary matrix). Asymmetry between positives and negatives makes classification accuracy not a very good metric. Retrieval metrics are better choices. If the output of the auto-tagging system is affinities then many metrics require binarization. Common binarization variants: select k top scoring tags for each track, threshold each column of tag affinities to achieve the tag priors in the training set. G. Tzanetakis 37 / 53

Annotation vs retrieval One possibility would be to convert matrices into vectors and then use classification evaluation metrics. This approach has the disadvantage that popular tags will dominate and performance in less-frequent tags (which one could argue are more important) will be irrelevant. Therefore the common approach is to treat each tag column separately and then average across tags (retrieval) or alternatively treat each track row separately and average across tracks (annotation). Validation schems are similar to classification: cross-validation, repeated cross-validation, and bootstrapping. G. Tzanetakis 38 / 53

Annotation Metrics Based on counting TP, FP, TN, FN: Precision Recall F-measure G. Tzanetakis 39 / 53

Annotation Metrics based on rank When using affinities it is possible to use rank correlation metrics: Spearman s rank correlation coefficient ρ Kendal tau τ G. Tzanetakis 40 / 53

Retrieval measures - Mean Average Precision Precision at N is the number of relevant songs retrieved out of N divided by N. Rather than choosing N one can average precision for different N and then take the mean over a set of queries (tags). G. Tzanetakis 41 / 53

Retrieval measures - AUC-ROC G. Tzanetakis 42 / 53

Stacking results I G. Tzanetakis 43 / 53

Stacking results II G. Tzanetakis 44 / 53

Stacking results III G. Tzanetakis 45 / 53

Stacking results IV G. Tzanetakis 46 / 53

Stacking results V G. Tzanetakis 47 / 53

MIREX Tag Annotation Task The Music Information Retrieval Evaluation Exchange (MIREX) audio tag annotation task started in 2008 MajorMiner dataset (2300 tracks, 45 tags) Mood tag dataset (6490 tracks, 135 tags) 10 second clips 3-fold cross-validation Binary relevance (F-measure, precision, recall) Affinity ranking (AUC-ROC, Precision at 3,6,9,12,15) G. Tzanetakis 48 / 53

MIREX 2012 F-measure G. Tzanetakis 49 / 53

MIREX 2012 AUC-ROC G. Tzanetakis 50 / 53

History of MIREX tagging G. Tzanetakis 51 / 53

Table of Contents I 1 Indexing music with tags 2 Tag acquisition 3 Autotagging 4 Evaluation 5 Ideas for future work G. Tzanetakis 52 / 53

Open questions Should the tag annotations be sanitized or should the machine learning part handle it? Do auto-taggers generalize outside their collections? Stacking seems to improve results (even though one paper has shown no improvement). How does stacking perform when dealing with synonyms, antonyms, noisy annotations? Why? How can multiple sources of tags be combined? G. Tzanetakis 53 / 53

Future work Weak labeling: in most cases absense of a tag does NOT imply that the tag would not be considered valid by most users Explore a continuous grading of semi-supervised learning where the distinction between supervised and unsupervised is not binary Explore feature clusering of untagged instances Include additional sources of information (separate from tags) such as artist, genre, album multiple instance learning approaches (for example if genre information is available at the album level) Statistical relational learning G. Tzanetakis 54 / 53

Future work The lukewarm start problem: what if some tags are known for the testing data but not all? Missing label type of approaches such as EM Markov logic inference in structured data Other ideas: Online learning where tags enter the system incrementally and individually rather than all at the same time or for a particular instance Taking into account user behavior when interacting with a tag system Personalization vs Crowd: would clustering users based on their tagging make sense? G. Tzanetakis 55 / 53