Codaich, Bodhidharma MIDI and SLAC

Overview of Codaich

Codaich (Gaelic for “share”) is a large and diverse collection of MP3s that has been designed as a working prototype demonstrating how the particular needs of music information retrieval (MIR) research can be addressed with respect to training and testing data. It was designed based on fifteen specific guidelines that were devised to meet the particular requirements of MIR research (as outlined in this paper presented at the 2006 International Conference on Music Information Retrieval (ISMIR)).

The original release version of Codaich consists of 26,420 carefully labeled MP3 encodings of music, although the current working version is much larger. Efforts were made to achieve as stylistically diverse a collection as possible, and this collection includes music from 55 different musical genres, which are distributed among the coarse categories of popular, world, classical and jazz. The details of the database and the metadata of its recordings can be accessed via iTunes XML, ACE XML, Weka ARFF or jMusicMetaManager HTML profiling files.

The 19 metadata fields that have been annotated in Codaich were originally extracted from Gracenote CD Database and pre-existing ID3 tags. These were then cleaned using jMusicMetaManager, and final manual corrections were made when necessary, in consultation with the All Music Guide and other references. A special emphasis was placed on the correctness of the Title, Artist, Composer, Album and Genre fields.

Codaich is profiled and managed with jMusicMetaManager. There are future plans to add additional types of music files to jMusicMetaManager, including other types of audio files and symbolic files such as those in the Bodhidharma MIDI database.

Codaich is ultimately intended to be integrated with Daniel McEnnis' On-demand Metadata Extraction Network (OMEN). The idea behind OMEN is to privately store sets of musical recordings on individual servers that have legal rights to their respective recordings. A coordinating server keeps track of which recordings are available at which sites. Researchers can send in customized feature extraction requests to this centralized server, which then contacts each relevant server that stores music. Features are then automatically extracted locally at each site using jAudio. These extracted features are then automatically distributed to researchers rather than actual audio samples, in order to avoid copyright violations.

Overview of the Bodhidharma MIDI Dataset

The Bodhidharma MIDI dataset is a set of 950 MIDI recordings belonging to 38 different musical genres. It was originally assembled for use in automatic genre classification research as part of this research, but can be adapted for other types of MIR research.

Overview of the SLAC Dataset

The SLAC (Symbolic Lyrical Audio Cultural) dataset is an expansion of the SAC Dataset that now includes lyrics. The specific purpose of this dataset is to facilitate experiments comparing the relative performance of features extracted from different types of musical data. SLAC consists of 250 MP3 recordings, 250 matching MIDI recordings, 250 matching sets of lyrics and identifying metadata for each recording. This metadata is stored in an iTunes XML file that can be parsed by software such as jWebMiner in order to extract cultural features from the web. The MIDI files were acquired separately from the MP3 files, such that neither type of file was generated from the other. The lyrics for each piece are stored in a text file.

SLAC is divided into 10 genres, with 25 pieces of music per genre. These 10 genres consist of 5 pairs of relatively similar genres (Modern Blues and Traditional Blues; Baroque and Romantic; Bebop and Swing; Hardcore Rap and Pop Rap; and Alternative Rock and Metal). This arrangement makes it possible to perform 5-class genre classification experiments as well as 10-class experiments simply by combining each pair of related genres into one class, thus providing an indication of how well systems perform on both small and moderately sized genre taxonomies.

SLAC is designed to be a particularly difficult dataset to classify. The similarity of the two genres in each pair makes 10-genre classification particularly difficult, and the instances in each genre were specifically chosen to span a diverse range of sub-genres within each genre. Different versions of some pieces in different genres are included, as well as different pieces by the same artist in different genres, in order to help test classification performance realistically.

Related Publications

McKay, C. 2010. Automatic music classification with jMIR. Ph.D. Thesis. McGill University, Canada.

McKay, C., J. A. Burgoyne, J. Hockman, J. B. L. Smith, G. Vigliensoni, and I. Fujinaga. 2010. Evaluating the genre classification performance of lyrical features relative to audio, symbolic and cultural features. Proceedings of the International Society for Music Information Retrieval Conference. 213–8.

McKay, C., and I. Fujinaga. 2010. Improving automatic music classification performance by extracting features from different types of data. Proceedings of the ACM SIGMM International Conference on Multimedia Information Retrieval. 257–66.

McKay, C., and I. Fujinaga. 2009. jMIR: Tools for automatic music classification. Proceedings of the International Computer Music Conference. 65–8.

McKay, C., and I. Fujinaga. 2008. Combining features extracted from audio, symbolic and cultural sources. Proceedings of the International Conference on Music Information Retrieval. 597–602.

McEnnis, D., C. McKay, and I. Fujinaga. 2006. Overview of OMEN. Proceedings of the International Conference on Music Information Retrieval. 7–12.

McKay, C., D. McEnnis and I. Fujinaga. 2006. A large publicly accessible prototype audio database for music research. Proceedings of the International Conference on Music Information Retrieval. 160–3.

McKay, C. 2004. Automatic genre classification of MIDI recordings. M.A. Thesis. McGill University, Canada.

McKay, C. and I. Fujinaga. 2004. Automatic genre classification using large high-level musical feature sets. Proceedings of the International Conference on Music Information Retrieval. 525–30.

Questions and Comments

Codaich, Bodhidharma MIDI and SLAC: Cory McKay at
OMEN: Daniel McEnnis at


-top of page-