jLyrics is a set of software tools for automatically mining lyrics from the web and extracting features from the lyrics once they have been acquired. Although jLyrics has not yet been as neatly packaged as the other jMIR components, due to its status as a new project, it is nonetheless fully functional and available for use. jLyrics currently consists of the following three components:
- jLyrics (Java component): The primary framework into which all of the jLyrics functionality will eventually be ported. It includes many of the standard jMIR feature extractor advantages, such as automatic feature extraction scheduling and resolution, and a highly extensible architecture. It does not yet include a GUI, however. Nineteen features are currently implemented directly in the jLyrics Java framework, and more will be added soon. This software also includes functionality for collecting lists of the most commonly occurring words in sets of lyrics, which can be a useful tool for developing new features.
- jLyrics (Python component): A set of additional features that have been prototyped using Python and various existing third-party software libraries. Examples of the types of features extracted include readability statistics, part-of-speech statistics, topic models, letter bigrams, etc.
- lyricFetcher: Ruby scripts for mining lyrics from the Internet. One script mines the lyrics from LyricWiki, and the other mines them from LyricsFly (although a key must be acquired for the latter). Lyrics are mined based on queries consisting of artist names and song names, and lyrics are pre-processed upon retrieval. For example, many lyrics are abridged by providing a label for the first occurrence of a section (e.g., “chorus,” “hook,” “refrain,” etc.) and repeating only this label when the section reoccurs. lyricFetcher automatically searches for and expands such sections.
Both one and multi-dimensional features can be extracted with jLyrics. Mined lyrics are saved as simple text files, and extracted features may be saved as ACE XML files.
The following is a list of all of the features extracted by both the Java and Python parts of the jLyrics software:
- Automated Readability Index: This index is designed to provide a measure of the understandability of a text, and provides a result that is roughly equivalent to a U.S. grade level. More information is available at en.wikipedia.org/wiki/Automated_Readability_Index.
- Average Syllable Count per Word: Average number of syllables per word (using the Flesh algorithm).
- Contains Words: Whether or not there is at least one word consisting of non whitespace chararacters in a text. Set to 1.0 if so and to 0.0 if not.
- Flesch-Kincaid Grade Level: Flesch-Kincaid grade level. A standard readability statistic. See http://flesh.sourceforge.net.
- Flesch Reading Ease: Flesch reading ease. An alternative readability statistic. See http://flesh.sourceforge.net.
- Function Word Frequencies: The relative frequencies (i.e., the number of instances of each word divided by the total word count) of a list of “function words” that are commonly used for style analysis:
- Letter-Bigram Components: The first 20 principal components of the relative frequencies of all letter bigrams. The full vectors of relative frequencies of letter bigrams (all 2-permutations, for 729 values) for all pieces in the training set were used to compute the principal components.
- Letter Frequencies: The relative frequencies of each of the 26 letters in the Western alphabet.
- Letters Per Word Average: The average number of characters per word. Punctuation and numerical characters are included in the count of characters for a given word. Whitespace characters are not included in calculations.
- Letters Per Word Variance: The variance of the number of characters per word. Punctuation and numerical characters are included in the count of characters for a given word. Whitespace characters are not included in calculations.
- Lines Per Segment Average: The average number of lines per segments (e.g. verses, choruses, etc.) in the text. Segments are assumed to be segmented by blank lines, and lines are segmented by line breaks. Lines consisting only of line breaks are filtered out. This count does not include lines consisting only of line breaks.
- Lines Per Segment Variance: The variance of the number of lines per segment (e.g. verses, choruses, etc.) in the text. Segments are assumed to be segmented by blank lines, and lines are segmented by line breaks. Lines consisting only of line breaks are filtered out. This count does not include lines consisting only of line breaks.
- Number of Lines: The total number of lines in the text. This count does not include lines consisting only of line breaks.
- Number of Segments: The total number of segments (e.g. verses, choruses, etc.) in the text. Segments are assumed to be segmented by blank lines, and lines are segmented by line breaks. Lines consisting only of line breaks are filtered out. This count does not include lines consisting only of line breaks.
- Number of Words: The total number of words in the text. This count is not of unique words, so words that occur more than once are counted more than once. Whitespace characters are not counted as words.
- Part-of-Speech Frequencies: Relative frequencies of 20 parts of speech:
Extracted using the Stanford parts-of-speech tagger. See: Wei, B., C. Zhang, and M. Ogihara. 2007. Keyword generation for lyrics. Proceedings of the International Conference on Music Information Retrieval. 121–2.
- coordinating conjunction (CC)
- cardinal number (CD)
- determiner (DT)
- existential “there” (EX)
- foreign word (FW)
- preposition or subordinating conjunction (IN)
- adjective (JJ, JJR, and JJS)
- list item marker (LS)
- modal (MD)
- noun (NN, NNS, NNP, NNPS)
- predeterminer (PDT)
- possessive ending (POS)
- personal pronoun (PRP, PP$)
- adverb (RB, RBR, RBS)
- particle (RP)
- mathematical or scientific symbol (SYM)
- “to” (TO)
- interjection (UH)
- verb (VB, VBD, VBG, VBN, VBP, VBZ)
- “wh”-determiner (WDT, WP, WP$, WRB)
- Punctuation Frequencies: The relative frequencies of the 32 punctuation marks available in ASCII.
- Rate of Misspelling: Proportion of misspelled words according to Aspell's English dictionary. Based on GNU Aspell.
- Sentence Count: Number of sentences.
- Sentence Length Average: Average word count per sentence.
- Topic Membership Probabilities: The posterior probability of membership in each of a set number of topics fit to the training set, found using latent Dirichlet allocation. Computed using the R package topicmodels. See Li, T., and M. Ogihara. 2004. Semi-supervised learning from different information sources. Knowledge and Information Systems 7 (3): 289–309.
- Vocabulary Richness: Vocabulary size divided by word count.
- Vocabulary Size: The total number unique of words in the text. This means that words that occur more than once are not counted twice. Whitespace characters are not counted as words.
- Word Profile Match: The percentage of words in the text that are found in a list of keywords that is known or assumed to be relevant to the task at hand (based on training data). This percentage is based on all words present in the text, not just the unique words. Case is ignored.
- Words Per Line Average: The average number of words per line. Blank lines are not included in this calculation.
- Words Per Line Variance: The variance of the number of words per line. Blank lines are not included in this calculation.
McKay, C., J. A. Burgoyne, J. Hockman, J. B. L. Smith, G. Vigliensoni, and I. Fujinaga. 2010. Evaluating the genre classification performance of lyrical features relative to audio, symbolic and cultural features. Proceedings of the International Society for Music Information Retrieval Conference. 213–8.
McKay, C. 2010. Automatic music classification with jMIR. Ph.D. Thesis. McGill University, Canada.
Questions and Comments
DOWNLOAD FROM SOURCEFORGE
-top of page-