jSymbolic Tutorial

jSymbolic Tutorial - Training a Classification Model and Using It

JOSQUIN ATTRIBUTION DATA

There is significant debate on the proper attribution of a number of pieces sometimes credited to Josquin
Jesse Rodin has separated Josquin's music into six levels of attribution security, based on historical research
- The Level 1 pieces are the most secure (i.e. are most likely to have truly been composed by Josquin), and attribution becomes less secure as the levels increase
- Pieces belonging to levels 1 and 2 can be reasonably expected to really be by Josquin, the other four levels are more ambiguous
In this experiment, we will use machine learning to help us get a rough overall idea of how reasonable (in general) the Rodin attribution levels are, based only on the musical content itself
- i.e. not taking any external historical knowledge into account
To do this, we will train a classification model to distinguish the music of the two most secure secure Josquin levels from the music of a variety of other composers that were roughly his contemporaries
We will then use this model to classify the music from the four least secure Josquin levels
- If the music belonging to the more secure of these four levels is classified as being by Josquin more often by this model then the music from the less secure levels, then this is a good sign that there is musical backing for Rodin's proposed attribution security levels
This is different from what we did in previous stages of this tutorial
- Previously, we did cross-validation on all the data at once, which means that Weka automatically broke the data into its own training and testing sets, for the purpose of seeing how well it was able to distinguish the music
- Here, we are going to manually separate out the training data and testing data (i.e. the music we are going to classify) separately
- This will allow us to see how each piece in the testing data is actually classified individually
IMPORTANT: the training data must never share any pieces with the testing data, otherwise results may be biased

EXTRACTING AND PRE-PROCESSING THE FEATURES

We will be working in the 03_Josquin_Attribution directory, which you downloaded as part of this tutorial
- Using the approaches covered in the previous sections of this tutorial, use jSymbolic (and the "FeaturesThatAvoidBiasInRenMusictConfigs.txt" config file), to extract features from all the files in the "JosquinNotJosquin_MIDI" folder
  - Save the features as "JosquinNotJosquinFeatureValues.xml" (and the associated ARFF and CSV files)
  - These will be the features we will use to train the model
- Using jSymbolic (and the "FeaturesThatAvoidBiasInRenMusictConfigs.txt" config file), extract features from all the files in the "MaybeJosquin_MIDI" folder
  - Save the features as "MaybeJosquinFeatureValues.xml" (and the associated ARFF and CSV files)
  - These will be the features we will classify with the model after we train it
Use the spreadsheet editing approaches covered earlier in this tutorial to create versions of the two CSV files jSymbolic just generated
- Save a version of "JosquinNotJosquinFeatureValues.csv" called "JosquinNotJosquinFeatureValuesWekaReady.csv" where:
  - A column has been added to the end called "COMPOSER"
  - This column has been filled with entries of either "Josquin" or "NotJosquin" for each piece, as appropriate for the given piece
  - The first column (where file paths are listed) has been deleted
- Save a version of "MaybeJosquinFeatureValues.csv" called "MaybeJosquinFeatureValuesWekaReady.csv" where:
  - A column has been added to the end called "COMPOSER"
  - This column has been filled with entries of "?"' for each and every piece
    - Since we do not actually know who composed these pieces (although it is of course suspected that many of them actually are by Josquin, especially those nearer the top, which have a greater security level)
  - The first column (where file paths are listed) has been deleted
Using the Weka skills you learned in previous parts of this tutorial, start the Weka Explorer
Using Weka, we will save the "MaybeJosquinFeatureValues.csv" testing data CSV file we just edited as a Weka ARFF file
- This will differ from the ARFF file jSymbolic created directly, since we have since edited the CSV file for this set to include class labels under the header of "COMPOSER"
- In Weka, under the Preprocess tab, click on the Open File button, and open the "MaybeJosquinFeatureValuesWekaReady.csv" file
  - Then, still in the Weka Preprocess tab, press the Save button
    - Make sure the Files of Type dropdown menu says "Arff data files (*.arff)"
    - Save the file as "MaybeJosquinFeatureValuesComposerFieldAdded.arff"
We now need to perform a bit of a hack in order to make Weka be willing to accept these files
- In a text editor, open the "MaybeJosquinFeatureValuesComposerFieldAdded.arff" file we just saved
- In the text editor, find the line that says "@attribute COMPOSER string"
- Replace this line with a line that says "@attribute COMPOSER {NotJosquin,Josquin}"
  - If you were using other potential labels than "NotJosquin" and "Josquin", then you would have to include them here
  - These labels must match exactly the labels that are used in your training data
    - Even though you do not need to generate an ARFF file for your training data (you can just use the CSV file you already saved), it can be useful to save one anyway and copy its relevant "@attribute . . . " line for its candidate classes to your test pieces ARFF file
- This step is necessary for Weka to know what possible classes (composers, in this case) each of the pieces could have
- Save the file as "MaybeJosquinFeatureValuesComposerHeaderAdjusted.arff"

SAVING AND TRAINING A CLASSIFICATION MODEL

We are now ready to train and save a classification model
In Weka's Preprocess tab, click the Open file button
- Open the "JosquinNotJosquinFeatureValuesWekaReady.csv" file
- Go to Weka's Classify tab
- In the Classifier area, press the Choose button
  - Select weka > classifiers > functions > SMO
We can do a quick cross-validation test (as we did in previous sections of this tutorial) to make sure that this data reasonably separates pieces by Josquin from pieces not by Josquin
- In the Test options area, make sure Cross-validation is selected
- Press the Start button
- You should get an average classification accuracy of around 95%
- If so, we are happy, since this data does a good job at doing what we want it to
Now let's actually train and save a model
- In the Test options area, select Use training set
- Press the Start button
- You will see an even higher classification accuracy than before
  - This is meaningless, since we just classified the training data itself
- What is important is that we now have a model ready to classify other separate test data
Save the model you just trained
- In the Results list area, tight-click on the last entry there and select Save model
- Save it as "JosquinNotJosquinTrainedModel.model"
- We do not have to save it to use it, but it is good to keep a record of it

USING OUR TRAINED MODEL TO CLASSIFY TEST PIECES

Now we are ready to classify our uncertain pieces using the classification model we just trained
In the Test options area, click on Supplied test set, and then click the Set button next to it
- Click the Open file button in the dialog box that appears
- In the file chooser dialog that appears, choose Arff data files (*.arff) from the Files of Type dropdown menu
- Select "MaybeJosquinFeatureValuesComposerHeaderAdjusted.arff"
- Press the Open button
- Press the Close button on the dialog box you then drop back down to
In the Test options area, press the More options button
- Press the Choose button and select Plain Text
- Press the OK button
In the Results list area, make sure that the last entry is selected
- If you wanted to load an already saved model, you could right-click on the white space of the Results list option and select Load model
- We do not need to open a saved model now, since we still have the one we just trained already loaded
Press the Start button
- This generates a big list of output data
- If you want, you can copy and paste it into a text editor, where it can be saved, maybe as "MaybeJosquinClassificationResults.txt"
In the big list of output data that was generated, scroll up to where it says "=== Predictions on test set ==="

This lists, for each piece, whether it was classified as being by Josquin or NotJosquin

The pieces are listed in the same order as in the 'MaybeJosquinFeatureValues.csv" file jSymbolic originally generated
- You can cross-reference the two to identify which piece is which
  - This is necessary because Weka sadly does not keep track of external instance identifiers
- Note that there will be an offset of one row number between the number in the list Weka generates and the row number in the "MaybeJosquinFeatureValues.csv" file
  - This is because the "MaybeJosquinFeatureValues.csv" file has an extra row at the top specifying feature headings
  - You can temporarily delete this feature heading row if you want the two sets of row numbers to match
If you scroll down the list of classifications, you will see that, as we go, fewer and fewer pieces are counted as being Josquin
- This makes sense, because the probability of a given piece actually being by Josquin decreases as we go down the list, because they have less secure attribution levels (according to Jesse Rodin)
- This is great news for Jesse Rodin, as it means that the patterns in the music tend to correspond overall with the results of the historical research that led him to specify these particular security levels
  - Congratulations, Jesse!
This does not prove definitively which individual pieces really are by Josquin and which are not, however
- We would need more training data from more diverse composers who are not Josquin to do that
It does, however, demonstrate and confirm a musicologically meaningful general pattern
- And it shows how historical musicological research and statistical research using jSymbolic features and machine learning can mutually confirm or, less pleasingly but arguably even more importantly, challenge each other

Now we are ready to do some entirely original work to practice the fantastic skills we have learned with jSymbolic and Weka . . .

-top of page-