jSymbolic
Tutorial - Training a Classification Model and Using It |
JOSQUIN ATTRIBUTION DATA
- There is significant debate on the proper attribution of a number of pieces
sometimes credited to Josquin
- Jesse Rodin has separated Josquin's music into six levels of attribution
security, based on historical research
- The Level 1 pieces are the most secure (i.e. are most likely to have
truly been composed by Josquin), and attribution becomes less secure as
the levels increase
- Pieces belonging to levels 1 and 2 can be reasonably expected to really
be by Josquin, the other four levels are more ambiguous
- In this experiment, we will use machine learning to help us get a rough
overall idea of how reasonable (in general) the Rodin attribution levels are,
based only on the musical content itself
- i.e. not taking any external historical knowledge into account
- To do this, we will train a classification model to distinguish the music
of the two most secure secure Josquin levels from the music of a variety of
other composers that were roughly his contemporaries
- We will then use this model to classify the music from the four least secure
Josquin levels
- If the music belonging to the more secure of these four levels is classified
as being by Josquin more often by this model then the music from the less
secure levels, then this is a good sign that there is musical backing
for Rodin's proposed attribution security levels
- This is different from what we did in previous stages of this tutorial
- Previously, we did cross-validation on all the data at once, which means
that Weka automatically broke the data into its own training and testing
sets, for the purpose of seeing how well it was able to distinguish the
music
- Here, we are going to manually separate out the training data and testing
data (i.e. the music we are going to classify) separately
- This will allow us to see how each piece in the testing data is actually
classified individually
- IMPORTANT: the training data must never share any pieces with the
testing data, otherwise results may be biased
EXTRACTING AND PRE-PROCESSING THE FEATURES
- We will be working in the 03_Josquin_Attribution directory, which you downloaded
as part of this tutorial
- Using the approaches covered in the previous sections of this tutorial,
use jSymbolic (and the "FeaturesThatAvoidBiasInRenMusictConfigs.txt"
config file), to extract features from all the files in the "JosquinNotJosquin_MIDI"
folder
- Save the features as "JosquinNotJosquinFeatureValues.xml"
(and the associated ARFF and CSV files)
- These will be the features we will use to train the model
- Using jSymbolic (and the "FeaturesThatAvoidBiasInRenMusictConfigs.txt"
config file), extract features from all the files in the "MaybeJosquin_MIDI"
folder
- Save the features as "MaybeJosquinFeatureValues.xml" (and
the associated ARFF and CSV files)
- These will be the features we will classify with the model after
we train it
- Use the spreadsheet editing approaches covered earlier in this tutorial
to create versions of the two CSV files jSymbolic just generated
- Save a version of "JosquinNotJosquinFeatureValues.csv" called
"JosquinNotJosquinFeatureValuesWekaReady.csv" where:
- A column has been added to the end called "COMPOSER"
- This column has been filled with entries of either "Josquin"
or "NotJosquin" for each piece, as appropriate for the given
piece
- The first column (where file paths are listed) has been deleted
- Save a version of "MaybeJosquinFeatureValues.csv" called "MaybeJosquinFeatureValuesWekaReady.csv"
where:
- A column has been added to the end called "COMPOSER"
- This column has been filled with entries of "?"' for each
and every piece
- Since we do not actually know who composed these pieces (although
it is of course suspected that many of them actually are by Josquin,
especially those nearer the top, which have a greater security
level)
- The first column (where file paths are listed) has been deleted
- Using the Weka skills you learned in previous parts of this tutorial, start
the Weka Explorer
- Using Weka, we will save the "MaybeJosquinFeatureValues.csv" testing data CSV file we just edited as a Weka
ARFF file
- This will differ from the ARFF file jSymbolic created directly, since
we have since edited the CSV file for this set to include class labels
under the header of "COMPOSER"
- In Weka, under the Preprocess tab, click on the Open File
button, and open the "MaybeJosquinFeatureValuesWekaReady.csv"
file
- Then, still in the Weka Preprocess tab, press the Save
button
- Make sure the Files of Type dropdown menu says "Arff
data files (*.arff)"
- Save the file as "MaybeJosquinFeatureValuesComposerFieldAdded.arff"
- We now need to perform a bit of a hack in order to make Weka be willing
to accept these files
- In a text editor, open the "MaybeJosquinFeatureValuesComposerFieldAdded.arff"
file we just saved
- In the text editor, find the line that says "@attribute COMPOSER
string"
- Replace this line with a line that says "@attribute COMPOSER {NotJosquin,Josquin}"
- If you were using other potential labels than "NotJosquin"
and "Josquin", then you would have to include them here
- These labels must match exactly the labels that are used
in your training data
- Even though you do not need to generate an ARFF file for your
training data (you can just use the CSV file you already saved),
it can be useful to save one anyway and copy its relevant "@attribute
. . . " line for its candidate classes to your test pieces
ARFF file
- This step is necessary for Weka to know what possible classes (composers,
in this case) each of the pieces could have
- Normally Weka would know implicitly from the file model composer labels
if they were included, but we marked these ones with "?" earlier
because the truth is unknown
- Save the file as "MaybeJosquinFeatureValuesComposerHeaderAdjusted.arff"
SAVING AND TRAINING A CLASSIFICATION MODEL
- We are now ready to train and save a classification model
- In Weka's Preprocess tab, click the Open file button
- Open the "JosquinNotJosquinFeatureValuesWekaReady.csv" file
- Go to Weka's Classify tab
- In the Classifier area, press the Choose button
- Select weka > classifiers > functions > SMO
- We can do a quick cross-validation test (as we did in previous sections
of this tutorial) to make sure that this data reasonably separates pieces
by Josquin from pieces not by Josquin
- In the Test options area, make sure Cross-validation
is selected
- Press the Start button
- You should get an average classification accuracy of around 95%
- If so, we are happy, since this data does a good job at doing what we
want it to
- Now let's actually train and save a model
- In the Test options area, select Use training set
- Press the Start button
- You will see an even higher classification accuracy than before
- This is meaningless, since we just classified the training data
itself
- What is important is that we now have a model ready to classify other
separate test data
- Save the model you just trained
- In the Results list area, tight-click on the last entry
there and select Save model
- Save it as "JosquinNotJosquinTrainedModel.model"
- We do not have to save it to use it, but it is good to keep a record
of it
USING OUR TRAINED MODEL TO CLASSIFY TEST PIECES
- Now we are ready to classify our uncertain pieces using the classification
model we just trained
- In the Test options area, click on Supplied test set, and
then click the Set button next to it
- Click the Open file button in the dialog box that appears
- In the file chooser dialog that appears, choose Arff data files
(*.arff) from the Files of Type dropdown menu
- Select "MaybeJosquinFeatureValuesComposerHeaderAdjusted.arff"
- Press the Open button
- Press the Close button on the dialog box you then drop back
down to
- In the Test options area, press the More options button
- Press the Choose button and select Plain Text
- Press the OK button
- In the Results list area, make sure that the last entry
is selected
- If you wanted to load an already saved model, you could right-click
on the white space of the Results list option and select Load
model
- We do not need to open a saved model now, since we still have the one
we just trained already loaded
- Press the Start button
- This generates a big list of output data
- If you want, you can copy and paste it into a text editor, where it
can be saved, maybe as "MaybeJosquinClassificationResults.txt"
- In the big list of output data that was generated, scroll up to where it
says "=== Predictions on test set ==="
- This lists, for each piece, whether it was classified as being by Josquin
or NotJosquin
- The pieces are listed in the same order as in the 'MaybeJosquinFeatureValues.csv"
file jSymbolic originally generated
- You can cross-reference the two to identify which piece is which
- This is necessary because Weka sadly does not keep track of external
instance identifiers
- Note that there will be an offset of one row number between the number
in the list Weka generates and the row number in the "MaybeJosquinFeatureValues.csv"
file
- This is because the "MaybeJosquinFeatureValues.csv" file
has an extra row at the top specifying feature headings
- You can temporarily delete this feature heading row if you want
the two sets of row numbers to match
- If you scroll down the list of classifications, you will see that, as we
go, fewer and fewer pieces are counted as being Josquin
- This makes sense, because the probability of a given piece actually
being by Josquin decreases as we go down the list, because they have less
secure attribution levels (according to Jesse Rodin)
- This is great news for Jesse Rodin, as it means that the patterns in
the music tend to correspond overall with the results of the historical
research that led him to specify these particular security levels
- This does not prove definitively which individual pieces really are by Josquin
and which are not, however
- We would need more training data from more diverse composers who are
not Josquin to do that
- It does, however, demonstrate and confirm a musicologically meaningful general
pattern
- And it shows how historical musicological research and statistical research
using jSymbolic features and machine learning can mutually confirm or,
less pleasingly but arguably even more importantly, challenge each other
Now we are ready to do some entirely original work to practice the fantastic
skills we have learned with jSymbolic and Weka .
. .
-top of page-