jSymbolic Tutorial

jSymbolic Tutorial - Using Weka

PREPARING FEATURES TO BE USED BY WEKA

Do these steps once the JosquinVsOckeghem features have finished being extracted
- As instructed earlier in this tutorial
We could just open the ARFF file jSymbolic generated in Weka
- This would not, however, tell Weka which pieces are by Josquin and which are by Ockeghem
Let's instead make a modified version of the "BigJosqOckFeatureValues.csv" jSymbolic just created
- Open the "BigJosqOckFeatureValues.csv" file in your spreadsheet (I'll be using Excel)
- In the spreadsheet, go to the first column on the right that does not have any data in it
  - Column ADW, in this case
  - In the first row, type "COMPOSER"
    - To let Weka know that this will identify the composer
  - In each row of this column that corresponds to Josquin, type "Josquin"
  - In each row of this column that corresponds to Ockeghem, type "Ockeghem"
  - Use copy and pasting to speed up these operations
  - Look at the the paths in the first column to see where the Ockeghem files end and the Josquin files start
- Now delete the entire first column (which holds the file paths)
  - Weka does not use this information
- Save this edited file as "BigJosqOckFeatureValuesWekaReady.csv"
- This file now, in the column you added, indicates which pieces are by Josquin and which by Ockeghem
  - And no longer indicates which file each set of features was extracted from, since you delted the column that provided that information
If you are going to be doing lots of labeling of instances, there is a jMIR tool for speeding the process up:
- jMIRUtilities includes a GUI for easily labeling music whose features have been extracted into ACE XML files
- For now, it better to do this in Excel, since it lets us get a better low-level feel for what the data looks like

USING WEKA TO MANUALLY EXPORE THE DATA AND SEARCH FOR MUSICOLOGICALLY MEANINGFUL PATTERNS

Weka is a data mining package from the University of Waikato, New Zealand
- We can use it to explore statistical patterns in the features
- We can use it to do machine learning
If you have not yet installed Weka, then please do so now
- See the installation instructions from earlier
Run Weka
- Macintosh:
  - Double click the "weka-3-8-2" drive icon on your Desktop, and in the window that opens double click on the "weka-3-8-2-oracle-jvm" to run Weka
- Windows:
  - Double click on the "Weka 3.8" icon on your desktop
Click on the Explorer button
This opens the Weka Explorer interface, which we will be working with
- You will be starting in the Preprocess tab of the Weka Explorer
Click the "Open file" button, and open the "BigJosqOckFeatureValuesWekaReady.csv" file we just created
- You will first need to select "CSV data files" from the Files of Type dropdown menu found in the file chooser dialog box that appears so that you can see CSV files
All of the features in the file are now listed in the Attributes area of the Weka Explorer
This includes the "COMPOSER" field that we created in our spreadsheet
- Scroll to the bottom of the attributes and select "COMPOSER"
- This shows that there are 98 Ockeghem pieces, who is displayed in blue, and 131 Josquin pieces, who is displayed in red
Click, for example, on the Vertical_Sixths feature (number 506)
- The Vertical Sixths feature definition, from the manual:
  - "Fraction of all wrapped vertical intervals that are minor or major sixths. This is weighted by how long intervals are held (e.g. an interval lasting a whole note will be weighted four times as strongly as an interval lasting a quarter note)."
- A number of stats about the feature distribution across all the files we extracted is shown on the right
  - Minimum, maximum, mean and standard deviation
- Of particular interest, note the histogram graph
  - In this graph, red corresponds to pieces by Josquin and blue to pieces by Ockeghem
  - The horizontal scale on the bottom of the histogram indicates the feature value range
  - The height or each bar indicates how many pieces had a feature value within the range of that bar
    - e.g. 6 pieces (all by Josquin) had a vertical sixths value of near 0.1
  - The colours show which pieces are associated with which composer
  - This particular graph shows that Josquin tends to have fewer vertical sixths proportionately then Ockeghem, overall
    - Although there is certainly some overlap
    - e.g. a some Ockeghem pieces have fewer vertical sixths than a few Josquin pieces, but this is rare
This may be an important musicological insight!
- The other features can be examined one-by-one for other interesting and potentially meaningful similar patterns
It can be even more meaningful to see how different features vary together
- Rather than just looking at one feature at a time, as we just did
To see a scatterplot showing how any two features are distributed compared to one another:
- Click on the Visualize tab of the Weka Explorer
- Click on any of the boxes in the Plot Matrix area of the Visualize tab to see a scatterplot of the two features corresponding to the box you clicked
  - Each blue "x" corresponds to a piece by Ockeghem, and each red "x" to a piece by Josquin
- Of course, not all of these will be interesting, and we need to find feature pairs that separate the composers out well
- Let's look at two features that experimentation has revealed to be interesting in the context of these two composers
  - In the scatterplot window that you just opened, choose "X: Average_Length_of_Melodic_Arcs (Num)" in the X dropdown box (top left), and "X: Vertical_Sixths (Num)" in the Y dropdown box (top right)
  - Notice how one could draw an imaginary curve on the graph that would separate the composers relatively well
    - It seems that just these two feature are quite useful in discriminating between the two composers!
    - Perhaps there is some musically meaningful relationship between the two features?
  - Close the scatterplot window when you are done
It can be useful to experiment by looking at many such graphs in order to look for meaningful relationships
If the number of features is too overwhelming in the Visualize tab, you can filter ones out that you do not want to look at by not selecting them in the box that appears when you press the Select Attribute button
- Press the Update button when you are done in order to change the scatterplots shown

USING WEKA TO AUTOMATICALLY FIND FEATURES THAT DISTINGUISH MUSICAL CLASSES (COMPOSERS IN THIS CASE)

Manually examining histograms and scatterplots like we did in the last section can be a very powerful and revealing approach
- I highly recommend it!
However, it can also be time-consuming and sometimes overwhelming
We will now look at a statistical technique called "feature selection" that uses statistical processing to find the features that are most likely to be significant in discriminating between the musical classes under consideration (composers, in this case)
- Note that I wrote "most likely" above
- This is because feature selection and automatic classification are very complex operations, and they tend to give your results that are good but not necessarily optimal
Click on the Select attributes tab in Weka
Under Attribute Evaluator, press the Choose button
- Select weka > attributeSelection > CfsSubsetEval
- This powerful algorithm evaluates the worth of a subset of features by considering the individual predictive ability of each feature along with the degree of redundancy between them
Under Search Method, press the Choose button
- Select weka > attributeSelection > BestFirst
- This sets the way that the CfsSubsetEval algorithm will explore and select features
Press the Start button
- Wait a few moments for processing to occur
The list that comes up is the list of features that this algorithm found to most effectively distinguish between the composers
- For CfsSubsetEval, the order of the list does not indicate priority
There is therefore statistical evidence that these particular reported features could be particularly musicologically meaningful when considered in connection to these two composers!
However, remember that no feature selection methodology is perfect
- There might be other features that are also important that CfsSubsetEval missed
In particular, there might be particular feature combinations that are particularly important
- A feature that has no discriminatory power alone might turn out to be very useful when combined with another feature
- This is why we need many candidate features!
Let's try another algorithm, and see what happens:
- Under Attribute Evaluator, press the Choose button
  - Select weka > attributeSelection > CorrelationAttributeEval
  - Click Yes on the dialog box that comes up
  - This algorithm evaluates the worth of a feature by measuring the correlation (Pearson's) between it and a piece's class
- Press the Start button
- The list of features that comes up is ranked from the ones estimated to be the most meaningful to the ones estimated to be the least meaningful
  - Unlike CfsSubsetEval
- Note that the best features suggested by CorrelationAttributeEval are similar but not identical to those suggested by CfsSubsetEval
It is useful to experiment with a range of such algorithms and look for similarities amongst the features they suggest
- Those features selected by a variety of algorithms are likely to be the most interesting ones in connection with the problem at hand
- Although other ones can certainly be interesting as well
- And, of course, the results of just one algorithm can certainly still be very musicologically meaningful
There are also many alternative statistical analysis techniques that can be used to determine which features seem to be most relevant to distinguishing between certain classes
- e.g. ANOVA is quite well-known, but sadly not implemented by Weka
- There are many advanced software packages that can be used to perform such analyses
  - e.g. SAS, Matlab, etc.
- They are beyond the scope of this tutorial, however
- For our purposes, Weka alone offers more than enough useful functionality to do meaningful analyses

USING MACHINE LEARNING IN WEKA TO CLASSIFY CLASSES (COMPOSERS IN THIS CASE)

Machine learning is a powerful set of techniques that, among other things, can train on sets of extracted features to automatically "learn" to distinguish between different classes (i.e. composers, in this case)
- e.g. recognize tumors or handwriting
- e.g. recognize composers
To use machine learning in Weka, click on the Classify tab
In the Classifier area, click on the Choose button
- Select weka > classifiers > functions > SMO
- This is an implementation of the Support Vector Machine (SVM) machine learning algorithm
  - A simple but highly effective (and fast!) algorithm
- Make sure the Cross-validation option is selected in the Test options area
- Press the Start button
  - Wait a little while for the processing to occur
- A long report will be displayed when processing is complete
- Scroll down this report and look for:
  - The Correctly Classified Instances percentage:
    - Indicates how accurately the algorithm was able to classify test pieces by composer after it was trained on separately partitioned training pieces
  - The Confusion Matrix:
    - Shows what kinds of misclassifications occurred
In just a few moments, the system was able to learn to correctly distinguish Josquin's music from Ockeghem's music about 93% of the time!
- With no feature selection pre-processing
  - The feature selection we did in the Select attributes tab was not carried through here
  - If we wanted to do machine learning with an automatically reduced feature set, we would use the Filter area of the Preprocess tab
- With no tweaking of classifier parameters
  - If we wanted to do this, we could click on where it says SMO
Even better results could be achieved if we had spent time setting up the classifier more carefully and pre-selecting features using statistically valid techniques
- However, this requires some expertise in machine learning, so we will not go more into it here
- Special care has to be taken when doing this not to accidentally inflate results by "overfitting"
- But this would only give us an extra one or two percentage points
- This basic approach is more than good enough for achieving meaningful results

Next we will learn to train and save a clasification model using Weka, and then use it to classify pieces of music . . .

-top of page-