jSymbolic
Tutorial - Using Weka |
PREPARING FEATURES TO BE USED BY WEKA
- Do these steps once the JosquinVsOckeghem features have finished being extracted
- As instructed earlier in this tutorial
- We could just open the ARFF file jSymbolic generated in Weka
- This would not, however, tell Weka which pieces are by Josquin and which
are by Ockeghem
- Let's instead make a modified version of the "BigJosqOckFeatureValues.csv"
jSymbolic just created
- Open the "BigJosqOckFeatureValues.csv" file in your spreadsheet
(I'll be using Excel)
- In the spreadsheet, go to the first column on the right that does not
have any data in it
- Column ADW, in this case
- In the first row, type "COMPOSER"
- To let Weka know that this will identify the composer
- In each row of this column that corresponds to Josquin, type "Josquin"
- In each row of this column that corresponds to Ockeghem, type "Ockeghem"
- Use copy and pasting to speed up these operations
- Look at the the paths in the first column to see where the Ockeghem
files end and the Josquin files start
- Now delete the entire first column (which holds the file paths)
- Weka does not use this information
- Save this edited file as "BigJosqOckFeatureValuesWekaReady.csv"
- This file now, in the column you added, indicates which pieces are by
Josquin and which by Ockeghem
- And no longer indicates which file each set of features was extracted
from, since you delted the column that provided that information
- If you are going to be doing lots of labeling of instances, there is a jMIR
tool for speeding the process up:
- jMIRUtilities
includes a GUI for easily labeling music whose features have been extracted
into ACE XML files
- For now, it better to do this in Excel, since it lets us get a better
low-level feel for what the data looks like
USING WEKA TO MANUALLY EXPORE THE DATA AND SEARCH FOR MUSICOLOGICALLY
MEANINGFUL PATTERNS
- Weka is a data mining package from the University of Waikato, New Zealand
- We can use it to explore statistical patterns in the features
- We can use it to do machine learning
- If you have not yet installed Weka, then please do so now
- Run Weka
- Macintosh:
- Double click the "weka-3-8-2" drive icon on your Desktop,
and in the window that opens double click on the "weka-3-8-2-oracle-jvm"
to run Weka
- Windows:
- Double click on the "Weka 3.8" icon on your desktop
- Click on the Explorer button
- This opens the Weka Explorer interface, which we will be working with
- You will be starting in the Preprocess tab of the Weka Explorer
- Click the "Open file" button, and open the "BigJosqOckFeatureValuesWekaReady.csv"
file we just created
- You will first need to select "CSV data files" from the Files of Type dropdown
menu found in the file chooser dialog box that appears so that you can
see CSV files
- All of the features in the file are now listed in the Attributes
area of the Weka Explorer
- This includes the "COMPOSER" field that we created in our spreadsheet
- Scroll to the bottom of the attributes and select "COMPOSER"
- This shows that there are 98 Ockeghem pieces, who is displayed in blue,
and 131 Josquin pieces, who is displayed in red
- Click, for example, on the Vertical_Sixths feature (number 506)
- The Vertical Sixths feature definition, from the manual:
- "Fraction of all wrapped vertical intervals that are minor
or major sixths. This is weighted by how long intervals are held (e.g.
an interval lasting a whole note will be weighted four times as strongly
as an interval lasting a quarter note)."
- A number of stats about the feature distribution across all the files
we extracted is shown on the right
- Minimum, maximum, mean and standard deviation
- Of particular interest, note the histogram graph
- In this graph, red corresponds to pieces by Josquin and blue to
pieces by Ockeghem
- The horizontal scale on the bottom of the histogram indicates the
feature value range
- The height or each bar indicates how many pieces had a feature value
within the range of that bar
- e.g. 6 pieces (all by Josquin) had a vertical sixths value of
near 0.1
- The colours show which pieces are associated with which composer
- This particular graph shows that Josquin tends to have fewer vertical
sixths proportionately then Ockeghem, overall
- Although there is certainly some overlap
- e.g. a some Ockeghem pieces have fewer vertical sixths than
a few Josquin pieces, but this is rare
- This may be an important musicological insight!
- The other features can be examined one-by-one for other interesting
and potentially meaningful similar patterns
- It can be even more meaningful to see how different features vary together
- Rather than just looking at one feature at a time, as we just did
- To see a scatterplot showing how any two features are distributed compared
to one another:
- Click on the Visualize tab of the Weka Explorer
- Click on any of the boxes in the Plot Matrix area of the Visualize
tab to see a scatterplot of the two features corresponding to the box
you clicked
- Each blue "x" corresponds to a piece by Ockeghem, and
each red "x" to a piece by Josquin
- Of course, not all of these will be interesting, and we need to find
feature pairs that separate the composers out well
- Let's look at two features that experimentation has revealed to be interesting
in the context of these two composers
- In the scatterplot window that you just opened, choose "X:
Average_Length_of_Melodic_Arcs (Num)" in the X dropdown box (top
left), and "X: Vertical_Sixths (Num)" in the Y dropdown
box (top right)
- Notice how one could draw an imaginary curve on the graph that would
separate the composers relatively well
- It seems that just these two feature are quite useful in discriminating
between the two composers!
- Perhaps there is some musically meaningful relationship between
the two features?
- Close the scatterplot window when you are done
- It can be useful to experiment by looking at many such graphs in order to
look for meaningful relationships
- If the number of features is too overwhelming in the Visualize
tab, you can filter ones out that you do not want to look at by not selecting
them in the box that appears when you press the Select Attribute button
- Press the Update button when you are done in order to change
the scatterplots shown
USING WEKA TO AUTOMATICALLY FIND FEATURES THAT DISTINGUISH MUSICAL
CLASSES (COMPOSERS IN THIS CASE)
- Manually examining histograms and scatterplots like we did in the last section
can be a very powerful and revealing approach
- However, it can also be time-consuming and sometimes overwhelming
- We will now look at a statistical technique called "feature selection"
that uses statistical processing to find the features that are most likely
to be significant in discriminating between the musical classes under consideration
(composers, in this case)
- Note that I wrote "most likely" above
- This is because feature selection and automatic classification are very
complex operations, and they tend to give your results that are good but
not necessarily optimal
- Click on the Select attributes tab in Weka
- Under Attribute Evaluator, press the Choose button
- Select weka > attributeSelection > CfsSubsetEval
- This powerful algorithm evaluates the worth of a subset of features
by considering the individual predictive ability of each feature along
with the degree of redundancy between them
- Under Search Method, press the Choose button
- Select weka > attributeSelection > BestFirst
- This sets the way that the CfsSubsetEval algorithm will explore and
select features
- Press the Start button
- Wait a few moments for processing to occur
- The list that comes up is the list of features that this algorithm found
to most effectively distinguish between the composers
- For CfsSubsetEval, the order of the list does not indicate priority
- There is therefore statistical evidence that these particular reported features
could be particularly musicologically meaningful when considered in connection
to these two composers!
- However, remember that no feature selection methodology is perfect
- There might be other features that are also important that CfsSubsetEval
missed
- In particular, there might be particular feature combinations that
are particularly important
- A feature that has no discriminatory power alone might turn out to be
very useful when combined with another feature
- This is why we need many candidate features!
- Let's try another algorithm, and see what happens:
- Under Attribute Evaluator, press the Choose button
- Select weka > attributeSelection > CorrelationAttributeEval
- Click Yes on the dialog box that comes up
- This algorithm evaluates the worth of a feature by measuring the
correlation (Pearson's) between it and a piece's class
- Press the Start button
- The list of features that comes up is ranked from the ones estimated
to be the most meaningful to the ones estimated to be the least meaningful
- Note that the best features suggested by CorrelationAttributeEval are
similar but not identical to those suggested by CfsSubsetEval
- It is useful to experiment with a range of such algorithms and look for
similarities amongst the features they suggest
- Those features selected by a variety of algorithms are likely to be
the most interesting ones in connection with the problem at hand
- Although other ones can certainly be interesting as well
- And, of course, the results of just one algorithm can certainly still
be very musicologically meaningful
- There are also many alternative statistical analysis techniques that can
be used to determine which features seem to be most relevant to distinguishing
between certain classes
- e.g. ANOVA is quite well-known, but sadly not implemented by Weka
- There are many advanced software packages that can be used to perform
such analyses
- They are beyond the scope of this tutorial, however
- For our purposes, Weka alone offers more than enough useful functionality
to do meaningful analyses
USING MACHINE LEARNING IN WEKA TO CLASSIFY CLASSES (COMPOSERS IN THIS
CASE)
- Machine learning is a powerful set of techniques that, among other things,
can train on sets of extracted features to automatically "learn"
to distinguish between different classes (i.e. composers, in this case)
- e.g. recognize tumors or handwriting
- e.g. recognize composers
- To use machine learning in Weka, click on the Classify tab
- In the Classifier area, click on the Choose button
- Select weka > classifiers > functions > SMO
- This is an implementation of the Support Vector Machine (SVM) machine
learning algorithm
- A simple but highly effective (and fast!) algorithm
- Make sure the Cross-validation option is selected in the Test
options area
- Press the Start button
- Wait a little while for the processing to occur
- A long report will be displayed when processing is complete
- Scroll down this report and look for:
- The Correctly Classified Instances percentage:
- Indicates how accurately the algorithm was able to classify
test pieces by composer after it was trained on separately partitioned
training pieces
- The Confusion Matrix:
- Shows what kinds of misclassifications occurred
- In just a few moments, the system was able to learn to correctly distinguish
Josquin's music from Ockeghem's music about 93% of the time!
- With no feature selection pre-processing
- The feature selection we did in the Select attributes
tab was not carried through here
- If we wanted to do machine learning with an automatically reduced
feature set, we would use the Filter area of the Preprocess
tab
- With no tweaking of classifier parameters
- If we wanted to do this, we could click on where it says SMO
- Even better results could be achieved if we had spent time setting up the
classifier more carefully and pre-selecting features using statistically valid
techniques
- However, this requires some expertise in machine learning, so we will
not go more into it here
- Special care has to be taken when doing this not to accidentally inflate
results by "overfitting"
- But this would only give us an extra one or two percentage points
- This basic approach is more than good enough for achieving meaningful
results
Next we will learn to train and save a clasification model using Weka,
and then use it to classify pieces of music .
. .
-top of page-