This section of the manual provides tips for improving the quality of feature
extraction results. jWebMiner allows network searches to be performed using
a wide variety of possible settings, and some of these settings will be better
suited for particular types of tasks than others.
Material will be added to this section as the current version continues to
be used and tested and as particularly effective practices become evident. The
tips available so far are as follows:
- Consistency of Results: The exact hit counts reported by web services for a given query will change with time, as the web changes. Reported hit counts can also change slightly from moment to moment, as they are often approximations and algorithms do not always emphasize consistency. So, slight differences between two consecutive identical queries should not be a cause for alarm, as this is the nature of the algorithms used by many web services, and it should not significantly impact final results.
- Performing Initial Test Runs: Feature extractions involving
many search strings can be time consuming, depending on the time that the
particular web services used take to respond to individual queries. It can
therefore be useful to perform small test runs initially using a subset of
the total search strings in order to verify that results are reasonable before
performing a full feature extraction. If results are surprising, see the debugging
and fine-tuning sections below.
- Debugging Surprising Results: If the numerical results
displayed in the Results Panel are surprising,
it can be useful to use the Network Search
Dialog to duplicate searches so that the specific hits that resulted in
feature values can be examined in detail. Ways in which what is revealed can
be addressed are described below under Fine-tuning Results. Enabling the generation
all of the reports possible can also be helpful (which reports are generated
can be set at the bottom right of the Options
Panel).
- Fine-tuning Results: jWebMiner allows many settings which
can be used to fine-tune results. Several of the key ways in which this can
be done are described below:
- Filter Strings: It can sometimes be useful to require
that all hits must include or not include particular strings in order
to be counted as hits. For example, it may be useful to require that all
hits contain the word "music" in order to ensure relevance to
a musical feature extraction. Alternatively, it may be beneficial to require
that certain words be excluded. For example, it may be desirable to exclude
the string "construction" in order to avoid non-relevant hits
between "The Doors" and "heavy metal". Filter strings
may be set in the Required
Filter Words Panel and the Excluded
Filter Words Panel. The required filter strings may also be set to
vary based on the particular search terms being used, as described in
the Required Filter Words Panel.
- Search String Synonyms: Sometimes it is useful to combine
the results for multiple search strings that are effectively equivalent.
For example, the class names "R and B" and "RnB" can
be usefully combined. Synonyms may be defined in the Search
Words Panel.
- Weighting Sites: It can sometimes be useful to pay
special attention to particular sites as well as or instead of the whole
network when searching. Site weightings and limitations can be set in
the Site Weightings Panel.
- Feature Calculation Settings: The right side of the
Options Panel allows users to set the
ways in which raw hit counts are manipulated to generate feature scores.
Although changing these does not allow fine-tuning as customized as the
above three approaches, it does allow global changes that can be appropriate
for particular types of feature extraction.
- Additional Settings: The left side of the Options
Panel allows a variety of additional search parameters to be set.
However, it is advised that users first attempt to customize filter strings,
synonyms and site weightings, as the defaults set in the left side of
the Options Panel are usually best left
as they are.
- Phrasing Queries: All queries sent to web services by jWebMiner
are automatically phrased and custom-formatted as text-based queries by jWebMiner
before they are sent. Since this is the ultimate form that they will take,
users should exercise some basic caution when entering search strings. More
specifically:
- Escape Characters: Do not use special or escape characters
(e.g. slashes) in search strings entered in the Search
Words Panel.
- Long Queries: Do not make queries too complex or use
too many words in them, as this may cause the queries to be truncated
by web services with maximum query lengths (see below).
- Literal Searches: The default set under the Options
Panel is that all search strings and filter strings are searched literally.
This means, for example, that a site searched for the string "The
Allman Brothers Band" must contain all of these words exactly and
sequentially to be counted as a hit, and that a site containing "The
Allman Brothers is a great band" will not be counted if the search
is literal.. Although the literal search option can be disabled, literal
searches typically results in less noisy results. If literal searches
are used, however, excessively long strings should be avoided, as too
long strings may cut down the number of appropriate sites which are counted
as hits.
- Limitations of Supported Web Services: The web services
accessed by jWebMiner have some inherent limitations of their own. In particular,
not all of the functionality provided by the left side of the Options
Panel is available in all services. Specifically:
- Yahoo!
- Suppressing similar hits affects the returned search results but
not the returned hit count.
- There is no functionality for searching for similar but non-matching
strings.
- Searches cannot be performed using a service located specifically
in Turkey.
- Only up to 5000 queries may be performed per day per IP address.
- Google:
- There is no functionality for searching for similar but non-matching
strings.
- There is no functionality for performing searches using a service
located in a specific country.
- There is a maximum of 10 words per query.
- Only up to 1000 queries may be performed per day. This quota refers
to queries from all IP addresses combined.
-top of page-