NAME

supervised-disambiguate


SYNOPSIS

This is a wrapper program for supervised, unsupervised and semi-supervised word sense disambiguation using CuiTools programs.


DESCRIPTION

The program takes as input the data set (SOURCE) which can either be a directory containing files in the CuiTools xml-like format or a single file in the CuiTools xml-like format and returns the results stored in a directory (DESTINATION).

The program extracts specified features based on the user defined options options and outputs arff formatted text to be used by the WEKA data mining package for supervised word sense disambiguation.


USAGE

perl supervised-disambiguate.pl [OPTIONS] SOURCE


INPUT

Required Arguments:

SOURCE

A directory containing the CuiTools xml-like .mm formatted training file(s) or a single file in the CuiTools xml-like .mm format.

head2 Optional Arguments:

SUPERVISED DATA OPTIONS :

--wekacv NUMBER

This will perform NUMBER-fold cross validation on the training data using the cross validation in the WEKA data mining package.

NOTE: This is also the default option. If neither --test or --split or --cv are specified, 10-fold cross validation will be performed.

--cv NUMBER

With weka (--wekacv option) the features are identified from all of the X folds, one of which is the test fold. In --cv, the features are identified from only the X-1 training folds, where the test fold is held out of feature identification.

--seed NUMBER

This is used with the --cv option in order to seed the random number generator for the cross validation

Default: --seed 1

--test FILE

The program will train on the complete training data and test on a given test data FILE.

--nokey

The test data does not contain the sense tags. The id and sense tag assigned by the classifier will be in the file, test.answers, in the log directory. This This option can only be used with the --test option.

--key FILE

The program will obtain the answers to the given test file from the key file. The key file is expected to be in the same format as the Senseval-2 key file.

    Example:
   art art.40003 pop_art%1:06:00::
    
    <target word> <instance id> <sense>

--split NUMBER

This option will split the training data into two sections. The NUMBER specifies the percentage of the training data that will be put aside for testing if the --test option is set. Example: --split 25 25% of the data will be used to test the algorithm and 75% of the data will be used to train.

FEATURE OPTIONS :

If no feature option is chosen, the default feature setting is used:


    --ngramcount "--ngram 1 --frequency 2"

If no feature option is chosen, the default is as follows: --ngramcount ``--ngram 1 --frequency 2''

--pos

The part of speech of the target word

--head

Whether the target word is the head word of its phrase.

--mesh

Uses the MeSH terms assigned to the abstract containing the target word as features

--mmi

Uses the top X MMI terms assigned to the abstract containing the target word as features

--ngramcount ``OPTIONS''

The words surrounding the target word as features

The Ngram Statistic Package (NSP) count.pl program is used to obtain the counts. The options specify any additional options to be passed to count.pl program.


Note: All the options must be within double quotes, as
      would be passed to the program at command line.

Note: If the --sentence option is defined, the surrounding words will come from the sentence that contains the target word. If the --phrase option is defined, the surrounding owrds will come only from the phrase that contains the target word. Otherwise they come from the abstract.

--ngrammeasure ``OPTIONS''

Specifies a measure of association to be used with the statistic.pl program in the Ngram Statistics Package which is used in combination with the --ngramstat option to perform feature selection for the --ngramcount option

Note: All the options must be within double quotes, as would be passed to the program at command line.

--ngramstat ``OPTIONS''


Specifies any additional options to be passed to statistic.pl program 
when using --ngramcount and --ngrammeasure options

Note: All the options must be within double quotes, as would be passed to the program at command line.

--cuicount ``OPTIONS''

Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) of the words surrounding the target word.

The Ngram Statistic Package (NSP) is used to obtain the counts.

The options specify any additional options to be passed to count.pl program.


Note: All the options must be within double quotes, as
      would be passed to the program at command line.

Note: If the --sentence option is defined, the surrounding words will come from the sentence that contains the target word. If the --phrase option is defined, the surrounding owrds will come only from the phrase that contains the target word. Otherwise they come from the abstract.

--cuimeasure ``OPTIONS''

Specifies a measure of association to be used with the statistic.pl program in the Ngram Statistics Package which is used in combination with the --cuistat option to perform feature selection for the --cuicount option

Note: All the options must be within double quotes, as would be passed to the program at command line.

--cuistats ``OPTIONS''


Specifies any additional options to be passed to statistic.pl program 
when using --ngramcount and --ngrammeasure options

Note: All the options must be within double quotes, as would be passed to the program at command line.

--firstranked

The first CUI in the set of one or more CUIs found in the mapping (that have the highest MTI score) is used as the feature rather than all of them (which is the default).

--conflatedcuis

Combine all the CUIS in the set of one or more CUIs found in the mapping into one sorted string and use the string as a feature

--stcount ``OPTIONS''

The words surrounding the target word as features

The Ngram Statistic Package (NSP) count.pl program is used to obtain the counts. The options specify any additional options to be passed to count.pl program.


Note: All the options must be within double quotes, as
      would be passed to the program at command line.

Note: If the --sentence option is defined, the surrounding words will come from the sentence that contains the target word. If the --phrase option is defined, the surrounding owrds will come only from the phrase that contains the target word. Otherwise they come from the abstract.

--stmeasure ``OPTIONS''

Specifies a measure of association to be used with the statistic.pl program in the Ngram Statistics Package which is used in combination with the --ststat option to perform feature selection for the --stcount option

Note: All the options must be within double quotes, as would be passed to the program at command line.

--ststats ``OPTIONS''


Specifies any additional options to be passed to statistic.pl program 
when using --ngramcount and --ngrammeasure options

Note: All the options must be within double quotes, as would be passed to the program at command line.

--nspconfig FILE

This is can be used in replace of or in conjunction with the ngramcount, ngrammeasure, ngramstat, cuicount, cuimeasure, cuistat, stcount, stmeasure and ststat options. It allows for multiple instances of ngramcount, cuicount and stcount to be called in order to allow for multiple ngrams to be used as features.

An example of the format required for the configuration file is as follows:

ngramcount:: count:: --ngram 3 --remove 3 statistic:: ll.pm --score 3.841

ngramcount:: count:: --ngram 2 --remove 1

cuicount:: count:: --ngram 2 --remove 3 statistic:: ll.pm --score 3.841

stcount:: count:: --ngram 1 --remove 3

--relation

The semantic relation of the semantic type of the target word and the semantic types s of the words surrounding the target word. The relations are read in from a file which is located in the 'default_options' directory. The name of the file is called 'relation'. The relations comes from the UMLS 2007AC documentation.

If you would like to provide your own, the format of the file is as follows:

    semantic_type relation semantic_type

where each semantic type pair and relation are on their own line.

For example:

    acab affects virs
    acab affects vtbt
    acap affects rept
    ...

--punc

Removes punctuation in the --unigram, --term, --bigram or --trigram options

--phrase

The context considered is only the features that are in the same phrase as the target word. The phrases are identified by MetaMap.

Note: If the phrase option is set then the --sentence and --line options need to be off.

--sentence

The context considered is only the features that are in the same sentence as the target word.

Note: If the sentence option is set then the --phrase and --line options need to be off.

--line

The context considered is the features that are in the the complete line as defined by the .mm file as the target word.

Note: If the line option is set then the --sentence and --phrase options need to be off.

Note: This is the default option.

PROGRAM OPTIONS:

--weka

The weka machine learning algorithm that you would like to run the feature set on. The program is expecting the java command that would be used to run weka.

Default:

weka.classifiers.bayes.NaiveBayes

--wekaparams PARAMETERS

Option so that the user can directly pass weka parameters through the command line

The main reason we added this was to allow for the -s NUM option to be used. This option sets the random number seed for cross-validation. Its default is 1.

--javaparams PARAMETERS

Option so that the user can directly pass java parameters through the command line

--directory

The prefix that the directories for the weka, arff, results and log files.

Default: <timestamp>.dir

Therefore the weka files would be located in <timestamp>.dir.weka., the arff files would be located in <timestamp>.dir.arff, results in the <timestamp>.dir.results and log files in the <timestamp>.dir.log.

--confidence

Print out the confidence values with the accuracy

HELP OPTIONS:

--help

Displays the quick summary of program options.

--version

Displays the version information.


OUTPUT

supervised-disambiguate.pl creates two directories. One containing the arff files and the other containing the weka files. In the weka directory, the overall averages are stored in the OverallAverage file.


PROGRAM REQUIREMENTS


AUTHOR

 Bridget T. McInnes, University of Minnesota


COPYRIGHT

Copyright (c) 2007,

 Bridget T. McInnes, University of Minnesota
 bthomson at cs.umn.edu
    
    Ted Pedersen, University of Minnesota Duluth
    tpederse at d.umn.edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.