supervised-disambiguate
This is a wrapper program for supervised, unsupervised and semi-supervised word sense disambiguation using CuiTools programs.
The program takes as input the data set (SOURCE) which can either be a directory containing files in the CuiTools xml-like format or a single file in the CuiTools xml-like format and returns the results stored in a directory (DESTINATION).
The program extracts specified features based on the user defined options options and outputs arff formatted text to be used by the WEKA data mining package for supervised word sense disambiguation.
perl supervised-disambiguate.pl [OPTIONS] SOURCE
A directory containing the CuiTools xml-like .mm formatted
training file(s) or a single file in the CuiTools xml-like
.mm format.
head2 Optional Arguments:
This will perform NUMBER-fold cross validation on the training data using the cross validation in the WEKA data mining package.
NOTE: This is also the default option. If neither --test or --split or --cv are specified, 10-fold cross validation will be performed.
With weka (--wekacv option) the features are identified from all of the X folds, one of which is the test fold. In --cv, the features are identified from only the X-1 training folds, where the test fold is held out of feature identification.
This is used with the --cv option in order to seed the random number generator for the cross validation
Default: --seed 1
The program will train on the complete training data and test on a given test data FILE.
The test data does not contain the sense tags. The id and sense tag assigned by the classifier will be in the file, test.answers, in the log directory. This This option can only be used with the --test option.
The program will obtain the answers to the given test file from the key file. The key file is expected to be in the same format as the Senseval-2 key file.
Example:
art art.40003 pop_art%1:06:00::
<target word> <instance id> <sense>
This option will split the training data into two sections. The NUMBER specifies the percentage of the training data that will be put aside for testing if the --test option is set. Example: --split 25 25% of the data will be used to test the algorithm and 75% of the data will be used to train.
If no feature option is chosen, the default feature setting is used:
--ngramcount "--ngram 1 --frequency 2"
If no feature option is chosen, the default is as follows: --ngramcount ``--ngram 1 --frequency 2''
The part of speech of the target word
Whether the target word is the head word of its phrase.
Uses the MeSH terms assigned to the abstract containing the target word as features
Uses the top X MMI terms assigned to the abstract containing the target word as features
The words surrounding the target word as features
The Ngram Statistic Package (NSP) count.pl program is used to obtain the counts. The options specify any additional options to be passed to count.pl program.
Note: All the options must be within double quotes, as
would be passed to the program at command line.
Note: If the --sentence option is defined, the surrounding words will come from the sentence that contains the target word. If the --phrase option is defined, the surrounding owrds will come only from the phrase that contains the target word. Otherwise they come from the abstract.
Specifies a measure of association to be used with the statistic.pl program in the Ngram Statistics Package which is used in combination with the --ngramstat option to perform feature selection for the --ngramcount option
Note: All the options must be within double quotes, as would be passed to the program at command line.
Specifies any additional options to be passed to statistic.pl program when using --ngramcount and --ngrammeasure options
Note: All the options must be within double quotes, as would be passed to the program at command line.
Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) of the words surrounding the target word.
The Ngram Statistic Package (NSP) is used to obtain the counts.
The options specify any additional options to be passed to count.pl program.
Note: All the options must be within double quotes, as
would be passed to the program at command line.
Note: If the --sentence option is defined, the surrounding words will come from the sentence that contains the target word. If the --phrase option is defined, the surrounding owrds will come only from the phrase that contains the target word. Otherwise they come from the abstract.
Specifies a measure of association to be used with the statistic.pl program in the Ngram Statistics Package which is used in combination with the --cuistat option to perform feature selection for the --cuicount option
Note: All the options must be within double quotes, as would be passed to the program at command line.
Specifies any additional options to be passed to statistic.pl program when using --ngramcount and --ngrammeasure options
Note: All the options must be within double quotes, as would be passed to the program at command line.
The first CUI in the set of one or more CUIs found in the mapping (that have the highest MTI score) is used as the feature rather than all of them (which is the default).
Combine all the CUIS in the set of one or more CUIs found in the mapping into one sorted string and use the string as a feature
The words surrounding the target word as features
The Ngram Statistic Package (NSP) count.pl program is used to obtain the counts. The options specify any additional options to be passed to count.pl program.
Note: All the options must be within double quotes, as
would be passed to the program at command line.
Note: If the --sentence option is defined, the surrounding words will come from the sentence that contains the target word. If the --phrase option is defined, the surrounding owrds will come only from the phrase that contains the target word. Otherwise they come from the abstract.
Specifies a measure of association to be used with the statistic.pl program in the Ngram Statistics Package which is used in combination with the --ststat option to perform feature selection for the --stcount option
Note: All the options must be within double quotes, as would be passed to the program at command line.
Specifies any additional options to be passed to statistic.pl program when using --ngramcount and --ngrammeasure options
Note: All the options must be within double quotes, as would be passed to the program at command line.
This is can be used in replace of or in conjunction with the ngramcount, ngrammeasure, ngramstat, cuicount, cuimeasure, cuistat, stcount, stmeasure and ststat options. It allows for multiple instances of ngramcount, cuicount and stcount to be called in order to allow for multiple ngrams to be used as features.
An example of the format required for the configuration file is as follows:
ngramcount:: count:: --ngram 3 --remove 3 statistic:: ll.pm --score 3.841
ngramcount:: count:: --ngram 2 --remove 1
cuicount:: count:: --ngram 2 --remove 3 statistic:: ll.pm --score 3.841
stcount:: count:: --ngram 1 --remove 3
The semantic relation of the semantic type of the target word and the semantic types s of the words surrounding the target word. The relations are read in from a file which is located in the 'default_options' directory. The name of the file is called 'relation'. The relations comes from the UMLS 2007AC documentation.
If you would like to provide your own, the format of the file is as follows:
semantic_type relation semantic_type
where each semantic type pair and relation are on their own line.
For example:
acab affects virs
acab affects vtbt
acap affects rept
...
Removes punctuation in the --unigram, --term, --bigram or --trigram options
The context considered is only the features that are in the same phrase as the target word. The phrases are identified by MetaMap.
Note: If the phrase option is set then the --sentence and --line options need to be off.
The context considered is only the features that are in the same sentence as the target word.
Note: If the sentence option is set then the --phrase and --line options need to be off.
The context considered is the features that are in the the complete line as defined by the .mm file as the target word.
Note: If the line option is set then the --sentence and --phrase options need to be off.
Note: This is the default option.
The weka machine learning algorithm that you would like to run the feature set on. The program is expecting the java command that would be used to run weka.
Default:
weka.classifiers.bayes.NaiveBayes
Option so that the user can directly pass weka parameters through the command line
The main reason we added this was to allow for the -s NUM option to be used. This option sets the random number seed for cross-validation. Its default is 1.
Option so that the user can directly pass java parameters through the command line
The prefix that the directories for the weka, arff, results and log files.
Default: <timestamp>.dir
Therefore the weka files would be located in <timestamp>.dir.weka., the arff files would be located in <timestamp>.dir.arff, results in the <timestamp>.dir.results and log files in the <timestamp>.dir.log.
Print out the confidence values with the accuracy
Displays the quick summary of program options.
Displays the version information.
supervised-disambiguate.pl creates two directories. One containing the arff files and the other containing the weka files. In the weka directory, the overall averages are stored in the OverallAverage file.
Bridget T. McInnes, University of Minnesota
Copyright (c) 2007,
Bridget T. McInnes, University of Minnesota
bthomson at cs.umn.edu
Ted Pedersen, University of Minnesota Duluth
tpederse at d.umn.edu
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to
The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.