Required Arguments:

SOURCE

Optional Arguments:

OPTIONS :

-training [DIR|FILE]
-senses [DIR|FILE]
-stop STOPFILE
-cuidef
-config
-summary
-directory
-help
-version

OUTPUT
THE .mm FORMAT
PROGRAM REQUIREMENTS
AUTHOR
COPYRIGHT

NAME

unsupervised-disambiguate.pl

SYNOPSIS

This is a wrapper program for unsupervised word sense disambiguation using CuiTools programs.

DESCRIPTION

The program takes as input the data directory (SOURCE) and returns the results stored in the results directory.

For unsupervised WSD, the program will extract the specified features as well as the context of the potential concepts and create vectors (using SenseCluster functions). These vectors will be analyzed using the cosine measure and the instance vector that is closest to the possible concept vector will be assigned that concept.


=head1 USAGE

perl unsupervised-disambiguate.pl [OPTIONS] SOURCE

INPUT

Required Arguments:

SOURCE

Directory containing the CuiTools xml-like .mm formatted file.

Optional Arguments:

OPTIONS :

--training [DIR|FILE]

Directory containing the training data that corresponds to each of the data sets in the SOURCE directory or a file containing the training data

The following format for the file(s) (either in the directory or passed into the command line) is as follows:

The expected name for each of the corresponding training files is :

    <target word>.trainingdata

For example, the target word adjustment would have the file ``adjustment.trainingdata'' located in the training directory.

--senses [DIR|FILE]

Directory containing the sense information that corresponds to each of the datasets in the SOURCE directory. This is located in the default_options directory. This directory is called:

    NLM-WSD.target_word.choices

And can also be downloaded downloaded from:

    http://wsd.nih.gov

If you are using another dataset, the expected format is:

For example, one of the possible concepts for the target word adjustment is:

Adjustment <1> (Individual Adjustment)|inbe, Individual Behavior|C0376209

The expected name of the file is <target word>.choices. For our adjustment example, the name of the sense file is ``adjustment.choices''

--stop STOPFILE

A file of Perl regexes that define the stop list of words to be excluded from the features.

STOPFILE could be specified with two modes -

AND mode - declared by including '@stop.mode=AND' on the first line of the STOPFILE. - ignores word pairs in which both words are stop words.

OR mode - declared by including '@stop.mode=OR' on the first line of the STOPFILE. - ignores word pairs in which either word is a stop word.

Both modes exclude stop words from unigram features.

Default is OR mode.

--cuidef

Use the UMLS CUI extneded definition of the possible concepts as context [Default]

--config CONFIGURATION

The UMLS-Interface configuration file where the CUI definitions are obtained.

--directory

The prefix that the directories for the log and results files.

Default: dir

The results file location: <directory>.results The vector files location: <directory>.logs

The log files directory also contains any log file associated with the unsupervised-disambiguate program.

--help

Displays the quick summary of program options.

--version

Displays the version information.

OUTPUT

mm2arff.pl creates one file in arff format. This file can be used as input training data to the WEKA data mining package.

THE .mm FORMAT

See README.format.mm.html

PROGRAM REQUIREMENTS

AUTHOR

 Bridget T. McInnes, University of Minnesota

COPYRIGHT

 Bridget T. McInnes, University of Minnesota
 bthomson at cs.umn.edu

 Ted Pedersen, University of Minnesota Duluth
 tpederse at d.umn.edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.