NAME

unsupervised-disambiguate.pl


SYNOPSIS

This is a wrapper program for unsupervised word sense disambiguation using CuiTools programs.


DESCRIPTION

The program takes as input the data directory (SOURCE) and returns the results stored in the results directory.

For unsupervised WSD, the program will extract the specified features as well as the context of the potential concepts and create vectors (using SenseCluster functions). These vectors will be analyzed using the cosine measure and the instance vector that is closest to the possible concept vector will be assigned that concept.


=head1 USAGE

perl unsupervised-disambiguate.pl [OPTIONS] SOURCE


INPUT

Required Arguments:

SOURCE

Directory containing the CuiTools xml-like .mm formatted file.

Optional Arguments:

OPTIONS :

--training [DIR|FILE]

Directory containing the training data that corresponds to each of the data sets in the SOURCE directory or a file containing the training data

The following format for the file(s) (either in the directory or passed into the command line) is as follows:

The expected name for each of the corresponding training files is :

    <target word>.trainingdata

For example, the target word adjustment would have the file ``adjustment.trainingdata'' located in the training directory.

--senses [DIR|FILE]

Directory containing the sense information that corresponds to each of the datasets in the SOURCE directory. This is located in the default_options directory. This directory is called:

    NLM-WSD.target_word.choices

And can also be downloaded downloaded from:

    http://wsd.nih.gov

If you are using another dataset, the expected format is:

<TAG>|<TERM>|<Semanticc Type>|<CUI>

For example, one of the possible concepts for the target word adjustment is:

Adjustment <1> (Individual Adjustment)|inbe, Individual Behavior|C0376209

The expected name of the file is <target word>.choices. For our adjustment example, the name of the sense file is ``adjustment.choices''

--stop STOPFILE

A file of Perl regexes that define the stop list of words to be excluded from the features.

STOPFILE could be specified with two modes -

AND mode - declared by including '@stop.mode=AND' on the first line of the STOPFILE. - ignores word pairs in which both words are stop words.

OR mode - declared by including '@stop.mode=OR' on the first line of the STOPFILE. - ignores word pairs in which either word is a stop word.

Both modes exclude stop words from unigram features.

Default is OR mode.

--cuidef

Use the UMLS CUI extneded definition of the possible concepts as context [Default]

--config CONFIGURATION

The UMLS-Interface configuration file where the CUI definitions are obtained.

--directory

The prefix that the directories for the log and results files.

Default: dir

The results file location: <directory>.results The vector files location: <directory>.logs

The log files directory also contains any log file associated with the unsupervised-disambiguate program.

--help

Displays the quick summary of program options.

--version

Displays the version information.


OUTPUT

mm2arff.pl creates one file in arff format. This file can be used as input training data to the WEKA data mining package.


THE .mm FORMAT

See README.format.mm.html


PROGRAM REQUIREMENTS


AUTHOR

 Bridget T. McInnes, University of Minnesota


COPYRIGHT

Copyright (c) 2007-2011

 Bridget T. McInnes, University of Minnesota
 bthomson at cs.umn.edu
 Ted Pedersen, University of Minnesota Duluth
 tpederse at d.umn.edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.