unsupervised-disambiguate.pl
This is a wrapper program for unsupervised word sense disambiguation using CuiTools programs.
The program takes as input the data directory (SOURCE) and returns the results stored in the results directory.
For unsupervised WSD, the program will extract the specified features as well as the context of the potential concepts and create vectors (using SenseCluster functions). These vectors will be analyzed using the cosine measure and the instance vector that is closest to the possible concept vector will be assigned that concept.
=head1 USAGE
perl unsupervised-disambiguate.pl [OPTIONS] SOURCE
Directory containing the CuiTools xml-like .mm formatted file.
Directory containing the training data that corresponds to each of the data sets in the SOURCE directory or a file containing the training data
The following format for the file(s) (either in the directory
or passed into the command line) is as follows:
The expected name for each of the corresponding training files is :
<target word>.trainingdata
For example, the target word adjustment would have the file ``adjustment.trainingdata'' located in the training directory.
Directory containing the sense information that corresponds to each of the datasets in the SOURCE directory. This is located in the default_options directory. This directory is called:
NLM-WSD.target_word.choices
And can also be downloaded downloaded from:
http://wsd.nih.gov
If you are using another dataset, the expected format is:
<TAG>|<TERM>|<Semanticc Type>|<CUI>
For example, one of the possible concepts for the target word adjustment is:
Adjustment <1> (Individual Adjustment)|inbe, Individual Behavior|C0376209
The expected name of the file is <target word>.choices. For our adjustment example, the name of the sense file is ``adjustment.choices''
A file of Perl regexes that define the stop list of words to be excluded from the features.
STOPFILE could be specified with two modes -
AND mode - declared by including '@stop.mode=AND' on the first line of the STOPFILE. - ignores word pairs in which both words are stop words.
OR mode - declared by including '@stop.mode=OR' on the first line of the STOPFILE. - ignores word pairs in which either word is a stop word.
Both modes exclude stop words from unigram features.
Default is OR mode.
Use the UMLS CUI extneded definition of the possible concepts as context [Default]
The UMLS-Interface configuration file where the CUI definitions are obtained.
The prefix that the directories for the log and results files.
Default: dir
The results file location: <directory>.results The vector files location: <directory>.logs
The log files directory also contains any log file associated with the unsupervised-disambiguate program.
Displays the quick summary of program options.
Displays the version information.
mm2arff.pl creates one file in arff format. This file can be used as input training data to the WEKA data mining package.
Bridget T. McInnes, University of Minnesota
Copyright (c) 2007-2011
Bridget T. McInnes, University of Minnesota bthomson at cs.umn.edu
Ted Pedersen, University of Minnesota Duluth tpederse at d.umn.edu
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to
The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.