mm2arff.pl
Creates arff files required by the weka data mining packages.
Takes as input the CuiTools xml-like .mm format and extracts specified features based on the following options from the data set and outputs an arff format to be used by the WEKA data mining package for supervised word sense disambiguation.
perl mm2arff.pl [OPTIONS] DESTINATION SOURCE
The CuiTools xml-like .mm formatted file.
The output file.
$=head3 --test
This option allows for arff files to be created for both the test data and the training data. The features for this will only be extracted from the test data
Example:
art art.40003 pop_art%1:06:00::
<target word> <instance id> <sense>
There is no key for the test file. The answers will be replaced with a question mark (?).
The part of speech of the target word
Whether the target word is the head word of its phrase.
The words surrounding the target word as features
The Ngram Statistic Package (NSP) count.pl program is used to obtain the counts. The options specify any additional options to be passed to count.pl program.
Note: All the options must be within double quotes, as
would be passed to the program at command line.
Note: If the --sentence option is defined, the surrounding words will come from the sentence that contains the target word. If the --phrase option is defined, the surrounding owrds will come only from the phrase that contains the target word. Otherwise they come from the abstract.
Specifies a measure of association to be used with the statistic.pl program in the Ngram Statistics Package which is used in combination with the --ngramstat option to perform feature selection for the --ngramcount option
Note: All the options must be within double quotes, as would be passed to the program at command line.
Specifies any additional options to be passed to statistic.pl program when using --ngramcount and --ngrammeasure options
Note: All the options must be within double quotes, as would be passed to the program at command line.
Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) of the words surrounding the target word.
The Ngram Statistic Package (NSP) is used to obtain the counts.
The options specify any additional options to be passed to count.pl program.
Note: All the options must be within double quotes, as
would be passed to the program at command line.
Note: If the --sentence option is defined, the surrounding words will come from the sentence that contains the target word. If the --phrase option is defined, the surrounding owrds will come only from the phrase that contains the target word. Otherwise they come from the abstract.
Specifies a measure of association to be used with the statistic.pl program in the Ngram Statistics Package which is used in combination with the --cuistat option to perform feature selection for the --cuicount option
Note: All the options must be within double quotes, as would be passed to the program at command line.
Specifies any additional options to be passed to statistic.pl program when using --ngramcount and --ngrammeasure options
Note: All the options must be within double quotes, as would be passed to the program at command line.
The first CUI in the set of one or more CUIs found in the mapping (that have the highest MTI score) is used as the feature rather than all of them (which is the default).
Combine all the CUIS in the set of one or more CUIs found in the mapping into one sorted string and use the string as a feature
The words surrounding the target word as features
The Ngram Statistic Package (NSP) count.pl program is used to obtain the counts. The options specify any additional options to be passed to count.pl program.
Note: All the options must be within double quotes, as
would be passed to the program at command line.
Note: If the --sentence option is defined, the surrounding words will come from the sentence that contains the target word. If the --phrase option is defined, the surrounding owrds will come only from the phrase that contains the target word. Otherwise they come from the abstract.
Specifies a measure of association to be used with the statistic.pl program in the Ngram Statistics Package which is used in combination with the --ststat option to perform feature selection for the --stcount option
Note: All the options must be within double quotes, as would be passed to the program at command line.
Specifies any additional options to be passed to statistic.pl program when using --ngramcount and --ngrammeasure options
Note: All the options must be within double quotes, as would be passed to the program at command line.
This is can be used in replace of or in conjunction with the ngramcount, ngrammeasure, ngramstat, cuicount, cuimeasure, cuistat, stcount, stmeasure and ststat options. It allows for multiple instances of ngramcount, cuicount and stcount to be called in order to allow for multiple ngrams to be used as features.
An example of the format required for the configuration file is as follows:
ngramcount:: count:: --ngram 3 --remove 3 statistic:: ll.pm --score 3.841
ngramcount:: count:: --ngram 2 --remove 1
cuicount:: count:: --ngram 2 --remove 3 statistic:: ll.pm --score 3.841
stcount:: count:: --ngram 1 --remove 3
The semantic relation of the semantic type of the target word and the semantic types s of the words surrounding the target word. The relations are read in from a file which is located in the 'default_options' directory. The name of the file is called 'relation'. The relations comes from the UMLS 2007AC documentation.
If you would like to provide your own, the format of the file is as follows:
semantic_type relation semantic_type
where each semantic type pair and relation are on their own line.
For example:
acab affects virs
acab affects vtbt
acap affects rept
...
...
Uses the MeSH terms assigned to the abstract containing the target word as features
Removes unigrams that contain only punctuation from the list of features.
Lowercases all the words for the --ngramcount option
The context considered is only the features that are in the same phrase as the target word. The phrases are identified by MetaMap.
Note: If the phrase option is set then the --sentence and --line options need to be off.
The context considered is only the features that are in the same sentence as the target word.
Note: If the sentence option is set then the --phrase and --line options need to be off.
The context considered is the features that are in the the complete line as defined by the .mm file as the target word.
Note: If the line option is set then the --sentence and --phrase options need to be off.
Note: This is the default option.
Directory to contain temporary and log files. DEFAULT: mm2arff.log
Displays the quick summary of program options.
Displays the version information.
mm2arff.pl creates one file in arff format. This file can be used as input training data to the WEKA data mining package.
See README.arff.format.pod
See README.mm.format.pod
Bridget T. McInnes, University of Minnesota, Twin Cities
Copyright (c) 2007-2008,
Bridget T. McInnes, University of Minnesota, Twin Cities bthomson at cs.umn.edu
Ted Pedersen, University of Minnesota Duluth tpederse at d.umn.edu
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to
The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.