NAME

mm2arff.pl


SYNOPSIS

Creates arff files required by the weka data mining packages.


DESCRIPTION

Takes as input the CuiTools xml-like .mm format and extracts specified features based on the following options from the data set and outputs an arff format to be used by the WEKA data mining package for supervised word sense disambiguation.


USAGE

perl mm2arff.pl [OPTIONS] DESTINATION SOURCE


INPUT

Required Arguments:

SOURCE

The CuiTools xml-like .mm formatted file.

DESTINATION

The output file.

Optional Arguments:

$=head3 --test

This option allows for arff files to be created for both the test data and the training data. The features for this will only be extracted from the test data

--key FILE The program will obtain the answers to the given test file from the key file. The key file is expected to be in the same format as the Senseval-2 key file.

    Example:
    art art.40003 pop_art%1:06:00::
        <target word> <instance id> <sense>

--nokey

There is no key for the test file. The answers will be replaced with a question mark (?).

--pos

The part of speech of the target word

--head

Whether the target word is the head word of its phrase.

--ngramcount ``OPTIONS'' or FILE

The words surrounding the target word as features

The Ngram Statistic Package (NSP) count.pl program is used to obtain the counts. The options specify any additional options to be passed to count.pl program.


Note: All the options must be within double quotes, as
      would be passed to the program at command line.

Note: If the --sentence option is defined, the surrounding words will come from the sentence that contains the target word. If the --phrase option is defined, the surrounding owrds will come only from the phrase that contains the target word. Otherwise they come from the abstract.

--ngrammeasure ``OPTIONS''

Specifies a measure of association to be used with the statistic.pl program in the Ngram Statistics Package which is used in combination with the --ngramstat option to perform feature selection for the --ngramcount option

Note: All the options must be within double quotes, as would be passed to the program at command line.

--ngramstat ``OPTIONS''


Specifies any additional options to be passed to statistic.pl program 
when using --ngramcount and --ngrammeasure options

Note: All the options must be within double quotes, as would be passed to the program at command line.

--cuicount ``OPTIONS''

Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) of the words surrounding the target word.

The Ngram Statistic Package (NSP) is used to obtain the counts.

The options specify any additional options to be passed to count.pl program.


Note: All the options must be within double quotes, as
      would be passed to the program at command line.

Note: If the --sentence option is defined, the surrounding words will come from the sentence that contains the target word. If the --phrase option is defined, the surrounding owrds will come only from the phrase that contains the target word. Otherwise they come from the abstract.

--cuimeasure ``OPTIONS''

Specifies a measure of association to be used with the statistic.pl program in the Ngram Statistics Package which is used in combination with the --cuistat option to perform feature selection for the --cuicount option

Note: All the options must be within double quotes, as would be passed to the program at command line.

--cuistats ``OPTIONS''


Specifies any additional options to be passed to statistic.pl program 
when using --ngramcount and --ngrammeasure options

Note: All the options must be within double quotes, as would be passed to the program at command line.

--firstranked

The first CUI in the set of one or more CUIs found in the mapping (that have the highest MTI score) is used as the feature rather than all of them (which is the default).

--conflatedcuis

Combine all the CUIS in the set of one or more CUIs found in the mapping into one sorted string and use the string as a feature

--stcount ``OPTIONS''

The words surrounding the target word as features

The Ngram Statistic Package (NSP) count.pl program is used to obtain the counts. The options specify any additional options to be passed to count.pl program.


Note: All the options must be within double quotes, as
      would be passed to the program at command line.

Note: If the --sentence option is defined, the surrounding words will come from the sentence that contains the target word. If the --phrase option is defined, the surrounding owrds will come only from the phrase that contains the target word. Otherwise they come from the abstract.

--stmeasure ``OPTIONS''

Specifies a measure of association to be used with the statistic.pl program in the Ngram Statistics Package which is used in combination with the --ststat option to perform feature selection for the --stcount option

Note: All the options must be within double quotes, as would be passed to the program at command line.

--ststats ``OPTIONS''


Specifies any additional options to be passed to statistic.pl program 
when using --ngramcount and --ngrammeasure options

Note: All the options must be within double quotes, as would be passed to the program at command line.

--nspconfig FILE

This is can be used in replace of or in conjunction with the ngramcount, ngrammeasure, ngramstat, cuicount, cuimeasure, cuistat, stcount, stmeasure and ststat options. It allows for multiple instances of ngramcount, cuicount and stcount to be called in order to allow for multiple ngrams to be used as features.

An example of the format required for the configuration file is as follows:

ngramcount:: count:: --ngram 3 --remove 3 statistic:: ll.pm --score 3.841

ngramcount:: count:: --ngram 2 --remove 1

cuicount:: count:: --ngram 2 --remove 3 statistic:: ll.pm --score 3.841

stcount:: count:: --ngram 1 --remove 3

--relation

The semantic relation of the semantic type of the target word and the semantic types s of the words surrounding the target word. The relations are read in from a file which is located in the 'default_options' directory. The name of the file is called 'relation'. The relations comes from the UMLS 2007AC documentation.

If you would like to provide your own, the format of the file is as follows:

    semantic_type relation semantic_type

where each semantic type pair and relation are on their own line.

For example:

    acab affects virs
    acab affects vtbt
    acap affects rept
    ...
    ...

--mesh

Uses the MeSH terms assigned to the abstract containing the target word as features

--punc

Removes unigrams that contain only punctuation from the list of features.

--lc

Lowercases all the words for the --ngramcount option

--phrase

The context considered is only the features that are in the same phrase as the target word. The phrases are identified by MetaMap.

Note: If the phrase option is set then the --sentence and --line options need to be off.

--sentence

The context considered is only the features that are in the same sentence as the target word.

Note: If the sentence option is set then the --phrase and --line options need to be off.

--line

The context considered is the features that are in the the complete line as defined by the .mm file as the target word.

Note: If the line option is set then the --sentence and --phrase options need to be off.

Note: This is the default option.

--log DIRECTORY

Directory to contain temporary and log files. DEFAULT: mm2arff.log

--help

Displays the quick summary of program options.

--version

Displays the version information.


OUTPUT

mm2arff.pl creates one file in arff format. This file can be used as input training data to the WEKA data mining package.


THE .arff FORMAT

See README.arff.format.pod


THE .mm FORMAT

See README.mm.format.pod


PROGRAM REQUIREMENTS


AUTHOR

 Bridget T. McInnes, University of Minnesota, Twin Cities


COPYRIGHT

Copyright (c) 2007-2008,

 Bridget T. McInnes, University of Minnesota, Twin Cities
 bthomson at cs.umn.edu
 Ted Pedersen, University of Minnesota Duluth
 tpederse at d.umn.edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.