Training n-gram NER with Stanford NLP -

July 15, 2010

recently have been trying train n-gram entities stanford core nlp. have followed following tutorials - http://nlp.stanford.edu/software/crf-faq.shtml#b

with this, able specify unigram tokens , class belongs to. can 1 guide me through can extend n-grams. trying extract known entities movie names chat data set.

please guide me through in case have mis-interpretted stanford tutorials , same can used n-gram training.

what stuck following property

#structure of training file; tells classifier #that word in column 0 , correct answer in #column 1 map = word=0,answer=1

here first column word (unigram) , second column entity, example

chapter o   o emma    pers woodhouse   pers

now need train known entities (say movie names) hulk, titanic etc movies, easy approach. in case need train i know did last summer or baby's day out, best approach ?

it had been long wait here answer. have not been able figure out way done using stanford core. mission accomplished. have used lingpipe nlp libraries same. quoting answer here because, think else benefit it.

please check out lingpipe licencing before diving in implementation in case developer or researcher or ever.

lingpipe provides various ner methods.

1) dictionary based ner

2) statistical ner (hmm based)

3) rule based ner etc.

i have used dictionary statistical approaches.

first 1 direct methodology , second 1 being training based.

an example dictionary based ner can found here

the statstical approach requires training file. have used file following format -

<root> <s> data line <enamex type="myentity">entity1</enamex>  trained</s> ... <s> <enamex type="myentity">entity2</enamex>  annotated </s> </root>

i used following code train entities.

import java.io.file; import java.io.ioexception;  import com.aliasi.chunk.charlmhmmchunker; import com.aliasi.corpus.parsers.muc6chunkparser; import com.aliasi.hmm.hmmcharlmestimator; import com.aliasi.tokenizer.indoeuropeantokenizerfactory; import com.aliasi.tokenizer.tokenizerfactory; import com.aliasi.util.abstractexternalizable;  @suppresswarnings("deprecation") public class trainentities {      static final int max_n_gram = 50;     static final int num_chars = 300;     static final double lm_interpolation = max_n_gram; // default behavior      public static void main(string[] args) throws ioexception {         file corpusfile = new file("inputfile.txt");// annotated file         file modelfile = new file("outputmodelfile.model");           system.out.println("setting chunker estimator");         tokenizerfactory factory             = indoeuropeantokenizerfactory.instance;         hmmcharlmestimator hmmestimator             = new hmmcharlmestimator(max_n_gram,num_chars,lm_interpolation);         charlmhmmchunker chunkerestimator             = new charlmhmmchunker(factory,hmmestimator);          system.out.println("setting data parser");         muc6chunkparser parser = new muc6chunkparser();           parser.sethandler( chunkerestimator);          system.out.println("training data file=" + corpusfile);         parser.parse(corpusfile);          system.out.println("compiling , writing model file=" + modelfile);         abstractexternalizable.compileto(chunkerestimator,modelfile);     }  }

and test ner used following class

import java.io.bufferedreader; import java.io.file; import java.io.filereader; import java.util.arraylist; import java.util.set;  import com.aliasi.chunk.chunk; import com.aliasi.chunk.chunker; import com.aliasi.chunk.chunking; import com.aliasi.util.abstractexternalizable;  public class recognition {     public static void main(string[] args) throws exception {         file modelfile = new file("outputmodelfile.model");         chunker chunker = (chunker) abstractexternalizable                 .readobject(modelfile);         string teststring="my test string";             chunking chunking = chunker.chunk(teststring);             set<chunk> test = chunking.chunkset();             (chunk c : test) {                 system.out.println(teststring + " : "                         + teststring.substring(c.start(), c.end()) + " >> "                         + c.type());          }     } }

code courtesy : google :)

Search This Blog

Live one

Training n-gram NER with Stanford NLP -

Comments

Post a Comment

Popular posts from this blog

php - XML feed for Wordpress Social Board plugin modifications -

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

javascript - Twitter Bootstrap - how to add some more margin between tooltip popup and element -