Changes between Initial Version and Version 1 of ContentAnalyser/ContentAnalyserImplementation

07/21/10 13:31:59 (12 years ago)



  • ContentAnalyser/ContentAnalyserImplementation

    v1 v1  
     1= Content Analyser Implementation / OrganikTextAnalyzerServlet = 
     2OrganikTextAnalyzerServlet is the content analyzer implementation for OrganiK. It is a Java component reading plaintext and suggesting tags for the plaintext. The tags are generated from words that appear inside the text. To improve the quality of the component, tags are sorted by their importance - words that seem to be statistically relevant to a text are returned first. Using this component, candidates for tags can be extracted out of texts. This component does not suggest keywords based on a semantic understanding of the text, it cannot suggest "medicine" for a medical text. 
     4== How to use it == 
     5The content analyser is running as a servlet (an invisible Java software) on the same server as the Organik installation. It is embedded into Drupal using the [wiki:Drupal/OrganikNLP OrganikNLP module]. When configured correctly, it will suggest tags when text is modified. 
     7== How it works == 
     8Inside the project, the only exposed servlet is the !TextAnalyzerServlet at address {{{/TextAnalyzerServlet}}} relative to the web-application. The servlet takes two arguments: 
     9 * content - the plain-text content to analyse 
     10 * defIDF - the default inverse document frequency (a floating-point number, optional!) 
     12It extracts terms from this text using the TaxonomyLearningImplementation's class {{{eu.organik.ontolearn.TermExtractor}}}. The text is parsed for the linguistic patterns 'NN','NP','JJ NN'. Resulting keywords can be single-term or multi-term strings. Each keyword is ranked using the rank returned by the !TermExtractor and additionally weighted by the inverse document frequency for the keyword (in Drupal's index, the IDF is the "count" in table "word_list"). 
     14The term-extractor from TaxonomyLearningImplementation is also explained here, as it primarily a !ContentAnalyser and secondary used for taxonomy learning.  
     16=== !TermExtractor === 
     17The term extractor takes plaintext as input and returns terms. It first uses an opennlp sentence detector and tokenzier to tokenize the string. With an opennlp parser, the sentence structure is dissected into individual tokes in a parse-tree. In that form, the tokens can be filtered for configured patterns, which are 'NN','NP','JJ NN'. Internally, the parser is based on trained language models which have to be loaded beforehand. The language models can be trained automatically with annotated text corpora, which Gunnar Grimnes did for DFKI for a german corpora. An english model is part of the normal OpenNLP distribution which is used.  
     19Important to know is that the !TermExctractor is depending on a configured language. In our tests, the extractor is hardcoded to English, but using a trained [ German model], it is also possible to analyse German texts.  
     21== Configuration == 
     22Configuration of the content analyser is described at AdministratorTutorial/Installation. Most important is: 
     23 * the OrganikTextAnalyzerServlet runs as a web-application. It is usually reachable on a port and web address, such as this: http://localhost:8180/OrganikTextAnalyzerServlet/ (replace localhost with your server name) 
     25The web.xml contains the parameters needed to connect to the mysql database used by Drupal, which need to be adapted. Example: 
     27        <init-param> 
     28                <param-name> db.url </param-name> 
     29                <param-value>jdbc:mysql://localhost:3306/organik</param-value> 
     30        </init-param> 
     31        <init-param> 
     32                <param-name> db.user </param-name> 
     33                <param-value>organik</param-value> 
     34        </init-param> 
     35                <init-param> 
     36                <param-name> db.pass </param-name> 
     37                <param-value>secret</param-value> 
     38        </init-param>                 
     41== Sourcecode ==