Content Analyser Implementation / OrganikTextAnalyzerServlet
OrganikTextAnalyzerServlet is the content analyzer implementation for OrganiK. It is a Java component reading plaintext and suggesting tags for the plaintext. The tags are generated from words that appear inside the text. To improve the quality of the component, tags are sorted by their importance - words that seem to be statistically relevant to a text are returned first. Using this component, candidates for tags can be extracted out of texts. This component does not suggest keywords based on a semantic understanding of the text, it cannot suggest "medicine" for a medical text.
How to use it
The content analyser is running as a servlet (an invisible Java software) on the same server as the Organik installation. It is embedded into Drupal using the OrganikNLP module?. When configured correctly, it will suggest tags when text is modified.
How it works
Inside the project, the only exposed servlet is the TextAnalyzerServlet at address /TextAnalyzerServlet relative to the web-application. The servlet takes two arguments:
- content - the plain-text content to analyse
- defIDF - the default inverse document frequency (a floating-point number, optional!)
It extracts terms from this text using the TaxonomyLearningImplementation's class eu.organik.ontolearn.TermExtractor. The text is parsed for the linguistic patterns 'NN','NP','JJ NN'. Resulting keywords can be single-term or multi-term strings. Each keyword is ranked using the rank returned by the TermExtractor and additionally weighted by the inverse document frequency for the keyword (in Drupal's index, the IDF is the "count" in table "word_list").
The term-extractor from TaxonomyLearningImplementation is also explained here, as it primarily a ContentAnalyser and secondary used for taxonomy learning.
The term extractor takes plaintext as input and returns terms. It first uses an opennlp sentence detector and tokenzier to tokenize the string. With an opennlp parser, the sentence structure is dissected into individual tokes in a parse-tree. In that form, the tokens can be filtered for configured patterns, which are 'NN','NP','JJ NN'. Internally, the parser is based on trained language models which have to be loaded beforehand. The language models can be trained automatically with annotated text corpora, which Gunnar Grimnes did for DFKI for a german corpora. An english model is part of the normal OpenNLP distribution which is used.
Important to know is that the TermExctractor is depending on a configured language. In our tests, the extractor is hardcoded to English, but using a trained German model, it is also possible to analyse German texts.
Configuration of the content analyser is described at AdministratorTutorial/Installation. Most important is:
- the OrganikTextAnalyzerServlet runs as a web-application. It is usually reachable on a port and web address, such as this: http://localhost:8180/OrganikTextAnalyzerServlet/ (replace localhost with your server name)
The web.xml contains the parameters needed to connect to the mysql database used by Drupal, which need to be adapted. Example:
<init-param> <param-name> db.url </param-name> <param-value>jdbc:mysql://localhost:3306/organik</param-value> </init-param> <init-param> <param-name> db.user </param-name> <param-value>organik</param-value> </init-param> <init-param> <param-name> db.pass </param-name> <param-value>secret</param-value> </init-param>