wiki:TaxonomyLearningImplementation

Version 6 (modified by sauermann, 9 years ago) (diff)

--

Taxonomy Learning Component Implementation

The taxonomy learning component creates a dictionary (=taxonomy) of words that are useful to tag documents. Taking existing documents as input, the taxonomy learning component analyzes the text inside the documents and compares it with existing taxonomies or existing data. It suggests new terms for the SME's taxonomy managed in the SemanticApi. The trained taxonomy is specialized for the SME company, as it is based on documents provided by the SME, it will reflect the words typically used inside the company and thus helps employees to find the right tags for new documents and to find existing documents based on the tags.

The relations between terms (broader/narrower terms, hierarchic relations, related terms), and possible alternative spellings, can be extracted from existing data sources. For example, DBPedia and linked open data can be used.

How to use it

The component is run from command-line. It is a separate application written in Java, Python, and bash. To run the application on a unix/linux server, use the command

runAll.sh

It will read the configuration file in config2, to configure it to run on a dataset called "mydataset" use

NAME="mydataset"

This will analyze text files inside the data-file-folder ./data/mydataset. The algorithm will do the following:

  • read all files in the data-file-folder and extract relevant terms using !eu.organik.ontolearn.TermExtractor?
  • match the terms with the public open linked data base dbpedia using !eu.organik.ontolearn.DBpediaTagger
  • rank and filter the terms according to relevancy
  • call !eu.organik.ontolearn.DBpediaLookUp to create a SKOS output file

The result is a SKOS thesaurus which contains interesting terms which can now be used as tags. Take the resulting SKOS thesaurus from results/mydataset/taxonomy_skos.rdf and upload it to OrganiK using Drupal/TaxonomyImport. After uploading, the terms are available in the system.

How it works

Before going into details: the term-extraction used by this component is described at ContentAnalyserImplementation#TermExtractor? .

As input for the bootstrapping process all content existing in OrganiK (or imported into OrganiK) is taken. The text sources are analysed with natural language processing techniques and post-processed to create a taxonomy. Overall, taxonomy learning includes the following steps: term extraction, finding synonyms, identification of concepts, and placing concepts into a hierarchy.

In the first step, all sentences in the text are parsed to identify noun phrases ("NP"). OpenNLP is used to generate parse trees of the sentences. In the next step, for each noun phrase, matching DBPedia resources are identified. Here, DBPedia is used as background knowledge to evaluate the usefulness of found strings in a taxonomy. The string representation of a noun phrase is matched with labels from DBPedia. Partial matches are included. The assumption behind this approach is that noun phrases mentioned on DBPedia are known to a broad audience and can serve as taxonomy terms for the SME in question. For example the noun phrase Julius Caesar's Conquests matches both Julius Caesar and Military career of Julius Caesar on DBPedia. Obviously, this fuzzy matching may find too many candidates in the background knowledge. The list of candidate is ranked by what we refer to as DBPedia index.

TracMath macro processor has detected an error. Please fix the problem before continuing.


The command:

'/usr/bin/pdflatex -interaction=nonstopmode 05481fd77f862c03b921611a87a4d0b69a8d5073.tex'
failed with the following output:
"This is pdfTeX, Version 3.14159265-2.6-1.40.15 (TeX Live 2015/dev/Debian) (preloaded format=pdflatex)\n restricted \\write18 enabled.\nentering extended mode\n(./05481fd77f862c03b921611a87a4d0b69a8d5073.tex\nLaTeX2e <2014/05/01>\nBabel <3.9l> and hyphenation patterns for 2 languages loaded.\n(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls\nDocument Class: article 2014/09/29 v1.4h Standard LaTeX document class\n(/usr/share/texlive/texmf-dist/tex/latex/base/size10.clo))\n(/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty\n(/usr/share/texlive/texmf-dist/tex/latex/base/utf8.def\n(/usr/share/texlive/texmf-dist/tex/latex/base/t1enc.dfu)\n(/usr/share/texlive/texmf-dist/tex/latex/base/ot1enc.dfu)\n(/usr/share/texlive/texmf-dist/tex/latex/base/omsenc.dfu)))\n\n! LaTeX Error: File `cmap.sty' not found.\n\nType X to quit or <RETURN> to proceed,\nor enter new name. (Default extension: sty)\n\nEnter file name: \n! Emergency stop.\n<read *> \n         \nl.4 \\usepackage\n               {type1ec}^^M\n!  ==> Fatal error occurred, no output PDF file produced!\nTranscript written on 05481fd77f862c03b921611a87a4d0b69a8d5073.log.\n"

The rank for found resources on DBPedia is a weighted combination of a total of three scores:

  • percentile rank of term

  • inverse length (1/(number of words in the term))

  • DBPedia match index (how well a term matches with any dpedia term)

Only terms with a score higher than a threshold-K are chosen as concepts to be included in the taxonomy. Result of the ranking step is to identify concepts from DBPedia that possibly match keywords found in the text.

After ranking, terms are positioned in a hierarchy based on narrower and broader relations in DBPedia. These existing SKOS relations\footnote{Broader and narrower are properties defined in the SKOS vocabulary and express taxonomic structure.} are already building a hierarchy in DBPedia, which can be adopted for the SME case. All broader and narrower relations {r} of found DBPedia resources are collected and kept in a buffer. When one relation r connects two noun phrases found in the overall SME corpus, r is a good candidate for a hierarchy and is adopted with a high rank. Also indirect relations can be used, when two hierarchical relations {r1, r2} contain two noun phrases from the corpus and an additional new resource n . The new resource n can be either a broader topic of both phrases, or an intermediate topic in a larger hierarchy. For example, when both the resource Julius Caesar and Mark Antony are mentioned, the broader resource Ancient Roman generals of both will be included in the resulting taxonomy.

Sourcecode