wiki:TaxonomyLearningImplementation

Taxonomy Learning Component Implementation

The taxonomy learning component creates a dictionary (=taxonomy) of words that are useful to tag documents. Taking existing documents as input, the taxonomy learning component analyzes the text inside the documents and compares it with existing taxonomies or existing data. It suggests new terms for the SME's taxonomy managed in the SemanticApi. The trained taxonomy is specialized for the SME company, as it is based on documents provided by the SME, it will reflect the words typically used inside the company and thus helps employees to find the right tags for new documents and to find existing documents based on the tags.

The relations between terms (broader/narrower terms, hierarchic relations, related terms), and possible alternative spellings, can be extracted from existing data sources. For example, DBPedia and linked open data can be used.

First time Taxonomy Learning

The taxonomy learning is separate from Drupal and is written in Java. For details on how to actually download and run it please refer to: http://organik.opendfki.de/wiki/AdministratorTutorial/TaxonomyLearning

This will analyze text from the nodes in your drupal installation and the result is a SKOS thesaurus which contains interesting terms which can now be used as tags. Take the resulting SKOS thesaurus and upload it to OrganiK using Drupal/TaxonomyImport. After uploading, the terms are available in the system.

Ontology Refinement

Ontology refinement looks at terms added by the users and suggest places in the taxonomy where the term can be placed. It is run from commandline and works with the Drupal database, which need to be passed as command-line arguments.

Result are new hierarchical structures in the taxonomy.

The refiner can be run nightly by configuring a cronjob on the server.

Sourcecode: source:trunk/OrganikOntologyLearning/src/eu/organik/ontolearn/Refiner.java

How it works

  • The input text is either read from files in a folder or extracted from Drupal Nodes. HTML tags etc. are stripped.
  • The candidate terms are matched with the public open linked project DbPedia?. (source:trunk/OrganikOntologyLearning/src/eu/organik/ontolearn/dbpedia/DBpediaTagger.java) Here, DBPedia is used as background knowledge to evaluate the usefulness of found strings in a taxonomy. The string representation of a noun phrase is matched with labels from DBPedia. Partial matches are included. The assumption behind this approach is that noun phrases mentioned on DBPedia are known to a broad audience and can serve as taxonomy terms for the SME in question.
  • The terms with dbpedia matches are ranked according to their frequency combined with how well the term matched the dbpedia article label.
  • Terms with a score lower than some threshold K are discarded. (this can be set with the commandline option --taxonomyRankingThreshold)
  • Now implicit terms are added - i.e. terms that are connected to two or more of our terms in DbPedia? but were not found in our original texts. For example, when both the resource Julius Caesar and Mark Antony are mentioned, the broader resource Ancient Roman generals of both will be included in the resulting taxonomy. The new term can be either a broader topic of both phrases, or an intermediate topic in a larger hierarchy.
  • Now some amount of spreading activation is done, this helps us focus on parts of the found taxonomy that are well connected. The degree of spreading activation can be adjusted with the options: spreadingChildFlow, spreadingParentFlow, spreadingRelatedFlow, and spreadingIterations.
  • Finally, a SKOS file is output with the top ranked terms, the number of terms to output can be configured with the --taxonomyNoTerms option.

For a more technical description of the process, see our recent paper at i-Semantics'10: Gunnar Aastrand Grimnes, Remzi Celebi and Leo Sauermann, Using Linked Open Data to bootstrap corporate Knowledge Management in the OrganiK Project.

Sourcecode

Last modified 7 years ago Last modified on 07/28/10 13:56:57