wiki:TaxonomyLearningImplementation

Version 2 (modified by sauermann, 10 years ago) (diff)

--

Taxonomy Learning Component Implementation

The taxonomy learning component creates a dictionary (=taxonomy) of words that are useful to tag documents. Taking existing documents as input, the taxonomy learning component analyzes the text inside the documents and compares it with existing taxonomies or existing data. It suggests new terms for the SME's taxonomy managed in the SemanticApi. The trained taxonomy is specialized for the SME company, as it is based on documents provided by the SME, it will reflect the words typically used inside the company and thus helps employees to find the right tags for new documents and to find existing documents based on the tags.

The relations between terms (broader/narrower terms, hierarchic relations, related terms), and possible alternative spellings, can be extracted from existing data sources. For example, DBPedia and linked open data can be used.

How to use it

The component is run from command-line. It is a separate application written in Java. To run the application, use the command

runAll.sh

It will read the configuration file in config2, to configure it to run on a dataset called "mydataset" use

NAME="mydataset"

This will analyze text files inside the data-file-folder ./data/mydataset. The algorithm will do the following:

  • read all files in the data-file-folder and extract relevant terms using !eu.organik.ontolearn.TermExtractor?
  • match the terms with the public open linked data base dbpedia using !eu.organik.ontolearn.DBpediaTagger
  • rank and filter the terms according to relevancy
  • call !eu.organik.ontolearn.DBpediaLookUp to create a SKOS output file

The result is a SKOS thesaurus which contains interesting terms which can now be used as tags. Take the resulting SKOS thesaurus from results/mydataset/taxonomy_skos.rdf and upload it to OrganiK using Drupal/TaxonomyImport. After uploading, the terms are available in the system.

How it works

The details of the algorithm are summed up currently for publication.

Sourcecode