# Changes between Version 7 and Version 8 of TaxonomyLearningImplementation

Ignore:
Timestamp:
06/08/10 18:48:24 (9 years ago)
Comment:

--

### Legend:

Unmodified
 v7 The relations between terms (broader/narrower terms, hierarchic relations, related terms), and possible alternative spellings, can be extracted from existing data sources. For example, DBPedia and linked open data can be used. == How to use it == The component is run from command-line. It is a separate application written in Java, Python, and bash. To run the application on a unix/linux server, use the command {{{ runAll.sh }}} It will read the configuration file in {{{config2}}}, to configure it to run on a dataset called "mydataset" use {{{ NAME="mydataset" }}} == First time Taxonomy Learning == The taxonomy learning is separate from Drupal and is written in Java. For details on how to actually download and run it please refer to: http://organik.opendfki.de/wiki/AdministratorTutorial/TaxonomyLearning This will analyze text files inside the data-file-folder {{{./data/mydataset}}}. The algorithm will do the following: * read all files in the data-file-folder and extract relevant terms using !eu.organik.ontolearn.TermExtractor * match the terms with the public open linked data base dbpedia using !eu.organik.ontolearn.DBpediaTagger * rank and filter the terms according to relevancy * call !eu.organik.ontolearn.DBpediaLookUp to create a '''SKOS output file''' The result is a [wiki:SKOS] thesaurus which contains interesting terms which can now be used as tags. Take the resulting SKOS thesaurus from {{{results/mydataset/taxonomy_skos.rdf}}} and upload it to OrganiK using Drupal/TaxonomyImport. This will analyze text from the nodes in your drupal installation and the result is a [wiki:SKOS] thesaurus which contains interesting terms which can now be used as tags. Take the resulting SKOS thesaurus and upload it to OrganiK using Drupal/TaxonomyImport. After uploading, the terms are available in the system. == How it works == '''TODO: this is outdated. We do more. Gunnar Grimnes will update this page according to the documentation in our recent paper and give some links to the source. ''' Before going into details: the term-extraction used by this component is described at ContentAnalyserImplementation#TermExtractor . As input for the bootstrapping process all content existing in OrganiK (or imported into OrganiK) is taken. The text sources are analysed with natural language processing techniques and post-processed to create a taxonomy. Overall, taxonomy learning includes the following steps: ''term extraction'', ''finding synonyms'', ''identification of concepts'', and placing concepts into a ''hierarchy''. In the first step, all sentences in the text are parsed to identify noun phrases ("NP"). [http://opennlp.sf.net OpenNLP] is used to generate parse trees of the sentences. In the next step, for each noun phrase, matching DBPedia resources are identified. Here, DBPedia is used as ''background knowledge'' to evaluate the usefulness of found strings in a taxonomy. The string representation of a noun phrase is matched with labels from DBPedia. Partial matches are included. The assumption behind this approach is that noun phrases mentioned on DBPedia are known to a broad audience and can serve as taxonomy terms for the SME in question. For example the noun phrase ''Julius Caesar's Conquests'' matches both ''Julius Caesar'' and ''Military career of Julius Caesar'' on DBPedia. Obviously, this fuzzy matching may find too many candidates in the background knowledge. The list of candidate is ranked by what we refer to as ''DBPedia index''. * The input text is either read from files in a folder or extracted from Drupal Nodes. HTML tags etc. are stripped. {{{ #!latex $\displaystyle (DBPedia index)=\frac{(number of common words for matching DBPedia label)}{(number of words in the concepts)}$ }}} The rank for found resources on DBPedia is a weighted combination of a total of three scores: * Using a machine-learning based noun-phrase chunker, the ''noun-chunks'' are extracted from the text, these are the candidate terms. They are counted and only the most frequent are kept. We use [http://opennlp.sf.net OpenNLP] for this - and have trained our own noun-chunking model for german. (source:trunk/OrganikOntologyLearning/src/eu/organik/ontolearn/NPExtractor.java) This same method is used for the term-suggestion servlet described at ContentAnalyserImplementation#TermExtractor . *  percentile rank of term *  inverse length (1/(number of words in the term)) *  DBPedia match index (how well a term matches with any dpedia term) * The candidate terms are matched with the public open linked project DbPedia. (source:trunk/OrganikOntologyLearning/src/eu/organik/ontolearn/dbpedia/DBpediaTagger.java) Here, DBPedia is used as ''background knowledge'' to evaluate the usefulness of found strings in a taxonomy. The string representation of a noun phrase is matched with labels from DBPedia. Partial matches are included. The assumption behind this approach is that noun phrases mentioned on DBPedia are known to a broad audience and can serve as taxonomy terms for the SME in question. Only terms with a score higher than a threshold-K are chosen as concepts to be included in the taxonomy. Result of the ranking step is to identify concepts from DBPedia that possibly match keywords found in the text. * The terms with dbpedia matches are ranked according to their frequency combined with how well the term matched the dbpedia article label. After ranking, terms are ''positioned in a hierarchy'' based on narrower and broader relations in DBPedia. These existing SKOS relations\footnote{Broader and narrower are properties defined in the SKOS vocabulary and express taxonomic structure.} are already building a hierarchy in DBPedia, which can be adopted for the SME case. All broader and narrower relations {{{ {r} }}} of found DBPedia resources are collected and kept in a buffer. When one relation {{{ r }}} connects two noun phrases found in the overall SME corpus, {{{ r }}} is a good candidate for a hierarchy and is adopted with a high rank. Also indirect relations can be used, when two hierarchical relations {{{ {r1, r2} }}} contain two noun phrases from the corpus and an additional new resource {{{ n }}}. The new resource {{{ n }}} can be either a broader topic of both phrases, or an intermediate topic in a larger hierarchy. For example, when both the resource ''Julius Caesar'' and ''Mark Antony'' are mentioned, the broader resource ''Ancient Roman generals'' of both will be included in the resulting taxonomy. * Terms with a score lower than some threshold K are discarded. (this can be set with the commandline option --taxonomyRankingThreshold) * After ranking and filtering, terms are ''positioned in a hierarchy'' based on narrower and broader relations in DBPedia. These existing SKOS relations\footnote{Broader and narrower are properties defined in the SKOS vocabulary and express taxonomic structure.} are already building a hierarchy in DBPedia, which can be adopted for the SME case. (source:trunk/OrganikOntologyLearning/src/eu/organik/ontolearn/dbpedia/DBpediaTaxonomyBuilder.java) * Now ''implicit terms'' are added - i.e. terms that are connected to two or more of our terms in DbPedia but were not found in our original texts. For example, when both the resource ''Julius Caesar'' and ''Mark Antony'' are mentioned, the broader resource ''Ancient Roman generals'' of both will be included in the resulting taxonomy. The new term can be either a broader topic of both phrases, or an intermediate topic in a larger hierarchy. * Now some amount of ''spreading activation'' is done, this helps us focus on parts of the found taxonomy that are well connected. The degree of spreading activation can be adjusted with the options: spreadingChildFlow, spreadingParentFlow, spreadingRelatedFlow, and spreadingIterations. * Finally, a  ''SKOS'' file is output with the top ranked terms, the number of terms to output can be configured with the ''--taxonomyNoTerms'' option. For a more technical description of the process, see our recent paper at i-Semantics'10: Gunnar Aastrand Grimnes, Remzi Celebi and Leo Sauermann, Using Linked Open Data to bootstrap corporate Knowledge Management in the OrganiK Project. == Sourcecode ==