Taxonomy Learning

Taking existing documents as input, the taxonomy learning component analyzes the text inside the documents and compares it with existing taxonomies or existing data. It suggests new terms for the SME's taxonomy managed in the SemanticApi. The relations between terms (broader/narrower terms, hierarchic relations, related terms), and possible alternative spellings, can be extracted from existing data sources. For example, DBPedia and linked open data can be used.

one of the OrganikComponents.

Concept extraction

  • Do POS tagging, and NP chunking for documents
    • Using OpenNLP for English, and CRFTagger/Chunker for German
  • Count all NPs, find matches with labels in dbpedia, for each substring match, compute the three scores:
    • "Concept Frequency" (f) - how many sentences in our corpus contains this. This must be scaled somehow, for instance with
    • Inverse length (l) , 1/(number of words in concept)
    • Dbpedia match index (w): (number of words in concept) / (number of words in dbpedia article)
  • For each word, only include the 10 highest rank wikipedia articles. This can be done with this script: source:trunk/OrganikOntologyLearning/results/
  • Rank the rest by some function, source:trunk/OrganikOntologyLearning/results/ will scale the cf, and do:
    TracMath macro processor has detected an error. Please fix the problem before continuing.

    The command:

    '/usr/bin/pdflatex -interaction=nonstopmode 3153981e4860c975817c374104b7b2c7743af05e.tex'
    failed with the following output:
    "This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2019/dev/Debian) (preloaded format=pdflatex)\n restricted \\write18 enabled.\nentering extended mode\n(./3153981e4860c975817c374104b7b2c7743af05e.tex\nLaTeX2e <2018-12-01>\n(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls\nDocument Class: article 2018/09/03 v1.4i Standard LaTeX document class\n(/usr/share/texlive/texmf-dist/tex/latex/base/size10.clo))\n(/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty)\n\n! LaTeX Error: File `cmap.sty' not found.\n\nType X to quit or <RETURN> to proceed,\nor enter new name. (Default extension: sty)\n\nEnter file name: \n! Emergency stop.\n<read *> \n         \nl.4 \\usepackage\n               {type1ec}^^M\n!  ==> Fatal error occurred, no output PDF file produced!\nTranscript written on 3153981e4860c975817c374104b7b2c7743af05e.log.\n"

Making a hierarchy

We would like to structure our tags in a tree, according to part of relations. I think we will use different types of evidence for this, and perhaps end up several trees in the end (i.e. a forest), and combine these by attaching all to a "everthing" tag at the top.

  • For each of the top N things, fetch the wikipedia html page/dbpedia html page/dbpedia n3/any random format
  • Create a distance matric using Normalised Compression Distance:
  • Use distance matrix to create a hierachical tree - the relationships in this tree is one part of the evidence we use to create partOf relations
  • Look up what dbpedia says about the topics we have found - are there any properties relating them? If so, can we classify these properties as partOf/related ? All the yago properties should be in dbpedia - they should include all of WordNet?... i.e. there should be enough properties, but they may not relate the things we have found. This is evidence part 2
  • Combine the evidence somehow...

Making relations

We would also like to relate the tags further by "related to" relations. Use as input the relations from DBPedia and the distance matrix from ncd again, things that are close are related.

  • If two things have dbpedia relation, and similarity over a threshold A, add related link.
  • If no dbpedia relation, but similarity over threshold B, add related link.



Last modified 12 years ago Last modified on 07/21/10 13:06:42