Taking existing documents as input, the taxonomy learning component analyzes the text inside the documents and compares it with existing taxonomies or existing data. It suggests new terms for the SME's taxonomy managed in the SemanticApi. The relations between terms (broader/narrower terms, hierarchic relations, related terms), and possible alternative spellings, can be extracted from existing data sources. For example, DBPedia and linked open data can be used.
one of the OrganikComponents.
- Do POS tagging, and NP chunking for documents
- Using OpenNLP for English, and CRFTagger/Chunker for German
- Count all NPs, find matches with labels in dbpedia, for each substring match, compute the three scores:
- "Concept Frequency" (f) - how many sentences in our corpus contains this. This must be scaled somehow, for instance with http://en.wikipedia.org/wiki/Percentile_rank
- Inverse length (l) , 1/(number of words in concept)
- Dbpedia match index (w): (number of words in concept) / (number of words in dbpedia article)
- For each word, only include the 10 highest rank wikipedia articles. This can be done with this script: source:trunk/OrganikOntologyLearning/results/filter.py
- Rank the rest by some function, source:trunk/OrganikOntologyLearning/results/rank.py will scale the cf, and do:
Making a hierarchy
We would like to structure our tags in a tree, according to part of relations. I think we will use different types of evidence for this, and perhaps end up several trees in the end (i.e. a forest), and combine these by attaching all to a "everthing" tag at the top.
- For each of the top N things, fetch the wikipedia html page/dbpedia html page/dbpedia n3/any random format
- Create a distance matric using Normalised Compression Distance: http://complearn.org/ncd.html
- Use distance matrix to create a hierachical tree - the relationships in this tree is one part of the evidence we use to create partOf relations
- Look up what dbpedia says about the topics we have found - are there any properties relating them? If so, can we classify these properties as partOf/related ? All the yago properties should be in dbpedia - they should include all of WordNet?... i.e. there should be enough properties, but they may not relate the things we have found. This is evidence part 2
- Combine the evidence somehow...
We would also like to relate the tags further by "related to" relations. Use as input the relations from DBPedia and the distance matrix from ncd again, things that are close are related.
- If two things have dbpedia relation, and similarity over a threshold A, add related link.
- If no dbpedia relation, but similarity over threshold B, add related link.