| 1 | = Content Analyser Implementation / Probabilistic Topic Analyser = |
| 2 | |
| 3 | This method of content analysis is based on the idea of applying statistical methods to text in order to provide insight into how the words are linked to topics and to documents. |
| 4 | |
| 5 | |
| 6 | |
| 7 | Two assumptions are made beforehand. First, the “Bag-of-Words” assumption implies that a text is represented as an unordered collection of word. This means that while processing text, grammar and even word order is disregarded. Second, it is assumed that latent topics exist between documents and words: a document is a mixture of topics and a topic is a mixture of words. We subsequently use probability distributions to model these mixtures. |
| 8 | |
| 9 | |
| 10 | |
| 11 | [[Image(topics.png)]] |
| 12 | Figure : Illustration of a Generative Model using latent topics |
| 13 | |
| 14 | |
| 15 | The latent topics analysis is a common theme in a number of methods such as latent semantic analysis(LSA), probabilistic LSA, Latent Dirichlet Allocation and Correlated Topic Models. We chose to use LDA as one of the most effective and promising methods (Blei D.M, Ng A.Y. and Jordan M.I. 2003). This method utilizes a multivariate dirichlet prior probability distribution of topics over words and also of topics over documents. This method uses a fixed number of topics and can also be parameterized by using two hyper parameters a and b. In Figures 3.19(a) and 3.19(b) the plate notation of the LDA is illustrated as well as the Dirichlet Distribution. In the depicted illustration the elevated parts of the distribution correspond to the topics, i.e. the parameters of the multivariate prior distribution.In our application, Gibbs sampling (Geman S. and D., 1984) as a special case of a Metropolis-Hastings algorithm has been used in order to generate a sequence of samples from the joint probability distribution that we need to compute. This algorithm, after a burn-in period converges to the joint probability distribution we need. |
| 16 | |
| 17 | |
| 18 | |
| 19 | After running the algorithm on the OrganiK resources corpora, this module can provide estimates of word-topic and topic-document distributions. These distributions are subsequently used for a number of applications that provide intelligent assistance to users. |
| 20 | |
| 21 | |
| 22 | This module has been implemented in Java and is using the mallet natural language processing framework. |