Topic Model: Processes & Implementation

Here's how Acuity's topic model operates unsupervised and without the need for labelled data.

1 Topic Model

Acuity's topic model is built upon the Latent Dirichlet Allocation (LDA) methodology (Blei et al., 2003). LDA is a generative probabilistic model, which provides an effective and efficient representation of the joint probability distribution of a collection of random variables.

In the LDA framework, documents are assumed to be represented as random mixtures of latent topics, where each topic is characterised by a probability distribution over the words. Importantly, the model operates in an unsupervised manner, enabling it to learn the underlying structure within textual articles without the need for labelled data.

As outlined in the original paper, LDA follows a specific process for each document:

The parameters α and β are computed at the corpus level. In addition to these parameters, when implementing an LDA model, we must carefully choose the optimal number of topics (K) that the model can identify. This step is crucial as it significantly impacts the interpretability of the topics generated by the model. To determine the optimal number of topics, we conduct a search among a set of candidate values for K (50, 100, 180, 200, 300, 400, and 500).

In our implementation, we select the number of topics that maximises the coherence among the topics. Coherence, a measure of semantic consistency, plays a key role in this selection process. Therefore, a higher probability of top-ranked words co-occurring within a document reflects a better classification effect of the model.

On the following page, Figure 1 illustrates the coherence scores obtained for each value of K in the set of candidate values. The maximum coherence is achieved when the model identifies 200 topics.

Flip the page to learn more. >

Our model was trained to discover latent topics within a corpus of Dow Jones news articles published from 2000 to 2021.

The corpus comprises over 2 million news articles from esteemed journals such as The Wall Street Journal Asia, Barron’s, The Wall Street Journal, Wall Street Journal Europe, and Smart Money. The model's training period ended in June 2021.

Once the model is trained using the α and β parameters estimated for the optimal K, it generates a vector of 200 probabilities for each new document.

These probabilities represent the proportion of content in a document that relates to each specific topic. In the subsequent section, these values are used to compute the attention score, which is applied in further analyses.

To facilitate the interpretability of the latent topics detected by our model, we manually assigned names to them, considering the relevance of each word in the vocabulary within a particular topic. Figure 2 (next page) showcases an example of a word cloud used during the naming process. Appendix A contains the full list of topics detected by the model.

Given the capability of our model to identify a relatively large number of topics, we can offer a detailed breakdown of each specific category. However, in cases where a more generalised analysis suffices, finer topics can be clustered into groups that share similar themes. For this purpose, we employ a hierarchical recursive agglomerative algorithm that organises similar topics based on the weights assigned to each word within a particular topic.

The relationships among the topics in the dendogram’s hierarchy are established by applying both the L1 distance affinity and the average linkage. Appendix B contains the definition of the communities obtained through this approach. The clusters obtained after applying this approach are the ones used in the analysis conducted in the next section.