In this notebook we're going to expand our previous topic modeling approaches in order to model hierarchic topics.
/Users/lmarti/.pyenv/versions/3.8.2/envs/risotto/lib/python3.8/site-packages/spacy/util.py:271: UserWarning: [W031] Model 'en_core_sci_sm' (0.2.4) requires spaCy v2.2 and is incompatible with the current spaCy version (2.3.0). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)

Hierarchical Latent Dirichlet Allocation (hLDA)

This technique was presented in the 2004 NeurIPS paper "Hierarchical Topic Models and the Nested Chinese Restaurant Process" by David Blei et al. available at: https://papers.nips.cc/paper/2466-hierarchical-topic-models-and-the-nested-chinese-restaurant-process.pdf.

A quick Google search yields at least two implementations:

We'll use the second one since it publishes a Jupyter Notebook with an example using the library. First, we'll install it.

 
%load_ext autoreload
%autoreload 2
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

from fastprogress.fastprogress import progress_bar

from risotto.lda import tokenizer
from risotto.references import load_papers_from_metadata_file, paper_as_markdown
from risotto.lda import process_papers_file_contents
from risotto.lda import topic_descriptors
from risotto.sampler import HierarchicalLDA

import pickle, random
from pathlib import Path
from collections import defaultdict
from scipy.sparse import vstack

Now we'll proceed to load the CORD-19 dataset papers.

CORD19_DATASET_FOLDER = Path("./datasets/CORD-19-research-challenge")
papers, _ = load_papers_from_metadata_file(CORD19_DATASET_FOLDER)

We've loaded the papers on memory. Now, we'll process them in order to produce text strings with their contents.

docs = process_papers_file_contents(papers)

The last preprocessing step is to vectorize each paper. We'll represent them using the CountVectorizer scikit-learn implementation. We purposefully don't use representations such as tf-idf because the LDA algorithm takes care of the document frequency normalization of tokens.

def get_hlda_corpus(docs, max_vocab_size=2**13):
    count_vectorizer = CountVectorizer(
        tokenizer=tokenizer,
        lowercase=True,
        max_features=max_vocab_size,
    )
    
    count_vectorizer.fit(docs)
    
    vocab = count_vectorizer.vocabulary_
    docs_tokenized = []
    docs_tokens_idxs = []
    
    for doc in progress_bar(docs):
        tokens = [token.lower() for token in tokenizer(doc)]
        idxs = []
        for token in tokens:
            if token in vocab:
                idxs.append(vocab[token])
        docs_tokenized.append(tokens)
        docs_tokens_idxs.append(idxs)
        
    vocab_list = count_vectorizer.get_feature_names()
    
    return docs_tokenized, docs_tokens_idxs, vocab_list
docs_tokenized, docs_tokens_idxs, vocab = get_hlda_corpus(docs)

docs_tokenized is a list of lists with the tokenization of each paper. docs_tokens_idxs is a list of lists with the vocabulary indeces of each paper token. Finally, vocab is the list with the vocabulary used.

hlda = HierarchicalLDA(
    corpus=random.sample(docs_tokens_idxs, int(len(docs_tokens_idxs) * 0.1)),
    vocab=vocab,
    alpha=10,
    gamma=1,
    eta=0.1,
    seed=0,
    verbose=True,
    num_levels=3,
)
hlda.estimate(
    num_samples=500,
    display_topics=50,
    n_words=5,
    with_weights=False,
)

Sampling a 10% of the total papers results in a sub-dataset of about 3.888 papers. The number of topics of each level is determined by the Chinese Restaurant Process and can be influenced by tweaking the alpha and gamma hyperparameters. Training the model on the 10% sample took about an hour for each 50 iterations.

To avoid spending time retraining the model, it'll be dumped to be able to load it in posterior experiments.

with open("hlda.pkl", "wb") as dump_file:
    pickle.dump(hlda, dump_file)

Now, we'll load the dumped model.

with open("hlda.pkl", "rb") as dump_file:
    hlda = pickle.load(dump_file)

Manual Hierarchical LDA

In this section we'll attempt to manually build a hierarchical topic model. Essentially, using the same number of topics found by the hLDA technique at level=1, we'll model topics using the standard LDA. Afterwards, for each group of documents of the first level topics, we'll run a new LDA topic modelling step.

def get_lda_corpus(docs, max_vocab_size=2**13):
    count_vectorizer = CountVectorizer(
        tokenizer=tokenizer,
        lowercase=True,
        max_features=max_vocab_size,
    )
    vectorized_docs = count_vectorizer.fit_transform(docs)
    return vectorized_docs, count_vectorizer


def fit_lda_model(docs, **kwargs):
    vectorized_docs, count_vectorizer = get_lda_corpus(docs)
    lda = LatentDirichletAllocation(**kwargs)
    lda = lda.fit(vectorized_docs)
    return lda, vectorized_docs, count_vectorizer

get_lda_corpus[source]

get_lda_corpus(docs, max_vocab_size=8192)

fit_lda_model[source]

fit_lda_model(docs, **kwargs)

lda, vectorized_docs, count_vectorizer = fit_lda_model(
    docs,
    n_components=8,
    verbose=2,
    n_jobs=4,
)

The following cell will print the most relevant tokens of each modelled component.

topic_descriptors(lda, count_vectorizer, 5)

Now, we'll build the groups of papers belonging to the different modelled topics.

def group_docs_by_topics(model, vectorized_docs):
    docs_classified = lda.transform(vectorized_docs)
    docs_topics = docs_classified.argmax(1)
    clustered_docs = defaultdict(list)

    for vectorized_doc, topic_idx in zip(vectorized_docs, docs_topics):
        clustered_docs[topic_idx].append(vectorized_doc)

    stacked_clustered_docs = {}
    for topic_idx, docs_list in clustered_docs.items():
        stacked_clustered_docs[topic_idx] = vstack(docs_list)
    
    return stacked_clustered_docs
grouped_docs = group_docs_by_topics(lda, vectorized_docs)
grouped_docs

Then, for each paper group, we'll run LDA on them.

topic_descriptors(lda, count_vectorizer, 5)

models = {}
for topic_idx, group_docs in progress_bar(grouped_docs.items()):
    print(f"Topic ID #{topic_idx}; documents = {group_docs.shape[0]}")
    
    models[topic_idx] = LatentDirichletAllocation(
        n_components=4,
        verbose=0,
        n_jobs=4,
    )
    models[topic_idx] = models[topic_idx].fit(group_docs)
    
    topic_descriptors(models[topic_idx], count_vectorizer, 5)
    print("\n", end="")