Here we calculate representations of the papers based on their text content. Then, from these representations, a modeling of topics will be carried out using the LDA method. Finally, the most relevant papers for each topic will be determined using the PageRank scores of each paper.
/Users/lmarti/.pyenv/versions/3.8.2/envs/risotto/lib/python3.8/site-packages/spacy/util.py:271: UserWarning: [W031] Model 'en_core_sci_sm' (0.2.4) requires spaCy v2.2 and is incompatible with the current spaCy version (2.3.0). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)

Loading paper dataset and re-generating the graph of papers and the corresponding PageRank.

cord19_dataset_folder = "./datasets/CORD-19-research-challenge"
papers, _ = load_papers_from_metadata_file(cord19_dataset_folder)
G = build_papers_reference_graph(papers)
pageranks = nx.pagerank(G)

Paper representations

In order to build a representation for each paper, the following libraries will be used:

The language model nameden_core_sci_sm will be used, which has been trained with a corpus of biomedical text with a vocabulary of more than 100.000 words. In case of needing a model with a larger vocabulary, there are some others available.

Loading the biomedical language pipeline.

# Select a paper to showcase spacy's features
sample_paper = list(pageranks.keys())[0]
sample_text = "\n".join([ paragraph["text"] for paragraph in sample_paper._file_contents["body_text"]])
sample_text

doc = nlp(sample_text, disable=["tagger", "parser", "ner"])
doc[17].lemma_

The document tokenized by the spacy pipeline is displayed. An interesting thing about using spacy with the pretrained language model is that it automatically computes document and token representations vectors. It's a pending task to find out which language model architecture it's used to compute those vectors.

A relevant aspect that influences downstream tasks is the number of out-of-vocabulary (OOV) tokens. The following cell makes a quick inspection over a sample paper counting the number of OOV tokens. A continuación, se realizará una iteración sobre los tokens para detectarlos.

num_oov = 0
for token in doc:
    if token.is_oov and token.string != "\n":
        if token.string.endswith("virus"):
            print(token, "not found")
        num_oov += 1
    else:
        if token.string.endswith("virus"):
            print(token, "found")
print(f'Number of out of vocabulary tokens: {num_oov} ({100 * num_oov / len(doc)}%).')

Note that relevant tokens, such as coronavirus are included in the language model vocabulary.

Testing the mechanisms used to remove stopwords, punctuation, spaces, and extract the token's lemma.

tokens = {token for token in doc}
no_stop_word_tokens = {token for token in doc if not (token.is_stop or token.is_punct or token.is_space)}
len(tokens), len(no_stop_word_tokens)

Latent Dirichlet Allocation (LDA)

The following cells will perform topic modelling experiments using the LDA technique. The scikit-learn implementation of this model will be used.

First, let's process all documents texts.

process_papers_file_contents[source]

process_papers_file_contents(papers)

docs = process_papers_file_contents(list(pageranks.keys()))

Peeking at the top 5 papers.

print('\n=====\n'.join(docs[:5]))

Vectors storing the token occurrence count will be used as document representations. tf-idf vectors are purposefully not used because the document frequency normalization is already carried out by the LDA technique.

tokenizer[source]

tokenizer(sentence)

count_vectorizer = CountVectorizer(tokenizer=tokenizer, lowercase=True)
vectorized_docs = count_vectorizer.fit_transform(docs)

A sparse matrix is built rows one for each document, and columns: one for each token.

vectorized_docs.shape
len(count_vectorizer.vocabulary_)
lda = LatentDirichletAllocation(n_components=10, verbose=2, n_jobs=-1)
lda = lda.fit(vectorized_docs)
lda

The execution of the following cells will display the most relevant tokens for each identified topic.

topic_descriptors[source]

topic_descriptors(topic_model, vectorizer, num_words)

descriptors = topic_descriptors(lda, count_vectorizer, 10)
for topic_id in descriptors:
    print(f'Topic {topic_id}:', ', '.join(descriptors[topic_id]))

The dataset papers will be classified into the different previously modelled topics.

docs_classified = lda.transform(vectorized_docs)
docs_classified[:5]

Finally, the top-5 PageRank-sorted papers belonging to each topic are displayed.

docs_topics = docs_classified.argmax(1)
topic_papers = defaultdict(list)
all_papers = list(pageranks.keys())
for idx, topic_id in enumerate(docs_topics):
    topic_papers[topic_id].append(all_papers[idx])
    
for topic_id, papers in sorted(topic_papers.items(), key=lambda t: t[0]):
    print(f"Topic ID {topic_id}")
    sorted_papers = sorted(papers, reverse=True, key=lambda p: pageranks[p])
    for paper in sorted_papers[:5]:
        paper_as_markdown(paper)
    #print("\n", end="")