Loading paper dataset and re-generating the graph of papers and the corresponding PageRank.
cord19_dataset_folder = "./datasets/CORD-19-research-challenge"
papers, _ = load_papers_from_metadata_file(cord19_dataset_folder)
G = build_papers_reference_graph(papers)
pageranks = nx.pagerank(G)
Paper representations
In order to build a representation for each paper, the following libraries will be used:
- spaCy: https://spacy.io/
- scispaCy: https://allenai.github.io/scispacy/
The language model nameden_core_sci_sm
will be used, which has been trained with a corpus of biomedical text with a vocabulary of more than 100.000 words.
In case of needing a model with a larger vocabulary, there are some others available.
Loading the biomedical language pipeline.
# Select a paper to showcase spacy's features
sample_paper = list(pageranks.keys())[0]
sample_text = "\n".join([ paragraph["text"] for paragraph in sample_paper._file_contents["body_text"]])
sample_text
doc = nlp(sample_text, disable=["tagger", "parser", "ner"])
doc[17].lemma_
The document tokenized by the spacy
pipeline is displayed.
An interesting thing about using spacy
with the pretrained language model is that it automatically computes document and token representations vectors.
It's a pending task to find out which language model architecture it's used to compute those vectors.
A relevant aspect that influences downstream tasks is the number of out-of-vocabulary (OOV) tokens. The following cell makes a quick inspection over a sample paper counting the number of OOV tokens. A continuación, se realizará una iteración sobre los tokens para detectarlos.
num_oov = 0
for token in doc:
if token.is_oov and token.string != "\n":
if token.string.endswith("virus"):
print(token, "not found")
num_oov += 1
else:
if token.string.endswith("virus"):
print(token, "found")
print(f'Number of out of vocabulary tokens: {num_oov} ({100 * num_oov / len(doc)}%).')
Note that relevant tokens, such as coronavirus are included in the language model vocabulary.
Testing the mechanisms used to remove stopwords, punctuation, spaces, and extract the token's lemma.
tokens = {token for token in doc}
no_stop_word_tokens = {token for token in doc if not (token.is_stop or token.is_punct or token.is_space)}
len(tokens), len(no_stop_word_tokens)
First, let's process all documents texts.
docs = process_papers_file_contents(list(pageranks.keys()))
Peeking at the top 5 papers.
print('\n=====\n'.join(docs[:5]))
Vectors storing the token occurrence count will be used as document representations.
tf-idf
vectors are purposefully not used because the document frequency normalization is already carried out by the LDA technique.
count_vectorizer = CountVectorizer(tokenizer=tokenizer, lowercase=True)
vectorized_docs = count_vectorizer.fit_transform(docs)
A sparse matrix is built rows one for each document, and columns: one for each token.
vectorized_docs.shape
len(count_vectorizer.vocabulary_)
lda = LatentDirichletAllocation(n_components=10, verbose=2, n_jobs=-1)
lda = lda.fit(vectorized_docs)
lda
The execution of the following cells will display the most relevant tokens for each identified topic.
descriptors = topic_descriptors(lda, count_vectorizer, 10)
for topic_id in descriptors:
print(f'Topic {topic_id}:', ', '.join(descriptors[topic_id]))
The dataset papers will be classified into the different previously modelled topics.
docs_classified = lda.transform(vectorized_docs)
docs_classified[:5]
Finally, the top-5 PageRank-sorted papers belonging to each topic are displayed.
docs_topics = docs_classified.argmax(1)
topic_papers = defaultdict(list)
all_papers = list(pageranks.keys())
for idx, topic_id in enumerate(docs_topics):
topic_papers[topic_id].append(all_papers[idx])
for topic_id, papers in sorted(topic_papers.items(), key=lambda t: t[0]):
print(f"Topic ID {topic_id}")
sorted_papers = sorted(papers, reverse=True, key=lambda p: pageranks[p])
for paper in sorted_papers[:5]:
paper_as_markdown(paper)
#print("\n", end="")