%env CUDA_VISIBLE_DEVICES=6
env: CUDA_VISIBLE_DEVICES=6
 

Natural Language Inference (NLI) method

In this approach we use a BART classifier (Lewis et al., 2019) pre-trained on the Multi-Genre NLI (MultiNLI, Williams et al., 2018) corpus as the base model.

Given research interests expressed in natural language, we pose the problem of recovering relevant research from the CORD-19 dataset (Wang et al., 2020) as a Zero Shot Topic Classification task (Yin et al., 2019). Leveraging the Natural Language Inference task framework, we assess each paper relevance by feeding the model with the paper's title and abstract as premise and a research interest as hypothesis.

Finally, we use the model's entailment inference values as proxy relevance scores for each paper.

get_nli_model[source]

get_nli_model(name='facebook/bart-large-mnli')

from risotto.artifacts import load_papers_artifact

try:
    papers = load_papers_artifact()
    model, tokenizer = get_nli_model(name="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli")
except FileNotFoundError:
    print('Data artifacts not ready.')
/Users/lmarti/.pyenv/versions/3.8.2/envs/risotto/lib/python3.8/site-packages/pandas/compat/__init__.py:117: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
/Users/lmarti/.pyenv/versions/3.8.2/envs/risotto/lib/python3.8/site-packages/spacy/util.py:271: UserWarning: [W031] Model 'en_core_sci_sm' (0.2.4) requires spaCy v2.2 and is incompatible with the current spaCy version (2.3.0). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
Data artifacts not ready.

build_tokenized_papers_artifact[source]

build_tokenized_papers_artifact(papers, tokenizer, should_dump=True, dump_path=None, batch_size=128)

load_tokenized_papers_artifact[source]

load_tokenized_papers_artifact(artifacts_path)

tokenized_papers = build_tokenized_papers_artifact(
    papers=papers,
    tokenizer=tokenizer,
    dump_path="artifacts/nli_bert_artifacts.hdf",
)
tokenized_papers.head()
100.00% [604/604 12:49<00:00]
/home/lmarti/risotto/venv-risotto/lib/python3.7/site-packages/pandas/core/generic.py:2505: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->values] [items->None]

  encoding=encoding,
ug7v899j    [101, 6612, 2838, 1997, 3226, 1011, 10003, 202...
02tnwd4m    [101, 9152, 12412, 15772, 1024, 1037, 4013, 10...
ejv2xln0    [101, 14175, 18908, 4630, 5250, 1011, 1040, 19...
2b73a28n    [101, 2535, 1997, 2203, 14573, 18809, 1011, 10...
9785vg6d    [101, 4962, 3670, 1999, 4958, 8939, 24587, 444...
Name: tokenized_papers, dtype: object
tokenized_papers = load_tokenized_papers_artifact(
    "artifacts/nli_bert_artifacts.hdf")
tokenized_papers.head()
ug7v899j    [0, 20868, 1575, 9, 2040, 12, 32012, 1308, 438...
02tnwd4m    [0, 19272, 4063, 30629, 35, 10, 1759, 12, 3382...
ejv2xln0    [0, 6544, 24905, 927, 8276, 12, 495, 8, 34049,...
2b73a28n    [0, 21888, 9, 253, 15244, 2614, 12, 134, 11, 1...
9785vg6d    [0, 13120, 8151, 11, 22201, 44828, 4590, 11, 1...
Name: tokenized_papers, dtype: object

build_entailments_artifact[source]

build_entailments_artifact(tokenized_papers, query_tokenized, batch_size=64, device='cuda', should_dump=True, dump_path=None)

load_entailments_artifact[source]

load_entailments_artifact(artifacts_path)

query_tokenized = tokenizer.encode(
    "This paper is about vaccines and therapeutics.")
build_entailments_artifact(batch_size=256,
                           tokenized_papers=tokenized_papers,
                           query_tokenized=query_tokenized,
                           dump_path="artifacts/nli_bert_artifacts.hdf")
100.00% [302/302 18:46<00:00]
ug7v899j    34.904873
02tnwd4m     3.709159
ejv2xln0    83.782410
2b73a28n     0.407605
9785vg6d    83.543762
              ...    
2upc2spn    72.468544
48kealmj    21.194389
7goz1agp    90.725761
twp49jg3    78.526169
wtoj53xy    10.045611
Name: entailments, Length: 77304, dtype: float64

References

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165

Davison, J. (2020). Zero-Shot Learning in Modern NLP. https://joeddav.github.io/blog/2020/05/29/ZSL.html

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2019). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. http://arxiv.org/abs/1910.13461

Reimers, N., & Gurevych, I. (2020). Sentence-BERT: Sentence embeddings using siamese BERT-networks. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 3982–3992. https://doi.org/10.18653/v1/d19-1410

Veeranna, S. P., Nam, J., Mencía, E. L., & Fürnkranz, J. (2016). Using semantic similarity for multi-label zero-shot classification of text documents. ESANN 2016 - 24th European Symposium on Artificial Neural Networks, April, 423–428.

Wang, L. L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., Funk, K., Kinney, R., Liu, Z., Merrill, W., Mooney, P., Murdick, D., Rishi, D., Sheehan, J., Shen, Z., Stilson, B., Wade, A. D., Wang, K., Wilhelm, C., … Kohlmeier, S. (2020). CORD-19: The Covid-19 Open Research Dataset. https://arxiv.org/abs/2004.10706

Williams, A., Nangia, N., & Bowman, S. R. (2018). A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1112--1122. http://aclweb.org/anthology/N18-1101

Yin, W., Hay, J., & Roth, D. (2019). Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 3914–3923. https://doi.org/10.18653/v1/d19-1404