To help medical community in the current time-critical race to find a cure for the virus, we propose a machine learning-based system that uses state-of-the-art natural language processing (NLP) question answering (QA) techniques combined with summarization for mining the available scientific literature. Our system is based on COVID-19 Open Research Dataset (CORD-19), which is a resource of over 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. We have uploaded our code on a Kaggle notebook as part of our submission to the CORD-19 challenge, and we have also submitted an arxiv paper describing our system. This system is developed for expert research questions on the topic. It is not to be used by the general public for diagnostic purposes.

Our system consists of three different modules:

  1. Document Retrieval
    • Query Paraphrasing: It converts a long/complicated query from a user to several shorter and simpler questions for search;
    • Search Engine: We use Anserini with Lucene to retrieve related publications from the candidate pool with high coverage.
  2. Question Answering
    • Question Answering (QA): This sub-module looks for and integrates evidence from one or multiple paragraphs. We leverage an ensemble of two neural-based QA models which are pre-trained on SQuAD style QA datasets. Here we consider the QA module as a supporting fact selector to provide relevant snippets from the retrieved documents.
    • Answer Re-ranking & Highlight Generation: We rerank the retrieved result by a word matching score based on part-of-speech tagging as well as the QA system confidence score. We also highlight the answer span in order to enable easier reading of the QA results.
  3. Multi-document Abstractive Summarization
    • Abstractive Summarization: Another output of our system is an abstractive summary that synthesizes the answer from multiple retrieved snippets. This step aims to generate short pieces of fluent summaries based on the top relevant results. Using the neural-based summarizer, we generate summaries from long paragraphs to improve the legibility of the results and help the user to have an overview of the relevant snippets in a short time.