Dmitriy Katz-Rogozhnikov, Baruch Schieber, et al.
Algorithmica
Querying biomedical documents from large databases such as PubMed is traditionally keyword-based and usually results in large volumes of documents that lack specificity. A common bottleneck of further filtering using natural language processing (NLP) techniques stems from the need for a large amount of labeled data to train a machine learning model. To overcome this limitation, we are constructing an NLP pipeline to automatically label relevant published abstracts, without fitting to any hand-labeled training data, with the goal of identifying the most promising non-cancer generic drugs to repurpose for the treatment of cancer. This work aims to programmatically filter a large set of research articles as either relevant or non-relevant, where relevance is defined as those studies that have evaluated the efficacy of non-cancer generic drugs in cancer patient populations. We use Snorkel, a Python-based weak supervision modeling library, which allows domain expertise to be infused into heuristic rules. With a robust set of rules, promising classification accuracy can be cheaply achieved on a large set of documents, making this work easily applicable to other domains.
Dmitriy Katz-Rogozhnikov, Baruch Schieber, et al.
Algorithmica
Shivashankar Subramanian, Ioana Baldini, et al.
IAAI 2020
Ioana Baldini, Perry Cheng, et al.
Onward! 2017
Karthikeyan Natesan Ramamurthy, Dennis Wei, et al.
Big Data 2017