Towards Evidence-Based Discovery


Vast quantities of electronic information provide a unique opportunity for scientists identify candidate solutions for grand challenges as scientists, policy makers, and students have never had access to more electronic information than they do today. The goal in this research is to develop new text mining methods that are consistent with the manual processes that experts currently used to resolve contradictory and redundant evidence. Both discovery and synthesis are difficult activities even for people, so a socio-technical strategy will be required to achieve this goal.

Key outcomes from this study will be:

  • A longitudinal study of manual discovery and synthesis behaviors of a diverse network of faculty, policy makers, and students from UNC and the Research Triangle Park;
  • Advances in natural language processing methods that automatically identify concepts and relationships, detect entailment and paraphrasing, and generate multi-document summaries.
  • A collection of gold standards that reflect diverse and realistic information needs that will drive further research in natural language processing.
  • Increased understanding of the degree to which text mining methods assist in discovery and synthesis activities through a series of qualitative and quantitative user studies.
  • A set of "next generation" scientists who are well prepared to explore complex research questions that span disciplines.
  • Increased awareness and support for the "human side of discovery" through courses, and workshops


Blake, C. (In press, 2011) Text Mining Chapter in the Annual Review of Information Science and Technology, Volume 45
This paper reviews current strategies used to identify patterns from text


Blake,C., Zheng,W., Painter,K., Weyerhaeuser, W. (2010) The Role of Semantics in Recognizing Textual Entailment, Text Analysis Conference (TAC) 2010 Recognizing Textual Entailment Track (RTE-6)
Our goal this year was to explore the degree to which semantics alone can accuracy determine entailment. This paper reports the knowledge-bases considered and selected for person names, locations and organizations and the results of the system when used on the Recognizing Textual Entailment (RTE) Track of the Text Analysis Conference (TAC).


Zheng, W. & Blake, C. (2010) Bootstrapping location relations from text, American Society for Information Science and Technology Annual Meeting, Oct, Pittsburgh, PA.
Ontologies play a critical role in information organization and can be used for a range of applications from information retrieval to knowledge discovery. However, manual ontology construction is extremely labor intensive. This paper describes a bootstrapping algorithm that, when provided with a seed term, automatically induces relations from text. We describe a series of experiments that explore the role of sentence syntax during the bootstrapping process and demonstrate the feasibility of this approach by identifying a primitive instance-level relation the location relation, which is of interest because locations are described in multiple genres, such as in the news, novels and scientific articles. Our results suggest that syntax plays a critical role in identifying location relations.

Blake, C. (2010) Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles, Journal of Biomedical Informatics, 43(2):173-189.
Massive increases in electronically available text have spurred a variety of natural language processing methods to automatically identify relationships from text; however, existing annotated collections comprise only bioinformatics (geneprotein) or clinical informatics (treatmentdisease) relationships. This paper introduces the Claim Framework that reflects how authors across biomedical spectrum communicate findings in empirical studies. The Framework captures different levels of evidence by differentiating between explicit and implicit claims, and by capturing under-specified claims such as correlations, comparisons, and observations. The results from 29 full-text articles show that authors report fewer than 7.84% of scientific claims in an abstract, thus revealing the urgent need for text mining systems to consider the full-text of an article rather than just the abstract. The results also show that authors typically report explicit claims (77.12%) rather than an observations (9.23%), correlations (5.39%), comparisons (5.11%) or implicit claims (2.7%). Informed by the initial manual annotations, we introduce an automated approach that uses syntax and semantics to identify explicit claims automatically and measure the degree to which each feature contributes to the overall precision and recall. Results show that a combination of semantics and syntax is required to achieve the best system performance.


Blake,C. (2007) The Role of Sentence Structure in Recognizing Textual Entailment. Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, p101-6, Prague, Czech Republic.
Recent research suggests that sentence structure can improve the accuracy of recognizing textual entailments and paraphrasing. Although background knowledge such as gazetteers, WordNet and custom built knowledge bases are also likely to improve performance, our goal in this paper is to characterize the syntactic features alone that aid in accurate entailment prediction. We describe candidate features, the role of machine learning, and two final decision rules. These rules resulted in an accuracy of 60.50 and 65.87% and average precision of 58.97 and 60.96% in RTE3Test and suggest that sentence structure alone can improve entailment accuracy by 9.25 to 14.62% over the baseline majority class.

Blake, C. & Pratt, W. (2006). Collaborative Information Synthesis I: A Model of information behaviors of scientists in Medicine and Public Health. Journal of the American Society for Information Science and Technology, 57(13):1740-9. JASIST Best Paper Award
Scientists engage in the discovery process more than any other user population, yet their day-to-day activities are often elusive. One activity that consumes much of a scientists time is developing models that balance contradictory and redundant evidence. Driven by our desire to understand the information behaviors of this important user group, and the behaviors of scientific discovery in general, we conducted an observational study of academic research scientists as they resolved different experimental results reported in the biomedical literature. This article is the first of two that reports our findings. In this article, we introduce the Collaborative Information Synthesis (CIS) model that reflects the salient information behaviors that we observed. The CIS model emerges from a rich collection of qualitative data including interviews, electronic recordings of meetings, meeting minutes, e-mail communications, and extraction worksheets. Our findings suggest that scientists provide two information constructs: a hypothesis projection and context information. They also engage in four critical tasks: retrieval, extraction, verification, and analysis. The findings also suggest that science is not an individual but rather a collaborative activity and that scientists use the results of one analysis to inform new analyses. In Part 2, we compare and contrast existing information and cognitive models that have inadvertently reported synthesis, and then provide five recommendations that will enable designers to build information systems that support the important synthesis activity.

Blake, C. & Pratt, W. (2006). Collaborative Information Synthesis II: Recommendations for information systems that support synthesis activities. Journal of the American Society for Information Science and Technology. 57(14):1888-95. JASIST Best Paper Award
As the quantity of information continues to exceed our human processing capacity, information systems must support users as they face the daunting task of synthesizing information. One activity that consumes much of a scientists time is developing models that balance contradictory and redundant evidence. Driven by our desire to understand the information behaviors of this important user group, and the behaviors of scientific discovery in general, we conducted an observational study of academic research scientists as they resolved different experimental results reported in the biomedical literature. This article is Part 2 of two articles that report our findings. In Part 1 (Blake & Pratt, 2006), we introduced the Collaborative Information Synthesis (CIS) model, which captures the salient information behaviors that we observed. In this article, we review existing cognitive and information seeking models that have inadvertently reported synthesis behavior and provide five recommendations for systems designers to build information systems that support synthesis activities.

Blake, C. (2006). A Comparison of Document, Sentence and Term Event Spaces, In Proceedings of the Joint 21st International Conference on Computational Linguistics (COLING) and the 44th Annual Meeting of the Association for Computational Linguistics (ACL), p601-8, Sydney, Australia. (Paper acceptance rate 22.3%)
The trend in information retrieval systems  is from document to sub-document retrieval, such as sentences in a summarization system and words or phrases in question-answering system. Despite this trend, systems continue to model language at a document level using the inverse document frequency (IDF). In this paper, we compare and contrast IDF with inverse sentence frequency (ISF) and inverse term frequency (ITF). A direct comparison reveals that all language models are highly correlated; however, the average ISF and ITF values are 5.5 and 10.4 higher than IDF. All language models appeared to follow a power law distribution with a slope coefficient of 1.6 for documents and 1.7 for sentences and terms. We conclude with an analysis of IDF stability with respect to random, journal, and section partitions of the 100,830 full-text scientific articles in our experimental corpus.

Blake, C. Kampov,J., Orphanides,A., West,D., & Lown,C., (2007) UNC-CH at DUC 2007: Query Expansion, Lexical Simplification, and Sentence Selection strategies for Multi-Document Summarization, Presentation at Document Understanding Conference (DUC) 2007, Rochester, NY.
This paper describes the approach used in the UNC-CH system to generate a topic-focused summary of information reported in multiple news articles. We explored query expansion, lexical simplification and sentence simplification. Results suggest that cluster membership plays an important role in improving summarization performance, while query expansion does not. The UNC-CH system performed well in both automated and manual evaluations, achieving the 12th highest ROUGE-2 score and a score greater than or equal to the average system responsiveness score for 30 of the 45 DUC 2007 topics.

Blake,C and Rendall, M., (2006) Scientific Discovery: A View from the Trenches, In Ljupco Todorovski, Nada Lavrac, Klaus P. Jantke (Eds.): Lecture Notes in Computer Science, Discovery Science, 9th International Conference,  p41-52 Barcelona, Spain. (Long paper acceptance rate 27%)
One of the primary goals in discovery science is to understand the human scientific reasoning processes. Despite sporadic success of automated discovery systems, few studies have systematically explored the socio-technical environments in which a discovery tool will ultimately be embedded. Modeling day-to-day activities of experienced scientists as they develop and verify hypotheses provides both a glimpse into the human cognitive processes surrounding discovery and a deeper understanding of the characteristics that are required for a discovery system to be successful. In this paper, we describe a study of experienced faculty in chemistry and chemical engineering as they engage in what Kuhn would call normal science, focusing in particular on how these scientists characterize discovery, how they arrive at their research question, and the processes they use to transform an initial idea into a subsequent publication. We discuss gaps between current definitions used in discovery science, and examples of system design improvements that would better support the information environment and activities in normal science. 


Blake, C. (2010) Beyond Genes, Proteins, and Abstracts: Identifying Scientific Claims from Full-Text Biomedical Articles. Presented at GSLIS Research Showcase, April, 2010, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Blake, C. (2010) The Claim Framework. Presented at the e-Research Roundtable, March, 2010, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign


Zheng, W. & Blake, C. (2010) Automatic extraction of location relations from text,Displayed at iConference 2010, February, 2010 Champaign, IL.