Collaboration Capacity: A Framework for Measuring Data-Intensive Biomedical Research (2020-2022)

The goal of this proposed project is to develop a collaboration capacity framework and evaluate the collaboration capacity of science teams at macro-, meso-, and micro-levels through using GenBank metadata and other related data sources. The framework defines the Scientific & Technical (S&T) human capital, cyberinfrastructure, and science policy as the enablers of collaboration capacity, the impact of which on collaboration capacity can be measured by data production and data-to-knowledge metrics such as team size and ratio of data to publications. GenBank metadata as the primary data source for this project offers a longitudinal coverage (1984-2018) and full research lifecycle traces from data production to publication to patent application, creating an unprecedented opportunity to study the biomedical research enterprise.

This project will design and create datasets from GenBank metadata to generate analysis-ready data, which will be combined with statistics from NSF and NIH. The datasets will be used to develop computational models and test hypotheses that examine the correlation between collaboration capacity, team size, and connectedness of nodes, as well as the properties of disruptive nodes and their impact on productivity and innovation. In addition to statistics from NSF and NIH, the project will also combine events in science policy (e.g., mandates on data sharing), public health (e.g., outbreaks and prevalent chronic diseases), and funding to triangulate with the datasets and analyze collaboration capacity and policy implications. The data source and theoretical approach compensate for the limitations of publication-centric data sources used in past research on collaboration networks. The fact that the primary data source comes from basic biomedical research situates this study at the cutting-edge and allows us to gain more holistic insights into the impact of federal investment and policy on collaboration capacity. Our future research will use this longitudinal, rich data collection to continue deeper mining of collaboration in data production and data-to-knowledge lifecycle, particularly in relation to specific genes, diseases, and treatments that are key aspects in basic and clinical biomedical research.

Structural Shift in Collaboration Networks 1992-2018 (Credit: Jeff Hemsley)

Cyberinfrastructure-Enabled Collaboration Networks (2016-2019)

Cyberinfrastructure enables collaborative research and significantly impacts scientific capacity and knowledge diffusion. In response to the growing need for quantitatively evaluating outcomes and impact of federal investment on research, this project deploys new data, tools, metrics, and methods for assessing the impact of cyberinfrastructures and the data services built on them. This research helps researchers and policy makers understand how cyberinfrastructure affects collaboration dynamics and network structures of researchers. Datasets organized by longitudinal, thematic, topical, geographical, institutional, and author dimensions provided, which researchers, policy makers, and students can access and use to explore data-intensive science of science and innovation policy related research.

Metadata from GenBank, patent data from U.S. Patent and Trademark Office and funding data from NIH ExPORT are analyzed with descriptive statistics and models from Complex Network Analysis. The project not only examines the topological properties of the data submission and publication networks, but also the temporal ordering of collaborative relationships and the overlap of the sequence submission and publication networks. Through slicing, plotting, and visualizing data, appropriate sampling strategies and algorithms are developed to more deeply explore collaboration networks, both structurally and temporally. Algorithms used in community detection, machine learning, and visualization serve as primary computational methods in this research. Data products to be shared with research communities include 1) discovery lifecycle datasets containing sequence submissions, publications, and patents as well as the links between them and 2) funding factor datasets containing links between U.S. federal funding data and the discovery lifecycle datasets.

FUNDED BY: NIH/NIGMS

   History of the NIH Logo | National Institutes of Health (NIH)NSF Logo | NSF - National Science Foundation