• Search by keyword
  • Search by citation

Page 1 of 13

Dissecting reactive astrocyte responses: lineage tracing and morphology-based clustering

Brain damage triggers diverse cellular and molecular events, with astrocytes playing a crucial role in activating local neuroprotective and reparative signaling within damaged neuronal circuits. Here, we inves...

  • View Full Text

biology research paper data

Neuronal repair after spinal cord injury by in vivo astrocyte reprogramming mediated by the overexpression of NeuroD1 and Neurogenin-2

As a common disabling disease, irreversible neuronal death due to spinal cord injury (SCI) is the root cause of functional impairment; however, the capacity for neuronal regeneration in the developing spinal c...

PvMYB60 gene, a candidate for drought tolerance improvement in common bean in a climate change context

Common bean ( Phaseolus vulgaris ) is one of the main nutritional resources in the world, and a low environmental impact source of protein. However, the majority of its cultivation areas are affected by drought and...

Enhancing adipose tissue functionality in obesity: senotherapeutics, autophagy and cellular senescence as a target

Obesity, a global health crisis, disrupts multiple systemic processes, contributing to a cascade of metabolic dysfunctions by promoting the pathological expansion of visceral adipose tissue (VAT). This expansi...

Effects of a supplemented diet containing 7 probiotic strains (Honeybeeotic) on honeybee physiology and immune response: analysis of hemolymph cytology, phenoloxidase activity, and gut microbiome

In this study, a probiotic mixture (Honeybeeotic) consisting of seven bacterial strains isolated from a unique population of honeybees ( Apis mellifera ligustica ) was used. That honeybee population was located in ...

Uncovering the role of the subcommissural organ in early brain development through transcriptomic analysis

The significant role of embryonic cerebrospinal fluid (eCSF) in the initial stages of brain development has been thoroughly studied. This fluid contains crucial molecules for proper brain development such as m...

A preclinical mice model of multiple sclerosis based on the toxin-induced double-site demyelination of callosal and cerebellar fibers

Multiple sclerosis (MS) is an irreversible progressive CNS pathology characterized by the loss of myelin (i.e. demyelination). The lack of myelin is followed by a progressive neurodegeneration triggering sympt...

Renoprotective effect of a novel combination of 6-gingerol and metformin in high-fat diet/streptozotocin-induced diabetic nephropathy in rats via targeting miRNA-146a, miRNA-223, TLR4/TRAF6/NLRP3 inflammasome pathway and HIF-1α

MiRNA-146a and miRNA-223 are key epigenetic regulators of toll-like receptor 4 (TLR4)/tumor necrosis factor-receptor-associated factor 6 (TRAF6)/NOD-like receptor family pyrin domain-containing 3 (NLRP3) infla...

Unveiling a novel memory center in human brain: neurochemical identification of the nucleus incertus , a key pontine locus implicated in stress and neuropathology

The nucleus incertus (NI) was originally described by Streeter in 1903, as a midline region in the floor of the fourth ventricle of the human brain with an ‘unknown’ function. More than a century later, the neuro...

biology research paper data

Chrysin-loaded PEGylated liposomes protect against alloxan-induced diabetic neuropathy in rats: the interplay between endoplasmic reticulum stress and autophagy

Diabetic neuropathy (DN) is recognized as a significant complication arising from diabetes mellitus (DM). Pathogenesis of DN is accelerated by endoplasmic reticulum (ER) stress, which inhibits autophagy and co...

biology research paper data

Lead and calcium crosstalk tempted acrosome damage and hyperpolarization of spermatozoa: signaling and ultra-structural evidences

Exposure of humans and animals to heavy metals is increasing day-by-day; thus, lead even today remains of significant public health concern. According to CDC, blood lead reference value (BLRV) ranges from 3.5 ...

biology research paper data

Molecular hydrogen promotes retinal vascular regeneration and attenuates neovascularization and neuroglial dysfunction in oxygen-induced retinopathy mice

Retinopathy of Prematurity (ROP) is a proliferative retinal vascular disease occurring in the retina of premature infants and is the main cause of childhood blindness. Nowadays anti-VEGF and retinal photocoagu...

biology research paper data

Retraction Note: Tridax procumbens flavonoids promote osteoblast differentiation and bone formation

Exercise reduces physical alterations in a rat model of fetal alcohol spectrum disorders.

Prenatal alcohol exposure (PAE) has serious physical consequences for children such as behavioral disabilities, growth disorders, neuromuscular problems, impaired motor coordination, and decreased muscle tone....

Loss of protein tyrosine phosphatase receptor delta PTPRD increases the number of cortical neurons, impairs synaptic function and induces autistic-like behaviors in adult mice

The brain cortex is responsible for many higher-level cognitive functions. Disruptions during cortical development have long-lasting consequences on brain function and are associated with the etiology of brain...

Inhibition of astroglial hemichannels prevents synaptic transmission decline during spreading depression

Spreading depression (SD) is an intriguing phenomenon characterized by massive slow brain depolarizations that affect neurons and glial cells. This phenomenon is repetitive and produces a metabolic overload th...

Correction: Conformational characterization of the mammalian-expressed SARS-CoV-2 recombinant receptor binding domain, a COVID-19 vaccine

The original article was published in Biological Research 2023 56 :22

The current insights of mitochondrial hormesis in the occurrence and treatment of bone and cartilage degeneration

It is widely acknowledged that aging, mitochondrial dysfunction, and cellular phenotypic abnormalities are intricately associated with the degeneration of bone and cartilage. Consequently, gaining a comprehens...

The crucial role of HFM1 in regulating FUS ubiquitination and localization for oocyte meiosis prophase I progression in mice

Helicase for meiosis 1 (HFM1), a putative DNA helicase expressed in germ-line cells, has been reported to be closely associated with premature ovarian insufficiency (POI). However, the underlying molecular mec...

Distinct properties of putative trophoblast stem cells established from somatic cell nuclear-transferred pig blastocysts

Genetically modified pigs are considered ideal models for studying human diseases and potential sources for xenotransplantation research. However, the somatic cell nuclear transfer (SCNT) technique utilized to...

Electroacupuncture attenuates neuropathic pain via suppressing BIP-IRE-1α-mediated endoplasmic reticulum stress in the anterior cingulate cortex

Studies have suggested that endoplasmic reticulum stress (ERS) is involved in neurological dysfunction and that electroacupuncture (EA) attenuates neuropathic pain (NP) via undefined pathways. However, the rol...

Effect of Cannabis sativa L. extracts, phytocannabinoids and their acetylated derivates on the SHSY-5Y neuroblastoma cells’ viability and caspases 3/7 activation

There is a need for novel treatments for neuroblastoma, despite the emergence of new biological and immune treatments, since refractory pediatric neuroblastoma is still a medical challenge. Phyto cannabinoids ...

The hepatoprotective effect of 4-phenyltetrahydroquinolines on carbon tetrachloride induced hepatotoxicity in rats through autophagy inhibition

The liver serves as a metabolic hub within the human body, playing a crucial role in various essential functions, such as detoxification, nutrient metabolism, and hormone regulation. Therefore, protecting the ...

Connexin channels and hemichannels are modulated differently by charge reversal at residues forming the intracellular pocket

Members of the β-subfamily of connexins contain an intracellular pocket surrounded by amino acid residues from the four transmembrane helices. The presence of this pocket has not previously been investigated i...

IDH1 mutation produces R-2-hydroxyglutarate (R-2HG) and induces mir-182-5p expression to regulate cell cycle and tumor formation in glioma

Mutations in isocitrate dehydrogenase 1 and 2 ( IDH1 and IDH2 ), are present in most gliomas. IDH1 mutation is an important prognostic marker in glioma. However, its regulatory mechanism in glioma remains incomplet...

Therapeutic potential of oleic acid supplementation in myotonic dystrophy muscle cell models

We recently reported that upregulation of Musashi 2 (MSI2) protein in the rare neuromuscular disease myotonic dystrophy type 1 contributes to the hyperactivation of the muscle catabolic processes autophagy and...

Dorsal root ganglion-derived exosomes deteriorate neuropathic pain by activating microglia via the microRNA-16-5p/HECTD1/HSP90 axis

The activated microglia have been reported as pillar factors in neuropathic pain (NP) pathology, but the molecules driving pain-inducible microglial activation require further exploration. In this study, we in...

MicroRNA-721 regulates gluconeogenesis via KDM2A-mediated epigenetic modulation in diet-induced insulin resistance in C57BL/6J mice

Aberrant gluconeogenesis is considered among primary drivers of hyperglycemia under insulin resistant conditions, with multiple studies pointing towards epigenetic dysregulation. Here we examine the role of mi...

biology research paper data

Combined transcriptomics and proteomics unveil the impact of vitamin C in modulating specific protein abundance in the mouse liver

Vitamin C (ascorbate) is a water-soluble antioxidant and an important cofactor for various biosynthetic and regulatory enzymes. Mice can synthesize vitamin C thanks to the key enzyme gulonolactone oxidase (Gul...

Novel role of LLGL2 silencing in autophagy: reversing epithelial-mesenchymal transition in prostate cancer

Prostate cancer (PCa) is a major urological disease that is associated with significant morbidity and mortality in men. LLGL2 is the mammalian homolog of Lgl. It acts as a tumor suppressor in breast and hepati...

Rapid development and mass production of SARS-CoV-2 neutralizing chicken egg yolk antibodies with protective efficacy in hamsters

Despite the record speed of developing vaccines and therapeutics against the SARS-CoV-2 virus, it is not a given that such success can be secured in future pandemics. In addition, COVID-19 vaccination and appl...

High-fat diet, microbiome-gut-brain axis signaling, and anxiety-like behavior in male rats

Obesity, associated with the intake of a high-fat diet (HFD), and anxiety are common among those living in modern urban societies. Recent studies suggest a role of microbiome-gut-brain axis signaling, includin...

General regulatory factors exert differential effects on nucleosome sliding activity of the ISW1a complex

Chromatin dynamics is deeply involved in processes that require access to DNA, such as transcriptional regulation. Among the factors involved in chromatin dynamics at gene regulatory regions are general regula...

Establishment of primary prostate epithelial and tumorigenic cell lines using a non-viral immortalization approach

Research on prostate cancer is mostly performed using cell lines derived from metastatic disease, not reflecting stages of tumor initiation or early progression. Establishment of cancer cell lines derived from...

The effect of diabetes mellitus on differentiation of mesenchymal stem cells into insulin-producing cells

Diabetes mellitus (DM) is a global epidemic with increasing incidences. DM is a metabolic disease associated with chronic hyperglycemia. Aside from conventional treatments, there is no clinically approved cure...

biology research paper data

Control of astrocytic Ca 2+ signaling by nitric oxide-dependent S-nitrosylation of Ca 2+ homeostasis modulator 1 channels

Astrocytes Ca 2+ signaling play a central role in the modulation of neuronal function. Activation of metabotropic glutamate receptors (mGluR) by glutamate released during an increase in synaptic activity triggers ...

Increased levels and activation of the IL-17 receptor in microglia contribute to enhanced neuroinflammation in cerebellum of hyperammonemic rats

Patients with liver cirrhosis may show minimal hepatic encephalopathy (MHE) with mild cognitive impairment and motor incoordination. Rats with chronic hyperammonemia reproduce these alterations. Motor incoordi...

Identification and expression analysis of two steamer-like retrotransposons in the Chilean blue mussel ( Mytilus chilensis )

Disseminated neoplasia (DN) is a proliferative cell disorder of the circulatory system of bivalve mollusks. The disease is transmitted between individuals and can also be induced by external chemical agents su...

Noncoding RNAs in skeletal development and disorders

Protein-encoding genes only constitute less than 2% of total human genomic sequences, and 98% of genetic information was previously referred to as “junk DNA”. Meanwhile, non-coding RNAs (ncRNAs) consist of app...

Cx43 hemichannels and panx1 channels contribute to ethanol-induced astrocyte dysfunction and damage

Alcohol, a widely abused drug, significantly diminishes life quality, causing chronic diseases and psychiatric issues, with severe health, societal, and economic repercussions. Previously, we demonstrated that...

Galectins in epithelial-mesenchymal transition: roles and mechanisms contributing to tissue repair, fibrosis and cancer metastasis

Galectins are soluble glycan-binding proteins that interact with a wide range of glycoproteins and glycolipids and modulate a broad spectrum of physiological and pathological processes. The expression and subc...

Glutaminolysis regulates endometrial fibrosis in intrauterine adhesion via modulating mitochondrial function

Endometrial fibrosis, a significant characteristic of intrauterine adhesion (IUA), is caused by the excessive differentiation and activation of endometrial stromal cells (ESCs). Glutaminolysis is the metabolic...

The long-chain flavodoxin FldX1 improves the biodegradation of 4-hydroxyphenylacetate and 3-hydroxyphenylacetate and counteracts the oxidative stress associated to aromatic catabolism in Paraburkholderia xenovorans

Bacterial aromatic degradation may cause oxidative stress. The long-chain flavodoxin FldX1 of Paraburkholderia xenovorans LB400 counteracts reactive oxygen species (ROS). The aim of this study was to evaluate the...

MicroRNA-148b secreted by bovine oviductal extracellular vesicles enhance embryo quality through BPM/TGF-beta pathway

Extracellular vesicles (EVs) and their cargoes, including MicroRNAs (miRNAs) play a crucial role in cell-to-cell communication. We previously demonstrated the upregulation of bta-mir-148b in EVs from oviductal...

YME1L-mediated mitophagy protects renal tubular cells against cellular senescence under diabetic conditions

The senescence of renal tubular epithelial cells (RTECs) is crucial in the progression of diabetic kidney disease (DKD). Accumulating evidence suggests a close association between insufficient mitophagy and RT...

Effects of latroeggtoxin-VI on dopamine and α-synuclein in PC12 cells and the implications for Parkinson’s disease

Parkinson’s disease (PD) is characterized by death of dopaminergic neurons leading to dopamine deficiency, excessive α-synuclein facilitating Lewy body formation, etc. Latroeggtoxin-VI (LETX-VI), a proteinaceo...

Glial-restricted progenitor cells: a cure for diseased brain?

The central nervous system (CNS) is home to neuronal and glial cells. Traditionally, glia was disregarded as just the structural support across the brain and spinal cord, in striking contrast to neurons, alway...

Carbapenem-resistant hypervirulent ST23 Klebsiella pneumoniae with a highly transmissible dual-carbapenemase plasmid in Chile

The convergence of hypervirulence and carbapenem resistance in the bacterial pathogen Klebsiella pneumoniae represents a critical global health concern. Hypervirulent K. pneumoniae (hvKp) strains, frequently from...

Endometrial mesenchymal stromal/stem cells improve regeneration of injured endometrium in mice

The monthly regeneration of human endometrial tissue is maintained by the presence of human endometrial mesenchymal stromal/stem cells (eMSC), a cell population co-expressing the perivascular markers CD140b an...

Embryo development is impaired by sperm mitochondrial-derived ROS

Basal energetic metabolism in sperm, particularly oxidative phosphorylation, is known to condition not only their oocyte fertilising ability, but also the subsequent embryo development. While the molecular pat...

  • Editorial Board
  • Manuscript editing services
  • Instructions for Editors
  • Sign up for article alerts and news from this journal
  • Follow us on Twitter
  • Follow us on Facebook
  • ISSN: 0717-6287 (electronic)

Biological Research

ISSN: 0717-6287

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

Diatom genome sizes predict abundance

August 8, 2024

Diatom genome sizes predict abundance

Body size is a fundamental predictor of organismal abundance, and larger-bodied organisms predominate in colder areas ("Bergmann's rule"). A study of diatoms by Wade Roberts, Adam Siepielski and Andrew Alverson reveals that in these unicellular organisms, genome size, rather than cell size, is a strong predictor of species abundance in the polar oceans.

Image credit: Matthew Ashworth and Andrew Alverson

PLOS Biologue

Community blog for plos biology, plos genetics and plos computational biology..

Methods and Resources

CellTracksColab for cell tracking

Exploring large amounts of cell tracking data remains a challenge. Estibaliz Gómez-de-Mariscal, Hanna Grobe, Joanna Pylvänäinen, Laura Xénard, Guillaume Jacquemet and colleagues present CellTracksColab, a platform that provides a transformative solution for cell tracking analysis, combining cutting-edge computational methods with a user-friendly interface.

Image credit: pbio.3002740

CellTracksColab for cell tracking

Recently Published Articles

  • GRK2 kinases in the primary cilium initiate SMOOTHENED-PKA signaling in the Hedgehog cascade
  • Toxoplasma gondii rhoptry discharge factor 3 is essential for invasion and microtubule-associated vesicle biogenesis"> Toxoplasma gondii rhoptry discharge factor 3 is essential for invasion and microtubule-associated vesicle biogenesis
  • Salmonella genomic plasticity identifies hotspots for pathogenicity genes">Comprehensive blueprint of Salmonella genomic plasticity identifies hotspots for pathogenicity genes

Current Issue

Current Issue July 2024

Research Article

Salmonella  pathogenicity gene hotspots

Effective management of Salmonella infections requires understanding its dynamic evolution. Simran Krishnakant Kushwaha, Franklin Nobrega and co-workers show how specific genomic regions influence the distribution of pathogenicity factors in Salmonella , highlighting the potential for targeted infection control strategies.

Image credit: pbio.3002746

Salmonella pathogenicity gene hotspots

Salmonella  exploits host polyamines

Bacterial pathogens often exploit host factors to enhance their infectivity. Tsuyoshi Miki, Tohru Minamino, Yun-Gi Kim and co-authors show that Salmonella Typhimurium boosts host polyamine production, which is crucial for the expression and needle assembly of its type 3 secretion system.

Image credit: pbio.3002731

Salmonella exploits host polyamines

Update Article

Cholesterol and cholecystokinin receptors

A previous PLOS Biology study used the cryo-EM structure of the cholecystokinin type 1 receptor (CCK1R) to reveal insights into G protein selectivity. This Update Article by Kaleeckal Harikumar, Peishen Zhao, Brian Cary, Denise Wootten, Patrick Sexton, Laurence Miller and co-workers provides a structural and biophysical characterization of the effects of cholesterol on ligand binding and G protein coupling at the receptor.

Cholesterol and cholecystokinin receptors

Image credit: pbio.3002673

FURNA: functional annotations of RNA structures

There is an increasing number of experimentally determined 3D RNA structures, but the majority lack functional annotation. To address this gap, Chengxin Zhang and Lydia Freddolino provide a database of 3D RNA structures with comprehensive, high-quality functional annotations to enable discovery of RNA functions from structural and sequence information

FURNA: functional annotations of RNA structures

Image credit: pbio.3002476

Alternative start sites in Cryptococcus

Alternative transcription start site (altTSS) usage is one of the major means of gene regulation in animals but is unknown in non-yeast fungi. Thi Tuong Vi Dang, Guilhem Janbon and co-workers reveal widespread altTSS in Cryptococcus that alters gene expression and protein targeting, regulated by a single transcription factor, Tur1, in response to environmental cues.

Alternative start sites in Cryptococcus

Image credit: pbio.3002724

Ancestral immunity

Aude Bernheim, Jean Cury and Enzo Poirier introduce the concept of ancestral immunity; the set of immune modules conserved between prokaryotes and eukaryotes, discussing the topology of ancestral immunity and an evolutionary scenario for its existence.

Ancestral immunity

Image credit: pbio.3002717

The new science of sleep

Omer Sharon, Eti Ben Simon, Matthew Walker and co-authors highlight eight of the most exciting new discoveries within sleep science, discussing how these have expanded our understanding of sleep's function at the cellular, organismal, and societal levels.

The new science of sleep

Image credit: pbio.3002684

Unsolved Mystery

The mysteries of mitochondrial shape

Mitochondria come in many shapes and sizes. Noga Preminger and Maya Schuldiner explore the diverse processes that influence mitochondrial shape and network formation, highlighting gaps in our understanding of mitochondrial architecture.

The mysteries of mitochondrial shape

Image credit: pbio.3002671

Aligning data with decisions

The planetary outlook for biodiversity is dire. Leah Gerber and Gwenllian Iacona introduce a new Collection of articles that discuss the data we have and the data we need for more effective conservation policies.

Aligning data with decisions

Image credit: Leah Gerber

Decision making for conservation and biodiversity

Translating conservation and biodiversity research from the field into the real world is a complex problem. This collection discusses issues around economics, policy, and how to do research that answers questions that decision makers have.

Symbiosis across the tree of life

Symbiosis research has become a holistic and pervasive field with a mature theoretical basis. This collection showcases symbiotic relationships across the tree of life, exploring their evolutionary basis and underlying mechanisms.

PLOS Biology 20th Anniversary

PLOS Biology is 20 and we are celebrating with a collection that contains articles that look back at landmark studies that we published, others that look past and future, and others discussing how publishing and open science have evolved and what is to come.

Engineering plants for a changing climate

This collection explores engineering strategies to help us adapt plants to a changing climate, including breeding techniques, genome engineering, synthetic biology and microbiome engineering.

Going for green

The green collection explores biological solutions that could be applied to reduce CO2 emissions, get rid of non-degradable plastics, produce food in a sustainable manner or generate energy.

European congress of immunology 2024

September 1 - 4

Meet Associate Editor Melissa Vazquez Hernandez ([email protected])

Wellcome: Organoids: advances and applications

September 9 - 11

Meet Senior Editor Ines Alvarez-Garcia ([email protected])

A million shades of green: Understanding and harnessing plant metabolic diversity

September 9 - 10

Meet Associate Editor Suzanne de Bruijn ([email protected])

Publish with PLOS

Submit Your Manuscript

Connect with Us

  • PLOS Biology on Twitter
  • PLOS on Facebook

Get new content from PLOS Biology in your inbox

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Biol Res (Thessalon)
  • v.22(1); 2015 Dec

Logo of jbiolres

Data integration in biological research: an overview

Vasileios lapatas.

Department of Informatics, Ionian University, 7 Tsirigoti Square, Corfu, 49100 Greece

Michalis Stefanidakis

Rafael c. jimenez.

ELIXIR, Wellcome Trust Genome Campus, Hinxton, CB10 1SD UK

Allegra Via

Biocomputing Group, Sapienza University, Piazzale Aldo Moro 5, Rome, 00185 Italy

Maria Victoria Schneider

361° Division, The Genome Analysis Centre, Norwich Research Park, Norwich, NR4 7UH UK

Data sharing, integration and annotation are essential to ensure the reproducibility of the analysis and interpretation of the experimental findings. Often these activities are perceived as a role that bioinformaticians and computer scientists have to take with no or little input from the experimental biologist. On the contrary, biological researchers, being the producers and often the end users of such data, have a big role in enabling biological data integration. The quality and usefulness of data integration depend on the existence and adoption of standards, shared formats, and mechanisms that are suitable for biological researchers to submit and annotate the data, so it can be easily searchable, conveniently linked and consequently used for further biological analysis and discovery. Here, we provide background on what is data integration from a computational science point of view, how it has been applied to biological research, which key aspects contributed to its success and future directions.

Introduction

Data driven biological research has made data integration strategies crucial for the advancements and discovery in a plethora of fields (e.g. genomics, proteomics, metabolomics, environmental sciences, clinical research to name a few) [ 1 – 6 ]. Technically, solutions for data integration have been developed and applied in both corporate and academic sectors. When it comes to biological research, there are different interpretations and levels of data integration people seem to consider [ 7 – 14 ], ranging from genomic data to protein-protein interactions.

Together with data production, there is no doubt that data management, storage and consequently retrieval, analysis and interpretation are at the core of any biological research project. Moreover, the ability to have access to the actual data sets used in a particular study is often crucial for reproducibility and expansion of such study, hence the emphasis in recent years on Open Science and the various initiatives associated [ 15 – 21 ]. Noticeably, in biological research, the difficulties associated with data integration have only expanded with the advent of high throughput technologies [ 3 , 22 , 23 ]. Anyone working with Next Generation Sequencing (NGS) faces challenges associated with a variety of aspects this type of data brings, one of the major being: the volume of the data [ 24 , 25 ].

Here, we refer to data integration as the computational solution allowing users, from end user (GUI) to power users (API), to fetch data from different sources, combine, manipulate and re-analyse them as well as being able to create new datasets and share these again with the scientific community.

With this definition in mind, it is clear that data integration solutions are imperative for the advancement of research in biological sciences as well as the mechanisms to make such processes traceable, shareable hence “integrable” [ 26 – 28 ]. Here, we provide an overview of the strategies most commonly adopted by the biological research community, current challenges and future directions.

Key concepts and terminology

Data integration should not just rely on software engineers and computational scientists, but needs to be driven by the actual users whose communities need to define, adopt and use standards, ontologies and annotation best practice. Therefore, it is particularly important for the biological research community to get acquainted with the conceptual basis of data integration, its limitations, challenges and actual terminology.

In order to familiarise the experimental biology community of readers, in Table ​ Table1 1 we present key concepts, definitions and terms used by bioinformaticians and computer scientists.

Terminology

SchemaA structured and “queryable” way of storing data
DatabaseA single or collection of schemata
SourcesA number of databases that contain data. Data that reside in each source can either duplicate and/or complement data from other sources
Data IntegrationThe process of combining data that reside in different sources, to provide users with a unified view of such data
Data StandardsAgreements on representation, format, and definition for common data
Data FormatsA structured way to represent data and metadata in a file
Data WarehousingModel for integrating data where the data from different sources reside on a central repository (aka data warehouse)
Federated DatabasesModel for integrating data where the data reside on the original sources and users are provided with a unified view of the data based on mapping mechanisms of the information
Linked DataThe network of interlinked data that is available on the web. It is used to automatically share semantically rich information and represents the biggest attempt to convert significant amounts of human knowledge across all fields in a computer readable format
OntologyA structured way of describing data, often presented in a computer-readable format. In bioinformatics, ontologies are sets of unambiguous, universally agreed terms used to describe biological phenomena and “entities”, their properties and their relationships
lled VocabularyA collection of terms for describing a certain domain of interest
Unique IdentifierA unique representation for a biological entity (molecule, organism, ontology term, etc.). Usually an alphanumeric string that is used to refer to this entity and distinguishes it from others (much like ID or passport number in humans).
MetadataData describing data, i.e., additional information (e.g., a comment, explanation, attributes, etc.) for a specific biological entity or process. As an example, in the context of an ontology, this is used to specify significant properties of the ontology
AnnotationThe process of attaching relevant information (metadata) to a raw biological entity
Automatic AnnotationAutomatic means that the annotation is being done by computer software (often by transferring information from a source to another). This is a way of producing a large amount of metadata
Manual AnnotationAs opposed to automatic annotation, manual means that an actual individual does it
GUIGraphical User Interface. Is the way that a user interacts with a computer by using graphical icons and visual indicators such as buttons, forms etc. In the scope of this paper we are using the term GUI to refer to interfaces that allow biologists to search/read/edit integrated biological data
APIApplication Programming Interface. Set of tool and protocols that a power user can use in order to automatically gain access to functionality and/or data that have been developed/gathered by another individual/organisation
UXUser eXperience. The process of improving user satisfaction by focusing on the usability of a given product.
Visualisation ToolsApplications that help biologists view the data in a more human-friendly way (e.g., Cytoscape for visualising complex networks) like 3D or graph representations of the data

In computational sciences the theoretical frameworks for data integration have been classified into two major categories namely “eager” and “lazy” [ 29 , 30 ]. The difference between the two approaches is the way the data get integrated. In the eager approach (warehousing), the data are being copied over to a global schema and stored in a central data warehouse; whereas in the lazy approach the data reside in distributed sources and are integrated on demand based on a global schema used to map the data between sources.

Each of the two main categories of data integration has to deal with its own challenges in order to provide the user with a unified view of the data. In the eager approach, researchers face challenges to keep data updated and consistent, and protect the global schema from having corrupted data [ 31 , 32 ]. In the lazy approach, data are queried at sources and the scientific community is trying to find ways of improving the answering query process [ 33 – 38 ] and source completeness [ 36 , 37 , 39 , 40 ]. Which approach should be used and when depends on amount of data, who owns them and the existing infrastructure.

In biology we see a diversity of implementations across these two approaches being used at a variety of levels and forms like data centralisation, federated databases [ 41 , 42 ] and linked data [ 43 ]. Figure ​ Figure1 1 shows the most common schemata used to integrate data in biology.

An external file that holds a picture, illustration, etc.
Object name is 40709_2015_32_Fig1_HTML.jpg

Data integration methodologies. This figure illustrates six major types of data integration methodologies in biology

UniProt [ 44 ] and GenBank [ 45 ] are examples of centralised resources (Fig. ​ (Fig.1-Data 1 -Data Centralisation), whereas Pathway commons [ 46 ] collects pathways from different databases and stores them to a shared repository that can be used to query and analyse pathway information (Fig. ​ (Fig.1-Data 1 -Data Warehousing). Datasets integration can also be made by in-house workflows accessing distributed databases and downloading data to a local repository (Fig. ​ (Fig.1-Dataset 1 -Dataset Integration). ExPASy [ 47 ] is the SIB Bioinformatics Resource Portal through which the user can access databases and tools in different areas of life science (Fig. ​ (Fig.1-Hyperlinks). 1 -Hyperlinks). Database links are crucial for interoperability and several efforts have been done in this context [ 48 ]. Regarding the federated database model (Fig. ​ (Fig.1-Federated 1 -Federated Databases), the Distributed Annotation System (DAS) [ 49 ] represents a valuable example. DAS is a client-server system used to integrate and display in a single view annotation data on biological sequences residing over multiple distant servers. In this case, a translation layer is needed to achieve data integration among heterogeneous databases. There are various ways to do this but in general it refers to ways to transform the data from the database to a common format so they can be interpreted in the same way from a mapping service. As for the linked data integration (Fig. ​ (Fig.1-Linked 1 -Linked Data), the services offered are graphical interfaces (GUI) that provide the user with hyperlinks connecting related data from multiple data providers in a large network of Linked Data. BIO2RDF [ 43 ] is an example of such integration system.

Data integration in biological research has its challenges associated to a variety of factors such as standards adoption or easy conversion between data/file formats [ 2 ].

Figure ​ Figure2 2 illustrates a simplified schematic view of the current state of biological research data integration components. Various attempts to integrate the data rely on translation layers that, by applying agreed standards, transform the data in a unified format in order to integrate them. In other words, different formats for the same type of data (e.g. NGS) need to be “translated” into a unified format by applying shared rules. On top of the integration layer, there are various GUIs that make it possible to utilise (download, analyse, represent, etc) the integrated data. Furthermore, there is a myriad of resources and visualisation tools generated that fail to comply with standards and/or are not compatible with each other [ 50 ] On the other hand, controlled vocabularies and ontologies to ease data integration are available for an increasing number of biological domain areas. Some of them can be found at the websites of the OBO (Open Biological and Biomedical Ontologies) foundry [ 51 ], the NCBO (National Center for Biomedical Ontology) BioPortal [ 52 ], and the OLS (Ontology Lookup Service). One successful example is the XML-based proteomic standards defined by the HUPO-PSI (Human Proteome Organisation-Proteomics Standards Initiative) consortium (see Table ​ Table2). 2 ). The rest of the paper will discuss key aspects of standards: ontologies, data formats, identifiers, reporting guidelines, consortiums and standard initiatives which will be followed by a section on visualisation.

An external file that holds a picture, illustration, etc.
Object name is 40709_2015_32_Fig2_HTML.jpg

Current state. This figure illustrates a simplified view of the current state of biological data and tools

List of data standards initiatives

AcronymNameGoalURLPMID
OBOThe Open Biological andEstablish a set of principles for ontology 17989687
Biomedical Ontologiesdevelopment to create a suite of orthogonal
interoperable reference ontologies in
the biomedical domain
CDISCClinical data interchangeEstablish standards to support the acquisition, 23833735
standards consortiumexchange, submission and archive of
clinical research data and metadata
HUPO-PSIHuman Proteome Organisation-Defines community standards for data 16901219
Proteomics Standards Initiativerepresentation in proteomics to facilitate
data comparison, exchange and verification
GAGHGlobal Alliance for GenomicsCreate interoperable approaches to catalyze 24896853
and Healthprojects that will help unlock the great
potential of genomic data
COMBINEComputational ModelingCoordinate the development of the various 25759811
in Biologycommunity standards and formats for
computational models
MSIMetabolomics StandardsDefine community-agreed reporting 17687353
Initiativestandards, which provided a clear description
of the biological system studied and
all components of metabolomics studies
RDAResearch Data AllianceBuilds the social and technical bridges that
enable open sharing of data across multiple
scientific disciplines

As mentioned above, one of the most important factors for the biological field to thrive is to standardise the data. In computational science a similar problem was encountered for the web and specifically with the way that browsers parse web pages. This was solved by agreeing on W3C standards [ 53 ] so that all the browsers are forced to comply otherwise they may result in poor user experience and they risk losing market share.

In biology there are many different ways of representing similar data and this makes the data harder to be integrated and processed to obtain unified views of such data. Gene naming is an example of poor uniformity in data representation. Despite full guidelines were issued in 1979 to adopt gene nomenclature standards (see [ 54 ]), an assortment of alternate names is still in use across the scientific literature and databases, posing a challenge to data sharing. When it comes to biological research, it is crucial to create (when non existing), adopt and implement standards. Without these it is (nearly) impossible to achieve data integration [ 55 , 56 ].

So what do we mean by standards? Standards can be defined as an agreed compliant term or structure to represent a biological entity. Entities are all types of units of biological information. For example we use T, G, A, C as a standard way to refer to the nucleotides that make the DNA, and aa (for amino acids) represented usually by one letter, and consequently, a string of letters to represent a DNA or protein sequence. However, a protein might be known in the scientific literature and referred by researchers by a variety of names, synonyms and abbreviations.

So, which standards exist, who defines them and how are these working? Lots of standard initiatives and efforts seem to exist, sometimes redundant, often non driven by the end users communities. It is out of the scope of this paper (and probably a never ending exercise) to review all of them, which do proliferate but not necessarily in harmonising ways. A snapshot of the variety of standards for metadata can be found at the DCC website [ 57 ] and BioSharing [ 58 ] as an example of the point we are making. Table ​ Table2 2 reports a list of standard initiatives along with their primary goal, URL and key reference in the omics field.

Standards facilitate data re-use. They make data sharing easier, saving overheads and losses of time in data loading, conversion, getting systems to work properly with data. They help overcome interoperability difficulties across different data formats, architectures, and naming conventions, and at infrastructure level, enabling access systems to work together [ 59 – 62 ]. Absence of standards means substantial loss of productivity and less data available to researchers [ 63 ].

Figure ​ Figure3 3 illustrates a schematic view of an ideal state of biological research data integration components. This figure emphasises on the importance of standards that is the base of all the top layers of the infrastructure. Without solid foundations, it is very difficult to build and maintain robust tools for the layers above. The arrows point out that the data can be used across all layers and this can go both ways. For example, in an ideal state, all biological data would be integrated from various databases across the world and biologists will be able to use a GUI to locate the entity of their interest. Then, they can use a visualisation tool to have a better representation of the entity by using the same data previously identified through the GUI (like a unique identifier). Furthermore, the biologist will be in a position to annotate or edit the data directly from the visualisation tool, which in turn will be able to commit the changes to the integrated service and from then on go all the way down the pyramid until the data in the proper database get edited and annotated.

An external file that holds a picture, illustration, etc.
Object name is 40709_2015_32_Fig3_HTML.jpg

Ideal state. This figure illustrates a simplified view of an ideal state of biological data and tools

Standards are therefore key to the data sharing process since they describe the norms which should be adopted to facilitate interchange and inter-working of information, processes, objects and software. Thus data resources play a major role not just in data management, integration, access, and preservation, but also for providing adequate support to research communities.

Ontologies have been proliferating in biological research, and their importance underlined several times [ 64 – 67 ] also in the specific context of data integration [ 68 ]. In order to bring some coordination and consolidation to the proliferation of ontologies across the biological and biomedical research fields, The Open Biological and Biomedical Ontologies (OBO) got together. OBO is a collaborative experiment involving developers of science-based ontologies who are establishing a set of principles for ontology development with the goal of creating a suite of orthogonal interoperable reference ontologies in the biomedical domain. Biological researchers can get involved and provide feedback by getting into the discussion fora OBO provides. Currently there are ten OBO foundry ontologies and more than 120 candidate ontologies or other ontologies of interest [ 51 ].

These efforts need the direct involvement of the actual biologists when it comes to the adoption and implementation of using such ontologies, ensuring these are known and disseminated across communities. Other important initiatives are, the NCBO (National Center for Biomedical Ontology) BioPortal [ 69 , 70 ], and the OLS (Ontology Lookup Service) [ 71 ].

With a set of unique common compliant standards in place, it will be possible to create tools to integrate the data on the web using an existing infrastructure like linked data. This will enable querying multiple sources without having to re-invent integration techniques for the integration of each source. As an example, one of the efforts currently trying to attempt this is Bio2RDF [ 43 ]. This is a major effort to integrate biological data using the linked data infrastructure. So far there are no tools that can utilise these data directly but they are mainly accessible via complex queries or low level GUIs.

Data formats are the concrete way we structure and represent biological information in a file. They are particularly relevant to those who deal with large amount of information such that generated by high throughput experiments. Indeed, a scientist interested in a single or a few genes at a time may extract information about them by manually “parsing” the literature or free-text (i.e. non formatted) documents. The need for storing biological data in formatted files arose from the need for using computers to analyse them. The amounts of genomics and proteomics data, which cannot be manually analysed element by element, are exponentially increasing and the adoption of commonly agreed formats to represent them in computer readable files is nowadays of utter importance. Historically, the scarcity of well structured data standards and schemas, caused the flourishing of many different formats even to represent the same type of data despite the adoption of standards in file formats would be essential to data exchange and integration. Funnily, the Roslin Bioinformatics Law’s First Law declaims: “The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats” [ 72 ].

For the benefit of data integration though, it would be ideal to have well-structured data across few basic formats that would be easily computer readable and therefore easily integrated. In the specific case of NGS data, the lag between the emerging high-throughput screening technologies and the adjusting of the scientific community to settle on a standard format, means time and effort spent on converting raw files across multiple sequencing platforms to make these compatible [ 73 ]. Currently, in NGS there are no really “standards” that people adhere to, but a set of commonly used formats (FASTA/Q, SAM, VCF, GFF/GTF, etc.). There are descriptor standards like MIGS [ 74 ], but these might not be generally adopted. More in general, today an exhaustive “atlas” of the formats used in bioinformatics cannot be found on the Internet. One partial list is available at http://genome.ucsc.edu/FAQ/FAQformat.html and the description of many formats can be found in the online forum BioStar [ 75 ].

A good format needs to take into account the data themselves (for example the DNA sequence of a gene) and the so called metadata, i.e. additional information describing the data (e.g. gene name, taxonomy information, cross reference to other resources, etc.) and has to adopt strategies (“tricks”) to make metadata unequivocally distinguishable from data by a computer program. This goal is achieved in different ways by different bioinformatics resources, resulting in the large number of formats we observe today. However, despite the large variety of computer readable formats, we realised that the most commonly used ones are ascribable to four main different classes: 1) tables 2) FASTA-like 3) GenBank-like 4) tag-structured. Table ​ Table3 3 reports examples for each of these classes.

Mostly commonly used data formats in bioinformatics

Data format classGeneral data-Nucleotide sequenceProtein sequenceStructuralSequenceOther data
interchange formatsdatadatadataalignmenttypes (PPI, etc)
TablCSV, TSVBED; GFFGFF, Uniprot-GFFPSF(D), MMCIF(D)SAM(D)
FASTA-likeFASTA; FASTQFASTA, PIRSAM(M)Wig
GenBank-likeGenBank; EMBLUniprot-TEXTPDB, PSF(M), MMCIF(D)CLUSTAL, MSF,
PHYLIP(D)
Tag-structuredHTML; XML; JSONSBOL-XMLUniprot-XML;PSI MI-XML;
Uniprot-RDF/XMLPSI-PAR

D = data; M = metadata. Formats appearing in more than one class are a mixture of classes

In table formats, data are organised in a table in which the columns are separated by tabs, commas, pipes, etc., depending on the source generating the file. FASTA-like files utilise, for each data record, one or more “definition” or “declaration lines”, which contain metadata information or specify the content of the following lines. Definition/declaration lines usually start with a special character or keyword in the first position of the line - a “ >” in FASTA files or a “@” in fastq or SAM files - followed by lines containing the data themselves (Fig. ​ (Fig.4). 4 ). In some cases, declaration lines may be interspersed with data lines. This format is mostly used for sequence data. In the GenBank-like format, each line starts with an identifier that specifies the content of the line (Fig. ​ (Fig.5). 5 ). Tag-structured formatting uses “tags” (“ <”, “ >”, “{”, “}”, etc.) to make data and metadata recognisable (Fig. ​ (Fig.6) 6 ) with high specificity. Tag-structured text files, especially XML and JSON, are being increasingly employed as data interchange formats between different programming languages.

An external file that holds a picture, illustration, etc.
Object name is 40709_2015_32_Fig4_HTML.jpg

Selected parts of a FASTQ file. In this format declaration lines start with two different characters (“@” and “+”) corresponding to different data types (the raw sequence and the sequence quality values, respectively)

An external file that holds a picture, illustration, etc.
Object name is 40709_2015_32_Fig5_HTML.jpg

Selected parts of the GenBank entry {"type":"entrez-nucleotide","attrs":{"text":"DQ408531","term_id":"89160953","term_text":"DQ408531"}} DQ408531 . The complete entry can be found at http://www.ncbi. nlm.nih.gov/nuccore/ {"type":"entrez-nucleotide","attrs":{"text":"DQ408531","term_id":"89160953","term_text":"DQ408531"}} DQ408531

An external file that holds a picture, illustration, etc.
Object name is 40709_2015_32_Fig6_HTML.jpg

Selected parts of the Uniprot entry {"type":"entrez-protein","attrs":{"text":"P01308","term_id":"124617","term_text":"P01308"}} P01308 in XML format - The complete entry can be found at http://www.uniprot.org/uniprot/P01308.xml

There are also examples of data files using different representations for data and metadata. This means that two or more format classes may be used in the same data file. An example is represented by SAM files, which contain both GenBank-like lines (for the metadata) and table columns (for the data) as shown in Fig. ​ Fig.7 7 .

An external file that holds a picture, illustration, etc.
Object name is 40709_2015_32_Fig7_HTML.jpg

Selected parts of a SAM file

Should any of these four data representation classes be preferred over the others? Despite we observe an increasing use of XML and some authors propose to adopt XML for biological data interchange between databases and other sources of data [ 76 ], we believe that there is not an ultimate answer. There are text formats that better suit some specific kind of data and specific computational requirements and purposes. For example, it is difficult to imagine how macromolecule X-ray or NMR coordinates and related annotation, currently stored in PDB files, could fit into the FASTA-like format. On the other hand, if one has to parse big sequence files, the FASTA format, with a single line annotation, will cause them to have a smaller size than differently formatted files and will allow parsing them with just a few lines of code. Notice that some formats (e.g. SAM) can be compressed into a binary version (BAM) for intensive data processing.

Therefore, we believe that the solution is not to urge scientists to conform to a unique “optimal” format but rather to identify a few operational formats and make database and tool developers aware of the importance of sticking to them.

For integration purposes, the scientific community of database and tool developers has begun to adopt some good practices in data file formatting. One example is represented by the FGED Society ( http://fged.org/ ) formed at a meeting on Microarray Gene Expression Databases (EBI, Hinxton, 1999) with the goal, amongst the others, of facilitating the adoption of standards for DNA microarrays and gene expression data representation. We believe, however, that further efforts should be made in order to achieve a more robust and systematic policy in all the areas where data sharing is essential to utilise these data to make new discoveries and the progress of science possible.

The community of scientists concerned by data sharing and integration, including us, should make the effort of 1) compiling a complete and structured (i.e. organised by data type and purpose) list of the currently available formats with their description and 2) developing guidelines and recommendations for the adoption of standards in file formatting, also discussing which data types fit into each different text format and the related performance implications. This list and the guidelines, which might be integrated in a resource such as BioSharing should encourage database and tool developers to present information in a way that a computer program can parse it, suggest that they avoid inventing new computer readable formats but rather comply with one of the existing ones, and only accept new data, for storage purposes, that meet certain formatting criteria. Such guidelines should be ambitious and forward-looking enough to also advice scientists in both academia and industry to keep in mind data representation in developing high throughput technologies and their information services.

The development of converters translating formats in a unified form should be promoted as well. This would actually make it possible to combine the data across all the formats. A rather isolated example of data format translation is represented by the PRIDE Converter [ 77 ], which makes it easy to translate a large variety of input formats into the unique XML [ 76 , 78 ] format for proteomic data submission to the PRIDE repository [ 79 ]. The PRIDE Converter was designed to be suitable for both small and large data submissions and has a very intuitive GUI also for wet-lab scientists without a strong bioinformatics background or informatics support. Format translation faces problems especially with not well-structured data that cannot be translated properly in a computer readable format and therefore rely on human manipulation of the data in order to verify the correctness of the transformation. In the case of NGS data, we rely on tools for conversion between next generation sequencing data formats, such as NGS-FC ( http://sourceforge.net/projects/ngsformaterconv/ ), to ensure each tool in a workflow can work with the right format.

Identifiers

An identifier is a unique representation of a given data entry [ 80 , 81 ]. For example the Universal Protein Database (UniProt) uses a “unique identifier” to refer to a protein entity which cannot be used in any other case, thus ensuring no redundancy and one agreed unique term that unequivocally identifies a given protein [ 82 ].

In biological research a variety of data repositories exist and each of them is using its own implementation for generating unique identifiers. As an example, for the same protein, UniProt uses the identifier {"type":"entrez-protein","attrs":{"text":"Q9Y6N8","term_id":"116241276","term_text":"Q9Y6N8"}} Q9Y6N8 whereas Ensembl [ 83 ] is referring to it as ENSP00000264463 and RefSeq [ 84 ] as {"type":"entrez-protein","attrs":{"text":"NP_006718.2","term_id":"16306530","term_text":"NP_006718.2"}} NP_006718.2 . If all the researchers could use a single unique identifier to refer to a given protein across their publications and work, data integration would be a step ahead of its current state.

An effort to help with the discoverability of the identifiers and assist the researcher with knowledge on how to query data across databases has be done from identifiers.org [ 85 ]. This is a registry that facilitates the discovery of resources in life sciences and allows to decouple the identification of records by the physical locations on the web where they can be retrieved.

Many biological concepts are described in several databases using different identifiers. To facilitate discoverability and integration, databases have their data entries cross-referenced with external entries using identifiers. This enables users to find a data entry like a protein in UniProt and then find the same biological concept described in other databases (ie. RefSeq) and gather more relevant data about the same entry. Several initiatives like PICR [ 86 ] or the “DAVID ID conversion tool” [ 87 ] provide mapping of such identifiers. It will be beneficial if such service gets integrated in the major bioinformatics databases.

Some organised efforts including distributed resources like IMEx [ 88 ] are very well organised and, though the independent databases that are part of the consortium like IntAct [ 81 ], MINT [ 89 ] and DIP [ 90 ] use their own identifiers, all their entries get assigned a unique IMEx identifier issued by a central authority. The IMEx identifier is assigned to a single biological entity with the purpose of being reused across databases/systems and always link to the same entity regardless the system. The IMEx Central repository coordinates curation effort, assigns identifiers and facilitates the exchange of completed records on molecular interaction data between the IMEx Consortium partners.

Approaches like these can increase discoverability and shareability of data and even enable publications and scientific studies to use a single identifier to refer to a given entity. This entity could be easily traced and further studied by their audience. With an infrastructure like this in place, it will be possible to enforce researchers to submit the unique identifier of the biological entity that they are studying on their research papers. This is happening already for nucleotide sequence data where researchers have to submit newly obtained/sequenced entities to one of the three major sequencing databases [ 91 ] and refer to it in the paper. Most of other data types can be used in publications without such requirement. This also extends to entire datasets.

Reporting guidelines

Huge steps have been achieved by the creation and adoption of clear recommended guidelines when it comes to depositing and disseminating data and datasets [ 92 – 95 ]. Such guidelines are often the result of several discussions (years of discussions in some occasions) in a field where data efforts for sharing have been maturing. The specification of several standards in life science include documentation and examples of how to use them, but many initiatives additionally include guidelines to agree on what minimum or recommended information should be provided when describing data. Minimum information guidelines have been very popular to ensure that data can be easily interpreted and that results derived from their analysis can be independently verified. These guidelines tend to concentrate on defining the content and structure of the necessary information rather than the technical format for capturing it. A key landmark in the development of guidelines of minimun information in this area comes from the “Minimum Information about a Biomedical or Biological Investigation” (MIBBI) [ 93 ].

It is crucial to have a place where such efforts are listed and shared in order to ensure redundancy is avoided. As an example of reporting guidelines we mention here the efforts done in the topic of protein-protein interactions. Currently we see two reporting guidelines: MIMIx [ 96 ] and IMEx [ 88 ]. A key project that is contributing in this area and where one can look for as well as add “reporting guidelines” is the Registry of guidelines in biosharing.org [ 58 , 97 ].

As we have seen, there are different formats when it comes to data files, and these will always evolve according to the needs of the communities as well as the nature of the data and associated technologies. For example, a format that contains 20 fields for which one researcher might have a subset of information versus another that might opt for prioritising a different set. It is clear that having a minimum agreed set of fields that all comply to report using standards is crucial for data integration and reusability across such data. Similarly, other fields might be crucial and informative to a specific set of users. These can be adopted at the level of recommended. For example a protein-protein interaction database wants to capture domain specific information about interactions versus another one that is not interested in such aspect. One also might have optional fields, for those that want to annotate and enrich further the data record with metadata. Doing this in a standard manner means again allowing future reusability and expansion for others to adopt and exchange, integrate data based on this level of information.

Consortiums and standards initiatives

There are several initiatives coordinating the development of community standards to facilitate data comparison, exchange and verification in bioinformatics. Some of this initiatives are community initiatives or consortia like COMBINE [ 98 ], PSI [ 99 ], GAGH [ 100 ], INSDC [ 101 ], proteomeXchange [ 102 ], IMEx [ 88 ], BioPax [ 103 ] involved in the development of standards in one specific biological domain. Some other community initiatives like RDA are more generic with a potential application in different scientific domains.

Some strategic efforts supported by major service providers and national governments like ELIXIR [ 104 ], BBMRI [ 105 ], BD2K [ 106 ] are also involved in the development of standards in life sciences. Projects supported by specific grants like BioMedBridges [ 107 ], BioSHaRE [ 108 ] do also contribute to this cause but their duration is normally bound to the duration of the grant. All these initiatives play a major role in achieving consensus and agreements which facilitates the development and adoptions of standards.

In biological research, molecular biology has been the field ahead in terms of such efforts and the associated bioinformatics applications. One can only imagine the work yet to be done, learning from existing efforts and initiatives as described here in the field of ecology, biodiversity, marine biology and so on. Examples of large scale efforts that need to talk to each other and ideally apply best practice when it comes to creating an infrastructure that fosters data integration are LifeWatch [ 109 ] and ISBE [ 110 ].

Visualisation

There is a variety of visualisation tools, but often each tool requires a different file format and the task of feeding back the discovered data is not trivial [ 111 , 112 ]. The field of visualisation has its own challenges given the increasing quantity of data, the integration of heterogeneous data and the need for tools that allow representing multiple aspects of the data (e.g. multiple connections between nodes with diverse biological meanings [ 113 , 114 ]). There is a myriad of visualisation and analysis tools, ever proliferating, with each tool providing specific features that address different aspects (e.g. genome browsers [ 115 – 119 ]). In 2008 Pavlopoulus et al published a wish list for visualisation of biological data which still remains valid [ 120 ].

Data integration principles are fundamental in providing tools that are user friendly and allow the end users (biologists) to focus their efforts on the actual study of the data instead of being lost in the process of looking for the data they need by querying multiple databases that appear to provide inconsistent results between them. The field of systems biology per se brought substantial advances in visualisations since the ability to analyse and interpret interactions, networks and pathways relies often in the ability of visualising these accurately [ 120 ].

Overcoming some of the challenges associated with visualisation relies on better standards adoption and improvement in annotation and metadata. This is clearly a two directional effort: bottom up, where data and datasets are annotated and stored following a common set of standards, this extends to the data formats as well as a top down level of standards and adoption of compatible formats and output files that allow comparisons and integrations of results [ 121 – 123 ].

Historically, many domains within biology have relied on visualisation as a way to represent the biological information thus creating what are now considered standards in their domains. Plenty of examples can be found in the areas of phylogenetics [ 124 ] and pathways [ 125 , 126 ]. The advent of next generation sequencing brought genomics as a domain were significant effort has been put to develop new visualisation techniques to represent sequences, alignments, expression patterns and ultimately entire genomes [ 127 – 130 ]. However, biological researchers might lack an understanding and awareness about the range of visualisation techniques available and which is the most appropriate visual representation [ 131 , 132 ].

An increased dialogue between the computational scientists involved in the creation and development of such tools with the end users (aka the biologists), would be beneficial for the entire community and we hope this paper is one step towards such outcome. Efforts in this direction are also on the way and we cite here the BiVi initiative ( http://bivi.co/ ), which is addressing several challenges in the realm of visualisation as well as trying to reduce the gap between the biology, computational sciences and developers of bioinformatics tools. BiVi has grouped many of the most notable visualisation tools produced by biologists and developers across seven domains (though some of the tools cover more than one of these) and provides information as to their provenance, current status and links to websites ( http://bivi.co/visualisations ). Other community efforts in this area are VizBI ( http://vizbi.org/ ), SciVis ( http://scivis.itn.liu.se/ ) and CoVis ( http://www.iwr.uni-heidelberg.de/groups/CoVis/ ).

It would be impossible for us to list the plethora of visualisation tools developed and used in biological research, hence we provide an overview in Table ​ Table4 4 of some of the most common visualisations tools in the area of “Interaction Network Visualisation” to illustrate the variety and types of resources available for one area.

Common visualisation tools in the area of “Interaction Network Visualisation”

Name of resourceWhat it doesURL
BicOverlapperVisualisation of biclusters combined with profile plots and heat maps
BiGGEsTSHeat map-based bicluster visualisation
Brain ExplorerVisualisation of 3D transcription data in the central nervous system
Data Matrix ViewerSimple profile plot visualisation; supports Gaggle
EXPANDERHeat maps, scatter plots and profile plots of cluster averages
GENESISAnalysis suite; offers several interactive visualisations
geWorkbenchModular suite; heat maps, dendrograms, profile and scatter plots
Hierarchical Clustering ExplorerLinked heat map, profile and scatter plots; systematic exploration
Java TreeViewLinked heat maps, karyoscopes, sequence alignments, scatter plots
MaydayModular suite; many linked visualisations; enhanced heat map113
MultiExperiment ViewerAnalysis suite; heat maps, dendrograms, profile and scatter plots
PointCloudXploreVisualisation of 3D transcription data in Drosophila embryos
TimeSearcherExploration and analysis of time series; advanced profile plots
R/BioConductor GeneplotterKaryoscope-style plots and other visualisations
GenePatternModular analysis platform; several visualisation modules available
CytoscapeOpen source software platform for visualizing molecular interaction networks and biological pathways and integrating these networks with annotations, gene expression profiles and other state data

There are also well known and generally adopted analysis suites that also provide visualisation tools as part of their repertoire of resources such as Galaxy [ 133 ], Cytoscape [ 134 , 135 ], Ondex [ 136 ], iPlant Collaborative [ 137 ], Bioconductor [ 138 ]. Other important efforts derive from initiatives that are working towards unlocking the actual visualisations, in other words going from the visualisation to the data and datasets. This is important not only for reproducibility but also to allow access for data and their integration with other data/datasets. A very interesting resource is Utopia Docs [ 139 , 140 ], a free PDF reader that connects the static content of scientific articles to the dynamic world of online content. This resources allows the user to interact directly with curated database entries; play with molecular structures; edit sequence and alignment data; even plot and export tabular data. Another totally different but relevant initiative in the world of visualisation is BIOJS, that aims to provide open-source library of JavaScript components to visualise biological data. BIOJS vision is that every online biological dataset in the world should be visualised with BIOJS tools ( http://biojs.net/ ) [ 141 , 142 ].

Data heterogeneity is one of the biggest challenges in biological data integration. This could be solved with standardising the data structures that are being used. Biologists should get more involved with the aspects described here and working with bioinformaticians and computational scientists to achieve uniformity of their data. With this issue resolved, integration of biological data will greatly boost biological research and the field will gain a more robust structure: computational scientists will be responsible for maintaining and improving the infrastructure of the data; bioinformaticians will be able to build upon this infrastructure; biologists will be able to do research with advanced tools without the overhead of getting acquainted with complex topics of database management and programming tools.

Acknowledgements

We like to thank The Genome Analysis Centre (TGAC, Norwich, UK) and the Biotechnology and Biological Sciences Research Council (BBSRC, UK). AV acknowledges the King Abdullah University of Science and Technology (KAUST) Award No. KUK-I1-012-43 for funding support.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

VL: worked on most of the writing, literature review, all illustrations and contributed to the design of this paper. MS: edited the paper and provided suggestions. RCJ: contributed to the specific aspects related to existing data integration methodologies and key references. AV: contributed with writing some specific sections and bringing the perspective of the biology readership as well as editing the manuscript. MVS worked on the design of the manuscript and some of the writing. All authors read and approved the final manuscript.

Contributor Information

Vasileios Lapatas, Email: moc.liamg@103raip .

Michalis Stefanidakis, Email: rg.oinoi@lartsim .

Rafael C. Jimenez, Email: [email protected] .

Allegra Via, Email: [email protected] .

Maria Victoria Schneider, Email: [email protected] .

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Published: 13 September 2021

A guide to machine learning for biologists

  • Joe G. Greener   ORCID: orcid.org/0000-0002-5154-1929 1   na1 ,
  • Shaun M. Kandathil   ORCID: orcid.org/0000-0002-2671-2140 1   na1 ,
  • Lewis Moffat 1 &
  • David T. Jones   ORCID: orcid.org/0000-0001-8626-3765 1  

Nature Reviews Molecular Cell Biology volume  23 ,  pages 40–55 ( 2022 ) Cite this article

105k Accesses

714 Citations

490 Altmetric

Metrics details

  • Bioinformatics
  • Computational biology and bioinformatics

The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 print issues and online access

176,64 € per year

only 14,72 € per issue

Buy this article

  • Purchase on Springer Link
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

biology research paper data

Similar content being viewed by others

biology research paper data

Current progress and open challenges for applying deep learning across the biosciences

biology research paper data

If deep learning is the answer, what is the question?

biology research paper data

Ensemble deep learning in bioinformatics

Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15 , 20170387 (2018). This is a thorough review of applications of deep learning to biology and medicine including many references to the literature .

PubMed   PubMed Central   Google Scholar  

Mitchell, T. M. Machine Learning (McGraw Hill, 1997).

Goodfellow, I., Bengio Y. & Courville, A. Deep Learning (MIT Press, 2016).

Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16 , 321–332 (2015).

CAS   PubMed   PubMed Central   Google Scholar  

Zou, J. et al. A primer on deep learning in genomics. Nat. Genet. 51 , 12–18 (2019).

CAS   PubMed   Google Scholar  

Myszczynska, M. A. et al. Applications of machine learning to diagnosis and treatment of neurodegenerative diseases. Nat. Rev. Neurol. 16 , 440–456 (2020).

PubMed   Google Scholar  

Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16 , 687–694 (2019).

Tarca, A. L., Carey, V. J., Chen, X.-W., Romero, R. & Drăghici, S. Machine learning and its applications to biology. PLoS Comput. Biol. 3 , e116 (2007). This is an introduction to machine learning concepts and applications in biology with a focus on traditional machine learning methods .

Silva, J. C. F., Teixeira, R. M., Silva, F. F., Brommonschenkel, S. H. & Fontes, E. P. B. Machine learning approaches and their current application in plant molecular biology: a systematic review. Plant. Sci. 284 , 37–47 (2019).

Kandoi, G., Acencio, M. L. & Lemke, N. Prediction of druggable proteins using machine learning and systems biology: a mini-review. Front. Physiol. 6 , 366 (2015).

Marblestone, A. H., Wayne, G. & Kording, K. P. Toward an integration of deep learning and neuroscience. Front. Comput. Neurosci. 10 , 94 (2016).

Jiménez-Luna, J., Grisoni, F. & Schneider, G. Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2 , 573–584 (2020).

Google Scholar  

Buchan, D. W. A. & Jones, D. T. The PSIPRED Protein Analysis Workbench: 20 years on. Nucleic Acids Res. 47 , W402–W407 (2019).

Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26 , 990–999 (2016).

Altman, N. & Krzywinski, M. Clustering. Nat. Methods 14 , 545–546 (2017).

CAS   Google Scholar  

Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35 , 128–135 (2017).

Zhang, Z. et al. Predicting folding free energy changes upon single point mutations. Bioinformatics 28 , 664–671 (2012).

Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).

Kuhn, M. Building predictive models in r using the caret package. J. Stat. Softw. 28 , 1–26 (2008).

Blaom, A. D. et al. MLJ: a Julia package for composable machine learning. J. Open Source Softw. 5 , 2704 (2020).

Jones, D. T. Setting the standards for machine learning in biology. Nat. Rev. Mol. Cell Biol. 20 , 659–660 (2019).

Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33 , 831–838 (2015).

Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577 , 706–710 (2020). Technology company DeepMind entered the CASP13 assessment in protein structure prediction and its method using deep learning was the most accurate of the methods entered .

Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 , 115–118 (2017).

Tegunov, D. & Cramer, P. Real-time cryo-electron microscopy data preprocessing with Warp. Nat. Methods 16 , 1146–1152 (2019).

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521 , 436–444 (2015). This is a review of deep learning by some of the major figures in the deep learning revolution .

Hastie T., Tibshirani R., Friedman J. The elements of statistical learning: data mining, inference, and prediction. 2nd Edn. (Springer Science & Business Media; 2009).

Adebayo, J. et al. Sanity checks for saliency maps. NeurIPS https://arxiv.org/abs/1810.03292 (2018).

Gal, Y. & Ghahramani, Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. ICML 48 , 1050–1059 (2016).

Smith, A. M. et al. Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data. BMC Bioinformatics 21 , 119 (2020).

Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. 58 , 267–288 (1996).

Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. 67 , 301–320 (2005).

Noble, W. S. What is a support vector machine? Nat. Biotechnol. 24 , 1565–1567 (2006).

Ben-Hur, A. & Weston, J. A user’s guide to support vector machines. Methods Mol. Biol. 609 , 223–239 (2010).

Ben-Hur, A., Ong, C. S., Sonnenburg, S., Schölkopf, B. & Rätsch, G. Support vector machines and kernels for computational biology. PLoS Comput. Biol. 4 , e1000173 (2008). This is an introduction to SVMs with a focus on biological data and prediction tasks .

Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46 , 310–315 (2014).

Driscoll, M. K. et al. Robust and automated detection of subcellular morphological motifs in 3D microscopy images. Nat. Methods 16 , 1037–1044 (2019).

Bzdok, D., Krzywinski, M. & Altman, N. Machine learning: supervised methods. Nat. Methods 15 , 5–6 (2018).

Wang, C. & Zhang, Y. Improving scoring-docking-screening powers of protein-ligand scoring functions using random forest. J. Comput. Chem. 38 , 169–177 (2017).

Zeng, W., Wu, M. & Jiang, R. Prediction of enhancer-promoter interactions via natural language processing. BMC Genomics 19 , 84 (2018).

Olson, R. S., Cava, W. L., Mustahsan, Z., Varik, A. & Moore, J. H. Data-driven advice for applying machine learning to bioinformatics problems. Pac. Symp. Biocomput. 23 , 192–203 (2018).

Rappoport, N. & Shamir, R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res. 47 , 1044 (2019).

Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35 , 1026–1028 (2017).

Jain, A. K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31 , 651–666 (2010).

Ester M., Kriegel H.-P., Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD‘96 Proc. Second Int. Conf. Knowl. Discov. Data Mining. 96 , 226–231 (1996).

Nguyen, L. H. & Holmes, S. Ten quick tips for effective dimensionality reduction. PLoS Comput. Biol. 15 , e1006907 (2019).

Moon, K. R. et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 37 , 1482–1492 (2019).

van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9 , 2579–2605 (2008).

Kobak, D. & Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 10 , 5416 (2019). This article provides a discussion and tips for using t -SNE as a dimensionality reduction technique on single-cell transcriptomics data .

Crick, F. The recent excitement about neural networks. Nature 337 , 129–132 (1989).

Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2 , 665–673 (2020). This article discusses a common problem in deep learning called ‘shortcut learning’, where the model uses decision rules that do not transfer to real-world data .

Qian, N. & Sejnowski, T. J. Predicting the secondary structure of globular proteins using neural network models. J. Mol. Biol. 202 , 865–884 (1988).

deFigueiredo, R. J. et al. Neural-network-based classification of cognitively normal, demented, Alzheimer disease and vascular dementia from single photon emission with computed tomography image data from brain. Proc. Natl Acad. Sci. USA 92 , 5530–5534 (1995).

Mayr, A., Klambauer, G., Unterthiner, T. & Hochreiter, S. DeepTox: toxicity prediction using deep learning. Front. Environ. Sci. 3 , 80 (2016).

Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117 , 1496–1503 (2020).

Xu, J., Mcpartlon, M. & Li, J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nat. Mach. Intell. 3 , 601–609 (2021).

Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36 , 983–987 (2018).

Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting 3D genome folding from DNA sequence with Akita. Nat. Methods 17 , 1111–1117 (2020).

Zeng, H., Edwards, M. D., Liu, G. & Gifford, D. K. Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics 32 , i121–i127 (2016).

Yao, R., Qian, J. & Huang, Q. Deep-learning with synthetic data enables automated picking of cryo-EM particle images of biological macromolecules. Bioinformatics 36 , 1252–1259 (2020).

Si, D. et al. Deep learning to predict protein backbone structure from high-resolution cryo-EM density maps. Sci. Rep. 10 , 4282 (2020).

Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng. 2 , 158–164 (2018).

AlQuraishi, M. End-to-end differentiable learning of protein structure. Cell Syst. 8 , 292–301.e3 (2019).

Heffernan, R., Yang, Y., Paliwal, K. & Zhou, Y. Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility. Bioinformatics 33 , 2842–2849 (2017).

Müller, A. T., Hiss, J. A. & Schneider, G. Recurrent neural network model for constructive peptide design. J. Chem. Inf. Model. 58 , 472–479 (2018).

Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F. & Sun, J. Doctor AI: predicting clinical events via recurrent neural networks. JMLR Workshop Conf. Proc. 56 , 301–318 (2016).

Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44 , e107 (2016).

Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16 , 1315–1322 (2019).

Vaswani, A. et al. Attention is all you need. arXiv https://arxiv.org/abs/1706.03762 (2017).

Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. arXiv https://arxiv.org/abs/2007.06225 (2020).

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596 , 583–589 (2021).

Battaglia, P. W. et al. Relational inductive biases, deep learning, and graph networks. arXiv https://arxiv.org/abs/1806.01261 (2018).

Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 181 , 475–483 (2020). In this work, a deep learning model predicts antibiotic activity, with one candidate showing broad-spectrum antibiotic activities in mice .

Gainza, P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods 17 , 184–192 (2020).

Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A. & Kim, P. M. Fast and flexible protein design using deep graph neural networks. Cell Syst. 11 , 402–411.e4 (2020).

Gligorijevic, V. et al. Structure-based function prediction using graph convolutional networks. Nat. Commun. 12 , 3168 (2021).

Zitnik, M., Agrawal, M. & Leskovec, J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34 , i457–i466 (2018).

Veselkov, K. et al. HyperFoods: machine intelligent mapping of cancer-beating molecules in foods. Sci. Rep. 9 , 9237 (2019).

Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch geometric. arXiv https://arxiv.org/abs/1903.02428 (2019).

Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 37 , 1038–1040 (2019).

Wang, Y. et al. Predicting DNA methylation state of CpG dinucleotide using genome topological features and deep networks. Sci. Rep. 6 , 19598 (2016).

Linder, J., Bogard, N., Rosenberg, A. B. & Seelig, G. A generative neural network for maximizing fitness and diversity of synthetic DNA and protein sequences. Cell Syst. 11 , 49–62.e16 (2020).

Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep. 8 , 16189 (2018).

Wang, J. et al. scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses. Nat. Commun. 12 , 1882 (2021).

Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 , 8024–8035 (2019).

Abadi M. et al. Tensorflow: a system for large-scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation. 265–283 (USENIX, 2016).

Wei, Q. & Dunbrack, R. L. Jr The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE 8 , e67863 (2013).

Walsh, I., Pollastri, G. & Tosatto, S. C. E. Correct machine learning on protein sequences: a peer-reviewing perspective. Brief. Bioinform 17 , 831–840 (2016). This article discusses how peer reviewers can assess machine learning methods in biology, and by extension how scientists can design and conduct such studies properly .

Schreiber, J., Singh, R., Bilmes, J. & Noble, W. S. A pitfall for machine learning methods aiming to predict across cell types. Genome Biol. 21 , 282 (2020).

Chothia, C. & Lesk, A. M. The relation between the divergence of sequence and structure in proteins. EMBO J. 5 , 823–826 (1986).

Söding, J. & Remmert, M. Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr. Opin. Struct. Biol. 21 , 404–411 (2011).

Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20 , 473 (2019).

Sillitoe, I. et al. CATH: expanding the horizons of structure-based functional annotations for genome sequences. Nucleic Acids Res. 47 , D280–D284 (2019).

Cheng, H. et al. ECOD: an evolutionary classification of protein domains. PLoS Comput. Biol. 10 , e1003926 (2014).

Li, Y. & Yang, J. Structural and sequence similarity makes a significant impact on machine-learning-based scoring functions for protein-ligand interactions. J. Chem. Inf. Model. 57 , 1007–1012 (2017).

Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med. 15 , e1002683 (2018).

Szegedy, C. et al. Intriguing properties of neural networks. arXiv https://arxiv.org/abs/1312.6199 (2014).

Hie, B., Cho, H. & Berger, B. Realizing private and practical pharmacological collaboration. Science 362 , 347–350 (2018).

Beaulieu-Jones, B. K. et al. Privacy-preserving generative deep neural networks support clinical data sharing. Circ. Cardiovasc. Qual. Outcomes 12 , e005122 (2019).

Konečný, J., Brendan McMahan, H., Ramage, D. & Richtárik, P. Federated optimization: distributed machine learning for on-device intelligence. arXiv https://arxiv.org/abs/1610.02527 (2016).

Pérez, A., Martínez-Rosell, G. & De Fabritiis, G. Simulations meet machine learning in structural biology. Curr. Opin. Struct. Biol. 49 , 139–144 (2018).

Noé, F., Olsson, S., Köhler, J. & Wu, H. Boltzmann generators: sampling equilibrium states of many-body systems with deep learning. Science 365 , 6457 (2019).

Shrikumar, A., Greenside, P. & Kundaje, A. Reverse-complement parameter sharing improves deep learning models for genomics. bioRxiv https://www.biorxiv.org/content/10.1101/103663v1 (2017).

Lopez, R., Gayoso, A. & Yosef, N. Enhancing scientific discoveries in molecular biology with deep generative models. Mol. Syst. Biol. 16 , e9198 (2020).

Anishchenko, I., Chidyausiku, T. M., Ovchinnikov, S., Pellock, S. J. & Baker, D. De novo protein design by deep network hallucination. bioRxiv https://doi.org/10.1101/2020.07.22.211482 (2020).

Article   Google Scholar  

Innes, M. et al. A differentiable programming system to bridge machine learning and scientific computing. arXiv https://arxiv.org/abs/1907.07587 (2019).

Ingraham J., Riesselman A. J., Sander C., Marks D. S. Learning protein structure with a differentiable simulator. ICLR https://openreview.net/forum?id=Byg3y3C9Km (2019).

Jumper, J. M., Faruk, N. F., Freed, K. F. & Sosnick, T. R. Trajectory-based training enables protein simulations with accurate folding and Boltzmann ensembles in cpu-hours. PLoS Comput. Biol. 14 , e1006578 (2018).

Wang, Y., Fass, J. & Chodera, J. D. End-to-end differentiable molecular mechanics force field construction. arXiv http://arxiv.org/abs/2010.01196 (2020).

Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs. GitHub http://github.com/google/jax (2018).

Chen, K. M., Cofer, E. M., Zhou, J. & Troyanskaya, O. G. Selene: a PyTorch-based deep learning library for sequence data. Nat. Methods 16 , 315–318 (2019). This work provides a software library based on PyTorch providing functionality for biological sequences .

Kopp, W., Monti, R., Tamburrini, A., Ohler, U. & Akalin, A. Deep learning for genomics using Janggu. Nat. Commun. 11 , 3488 (2020).

Schoenholz, S. S. & Cubuk, E. D. JAX, M.D.: end-to-end differentiable, hardware accelerated, molecular dynamics in pure Python. arXiv https://arxiv.org/abs/1912.04232 (2019).

Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37 , 592–600 (2019).

Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 18 , 203–211 (2020).

Livesey, B. J. & Marsh, J. A. Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations. Mol. Syst. Biol. 16 , e9380 (2020).

AlQuraishi, M. ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinformatics 20 , 311 (2019).

Townshend, R. J. L. et al. ATOM3D: tasks on molecules in three dimensions. arXiv https://arxiv.org/abs/2012.04035 (2020).

Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural. Inf. Process. Syst. 32 , 9689–9701 (2019).

Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP) — round XIII. Proteins 87 , 1011–1020 (2019).

Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20 , 244 (2019).

Munro, D. & Singh, M. DeMaSk: a deep mutational scanning substitution matrix and its use for variant impact prediction. Bioinformatics 36 , 5322–5329 (2020).

CAS   PubMed Central   Google Scholar  

Haario, H. & Taavitsainen, V.-M. Combining soft and hard modelling in chemical kinetic models. Chemom. Intell. Lab. Syst. 44 , 77–98 (1998).

Cozzetto, D., Minneci, F., Currant, H. & Jones, D. T. FFPred 3: feature-based function prediction for all gene ontology domains. Sci. Rep. 6 , 31865 (2016).

Nugent, T. & Jones, D. T. Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics 10 , 159 (2009).

Bao, L., Zhou, M. & Cui, Y. nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res. 33 , W480–W482 (2005).

Li, W., Yin, Y., Quan, X. & Zhang, H. Gene expression value prediction based on XGBoost algorithm. Front. Genet. 10 , 1077 (2019).

Zhang, Y. & Skolnick, J. SPICKER: a clustering approach to identify near-native protein folds. J. Comput. Chem. 30 , 865–871 (2004).

Teodoro, M. L., Phillips, G. N. Jr & Kavraki, L. E. Understanding protein flexibility through dimensionality reduction. J. Comput. Biol. 10 , 617–634 (2003).

Schlichtkrull, M. et al. Modeling relational data with graph convolutional networks. arXiv https://arxiv.org/abs/1703.06103 (2019).

Pandarinath, C. et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nat. Methods 15 , 805–815 (2018).

Antczak, M., Michaelis, M. & Wass, M. N. Environmental conditions shape the nature of a minimal bacterial genome. Nat. Commun. 10 , 3100 (2019).

Sun, T., Zhou, B., Lai, L. & Pei, J. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinformatics 18 , 277 (2017).

Hiranuma, N. et al. Improved protein structure refinement guided by deep learning based accuracy estimation. Nat. Commun. 12 , 1340 (2021).

Pagès, G., Charmettant, B. & Grudinin, S. Protein model quality assessment using 3D oriented convolutional neural networks. Bioinformatics 35 , 3313–3319 (2019).

Pires, D. E. V., Ascher, D. B. & Blundell, T. L. DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucleic Acids Res. 42 , W314–W319 (2014).

Yuan, Y. & Bar-Joseph, Z. Deep learning for inferring gene relationships from single-cell expression data. Proc. Natl Acad. Sci. USA 116 , 27151–27158 (2019).

Chen, L., Cai, C., Chen, V. & Lu, X. Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model. BMC Bioinformatics 17 , S9 (2016).

Kantz, E. D., Tiwari, S., Watrous, J. D., Cheng, S. & Jain, M. Deep neural networks for classification of LC-MS spectral peaks. Anal. Chem. 91 , 12407–12413 (2019).

Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16 , 299–302 (2019).

Liebal, U. W., Phan, A. N. T., Sudhakar, M., Raman, K. & Blank, L. M. Machine learning applications for mass spectrometry-based metabolomics. Metabolites 10 , 243 (2020).

Zhong, E. D., Bepler, T., Berger, B. & Davis, J. H. CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks. Nat. Methods 18 , 176–185 (2021).

Schmauch, B. et al. A deep learning model to predict RNA-Seq expression of tumours from whole slide images. Nat. Commun. 11 , 3877 (2020).

Das, P. et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nat. Biomed. Eng. 5 , 613–623 (2021).

Gligorijevic, V., Barot, M. & Bonneau, R. deepNF: deep network fusion for protein function prediction. Bioinformatics 34 , 3873–3881 (2018).

Karpathy A. A recipe for training neural networks. https://karpathy.github.io/2019/04/25/recipe (2019).

Bengio, Y. Practical recommendations for gradient-based training of deep architectures. Lecture Notes Comput. Sci. 7700 , 437–478 (2012).

Roberts, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell. 3 , 199–217 (2021). This study assesses 62 machine learning studies that analyse medical images for COVID-19 and none is found to be of clinical use, indicating the difficulties of training a useful model .

List, M., Ebert, P. & Albrecht, F. Ten simple rules for developing usable software in computational biology. PLoS Comput. Biol. 13 , e1005265 (2017).

Sonnenburg, S. Ã., Braun, M. L., Ong, C. S. & Bengio, S. The need for open source software in machine learning. J. Mach. Learn. Res. 8 , 2443–2466 (2007).

Download references

Acknowledgements

The authors thank members of the UCL Bioinformatics Group for valuable discussions and comments. This work was supported by the European Research Council Advanced Grant ProCovar (project ID 695558).

Author information

These authors contributed equally: Joe G. Greener, Shaun M. Kandathil.

Authors and Affiliations

Department of Computer Science, University College London, London, UK

Joe G. Greener, Shaun M. Kandathil, Lewis Moffat & David T. Jones

You can also search for this author in PubMed   Google Scholar

Contributions

All authors researched data for the article, contributed substantially to discussion of the content, wrote the article and reviewed the manuscript before submission.

Corresponding author

Correspondence to David T. Jones .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Peer review information.

Nature Reviews Molecular Cell Biology thanks S. Draghici who co-reviewed with T. Nguyen; B. Chain; S. Haider; F. Mahmood; and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Caret: https://topepo.github.io/caret

Colaboratory: https://research.google.com/colaboratory

Graph Nets: https://github.com/deepmind/graph_nets

MLJ: https://alan-turing-institute.github.io/MLJ.jl/stable

PyTorch: https://pytorch.org

PyTorch Geometric: https://pytorch-geometric.readthedocs.io/en/latest

scikit-learn: https://scikit-learn.org/stable

Tensorflow: https://www.tensorflow.org

Machine learning methods based on neural networks. The adjective ‘deep’ refers to the use of many hidden layers in the network, two hidden layers as a minimum but usually many more than that. Deep learning is a subset of machine learning, and hence of artificial intelligence more broadly.

A collection of connected nodes loosely representing neuron connectivity in a biological brain. Each node is part of a layer and represents a number calculated from the previous layer. The connections, or edges, allow a signal to flow from the input layer to the output layer via hidden layers.

The true value that the output of a machine learning model is compared with to train the model and test performance. These data usually come from experimental data (for example, accessibility of a region of DNA to transcription factors) or expert human annotation (for example healthy or pathological medical image).

Any scheme for numerically representing (often categorical) data in a form suitable for use in a machine learning model. An encoding can be a fixed numerical representation (for example, one-hot or continuous encoding) or can be defined using parameters that are trained along with the rest of a model.

An encoding scheme that represents a fixed set of n categorical inputs using n unique n -dimensional vectors, each with one element set to 1 and the rest set to 0. For example, the set of three letters (A,B,C) could be represented by the three vectors [1,0,0], [0,1,0] and [0,0,1], respectively.

A loss function that calculates the average squared difference between the predicted values and the ground truth. This function heavily penalizes outliers because it increases rapidly as the difference between a predicted value and the ground truth grows.

The most common loss function for training a binary classifier; that is, for tasks aimed at answering a question with only two choices (such as cancer versus non-cancer); sometimes called ‘log loss’.

A model that assumes that the output can be calculated from a linear combination of inputs; that is, each input feature is multiplied by a single parameter and these values are added. It is easy to interpret how these models make their predictions.

Transformations applied to each data point to map the original points into a space in which they become separable with respect to their class.

A model where the output is calculated from a non-linear combination of inputs; that is, the input features can be combined during prediction using operations such as multiplication. These models can describe more complex phenomena than linear regression.

A classification approach where a data point is classified on the basis of the known (ground truth) classes of the k most similar points in the training set using a majority voting rule. k is a parameter that can be tuned. Can also be used for regression by averaging the property value over the k nearest neighbours.

Restricting the values of parameters to prevent the model from overfitting to the training data. For example, penalizing high parameter values in regression models reduces the flexibility of the model and can stop it fitting to noise in the training data.

On-demand computing services, including processing power and data storage, typically available via the Internet. A pay-as-you-go model is usually used. Use of cloud computing minimizes up-front IT infrastructure costs.

A statistical model that can be used to describe the evolution of observable events that depend on factors that are not directly observable. It has various uses in biology, including representing protein sequence families.

In the context of machine learning, an image generated to show which pixels in an input image contribute to the prediction made by a model. It is useful in interpreting models.

A set of techniques to automatically calculate the gradient of a function in a computer program. Used to train neural networks, where it is called ‘backpropagation’.

The rate of change of one property as another property changes. In neural networks, the set of gradients of the loss function with respect to the neural network parameters, computed via a process known as backpropagation, is used to adjust the parameters and thus train the model.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Greener, J.G., Kandathil, S.M., Moffat, L. et al. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 23 , 40–55 (2022). https://doi.org/10.1038/s41580-021-00407-0

Download citation

Accepted : 23 July 2021

Published : 13 September 2021

Issue Date : January 2022

DOI : https://doi.org/10.1038/s41580-021-00407-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Comprehensive analysis of mitochondria-related genes indicates that ppp2r2b is a novel biomarker and promotes the progression of bladder cancer via wnt signaling pathway.

  • Shaosan Kang

Biology Direct (2024)

An novel effective and safe model for the diagnosis of nonalcoholic fatty liver disease in China: gene excavations, clinical validations, and mechanism elucidation

  • Beitian Jia
  • Yaogang Wang

Journal of Translational Medicine (2024)

Identification of BGN positive fibroblasts as a driving factor for colorectal cancer and development of its related prognostic model combined with machine learning

  • Shangshang Hu
  • Qianni Xiao
  • Shukui Wang

BMC Cancer (2024)

Interpretable machine learning framework to predict gout associated with dietary fiber and triglyceride-glucose index

  • Shunshun Cao
  • Yangyang Hu

Nutrition & Metabolism (2024)

An explainable machine learning-based model to predict intensive care unit admission among patients with community-acquired pneumonia and connective tissue disease

  • Linjing Gong
  • Zongan Liang

Respiratory Research (2024)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

biology research paper data

IMAGES

  1. AP Biology Research Paper

    biology research paper data

  2. Dissecting a scientific paper about evolutionary biology

    biology research paper data

  3. Biology Research Paper

    biology research paper data

  4. Statistical and modern data science methods used for workingon

    biology research paper data

  5. Biology Research Paper Example Pdf

    biology research paper data

  6. Biology Research paper 1 topics

    biology research paper data

COMMENTS

  1. Articles | Biological Research - BioMed Central

    Publishing across the broad spectrum of experimental biology, Biological Research brings together original research, developments and advances of interest to ...

  2. PLOS Biology

    The planetary outlook for biodiversity is dire. Leah Gerber and Gwenllian Iacona introduce a new Collection of articles that discuss the data we have and the data we need for more effective conservation policies.

  3. 2020 Top 50 Life and Biological Sciences Articles - Nature

    Featuring authors from around the world, these papers highlight valuable research from an international community. Browse all Top 50 subject area collections here. * Data obtained from SN...

  4. Top 100 in Cell and Molecular Biology - 2022 - Nature

    This collection highlights our most downloaded* cell and molecular biology papers published in 2022. Featuring authors from around the world, these papers showcase valuable research from an...

  5. Data integration in biological research: an overview - PMC

    Data driven biological research has made data integration strategies crucial for the advancements and discovery in a plethora of fields (e.g. genomics, proteomics, metabolomics, environmental sciences, clinical research to name a few) [ 1 – 6 ].

  6. A guide to machine learning for biologists | Nature Reviews ...

    Machine learning is becoming a widely used tool for the analysis of biological data. However, for experimentalists, proper use of machine learning methods can be challenging. This Review...