Foundation models

Modelling of single-cells to perform multiple tasks.

4 datasets · 8 methods · 2 control methods · 2 metrics

Info

Task info Method info Metric info Dataset info Results

Recent developments in deep-learning have led to the creation of several ‘foundation models’ for single-cell data. These are large models that have been trained on data from millions of cells and am to fully capture the variability in the single-cell landscape. Typically, they use a transformer architecture (Szałata et al. 2024) and undergo self-supervised pre-training using masking of parts of the input data. Trained foundation models can then be applied to a variety of downstream tasks, either by directly feeding new data into the model or by fine-tuning to better fit a new dataset or to produce a specific output. The general nature of single-cell foundation models and the large amount of data they have been trained on makes them potentially powerful tools for single-cell analysis but their performance is yet to be fully established.

Open Problems builds on existing evaluations (Boiarsky et al. 2023; Liu et al. 2024) of foundation models by incorporating them into our continuous benchmarking framework.

This overview combines results from the following benchmarks for individual tasks:

This benchmark is a work in progress. If you are interested in evaluating foundation models for single-cell data please fill in the form below to get in touch.

Foundation models contact form

Interpretation

The foundation models task combines results for multiple analysis tasks and therefore should be interpreted differently. We treat each task as a metric that means the overall performance of a foundation model on that type of analysis. For more information about how foundation models perform at specific aspects of each task you should refer to that tasks results page. As well as comparing between foundation models we provide additional context by including representative standard methods for each task, allowing us to see how foundation models compare to more established methods.

The high hardware and computational requirements of some foundation models present additional challenges for benchmarking and make it difficult to obtain results in some cases.

Summary

function aggregate_scores(obj) {
  return d3.mean(obj.map(val => {
    if (val.score === undefined || isNaN(val.score)) return 0;
    return Math.min(1, Math.max(0, val.score))
  }));
}

function transpose_list_of_objects(list) {
  return Object.fromEntries(Object.keys(list[0]).map(key => [key, list.map(d => d[key])]))
}

function label_time(time) {
  if (time < 1e-5) return "0s";
  if (time < 1) return "<1s";
  if (time < 60) return `${Math.floor(time)}s`;
  if (time < 3600) return `${Math.floor(time / 60)}m`;
  if (time < 3600 * 24) return `${Math.floor(time / 3600)}h`;
  if (time < 3600 * 24 * 7) return `${Math.floor(time / 3600 / 24)}d`;
  return ">7d"; // Assuming missing values are encoded as NaN
}

function label_memory(x_mb, include_mb = true) {
  if (!include_mb && x_mb < 1e3) return "<1G";
  if (x_mb < 1) return "<1M";
  if (x_mb < 1e3) return `${Math.round(x_mb)}M`;
  if (x_mb < 1e6) return `${Math.round(x_mb / 1e3)}G`;
  if (x_mb < 1e9) return `${Math.round(x_mb / 1e6)}T`;
  return ">1P";
}

function mean_na_rm(x) {
  return d3.mean(x.filter(d => !isNaN(d)));
}

poss_dataset_ids = dataset_info
  .map(d => d.dataset_id)
  .filter(d => results.map(r => r.dataset_id).includes(d))
poss_method_ids = method_info
  .map(d => d.method_id)
  .filter(d => results.map(r => r.method_id).includes(d))
poss_metric_ids = metric_info
  .map(d => d.metric_id)
  .filter(d => results.map(r => Object.keys(r.scaled_scores)).flat().includes(d))

has_resources = results[0].hasOwnProperty("resources")
has_exit_codes = results[0].hasOwnProperty("exit_codes")

results_long = results.flatMap(d => {
  return Object.entries(d.scaled_scores).map(([metric_id, value]) =>
    ({
      method_id: d.method_id,
      dataset_id: d.dataset_id,
      metric_id: metric_id,
      score: value
    })
  )
}).filter(d => method_ids.includes(d.method_id) && metric_ids.includes(d.metric_id) && dataset_ids.includes(d.dataset_id))

overall = d3.groups(results_long, d => d.method_id)
  .map(([method_id, values]) => ({method_id, mean_score: aggregate_scores(values)}))

per_dataset = d3.groups(results_long, d => d.method_id)
  .map(([method_id, values]) => {
    const datasets = d3.groups(values, d => d.dataset_id)
      .map(([dataset_id, values]) => ({["dataset_" + dataset_id]: aggregate_scores(values)}))
      .reduce((a, b) => ({...a, ...b}), {})
    return {method_id, ...datasets}
  })

per_metric = d3.groups(results_long, d => d.method_id)
  .map(([method_id, values]) => {
    const metrics = d3.groups(values, d => d.metric_id)
      .map(([metric_id, values]) => ({["metric_" + metric_id]: aggregate_scores(values)}))
      .reduce((a, b) => ({...a, ...b}), {})
    return {method_id, ...metrics}
  })

results_resources = {
  let results_resources = null

  if (has_resources) {
    results_resources = results.flatMap(d => {
      return ({
        method_id: d.method_id,
        dataset_id: d.dataset_id,
        ...d.resources
      })
    }).filter(d => method_ids.includes(d.method_id) && dataset_ids.includes(d.dataset_id))
  }

  return results_resources
}

resources = {
  let resources = null

  if (has_resources) {
    resources = d3.groups(results_resources, d => d.method_id)
      .map(([method_id, values]) => {
        const mean_peak_memory_mb = mean_na_rm(values.map(d => d.peak_memory_mb))
        const mean_disk_read_mb = mean_na_rm(values.map(d => d.disk_read_mb))
        const mean_disk_write_mb = mean_na_rm(values.map(d => d.disk_write_mb))
        const mean_duration_sec = mean_na_rm(values.map(d => d.duration_sec))

        return ({
          method_id,
          mean_cpu_pct: mean_na_rm(values.map(d => d.cpu_pct)),
          mean_peak_memory_mb,
          mean_peak_memory_log: -Math.log10(mean_peak_memory_mb),
          mean_peak_memory_str: " " + label_memory(mean_peak_memory_mb) + " ",
          mean_disk_read_mb: mean_na_rm(values.map(d => d.disk_read_mb)),
          mean_disk_read_log: -Math.log10(mean_disk_read_mb),
          mean_disk_read_str: " " + label_memory(mean_disk_read_mb) + " ",
          mean_disk_write_mb: mean_na_rm(values.map(d => d.disk_write_mb)),
          mean_disk_write_log: -Math.log10(mean_disk_write_mb),
          mean_disk_write_str: " " + label_memory(mean_disk_write_mb) + " ",
          mean_duration_sec,
          mean_duration_log: -Math.log10(mean_duration_sec),
          mean_duration_str: " " + label_time(mean_duration_sec) + " "
        })
      })
  }

  return resources
}

exit_codes = {
  let exit_codes = null

  if (has_exit_codes) {
    exit_codes = results.flatMap(d => {
      return ({
        method_id: d.method_id,
        dataset_id: d.dataset_id,
        exit_codes: Object.values(d.exit_codes)
      })
    }).filter(d => method_ids.includes(d.method_id) && dataset_ids.includes(d.dataset_id))
  } else {
    exit_codes = results_resources.flatMap(d => {
      let exit_code = d.exit_code
      if (exit_code === undefined) {
        // If there is not exit code, assume the method ran successfully
        exit_code = 0
      }

      return ({
        method_id: d.method_id,
        dataset_id: d.dataset_id,
        exit_codes: [exit_code]
      })
    }).filter(d => method_ids.includes(d.method_id) && dataset_ids.includes(d.dataset_id))
  }

  return exit_codes
}

error_reasons = d3.groups(exit_codes, d => d.method_id)
  .map(([method_id, values]) => {
    const all_codes = values.flatMap(d => d.exit_codes)

    if (all_codes.length === 0) {
      return {method_id, error_reason: []}
    }

    const error_pct_oom = d3.mean(all_codes, d => d === 137)
    const error_pct_timeout = d3.mean(all_codes, d => d === 143)
    const error_pct_na = d3.mean(all_codes, d => d === 99)
    const error_pct_error = d3.mean(all_codes, d => d > 0) - error_pct_oom - error_pct_timeout - error_pct_na
    const error_pct_unknown = d3.mean(all_codes, d => d < 0)
    const error_pct_ok = d3.mean(all_codes, d => d === 0)
    return ({
      method_id,
      error_reason: [
        error_pct_oom,
        error_pct_timeout,
        error_pct_error,
        error_pct_unknown,
        error_pct_na,
        error_pct_ok
      ],
    })
  })

summary_all = method_info
  .filter(d => show_con || !d.is_baseline)
  .filter(d => method_ids.includes(d.method_id))
  .map(method => {
    const method_id = method.method_id
    const method_name = method.method_name
    const mean_score = overall.find(d => d.method_id === method_id).mean_score
    const datasets = per_dataset.find(d => d.method_id === method_id)
    const metrics = per_metric.find(d => d.method_id === method_id)
    const error_reasons_ = error_reasons.find(d => d.method_id === method_id)

    let summary = {
      method_id,
      method_name,
      mean_score,
      ...datasets,
      ...metrics,
      ...error_reasons_
    }

    if (has_resources) {
      const resources_ = resources.find(d => d.method_id === method_id)
      summary = {...summary, ...resources_}
    }
    return summary
  })
  .sort((a, b) => b.mean_score - a.mean_score)

// make sure the first entry contains all columns
column_info = {
  let column_info = [
    {
      id: "method_name",
      name: "Name",
      label: null,
      group: "method",
      geom: "text",
      palette: null
    },
    {
      id: "mean_score",
      name: "Score",
      group: "overall",
      geom: "bar",
      palette: "overall"
    },
    {
      id: "error_reason",
      name: "Error reason",
      group: "overall",
      geom: "pie",
      palette: "error_reason"
    },
    ...dataset_info
      .filter(d => dataset_ids.includes(d.dataset_id))
      .map(
        d => ({
          id: "dataset_" + d.dataset_id,
          name: d.dataset_name,
          group: "dataset",
          geom: "funkyrect",
          palette: "dataset"
        })
      )
      .sort((a, b) => a.name.localeCompare(b.name)),
    ...metric_info
      .filter(d => metric_ids.includes(d.metric_id))
      .map(
        d => ({
          id: "metric_" + d.metric_id,
          name: d.metric_name,
          group: "metric",
          geom: "funkyrect",
          palette: "metric"
        })
      )
      .sort((a, b) => a.name.localeCompare(b.name)),
  ]

  if (has_resources) {
    column_info.push(
      {
        id: "mean_cpu_pct",
        name: "%CPU",
        group: "resources",
        geom: "funkyrect",
        palette: "resources"
      },
      {
        id: "mean_peak_memory_log",
        name: "Peak memory",
        label: "mean_peak_memory_str",
        group: "resources",
        geom: "rect",
        palette: "resources"
      },
      {
        id: "mean_disk_read_log",
        name: "Disk read",
        label: "mean_disk_read_str",
        group: "resources",
        geom: "rect",
        palette: "resources"
      },
      {
        id: "mean_disk_write_log",
        name: "Disk write",
        label: "mean_disk_write_str",
        group: "resources",
        geom: "rect",
        palette: "resources"
      },
      {
        id: "mean_duration_log",
        name: "Duration",
        label: "mean_duration_str",
        group: "resources",
        geom: "rect",
        palette: "resources"
      }
    )
  }

  column_info = column_info.map(d => {
    if (d.id === "method_name") {
      return {...d, options: {width: 15, hjust: 0}}
    } else if (d.id === "is_baseline") {
      return {...d, options: {width: 1}}
    } else if (d.geom === "bar") {
      return {...d, options: {width: 4}}
    } else {
      return d
    }
  })

  return column_info
}

column_groups = {
  let column_groups = [
    {
      group: "method",
      palette: null,
      level1: ""
    },
    {
      group: "overall",
      palette: "overall",
      level1: "Overall"
    },
    {
      group: "error_reason",
      palette: "error_reason",
      level1: "Error reason"
    },
    {
      group: "dataset",
      palette: "dataset",
      level1: dataset_info.length >= 3 ? "Datasets" : ""
    },
    {
      group: "metric",
      palette: "metric",
      level1: metric_info.length >= 3 ? "Metrics" : ""
    }
  ]

  if (has_resources) {
    column_groups.push(
      {group: "resources", palette: "resources", level1: "Resources"}
    )
  }

  return column_groups
}

palettes = [
  {
    overall: "Greys",
    dataset: "Blues",
    metric: "Reds",
    resources: "YlOrBr",
    error_reason: {
      colors: ["#8DD3C7", "#FFFFB3", "#BEBADA", "#fdb462", "#999999", "#FFFFFF"],
      names: [
        "Memory limit exceeded",
        "Time limit exceeded",
        "Execution error",
        "Unknown error",
        "Not applicable",
        "No error"
      ]
    }
  }
][0]

funkyheatmap(
    transpose_list_of_objects(summary_all),
    transpose_list_of_objects(column_info),
    [],
    transpose_list_of_objects(column_groups),
    [],
    palettes,
    {
        fontSize: 14,
        rowHeight: 26,
        rootStyle: 'max-width: none',
        colorByRank: color_by_rank,
        theme: {
            oddRowBackground: 'var(--bs-body-bg)',
            evenRowBackground: 'var(--bs-button-hover)',
            textColor: 'var(--bs-body-color)',
            strokeColor: 'var(--bs-body-color)',
            headerColor: 'var(--bs-body-color)',
            hoverColor: 'var(--bs-body-color)'
        }
    },
    scale_column
);

Figure 1: Overview of the results per method. This figures shows the mean of the scaled scores (group Overall), the mean scores per dataset (group Dataset) and the mean scores per metric (group Metric).

Display settings

viewof color_by_rank = Inputs.toggle({label: "Color by rank:", value: true})
viewof scale_column = Inputs.toggle({label: "Minmax column:", value: false})
viewof show_con = Inputs.toggle({label: "Show control methods:", value: true})

Filter datasets

viewof dataset_ids = Inputs.checkbox(
  dataset_info.filter(d => poss_dataset_ids.includes(d.dataset_id)),
  {
    keyof: d => d.dataset_name,
    valueof: d => d.dataset_id,
    value: dataset_info.map(d => d.dataset_id),
    label: "Datasets:"
  }
)

Filter methods

viewof method_ids = Inputs.checkbox(
  method_info.filter(d => poss_method_ids.includes(d.method_id)),
  {
    keyof: d => d.method_name,
    valueof: d => d.method_id,
    value: method_info.map(d => d.method_id),
    label: "Methods:"
  }
)

Filter metrics

viewof metric_ids = Inputs.checkbox(
  metric_info.filter(d => poss_metric_ids.includes(d.metric_id)),
  {
    keyof: d => d.metric_name,
    valueof: d => d.metric_id,
    value: metric_info.map(d => d.metric_id),
    label: "Metrics:"
  }
)

funkyheatmap = (await require('d3@7').then(d3 => {
  window.d3 = d3;
  window._ = _;
  return import('https://unpkg.com/funkyheatmapjs@0.2.5');
})).default;

Results

Results table of the scores per method, dataset and metric (after scaling). Use the filters to make a custom subselection of methods and datasets. The “Overall mean” dataset is the mean value across all datasets.

Dataset info

Show

GTEX v9

Source dataset · Data source · 23-01-2025 · 196.56 MiB

Single-nucleus cross-tissue molecular reference maps to decipher disease gene function (Eraslan et al. 2022).

Understanding the function of genes and their regulation in tissue homeostasis and disease requires knowing the cellular context in which genes are expressed in tissues across the body. Single cell genomics allows the generation of detailed cellular atlases in human tissues, but most efforts are focused on single tissue types. Here, we establish a framework for profiling multiple tissues across the human body at single-cell resolution using single nucleus RNA-Seq (snRNA-seq), and apply it to 8 diverse, archived, frozen tissue types (three donors per tissue). We apply four snRNA-seq methods to each of 25 samples from 16 donors, generating a cross-tissue atlas of 209,126 nuclei profiles, and benchmark them vs. scRNA-seq of comparable fresh tissues. We use a conditional variational autoencoder (cVAE) to integrate an atlas across tissues, donors, and laboratory methods. We highlight shared and tissue-specific features of tissue-resident immune cells, identifying tissue-restricted and non-restricted resident myeloid populations. These include a cross-tissue conserved dichotomy between LYVE1- and HLA class II-expressing macrophages, and the broad presence of LAM-like macrophages across healthy tissues that is also observed in disease. For rare, monogenic muscle diseases, we identify cell types that likely underlie the neuromuscular, metabolic, and immune components of these diseases, and biological processes involved in their pathology. For common complex diseases and traits analyzed by GWAS, we identify the cell types and gene modules that potentially underlie disease mechanisms. The experimental and analytical frameworks we describe will enable the generation of large-scale studies of how cellular and molecular processes vary across individuals and populations.

Tabula Sapiens

Source dataset · Data source · 23-01-2025 · 23.61 MiB

A multiple-organ, single-cell transcriptomic atlas of humans (Jones et al. 2022).

Tabula Sapiens is a benchmark, first-draft human cell atlas of nearly 500,000 cells from 24 organs of 15 normal human subjects. This work is the product of the Tabula Sapiens Consortium. Taking the organs from the same individual controls for genetic background, age, environment, and epigenetic effects and allows detailed analysis and comparison of cell types that are shared between tissues. Our work creates a detailed portrait of cell types as well as their distribution and variation in gene expression across tissues and within the endothelial, epithelial, stromal and immune compartments.

Immune Cell Atlas

Source dataset · Data source · 23-01-2025 · 117.72 MiB

Cross-tissue immune cell analysis reveals tissue-specific features in humans (Domínguez Conde et al. 2022).

Despite their crucial role in health and disease, our knowledge of immune cells within human tissues remains limited. We surveyed the immune compartment of 16 tissues from 12 adult donors by single-cell RNA sequencing and VDJ sequencing generating a dataset of ~360,000 cells. To systematically resolve immune cell heterogeneity across tissues, we developed CellTypist, a machine learning tool for rapid and precise cell type annotation. Using this approach, combined with detailed curation, we determined the tissue distribution of finely phenotyped immune cell types, revealing hitherto unappreciated tissue-specific features and clonal architecture of T and B cells. Our multitissue approach lays the foundation for identifying highly resolved immune cell types by leveraging a common reference dataset, tissue-integrated expression analysis, and antigen receptor sequencing.

Diabetic Kidney Disease

Source dataset · Data source · 23-01-2025 · 151.83 MiB

Multimodal single cell sequencing implicates chromatin accessibility and genetic background in diabetic kidney disease progression (Wilson et al. 2022).

Multimodal single cell sequencing is a powerful tool for interrogating cell-specific changes in transcription and chromatin accessibility. We performed single nucleus RNA (snRNA-seq) and assay for transposase accessible chromatin sequencing (snATAC-seq) on human kidney cortex from donors with and without diabetic kidney disease (DKD) to identify altered signaling pathways and transcription factors associated with DKD. Both snRNA-seq and snATAC-seq had an increased proportion of VCAM1+ injured proximal tubule cells (PT_VCAM1) in DKD samples. PT_VCAM1 has a pro-inflammatory expression signature and transcription factor motif enrichment implicated NFkB signaling. We used stratified linkage disequilibrium score regression to partition heritability of kidney-function-related traits using publicly-available GWAS summary statistics. Cell-specific PT_VCAM1 peaks were enriched for heritability of chronic kidney disease (CKD), suggesting that genetic background may regulate chromatin accessibility and DKD progression. snATAC-seq found cell-specific differentially accessible regions (DAR) throughout the nephron that change accessibility in DKD and these regions were enriched for glucocorticoid receptor (GR) motifs. Changes in chromatin accessibility were associated with decreased expression of insulin receptor, increased gluconeogenesis, and decreased expression of the GR cytosolic chaperone, FKBP5, in the diabetic proximal tubule. Cleavage under targets and release using nuclease (CUT&RUN) profiling of GR binding in bulk kidney cortex and an in vitro model of the proximal tubule (RPTEC) showed that DAR co-localize with GR binding sites. CRISPRi silencing of GR response elements (GRE) in the FKBP5 gene body reduced FKBP5 expression in RPTEC, suggesting that reduced FKBP5 chromatin accessibility in DKD may alter cellular response to GR. We developed an open-source tool for single cell allele specific analysis (SALSA) to model the effect of genetic background on gene expression. Heterozygous germline single nucleotide variants (SNV) in proximal tubule ATAC peaks were associated with allele-specific chromatin accessibility and differential expression of target genes within cis-coaccessibility networks. Partitioned heritability of proximal tubule ATAC peaks with a predicted allele-specific effect was enriched for eGFR, suggesting that genetic background may modify DKD progression in a cell-specific manner.

Method info

Show

Best standard method

Documentation

Baseline best standard method

A combined ‘best’ standard method constructed by combining scores from the best overall standard method from each task.

The selected methods are:

Median standard method

Documentation

Baseline median standard method

A combined ‘median’ standard method constructed by combining scores from the standard methods with the median overall score on each task.

The selected methods are:

Geneformer

Documentation · Repository

Geneformer is a foundational transformer model pretrained on a large-scale corpus of single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology (Theodoris et al. 2023) (Chen et al. 2024)

Geneformer is a context-aware, attention-based deep learning model pretrained on a large-scale corpus of single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology.

scGPT (fine-tuned)

Documentation · Repository

A fine-tuned version of the scGPT foundation model (Cui et al. 2024)

scGPT is a foundation model for single-cell biology based on a generative pre-trained transformer and trained on a repository of over 33 million cells. Here we fine-tune a pre-trained model for each task.

scGPT (zero shot)

Documentation · Repository

A zero-shot verions of the scGPT foundation model (Cui et al. 2024)

scGPT is a foundation model for single-cell biology based on a generative pre-trained transformer and trained on a repository of over 33 million cells. Here we preform zero-shot inference using a pre-trained model.

SCimilarity

Documentation · Repository

SCimilarity provides unifying representation of single cell expression profiles (Heimberg et al. 2023)

SCimilarity is a unifying representation of single cell expression profiles that quantifies similarity between expression states and generalizes to represent new studies without additional training.

scPRINT

Documentation · Repository

scPRINT is a large transformer model built for the inference of gene networks (Kalfon et al. 2024)

scPRINT is a large transformer model built for the inference of gene networks (connections between genes explaining the cell’s expression profile) from scRNAseq data. It uses novel encoding and decoding of the cell expression profile and new pre-training methodologies to learn a cell model. scPRINT can be used to perform the following analyses: - expression denoising: increase the resolution of your scRNAseq data - cell embedding: generate a low-dimensional representation of your dataset - label prediction: predict the cell type, disease, sequencer, sex, and ethnicity of your cells - gene network inference: generate a gene network from any cell or cell cluster in your scRNAseq dataset.

UCE

Documentation · Repository

UCE offers a unified biological latent space that can represent any cell (Rosen et al. 2023)

Universal Cell Embedding (UCE) is a single-cell foundation model that offers a unified biological latent space that can represent any cell, regardless of tissue or species.

Control method info

Show

Postive control

Documentation

Baseline positive control

Baseline positive control constructed by taking the highest mean score by a control method on each dataset for each task.

Negative control

Documentation

Baseline negative control

Baseline negative control constructed by taking the lowest mean score by a control method on each dataset for each task.

Metric info

Show

Label projection

Source code

Automated cell type annotation from rich, labeled reference data.

A major challenge for integrating single cell datasets is creating matching cell type annotations for each cell. One of the most common strategies for annotating cell types is referred to as “cluster-then-annotate” whereby cells are aggregated into clusters based on feature similarity and then manually characterized based on differential gene expression or previously identified marker genes. Recently, methods have emerged to build on this strategy and annotate cells using known marker genes. However, these strategies pose a difficulty for integrating atlas-scale datasets as the particular annotations may not match.

To ensure that the cell type labels in newly generated datasets match existing reference datasets, some methods align cells to a previously annotated reference dataset and then project labels from the reference to the new dataset.

Here, we compare methods for annotation based on a reference dataset. The datasets consist of two or more samples of single cell profiles that have been manually annotated with matching labels. These datasets are then split into training and test batches, and the task of each method is to train a cell type classifer on the training set and project those labels onto the test set.

Batch Integration

Source code

Remove unwanted batch effects from scRNA-seq data while retaining biologically meaningful variation.

As single-cell technologies advance, single-cell datasets are growing both in size and complexity. Especially in consortia such as the Human Cell Atlas, individual studies combine data from multiple labs, each sequencing multiple individuals possibly with different technologies. This gives rise to complex batch effects in the data that must be computationally removed to perform a joint analysis. These batch integration methods must remove the batch effect while not removing relevant biological information. Currently, over 200 tools exist that aim to remove batch effects scRNA-seq datasets (Zappia, Phipson, and Oshlack 2018). These methods balance the removal of batch effects with the conservation of nuanced biological information in different ways. This abundance of tools has complicated batch integration method choice, leading to several benchmarks on this topic (Luecken et al. 2021; Tran et al. 2020; Chazarra-Gil et al. 2021; Mereu et al. 2020). Yet, benchmarks use different metrics, method implementations and datasets. Here we build a living benchmarking task for batch integration methods with the vision of improving the consistency of method evaluation.

In this task we evaluate batch integration methods on their ability to remove batch effects in the data while conserving variation attributed to biological effects. As input, methods require either normalised or unnormalised data with multiple batches and consistent cell type labels. The batch integrated output can be a feature matrix, a low dimensional embedding and/or a neighbourhood graph. The respective batch-integrated representation is then evaluated using sets of metrics that capture how well batch effects are removed and whether biological variance is conserved. We have based this particular task on the latest, and most extensive benchmark of single-cell data integration methods.

Authors

Robrecht Cannoodt (author, maintainer) ,

References

Boiarsky, Rebecca, Nalini Singh, Alejandro Buendia, Gad Getz, and David Sontag. 2023. “A deep dive into single-cell RNA sequencing foundation models.” bioRxiv, 2023.10.19.563100. https://doi.org/10.1101/2023.10.19.563100.

Chazarra-Gil, Ruben, Stijn van Dongen, Vladimir Yu Kiselev, and Martin Hemberg. 2021. “Flexible Comparison of Batch Correction Methods for Single-Cell RNA-Seq Using BatchBench.” Nucleic Acids Research 49 (7): e42–42. https://doi.org/10.1093/nar/gkab004.

Chen, Han, Madhavan S Venkatesh, Javier Gomez Ortega, Siddharth V Mahesh, Tarak N Nandi, Ravi K Madduri, Karin Pelka, and Christina V Theodoris. 2024. “Quantized Multi-Task Learning for Context-Specific Representations of Gene Network Dynamics,” August. https://doi.org/10.1101/2024.08.16.608180.

Cui, Haotian, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. 2024. “scGPT: Toward Building a Foundation Model for Single-Cell Multi-Omics Using Generative AI.” Nature Methods 21 (8): 1470–80. https://doi.org/10.1038/s41592-024-02201-0.

Domínguez Conde, C., C. Xu, L. B. Jarvis, D. B. Rainbow, S. B. Wells, T. Gomes, S. K. Howlett, et al. 2022. “Cross-Tissue Immune Cell Analysis Reveals Tissue-Specific Features in Humans.” Science 376 (6594). https://doi.org/10.1126/science.abl5197.

Eraslan, Gökcen, Eugene Drokhlyansky, Shankara Anand, Evgenij Fiskin, Ayshwarya Subramanian, Michal Slyper, Jiali Wang, et al. 2022. “Single-Nucleus Cross-Tissue Molecular Reference Maps Toward Understanding Disease Gene Function.” Science 376 (6594). https://doi.org/10.1126/science.abl4290.

Heimberg, Graham, Tony Kuo, Daryle DePianto, Tobias Heigl, Nathaniel Diamant, Omar Salem, Gabriele Scalia, et al. 2023. “Scalable Querying of Human Cell Atlases via a Foundational Model Reveals Commonalities Across Fibrosis-Associated Macrophages,” July. https://doi.org/10.1101/2023.07.18.549537.

Jones, Robert C., Jim Karkanias, Mark A. Krasnow, Angela Oliveira Pisco, Stephen R. Quake, Julia Salzman, Nir Yosef, et al. 2022. “The Tabula Sapiens: A Multiple-Organ, Single-Cell Transcriptomic Atlas of Humans.” Science 376 (6594). https://doi.org/10.1126/science.abl4896.

Kalfon, Jérémie, Jules Samaran, Gabriel Peyré, and Laura Cantini. 2024. “scPRINT: Pre-Training on 50 Million Cells Allows Robust Gene Network Predictions,” July. https://doi.org/10.1101/2024.07.29.605556.

Liu, Tianyu, Kexing Li, Yuge Wang, Hongyu Li, and Hongyu Zhao. 2024. “Evaluating the utilities of foundation models in single-cell data analysis.” bioRxiv.org: The Preprint Server for Biology, 2023.09.08.555192. https://doi.org/10.1101/2023.09.08.555192.

Luecken, Malte D., M. Büttner, K. Chaichoompu, A. Danese, M. Interlandi, M. F. Mueller, D. C. Strobl, et al. 2021. “Benchmarking Atlas-Level Data Integration in Single-Cell Genomics.” Nature Methods 19 (1): 41–50. https://doi.org/10.1038/s41592-021-01336-8.

Mereu, Elisabetta, Atefeh Lafzi, Catia Moutinho, Christoph Ziegenhain, Davis J McCarthy, Adrian Alvarez-Varela, Eduard Batlle, et al. 2020. “Benchmarking Single-Cell RNA-Sequencing Protocols for Cell Atlas Projects.” Nature Biotechnology 38 (6): 747–55. https://doi.org/10.1038/s41587-020-0469-4.

Rosen, Yanay, Yusuf Roohani, Ayush Agrawal, Leon Samotorcan, Tabula Sapiens Consortium, Stephen R. Quake, and Jure Leskovec. 2023. “Universal Cell Embeddings: A Foundation Model for Cell Biology,” November. https://doi.org/10.1101/2023.11.28.568918.

Szałata, Artur, Karin Hrovatin, Sören Becker, Alejandro Tejada-Lapuerta, Haotian Cui, Bo Wang, and Fabian J Theis. 2024. “Transformers in single-cell omics: a review and new perspectives.” Nature Methods 21 (8): 1430–43. https://doi.org/10.1038/s41592-024-02353-z.

Theodoris, Christina V., Ling Xiao, Anant Chopra, Mark D. Chaffin, Zeina R. Al Sayed, Matthew C. Hill, Helene Mantineo, et al. 2023. “Transfer Learning Enables Predictions in Network Biology.” Nature 618 (7965): 616–24. https://doi.org/10.1038/s41586-023-06139-9.

Tran, Hoa Thi Nhu, Kok Siong Ang, Marion Chevrier, Xiaomeng Zhang, Nicole Yee Shin Lee, Michelle Goh, and Jinmiao Chen. 2020. “A Benchmark of Batch-Effect Correction Methods for Single-Cell RNA Sequencing Data.” Genome Biology 21 (1). https://doi.org/10.1186/s13059-019-1850-9.

Wilson, Parker C., Yoshiharu Muto, Haojia Wu, Anil Karihaloo, Sushrut S. Waikar, and Benjamin D. Humphreys. 2022. “Multimodal Single Cell Sequencing Implicates Chromatin Accessibility and Genetic Background in Diabetic Kidney Disease Progression.” Nature Communications 13 (1). https://doi.org/10.1038/s41467-022-32972-z.

Zappia, Luke, Belinda Phipson, and Alicia Oshlack. 2018. “Exploring the Single-Cell RNA-Seq Analysis Landscape with the scRNA-Tools Database.” Edited by Dina Schneidman. PLOS Computational Biology 14 (6): e1006245. https://doi.org/10.1371/journal.pcbi.1006245.