Quickstart annbatch#

This notebook will walk you through the following steps:

  1. How to convert an existing collection of anndata files into a shuffled, zarr-based, collection of anndata datasets

  2. How to load the converted collection using annbatch

  3. Extend an existing collection with new anndata datasets

# !pip install annbatch[zarrs, torch]
# Download two example datasets from CELLxGENE
!wget https://datasets.cellxgene.cziscience.com/866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad
!wget https://datasets.cellxgene.cziscience.com/f81463b8-4986-4904-a0ea-20ff02cbb317.h5ad

Hide code cell output

zsh:1: command not found: wget
zsh:1: command not found: wget

IMPORTANT: Configure zarrs

This step is both required for converting existing anndata files into a performant, shuffled collection of datasets for mini batch loading

import zarr

zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"})

Hide code cell output

<donfig.config_obj.ConfigSet at 0x127a5ed80>
import warnings

# Suppress zarr vlen-utf8 codec warnings
warnings.filterwarnings(
    "ignore",
    message="The codec `vlen-utf8` is currently not part in the Zarr format 3 specification.*",
    category=UserWarning,
    module="zarr.codecs.vlen_utf8",
)

Converting existing anndata files into a shuffled collection#

The conversion code will take care of the following things:

  • Align (outer join) the gene spaces across all datasets listed in adata_paths

    • The gene spaces are outer-joined based on the gene names provided in the var_names field of the individual AnnData objects.

    • If you want to subset to specific gene space, you can provide a list of gene names via the var_subset parameter.

  • Shuffle the cells across all datasets (this works on larger than memory datasets as well).

    • This is important for block-wise shuffling during data loading.

  • Shuffle the input files across multiple output datasets:

    • The size of each individual output dataset can be controlled via the n_obs_per_dataset parameter.

    • We recommend to choose a dataset size that comfortably fits into system memory.

You can apply custom data transformations to each input h5ad file by supplying a load_adata function to DatasetCollection.add

%load_ext autoreload
%autoreload 2

import anndata as ad
from annbatch import DatasetCollection

# let's write out only shared colunms - otherwise DatasetCollection will warn about all the columns we are missing for good reason - mismatched columns can lead to unexpected data and missing values.
shared_columns = ad.experimental.read_lazy("866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad").obs.columns.intersection(
    ad.experimental.read_lazy("f81463b8-4986-4904-a0ea-20ff02cbb317.h5ad").obs.columns
)


# For CELLxGENE data, the raw counts can either be found under .raw.X or under .X (if .raw is not supplied).
# To have a store that only contains raw counts, we can write the following load_adata function
def read_lazy_x_and_obs_only(path) -> ad.AnnData:
    """Custom load function to only load raw counts from CxG data."""
    # IMPORTANT: Large data should always be loaded lazily to reduce the memory footprint
    adata_ = ad.experimental.read_lazy(path)
    if adata_.raw is not None:
        x = adata_.raw.X
        var = adata_.raw.var
    else:
        x = adata_.X
        var = adata_.var

    return ad.AnnData(
        X=x,
        obs=adata_.obs.to_memory()[shared_columns],
        var=var.to_memory(),
    )


collection = DatasetCollection(zarr.open("annbatch_collection", mode="w"))
collection.add_adatas(
    # List all the h5ad files you want to include in the collection
    adata_paths=["866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad", "f81463b8-4986-4904-a0ea-20ff02cbb317.h5ad"],
    # Path to store the output collection
    shuffle=True,  # Whether to pre-shuffle the cells of the collection
    n_obs_per_dataset=2_097_152,  # Number of cells per dataset shard, this number is much higher than available in these datasets but is generally a good target
    var_subset=None,  # Optionally subset the collection to a specific gene space
    load_adata=read_lazy_x_and_obs_only,
)

Hide code cell output

<annbatch.io.DatasetCollection at 0x15ff0f8f0>

Data loading example#

Now we create our data loader with the desired arguments.

WARNING: Without load_adata argument in use_collection, the entire obs will be loaded and yielded, degrading performance. It is highly advised to use this argument.

import anndata as ad

from annbatch import Loader


def _load_adata(g: zarr.Group) -> ad.AnnData:
    return ad.AnnData(X=ad.io.sparse_dataset(g["X"]), obs=ad.experimental.read_lazy(g).obs[["cell_type"]].to_memory())


ds = Loader(
    batch_size=4096,  # Total number of obs per yielded batch
    chunk_size=256,  # Number of obs to load from disk contiguously - default settings should work well
    preload_nchunks=32,  # Number of chunks to preload + shuffle - default settings should work well
    # If True, preloaded chunks are moved to GPU memory via `cupy`, which can put more pressure on GPU memory but will accelerate loading ~20%
    preload_to_gpu=False,
    to_torch=True,
)

# Add in the shuffled data that should be used for training.
ds.use_collection(collection, load_adata=_load_adata)

Hide code cell output

<annbatch.loader.Loader at 0x1619de480>

IMPORTANT:

  • The Loader yields batches of sparse tensors.

  • The conversion to dense tensors should be done on the GPU, as shown in the example below.

    • First call .cuda() and then .to_dense()

    • E.g. x = x.cuda().to_dense()

    • This is significantly faster than doing the dense conversion on the CPU.

# Iterate over dataloader
import tqdm

for batch in tqdm.tqdm(ds):
    x, obs = batch["X"], batch["obs"]["cell_type"]
    # Important: Convert to dense on GPU
    x = x.cuda().to_dense()
    # Feed data into your model
    ...

Hide code cell output

Optional: Extend an existing collection with a new dataset#

You might want to extend an existing pre-shuffled collection with a new dataset. This can be done using the add method again.

This function will take care of shuffling the new dataset into the existing collection without having to re-shuffle the entire collection.

collection.add_adatas(
    adata_paths=[
        "866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad",
    ],
    load_adata=read_lazy_x_and_obs_only,
)

Hide code cell output

<annbatch.io.DatasetCollection at 0x13d8afec0>