Quickstart annbatch#
This notebook will walk you through the following steps:
How to convert an existing collection of
anndatafiles into a shuffled, zarr-based, collection ofanndatadatasetsHow to load the converted collection using
annbatchExtend an existing collection with new
anndatadatasets
# !pip install annbatch[zarrs, torch]
# Download two example datasets from CELLxGENE
!wget https://datasets.cellxgene.cziscience.com/866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad
!wget https://datasets.cellxgene.cziscience.com/f81463b8-4986-4904-a0ea-20ff02cbb317.h5ad
IMPORTANT: Configure zarrs
This step is both required for converting existing anndata files into a performant, shuffled collection of datasets for mini batch loading
import zarr
zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"})
import warnings
# Suppress zarr vlen-utf8 codec warnings
warnings.filterwarnings(
"ignore",
message="The codec `vlen-utf8` is currently not part in the Zarr format 3 specification.*",
category=UserWarning,
module="zarr.codecs.vlen_utf8",
)
Converting existing anndata files into a shuffled collection#
The conversion code will take care of the following things:
Align (outer join) the gene spaces across all datasets listed in
adata_pathsThe gene spaces are outer-joined based on the gene names provided in the
var_namesfield of the individualAnnDataobjects.If you want to subset to specific gene space, you can provide a list of gene names via the
var_subsetparameter.
Shuffle the cells across all datasets (this works on larger than memory datasets as well).
This is important for block-wise shuffling during data loading.
Shuffle the input files across multiple output datasets:
The size of each individual output dataset can be controlled via the
n_obs_per_datasetparameter.We recommend to choose a dataset size that comfortably fits into system memory.
You can apply custom data transformations to each input h5ad file by supplying a load_adata function to DatasetCollection.add
%load_ext autoreload
%autoreload 2
import anndata as ad
from annbatch import DatasetCollection
# let's write out only shared colunms - otherwise DatasetCollection will warn about all the columns we are missing for good reason - mismatched columns can lead to unexpected data and missing values.
shared_columns = ad.experimental.read_lazy("866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad").obs.columns.intersection(
ad.experimental.read_lazy("f81463b8-4986-4904-a0ea-20ff02cbb317.h5ad").obs.columns
)
# For CELLxGENE data, the raw counts can either be found under .raw.X or under .X (if .raw is not supplied).
# To have a store that only contains raw counts, we can write the following load_adata function
def read_lazy_x_and_obs_only(path) -> ad.AnnData:
"""Custom load function to only load raw counts from CxG data."""
# IMPORTANT: Large data should always be loaded lazily to reduce the memory footprint
adata_ = ad.experimental.read_lazy(path)
if adata_.raw is not None:
x = adata_.raw.X
var = adata_.raw.var
else:
x = adata_.X
var = adata_.var
return ad.AnnData(
X=x,
obs=adata_.obs.to_memory()[shared_columns],
var=var.to_memory(),
)
collection = DatasetCollection(zarr.open("annbatch_collection", mode="w"))
collection.add_adatas(
# List all the h5ad files you want to include in the collection
adata_paths=["866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad", "f81463b8-4986-4904-a0ea-20ff02cbb317.h5ad"],
# Path to store the output collection
shuffle=True, # Whether to pre-shuffle the cells of the collection
n_obs_per_dataset=2_097_152, # Number of cells per dataset shard, this number is much higher than available in these datasets but is generally a good target
var_subset=None, # Optionally subset the collection to a specific gene space
load_adata=read_lazy_x_and_obs_only,
)
Data loading example#
Now we create our data loader with the desired arguments.
WARNING: Without load_adata argument in use_collection, the entire obs will be loaded and yielded, degrading performance. It is highly advised to use this argument.
import anndata as ad
from annbatch import Loader
def _load_adata(g: zarr.Group) -> ad.AnnData:
return ad.AnnData(X=ad.io.sparse_dataset(g["X"]), obs=ad.experimental.read_lazy(g).obs[["cell_type"]].to_memory())
ds = Loader(
batch_size=4096, # Total number of obs per yielded batch
chunk_size=256, # Number of obs to load from disk contiguously - default settings should work well
preload_nchunks=32, # Number of chunks to preload + shuffle - default settings should work well
# If True, preloaded chunks are moved to GPU memory via `cupy`, which can put more pressure on GPU memory but will accelerate loading ~20%
preload_to_gpu=False,
to_torch=True,
)
# Add in the shuffled data that should be used for training.
ds.use_collection(collection, load_adata=_load_adata)
IMPORTANT:
The
Loaderyields batches of sparse tensors.The conversion to dense tensors should be done on the GPU, as shown in the example below.
First call
.cuda()and then.to_dense()E.g.
x = x.cuda().to_dense()This is significantly faster than doing the dense conversion on the CPU.
# Iterate over dataloader
import tqdm
for batch in tqdm.tqdm(ds):
x, obs = batch["X"], batch["obs"]["cell_type"]
# Important: Convert to dense on GPU
x = x.cuda().to_dense()
# Feed data into your model
...
Optional: Extend an existing collection with a new dataset#
You might want to extend an existing pre-shuffled collection with a new dataset.
This can be done using the add method again.
This function will take care of shuffling the new dataset into the existing collection without having to re-shuffle the entire collection.
collection.add_adatas(
adata_paths=[
"866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad",
],
load_adata=read_lazy_x_and_obs_only,
)