annbatch.DatasetCollection#

class annbatch.DatasetCollection(group, *, mode='a', is_collection_h5ad=False)#

A preshuffled collection object including functionality for creating, adding to, and loading collections shuffled by annbatch.

Attributes table#

is_empty

Wether or not there is an existing store at the group location.

Methods table#

add_adatas(adata_paths, *[, load_adata, ...])

Take AnnData paths and create or add to an on-disk set of AnnData datasets with uniform var spaces at the desired path (with n_obs_per_dataset rows per dataset if running for the first time).

Attributes#

DatasetCollection.is_empty#

Wether or not there is an existing store at the group location.

Methods#

DatasetCollection.add_adatas(adata_paths, *, load_adata=<function _default_load_adata>, var_subset=None, zarr_sparse_chunk_size=32768, zarr_sparse_shard_size=134217728, zarr_dense_chunk_size=1024, zarr_dense_shard_size=4194304, zarr_compressor=(BloscCodec(_tunable_attrs={'typesize'}, typesize=1, cname=<BloscCname.lz4: 'lz4'>, clevel=3, shuffle=<BloscShuffle.shuffle: 'shuffle'>, blocksize=0), ), h5ad_compressor='gzip', n_obs_per_dataset=2097152, shuffle_chunk_size=1000, shuffle=True)#

Take AnnData paths and create or add to an on-disk set of AnnData datasets with uniform var spaces at the desired path (with n_obs_per_dataset rows per dataset if running for the first time).

The set of AnnData datasets is collectively referred to as a “collection” where each dataset is called dataset_i.{zarr,h5ad}. The main purpose of this function is to create shuffled sharded zarr datasets, which is the default behavior of this function. However, this function can also output h5 datasets and also unshuffled datasets as well. The var space is by default outer-joined initially, and then subsequently added datasets (i.e., on second calls to this function) are subsetted, but this behavior can be controlled by var_subset. A key src_path is added to obs to indicate where individual row came from. We highly recommend making your indexes unique across files, and this function will call AnnData.obs_names_make_unique. Memory usage should be controlled by n_obs_per_dataset + shuffle_chunk_size as so many rows will be read into memory before writing to disk. After the dataset completes, a marker is added to the group’s attrs to note that this dataset has been shuffled by annbatch. This is not a stable API but only for internal purposes at the moment.

Parameters:
adata_paths Iterable[Group | Group | PathLike[str] | str]

Paths to the AnnData files used to create the zarr store.

load_adata Callable[[Group | Group | PathLike[str] | str], AnnData] (default: <function _default_load_adata at 0x794b873ab240>)

Function to customize (lazy-)loading the invidiual input anndata files. By default, anndata.experimental.read_lazy() is used with categoricals/nullables read into memory and (-1) chunks for obs. If you only need a subset of the input anndata files’ elems (e.g., only X and certain obs columns), you can provide a custom function here to speed up loading and harmonize your data. Beware that concatenating nullables/categoricals (i.e., what happens if len(adata_paths) > 1 internally in this function) from {class}`anndata.experimental.backed.Dataset2D` obs is very time consuming - consider loading these into memory if you use this argument.

var_subset Iterable[str] | None (default: None)

Subset of gene names to include in the store. If None, all genes are included. Genes are subset based on the var_names attribute of the concatenated AnnData object.

zarr_sparse_chunk_size int (default: 32768)

Size of the chunks to use for the indices and data of a sparse matrix in the zarr store.

zarr_sparse_shard_size int (default: 134217728)

Size of the shards to use for the indices and data of a sparse matrix in the zarr store.

zarr_dense_chunk_size int (default: 1024)

Number of observations per dense zarr chunk i.e., sharding is only done along the first axis of the array.

zarr_dense_shard_size int (default: 4194304)

Number of observations per dense zarr shard i.e., chunking is only done along the first axis of the array.

zarr_compressor Iterable[BytesBytesCodec] (default: (BloscCodec(_tunable_attrs={'typesize'}, typesize=1, cname=<BloscCname.lz4: 'lz4'>, clevel=3, shuffle=<BloscShuffle.shuffle: 'shuffle'>, blocksize=0),))

Compressors to use to compress the data in the zarr store.

h5ad_compressor Optional[Literal['gzip', 'lzf']] (default: 'gzip')

Compressors to use to compress the data in the h5ad store. See anndata.write_h5ad.

n_obs_per_dataset int (default: 2097152)

Number of observations to load into memory at once for shuffling / pre-processing. The higher this number, the more memory is used, but the better the shuffling. This corresponds to the size of the shards created. Only applicable when adding datasets for the first time, otherwise ignored.

shuffle bool (default: True)

Whether to shuffle the data before writing it to the store. Ignored once the store is non-empty.

shuffle_chunk_size int (default: 1000)

How many contiguous rows to load into memory before shuffling at once. (shuffle_chunk_size // n_obs_per_dataset) slices will be loaded of size shuffle_chunk_size.

Return type:

Self

Examples

>>> import anndata as ad
>>> from annbatch import DatasetCollection
# create a custom load function to only keep `.X`, `.obs` and `.var` in the output store
>>> def read_lazy_x_and_obs_only(path):
...     adata = ad.experimental.read_lazy(path)
...     return ad.AnnData(
...         X=adata.X,
...         obs=adata.obs.to_memory(),
...         var=adata.var.to_memory(),
...)
>>> datasets = [
...     "path/to/first_adata.h5ad",
...     "path/to/second_adata.h5ad",
...     "path/to/third_adata.h5ad",
... ]
>>> DatasetCollection("path/to/output/zarr_store.zarr").add_adatas(
...    datasets,
...    load_adata=read_lazy_x_and_obs_only,
...)