annbatch.DatasetCollection#
- class annbatch.DatasetCollection(group, *, mode='a', is_collection_h5ad=False)#
A preshuffled collection object including functionality for creating, adding to, and loading collections shuffled by
annbatch.
Attributes table#
Wether or not there is an existing store at the group location. |
Methods table#
|
Take AnnData paths and create or add to an on-disk set of AnnData datasets with uniform var spaces at the desired path (with |
Attributes#
- DatasetCollection.is_empty#
Wether or not there is an existing store at the group location.
Methods#
- DatasetCollection.add_adatas(adata_paths, *, load_adata=<function _default_load_adata>, var_subset=None, zarr_sparse_chunk_size=32768, zarr_sparse_shard_size=134217728, zarr_dense_chunk_size=1024, zarr_dense_shard_size=4194304, zarr_compressor=(BloscCodec(_tunable_attrs={'typesize'}, typesize=1, cname=<BloscCname.lz4: 'lz4'>, clevel=3, shuffle=<BloscShuffle.shuffle: 'shuffle'>, blocksize=0), ), h5ad_compressor='gzip', n_obs_per_dataset=2097152, shuffle_chunk_size=1000, shuffle=True)#
Take AnnData paths and create or add to an on-disk set of AnnData datasets with uniform var spaces at the desired path (with
n_obs_per_datasetrows per dataset if running for the first time).The set of AnnData datasets is collectively referred to as a “collection” where each dataset is called
dataset_i.{zarr,h5ad}. The main purpose of this function is to create shuffled sharded zarr datasets, which is the default behavior of this function. However, this function can also output h5 datasets and also unshuffled datasets as well. The var space is by default outer-joined initially, and then subsequently added datasets (i.e., on second calls to this function) are subsetted, but this behavior can be controlled byvar_subset. A keysrc_pathis added toobsto indicate where individual row came from. We highly recommend making your indexes unique across files, and this function will callAnnData.obs_names_make_unique. Memory usage should be controlled byn_obs_per_dataset+shuffle_chunk_sizeas so many rows will be read into memory before writing to disk. After the dataset completes, a marker is added to the group’sattrsto note that this dataset has been shuffled byannbatch. This is not a stable API but only for internal purposes at the moment.- Parameters:
- adata_paths
Iterable[Group|Group|PathLike[str] |str] Paths to the AnnData files used to create the zarr store.
- load_adata
Callable[[Group|Group|PathLike[str] |str],AnnData] (default:<function _default_load_adata at 0x794b873ab240>) Function to customize (lazy-)loading the invidiual input anndata files. By default,
anndata.experimental.read_lazy()is used with categoricals/nullables read into memory and(-1)chunks forobs. If you only need a subset of the input anndata files’ elems (e.g., onlyXand certainobscolumns), you can provide a custom function here to speed up loading and harmonize your data. Beware that concatenating nullables/categoricals (i.e., what happens iflen(adata_paths) > 1internally in this function) from {class}`anndata.experimental.backed.Dataset2D`obsis very time consuming - consider loading these into memory if you use this argument.- var_subset
Iterable[str] |None(default:None) Subset of gene names to include in the store. If None, all genes are included. Genes are subset based on the
var_namesattribute of the concatenated AnnData object.- zarr_sparse_chunk_size
int(default:32768) Size of the chunks to use for the
indicesanddataof a sparse matrix in the zarr store.- zarr_sparse_shard_size
int(default:134217728) Size of the shards to use for the
indicesanddataof a sparse matrix in the zarr store.- zarr_dense_chunk_size
int(default:1024) Number of observations per dense zarr chunk i.e., sharding is only done along the first axis of the array.
- zarr_dense_shard_size
int(default:4194304) Number of observations per dense zarr shard i.e., chunking is only done along the first axis of the array.
- zarr_compressor
Iterable[BytesBytesCodec] (default:(BloscCodec(_tunable_attrs={'typesize'}, typesize=1, cname=<BloscCname.lz4: 'lz4'>, clevel=3, shuffle=<BloscShuffle.shuffle: 'shuffle'>, blocksize=0),)) Compressors to use to compress the data in the zarr store.
- h5ad_compressor
Optional[Literal['gzip','lzf']] (default:'gzip') Compressors to use to compress the data in the h5ad store. See anndata.write_h5ad.
- n_obs_per_dataset
int(default:2097152) Number of observations to load into memory at once for shuffling / pre-processing. The higher this number, the more memory is used, but the better the shuffling. This corresponds to the size of the shards created. Only applicable when adding datasets for the first time, otherwise ignored.
- shuffle
bool(default:True) Whether to shuffle the data before writing it to the store. Ignored once the store is non-empty.
- shuffle_chunk_size
int(default:1000) How many contiguous rows to load into memory before shuffling at once.
(shuffle_chunk_size // n_obs_per_dataset)slices will be loaded of sizeshuffle_chunk_size.
- adata_paths
- Return type:
Self
Examples
>>> import anndata as ad >>> from annbatch import DatasetCollection # create a custom load function to only keep `.X`, `.obs` and `.var` in the output store >>> def read_lazy_x_and_obs_only(path): ... adata = ad.experimental.read_lazy(path) ... return ad.AnnData( ... X=adata.X, ... obs=adata.obs.to_memory(), ... var=adata.var.to_memory(), ...) >>> datasets = [ ... "path/to/first_adata.h5ad", ... "path/to/second_adata.h5ad", ... "path/to/third_adata.h5ad", ... ] >>> DatasetCollection("path/to/output/zarr_store.zarr").add_adatas( ... datasets, ... load_adata=read_lazy_x_and_obs_only, ...)