annbatch.Loader#

class annbatch.Loader(*, batch_sampler=None, chunk_size=None, preload_nchunks=None, shuffle=None, return_index=False, batch_size=None, preload_to_gpu=False, drop_last=None, to_torch=False, concat_strategy=None, rng=None)#

A loader for on-disk data anndata stores.

This loader batches together slice requests to the underlying stores to achieve higher performance. This custom code to do this task will be upstreamed into anndata at some point and no longer rely on private zarr apis. The loader is agnostic to the on-disk chunking/sharding, but it may be advisable to align with the in-memory chunk size for dense.

The dataset class on its own is quite performant for “chunked loading” i.e., chunk_size > 1. When chunk_size == 1, a torch.utils.data.DataLoader should wrap the dataset object. In this case, be sure to use spawn multiprocessing in the wrapping loader.

If preload_to_gpu to True and to_torch is False, the yielded type is a cupy matrix. If to_torch is True, the yielded type is a torch.Tensor. If both preload_to_gpu and to_torch are False, then the return type is the CPU class for the given data type. When providing a custom sampler, chunk_size, preload_nchunks, batch_size, shuffle, drop_last, and rng must not be set (they are controlled by the batch_sampler instead). When providing these arguments and no batch_sampler, they are used to construct a annbatch.ChunkSampler.

Parameters:
batch_sampler Sampler | None (default: None)

If not provided, a default annbatch.ChunkSampler will be used with the same defaults below.

chunk_size int | None (default: None)

The obs size (i.e., axis 0) of contiguous array data to fetch. Mutually exclusive with batch_sampler. Defaults to 512.

preload_nchunks int | None (default: None)

The number of chunks of contiguous array data to fetch. Mutually exclusive with batch_sampler. Defaults to 32.

shuffle bool | None (default: None)

Whether or not to shuffle the data. Mutually exclusive with batch_sampler. Defaults to False.

batch_size int | None (default: None)

Batch size to yield from the dataset. Mutually exclusive with batch_sampler. Defaults to 1.

drop_last bool | None (default: None)

Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. Leave as False when using in conjunction with a torch.utils.data.DataLoader. Mutually exclusive with batch_sampler. Defaults to False.

rng Generator | None (default: None)

Random number generator for shuffling. Mutually exclusive with batch_sampler. Defaults to np.random.default_rng() if not provided.

return_index bool (default: False)

Whether or not to yield the index on each iteration.

preload_to_gpu bool (default: False)

Whether or not to use cupy for non-io array operations like vstack and indexing once the data is in memory internally. This option entails greater GPU memory usage, but is faster at least for sparse operations. torch.vstack() does not support CSR sparse matrices, hence the current use of cupy internally (which also means torch is an optional dep). Setting this to False is advisable when using the torch.utils.data.DataLoader wrapper or potentially with dense data due to memory pressure. For top performance, this should be used in conjuction with to_torch and then torch.Tensor.to_dense() if you wish to densify. cupy.cuda.MemoryPool.free_all_blocks() (i.e., the method of the pool of cupy.get_default_memory_pool()) is called aggresively to keep memory usage low. If you are using your own memory pool or allocator, you may have to free blocks on your own.

to_torch bool (default: False)

Whether to return torch.Tensor as the output. Data transferred should be 0-copy independent of source, and transfer to cuda when applicable is non-blocking. Defaults to True if torch is installed.

concat_strategy None | Literal['concat-shuffle', 'shuffle-concat'] (default: None)

The strategy for how in-memory, preloaded data should be concatenated and yielded. With concat-shuffle, preloaded data is concatenated and then subsetted/shuffled (higher memory usage, but faster, at least for sparse data) With shuffle-concat, preloaded data is first shuffled/subsetted chunk-by-chunk and then concatenated (lower memory usage, potentially faster for dense data) The default is automatically chosen - concat-shuffle if the data added to the loader is sparse and otherwise shuffle-concat. See

Examples

>>> from annbatch import Loader
>>> ds = Loader(
        batch_size=4096,
        chunk_size=32,
        preload_nchunks=512,
    ).add_anndata(my_anndata)
>>> for batch in ds:
        # optionally convert to dense
        # batch = batch.to_dense()
        do_fit(batch)

Attributes table#

batch_sampler

The sampler used to generate batches.

dataset_type

The type of on-disk data used in this loader.

n_obs

The total number of observations in this instance i.e., the sum of the first axis of all added datasets.

n_var

The total number of variables in this instance i.e., the second axis (which is the same) across all datasets.

Methods table#

add_anndata(adata)

Append an anndata to this dataset.

add_anndatas(adatas)

Append anndatas to this dataset.

add_dataset(dataset[, obs])

Append a dataset to this dataset.

add_datasets(datasets[, obs])

Append datasets to this dataset.

use_collection(collection, *[, load_adata])

Load from an existing annbatch.DatasetCollection.

Attributes#

Loader.batch_sampler#

The sampler used to generate batches.

Returns:

The sampler.

Loader.dataset_type#

The type of on-disk data used in this loader.

Returns:

The type used.

Loader.n_obs#

The total number of observations in this instance i.e., the sum of the first axis of all added datasets.

Returns:

The number of observations.

Loader.n_var#

The total number of variables in this instance i.e., the second axis (which is the same) across all datasets.

Returns:

The number of variables.

Methods#

Loader.add_anndata(adata)#

Append an anndata to this dataset.

Parameters:
adata AnnData

A anndata.AnnData object, with zarr.Array or anndata.abc.CSRDataset as the data matrix in X, and obs containing annotations to yield in a pandas.DataFrame.

Return type:

Self

Loader.add_anndatas(adatas)#

Append anndatas to this dataset.

Parameters:
adatas list[AnnData]

List of anndata.AnnData objects, with zarr.Array or anndata.abc.CSRDataset as the data matrix in X, and obs containing annotations to yield in a pandas.DataFrame.

Return type:

Self

Loader.add_dataset(dataset, obs=None)#

Append a dataset to this dataset.

Parameters:
dataset CSRDataset | Array

A zarr.Array or anndata.abc.CSRDataset object, generally from anndata.AnnData.X.

obs DataFrame | None (default: None)

DataFrame obs, generally from anndata.AnnData.obs.

Return type:

Self

Loader.add_datasets(datasets, obs=None)#

Append datasets to this dataset.

Parameters:
datasets list[CSRDataset | Array]

List of zarr.Array or anndata.abc.CSRDataset objects, generally from anndata.AnnData.X. They must all be of the same type and match that of any already added datasets.

obs list[DataFrame] | None (default: None)

List of DataFrame obs, generally from anndata.AnnData.obs.

Return type:

Self

Loader.use_collection(collection, *, load_adata=<function load_x_and_obs>)#

Load from an existing annbatch.DatasetCollection.

This function can only be called once. If you want to manually add more data, use Loader.add_anndatas() or open an issue.

Parameters:
collection DatasetCollection

The collection who on-disk datasets should be used in this loader.

load_adata Callable[[Group], AnnData] (default: <function load_x_and_obs at 0x794b872e4900>)

A custom load function - recall that whatever is found in X and obs will be yielded in batches. Default is to just load X and all of obs. This default behavior can degrade performance if you don’t need all columns in obs - it is recommended to use the load_adata argument.

Return type:

Self