annbatch.Loader#
- class annbatch.Loader(*, batch_sampler=None, chunk_size=None, preload_nchunks=None, shuffle=None, return_index=False, batch_size=None, preload_to_gpu=False, drop_last=None, to_torch=False, concat_strategy=None, rng=None)#
A loader for on-disk data anndata stores.
This loader batches together slice requests to the underlying stores to achieve higher performance. This custom code to do this task will be upstreamed into anndata at some point and no longer rely on private zarr apis. The loader is agnostic to the on-disk chunking/sharding, but it may be advisable to align with the in-memory chunk size for dense.
The dataset class on its own is quite performant for “chunked loading” i.e.,
chunk_size > 1. Whenchunk_size == 1, atorch.utils.data.DataLoadershould wrap the dataset object. In this case, be sure to usespawnmultiprocessing in the wrapping loader.If
preload_to_gputo True andto_torchis False, the yielded type is acupymatrix. Ifto_torchis True, the yielded type is atorch.Tensor. If bothpreload_to_gpuandto_torchare False, then the return type is the CPU class for the given data type. When providing a custom sampler,chunk_size,preload_nchunks,batch_size,shuffle,drop_last, andrngmust not be set (they are controlled by thebatch_samplerinstead). When providing these arguments and nobatch_sampler, they are used to construct aannbatch.ChunkSampler.- Parameters:
- batch_sampler
Sampler|None(default:None) If not provided, a default
annbatch.ChunkSamplerwill be used with the same defaults below.- chunk_size
int|None(default:None) The obs size (i.e., axis 0) of contiguous array data to fetch. Mutually exclusive with
batch_sampler. Defaults to 512.- preload_nchunks
int|None(default:None) The number of chunks of contiguous array data to fetch. Mutually exclusive with
batch_sampler. Defaults to 32.- shuffle
bool|None(default:None) Whether or not to shuffle the data. Mutually exclusive with
batch_sampler. Defaults to False.- batch_size
int|None(default:None) Batch size to yield from the dataset. Mutually exclusive with
batch_sampler. Defaults to 1.- drop_last
bool|None(default:None) Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. Leave as False when using in conjunction with a
torch.utils.data.DataLoader. Mutually exclusive withbatch_sampler. Defaults to False.- rng
Generator|None(default:None) Random number generator for shuffling. Mutually exclusive with
batch_sampler. Defaults tonp.random.default_rng()if not provided.- return_index
bool(default:False) Whether or not to yield the index on each iteration.
- preload_to_gpu
bool(default:False) Whether or not to use cupy for non-io array operations like vstack and indexing once the data is in memory internally. This option entails greater GPU memory usage, but is faster at least for sparse operations.
torch.vstack()does not support CSR sparse matrices, hence the current use ofcupyinternally (which also meanstorchis an optional dep). Setting this toFalseis advisable when using thetorch.utils.data.DataLoaderwrapper or potentially with dense data due to memory pressure. For top performance, this should be used in conjuction withto_torchand thentorch.Tensor.to_dense()if you wish to densify.cupy.cuda.MemoryPool.free_all_blocks()(i.e., the method of the pool ofcupy.get_default_memory_pool()) is called aggresively to keep memory usage low. If you are using your own memory pool or allocator, you may have to free blocks on your own.- to_torch
bool(default:False) Whether to return
torch.Tensoras the output. Data transferred should be 0-copy independent of source, and transfer to cuda when applicable is non-blocking. Defaults to True iftorchis installed.- concat_strategy
None|Literal['concat-shuffle','shuffle-concat'] (default:None) The strategy for how in-memory, preloaded data should be concatenated and yielded. With
concat-shuffle, preloaded data is concatenated and then subsetted/shuffled (higher memory usage, but faster, at least for sparse data) Withshuffle-concat, preloaded data is first shuffled/subsetted chunk-by-chunk and then concatenated (lower memory usage, potentially faster for dense data) The default is automatically chosen -concat-shuffleif the data added to the loader is sparse and otherwiseshuffle-concat. See
- batch_sampler
Examples
>>> from annbatch import Loader >>> ds = Loader( batch_size=4096, chunk_size=32, preload_nchunks=512, ).add_anndata(my_anndata) >>> for batch in ds: # optionally convert to dense # batch = batch.to_dense() do_fit(batch)
Attributes table#
The sampler used to generate batches. |
|
The type of on-disk data used in this loader. |
|
The total number of observations in this instance i.e., the sum of the first axis of all added datasets. |
|
The total number of variables in this instance i.e., the second axis (which is the same) across all datasets. |
Methods table#
|
Append an anndata to this dataset. |
|
Append anndatas to this dataset. |
|
Append a dataset to this dataset. |
|
Append datasets to this dataset. |
|
Load from an existing |
Attributes#
- Loader.batch_sampler#
The sampler used to generate batches.
- Returns:
The sampler.
- Loader.dataset_type#
The type of on-disk data used in this loader.
- Returns:
The type used.
- Loader.n_obs#
The total number of observations in this instance i.e., the sum of the first axis of all added datasets.
- Returns:
The number of observations.
- Loader.n_var#
The total number of variables in this instance i.e., the second axis (which is the same) across all datasets.
- Returns:
The number of variables.
Methods#
- Loader.add_anndata(adata)#
Append an anndata to this dataset.
- Parameters:
- adata
AnnData A
anndata.AnnDataobject, withzarr.Arrayoranndata.abc.CSRDatasetas the data matrix inX, andobscontaining annotations to yield in apandas.DataFrame.
- adata
- Return type:
Self
- Loader.add_anndatas(adatas)#
Append anndatas to this dataset.
- Parameters:
- adatas
list[AnnData] List of
anndata.AnnDataobjects, withzarr.Arrayoranndata.abc.CSRDatasetas the data matrix inX, andobscontaining annotations to yield in apandas.DataFrame.
- adatas
- Return type:
Self
- Loader.add_dataset(dataset, obs=None)#
Append a dataset to this dataset.
- Parameters:
- dataset
CSRDataset|Array A
zarr.Arrayoranndata.abc.CSRDatasetobject, generally fromanndata.AnnData.X.- obs
DataFrame|None(default:None) DataFrameobs, generally fromanndata.AnnData.obs.
- dataset
- Return type:
Self
- Loader.add_datasets(datasets, obs=None)#
Append datasets to this dataset.
- Parameters:
- datasets
list[CSRDataset|Array] List of
zarr.Arrayoranndata.abc.CSRDatasetobjects, generally fromanndata.AnnData.X. They must all be of the same type and match that of any already added datasets.- obs
list[DataFrame] |None(default:None) List of
DataFrameobs, generally fromanndata.AnnData.obs.
- datasets
- Return type:
Self
- Loader.use_collection(collection, *, load_adata=<function load_x_and_obs>)#
Load from an existing
annbatch.DatasetCollection.This function can only be called once. If you want to manually add more data, use
Loader.add_anndatas()or open an issue.- Parameters:
- collection
DatasetCollection The collection who on-disk datasets should be used in this loader.
- load_adata
Callable[[Group],AnnData] (default:<function load_x_and_obs at 0x794b872e4900>) A custom load function - recall that whatever is found in
Xandobswill be yielded in batches. Default is to just loadXand all ofobs. This default behavior can degrade performance if you don’t need all columns inobs- it is recommended to use theload_adataargument.
- collection
- Return type:
Self