hydrax.DataGroup#
- final class hydrax.DataGroup(batch_size: int, data: Iterable[D], *, loader_arrays: Mapping[str, tuple[tuple[int, ...], dtype[Any] | None | type[Any] | _SupportsDType[dtype[Any]] | str | tuple[Any, int] | tuple[Any, SupportsIndex | Sequence[SupportsIndex]] | list[Any] | _DTypeDict | tuple[Any, Any], str | type[Any] | dtype | SupportsDType]] | None = None, cache_arrays: Mapping[str, tuple[tuple[int, ...], dtype[Any] | None | type[Any] | _SupportsDType[dtype[Any]] | str | tuple[Any, int] | tuple[Any, SupportsIndex | Sequence[SupportsIndex]] | list[Any] | _DTypeDict | tuple[Any, Any], str | type[Any] | dtype | SupportsDType]] | None = None, cache_location: str | PathLike | None = None, cache_readonly: bool = False, seed: int = 0, shuffle_first: bool = False, shuffle_later: str = 'default')#
Bases:
Sequence
[D
],Generic
[D
]Represents a group of data which share the same descriptor, batch size, and array shapes.
Caution
Don’t derive from DataGroup, and do not modify your dataset after placing it in a DataGroup.
- Parameters:
batch_size – The batch size for loading and processing the data.
data – A list of all data descriptors for this group. Descriptors are passed to the loader to identify the data item to load. Any finite sequence-like object or iterator is acceptable. The elements must pickleable, as they are sent directly to loader processes.
loader_arrays – Shape and datatype definitions for all arrays in a loader batch. Do not include the leading batch dimension, since that is specified by
batch_size
. These arrays will be presented to the loader for zero-copy initialization.cache_arrays – Shape and datatype definitions for all arrays in a cached batch. Do not include the leading batch dimension, since that is specified by
batch_size
. Arrays of this shape will be retrieved from and stored to disk. If not specified,loader_arrays
and data is automatically cached. Otherwise, the data to cache must be provided viaBatch.cache()
.cache_location – The location of the cache on-disk. If the path does not exist, it is created unless
cache_readonly
is specified.cache_readonly – If
True
, the cache is readonly and will not be created if it does not exist, nor will it be populated if batches are missing. The default isFalse
to allow creation and population.seed – Integer seed to use for shuffling and batch seeding. The default is
0
.shuffle_first – If
True
, the first epoch is shuffled itemwise. Otherwise, the first epoch proceeds in the order specified in the dataset, which is the default.shuffle_later –
A string indicating the shuffling and seeding mode for epochs after the first. The default is
"default"
, see below."repeat"
- do not shuffle or reseed batches"reseed"
- reseed batches but do not shuffle"itemwise"
- reseed batches and shuffle items between them"batchwise"
- shuffle the order of batches, but do not reseed or change their contents"default"
-"batchwise"
ifcache_location
is specified, otherwise"itemwise"
If you do not specify
loader_arrays
, any batches which cannot be loaded from the cache will be dropped. As such, bothcached_arrays
andcache_location
are required, andcache_readonly
is implied.Warning
If your dataset is an iterable and not otherwise indexable, it will be materialized by the DataGroup. If you have hundreds of thousands of items, consider using the
hydrax.pandas
adapter module.A cache cannot be reused if the
batch_size
,data
,cache_arrays
,seed
, or shuffling parameters have changed. Only the length of thedata
is verified, not its ordering or contents. If either of those change, the result of using the cache is undefined. You may change the name and shape of the cache arrays as long as their position, NumPy dtype, and total size remain the same.- property batch_count: int#
The total number of batches this DataGroup can produce per epoch.
- property batch_size: int#
The batch size configured for the DataGroup.
- property cache_layouts: mappingproxy[str, tuple[tuple[int, ...], dtype, dtype, int, int]] | None#
The array layouts for all arrays in the cache. Includes the leading batch dimension.
The mapping has the form:
{ 'array_name': ((batch_size, dim_1, ...), numpy_dtype, jax_dtype, offset, count), ... }
- property cache_size: int | None#
The size, in bytes, of a single cached batch, or
None
ifcached_arrays
was not specified.
- count(value) integer -- return number of occurrences of value #
- index(value[, start[, stop]]) integer -- return first index of value. #
Raises ValueError if the value is not present.
Supporting start and stop arguments is optional, but recommended.
- property loader_layouts: mappingproxy[str, tuple[tuple[int, ...], dtype, dtype, int, int]] | None#
The array layouts for all arrays provided to the loader. Includes the leading batch dimension.
The mapping has the form:
{ 'array_name': ((batch_size, dim_1, ...), numpy_dtype, jax_dtype, offset, count), ... }
- property loader_size: int | None#
The size, in bytes, of a single loaded batch, or
None
ifloader_arrays
was not specified.