hydrax.DataGroup#

final class hydrax.DataGroup(batch_size: int, data: Iterable[D], *, loader_arrays: Mapping[str, tuple[tuple[int, ...], dtype[Any] | None | type[Any] | _SupportsDType[dtype[Any]] | str | tuple[Any, int] | tuple[Any, SupportsIndex | Sequence[SupportsIndex]] | list[Any] | _DTypeDict | tuple[Any, Any], str | type[Any] | dtype | SupportsDType]] | None = None, cache_arrays: Mapping[str, tuple[tuple[int, ...], dtype[Any] | None | type[Any] | _SupportsDType[dtype[Any]] | str | tuple[Any, int] | tuple[Any, SupportsIndex | Sequence[SupportsIndex]] | list[Any] | _DTypeDict | tuple[Any, Any], str | type[Any] | dtype | SupportsDType]] | None = None, cache_location: str | PathLike | None = None, cache_readonly: bool = False, seed: int = 0, shuffle_first: bool = False, shuffle_later: str = 'default')#

Bases: Sequence[D], Generic[D]

Represents a group of data which share the same descriptor, batch size, and array shapes.

Caution

Don’t derive from DataGroup, and do not modify your dataset after placing it in a DataGroup.

Parameters:
  • batch_size – The batch size for loading and processing the data.

  • data – A list of all data descriptors for this group. Descriptors are passed to the loader to identify the data item to load. Any finite sequence-like object or iterator is acceptable. The elements must pickleable, as they are sent directly to loader processes.

  • loader_arrays – Shape and datatype definitions for all arrays in a loader batch. Do not include the leading batch dimension, since that is specified by batch_size. These arrays will be presented to the loader for zero-copy initialization.

  • cache_arrays – Shape and datatype definitions for all arrays in a cached batch. Do not include the leading batch dimension, since that is specified by batch_size. Arrays of this shape will be retrieved from and stored to disk. If not specified, loader_arrays and data is automatically cached. Otherwise, the data to cache must be provided via Batch.cache().

  • cache_location – The location of the cache on-disk. If the path does not exist, it is created unless cache_readonly is specified.

  • cache_readonly – If True, the cache is readonly and will not be created if it does not exist, nor will it be populated if batches are missing. The default is False to allow creation and population.

  • seed – Integer seed to use for shuffling and batch seeding. The default is 0.

  • shuffle_first – If True, the first epoch is shuffled itemwise. Otherwise, the first epoch proceeds in the order specified in the dataset, which is the default.

  • shuffle_later

    A string indicating the shuffling and seeding mode for epochs after the first. The default is "default", see below.

    • "repeat" - do not shuffle or reseed batches

    • "reseed" - reseed batches but do not shuffle

    • "itemwise" - reseed batches and shuffle items between them

    • "batchwise" - shuffle the order of batches, but do not reseed or change their contents

    • "default" - "batchwise" if cache_location is specified, otherwise "itemwise"

If you do not specify loader_arrays, any batches which cannot be loaded from the cache will be dropped. As such, both cached_arrays and cache_location are required, and cache_readonly is implied.

Warning

If your dataset is an iterable and not otherwise indexable, it will be materialized by the DataGroup. If you have hundreds of thousands of items, consider using the hydrax.pandas adapter module.

A cache cannot be reused if the batch_size, data, cache_arrays, seed, or shuffling parameters have changed. Only the length of the data is verified, not its ordering or contents. If either of those change, the result of using the cache is undefined. You may change the name and shape of the cache arrays as long as their position, NumPy dtype, and total size remain the same.

property batch_count: int#

The total number of batches this DataGroup can produce per epoch.

property batch_size: int#

The batch size configured for the DataGroup.

property cache_layouts: mappingproxy[str, tuple[tuple[int, ...], dtype, dtype, int, int]] | None#

The array layouts for all arrays in the cache. Includes the leading batch dimension.

The mapping has the form: { 'array_name': ((batch_size, dim_1, ...), numpy_dtype, jax_dtype, offset, count), ... }

property cache_size: int | None#

The size, in bytes, of a single cached batch, or None if cached_arrays was not specified.

count(value) integer -- return number of occurrences of value#
index(value[, start[, stop]]) integer -- return first index of value.#

Raises ValueError if the value is not present.

Supporting start and stop arguments is optional, but recommended.

property loader_layouts: mappingproxy[str, tuple[tuple[int, ...], dtype, dtype, int, int]] | None#

The array layouts for all arrays provided to the loader. Includes the leading batch dimension.

The mapping has the form: { 'array_name': ((batch_size, dim_1, ...), numpy_dtype, jax_dtype, offset, count), ... }

property loader_size: int | None#

The size, in bytes, of a single loaded batch, or None if loader_arrays was not specified.