ray.data.Dataset.to_random_access_dataset#

Dataset.to_random_access_dataset(key: str, num_workers: int | None = None) RandomAccessDataset[source]#

Convert this dataset into a distributed RandomAccessDataset (EXPERIMENTAL).

RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the dataset.

Note that the key must be unique in the dataset. If there are duplicate keys, an arbitrary value is returned.

This is only supported for Arrow-format datasets.

Note

This operation will trigger execution of the lazy transformations performed on this dataset.

Parameters:
  • key – The key column over which records can be queried.

  • num_workers – The number of actors to use to serve random access queries. By default, this is determined by multiplying the number of Ray nodes in the cluster by four. As a rule of thumb, you can expect each worker to provide ~3000 records / second via get_async(), and ~10000 records / second via multiget().