sklearn Ray Backend API (Experimental)

Warning

Support for running scikit-learn on Ray is an experimental feature, so it may be changed at any time without warning. If you encounter any bugs/shortcomings/incompatibilities, please file an issue on GitHub. Contributions are always welcome!

Ray supports running distributed scikit-learn programs by implementing a Ray backend for joblib using Ray Actors instead of local processes. This makes it easy to scale existing applications that use scikit-learn from a single node to a cluster.

Quickstart

To get started, first install Ray, then use from ray.experimental.joblib import register_ray and run register_ray(). This will register Ray as a joblib backend for scikit-learn to use. Then run your original scikit-learn code inside with joblib.parallel_backend('ray'). This will start a local Ray cluster. See the Run on a Cluster section below for instructions to run on a multi-node Ray cluster instead.

import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
digits = load_digits()
param_space = {
    'C': np.logspace(-6, 6, 30),
    'gamma': np.logspace(-8, 8, 30),
    'tol': np.logspace(-4, -1, 30),
    'class_weight': [None, 'balanced'],
}
model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=5, n_iter=300, verbose=10)

import joblib
from ray.experimental.joblib import register_ray
register_ray()
with joblib.parallel_backend('ray'):
    search.fit(digits.data, digits.target)

Run on a Cluster

This section assumes that you have a running Ray cluster. To start a Ray cluster, please refer to the cluster setup instructions.

To connect a scikit-learn to a running Ray cluster, you have to specify the address of the head node by setting the RAY_ADDRESS environment variable.

You can also start Ray manually by calling ray.init() (with any of its supported configuration options) before calling with joblib.parallel_backend('ray').

Warning

If you do not set the RAY_ADDRESS environment variable and do not provide address in ray.init(address=<address>) then scikit-learn will run on a SINGLE node!