Speed Up Model Training: Distributed XGboost with Ray

Published in

Artificial Intelligence in Plain English

3 min readFeb 6, 2023

In previous post, I showed how parallelizing your training job on all your CPU cores can speed up your training. In this post, I’m comparing the result using Ray — a general purpose distributed execution framework — to scale the required computations without much change in the original code.

From the previous post, I am using exact tree method and with n_jobs=-1 as they were the fastest parameters.

Ray backend for joblib

Ray supports high-level parallelism of scikit-learn models as the backend for joblib. The difference with the previous approach is that Ray uses Ray Actors instead of local processes.

In this experiment, I am using a single-node local Ray cluster without GPUs to run the training job. All I need to do is to run my scikit-learn code inside with joblib.parallel_backend('ray') (for full implementation check the ray notebook on my GitHub).

for i in tqdm(repeat(1, num_exp), total=num_exp):

    xgb_model = xgb.XGBClassifier(n_jobs=n,
                                  tree_method=t,
                                  n_estimators=100, 
                                  random_state=42
                                 )


    text_clf = Pipeline([
        ('vect', CountVectorizer(lowercase=False, ngram_range=(1,2))),
        ('clf', xgb_model)
    ])

    with joblib.parallel_backend('ray'):        
        start = time.time()
        text_clf.fit(X_train, y_train)
        end = time.time()
    exp_time.append(end-start)

Distributed XGBoost using Ray AI Runtime

With Ray, you also have the option to train your model using XGBoost_Ray directly or use Ray AI Runtime (AIR).

XGBoost_ray provides an interface to run XGBoost training and prediction jobs on a Ray cluster and allows to utilize distributed data representations.

AIR enables simple scaling of individual workloads, end-to-end workflows, and popular ecosystem frameworks, all in Python.

I couldn’t use Ray’s own Text Vectorization method at this time (I’m waiting for their response here). As a workaround, I decided to vectorize the text using scikit-learn’s CountVectorizer() first, then pass the result to my model.

Doing so causes to reach the object store memory limit quickly and objects are spilled to disk. By default, objects are spilled to Ray’s temporary directory in the local filesystem.

Ray takes care of memory management including handling of the spillage.

This makes the process of vectorizing the text and creating the appropriate ray dataset rather a long one (about 12min). The good news is that this is a one time operation in my code. I tried to increase the object storage size but Ray strongly recommends to don’t increase it and keep it as default value of 2GB.

ValueError: The configured object store size (7.451GiB) exceeds the optimal size on Mac (2.0GiB). This will harm performance! There is a known issue where Ray's performance degrades with object store size greater than 2.0GB on a Mac.To reduce the object store capacity, specify`object_store_memory` when calling ray.init() or ray start.To ignore this warning, set RAY_ENABLE_MAC_LARGE_OBJECT_STORE=1.

Conclusion

Using ray in the backend saves ~7% of training time compared to using all CPU cores by scikit-learn n_jobs parameter
Interesting observation is that the resulted training time using Ray in the backend is more robust, i.e there is no outlier time.
Avoid using small tasks where Ray program is actually slower than the sequential program. Every task invocation has a non-trivial overhead (e.g., scheduling, inter-process communication, updating the system state) and this overhead dominates the actual time it takes to execute the task.

Speed Up Model Training: Distributed XGboost with Ray

Ray backend for joblib

Distributed XGBoost using Ray AI Runtime

Conclusion

More content at PlainEnglish.io.

Written by Amir imani