Speed Up Model Training: Distributed XGboost with Ray

Amir imani
Artificial Intelligence in Plain English
3 min readFeb 6, 2023

--

In previous post, I showed how parallelizing your training job on all your CPU cores can speed up your training. In this post, I’m comparing the result using Ray — a general purpose distributed execution framework — to scale the required computations without much change in the original code.

From the previous post, I am using exact tree method and with n_jobs=-1 as they were the fastest parameters.

Ray backend for joblib

Ray supports high-level parallelism of scikit-learn models as the backend for joblib. The difference with the previous approach is that Ray uses Ray Actors instead of local processes.

In this experiment, I am using a single-node local Ray cluster without GPUs to run the training job. All I need to do is to run my scikit-learn code inside with joblib.parallel_backend('ray') (for full implementation check the ray notebook on my GitHub).

for i in tqdm(repeat(1, num_exp), total=num_exp):

xgb_model = xgb.XGBClassifier(n_jobs=n,
tree_method=t,
n_estimators=100,
random_state=42
)


text_clf = Pipeline([
('vect', CountVectorizer(lowercase=False, ngram_range=(1,2))),
('clf', xgb_model)
])

with joblib.parallel_backend('ray'):
start = time.time()
text_clf.fit(X_train, y_train)
end = time.time()
exp_time.append(end-start)

Distributed XGBoost using Ray AI Runtime

With Ray, you also have the option to train your model using XGBoost_Ray directly or use Ray AI Runtime (AIR).

XGBoost_ray provides an interface to run XGBoost training and prediction jobs on a Ray cluster and allows to utilize distributed data representations.

AIR enables simple scaling of individual workloads, end-to-end workflows, and popular ecosystem frameworks, all in Python.

AIR ecosystem (source)

I couldn’t use Ray’s own Text Vectorization method at this time (I’m waiting for their response here). As a workaround, I decided to vectorize the text using scikit-learn’s CountVectorizer() first, then pass the result to my model.

Doing so causes to reach the object store memory limit quickly and objects are spilled to disk. By default, objects are spilled to Ray’s temporary directory in the local filesystem.

Ray takes care of memory management including handling of the spillage.

This makes the process of vectorizing the text and creating the appropriate ray dataset rather a long one (about 12min). The good news is that this is a one time operation in my code. I tried to increase the object storage size but Ray strongly recommends to don’t increase it and keep it as default value of 2GB.

ValueError: The configured object store size (7.451GiB) exceeds the optimal size on Mac (2.0GiB). This will harm performance! There is a known issue where Ray's performance degrades with object store size greater than 2.0GB on a Mac.To reduce the object store capacity, specify`object_store_memory` when calling ray.init() or ray start.To ignore this warning, set RAY_ENABLE_MAC_LARGE_OBJECT_STORE=1.

Conclusion

  • Using ray in the backend saves ~7% of training time compared to using all CPU cores by scikit-learn n_jobs parameter
  • Interesting observation is that the resulted training time using Ray in the backend is more robust, i.e there is no outlier time.
  • Avoid using small tasks where Ray program is actually slower than the sequential program. Every task invocation has a non-trivial overhead (e.g., scheduling, inter-process communication, updating the system state) and this overhead dominates the actual time it takes to execute the task.

More content at PlainEnglish.io.

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord. Build awareness and adoption for your tech startup with Circuit.

--

--