Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray XGboost #55

Open
szilard opened this issue Dec 16, 2021 · 8 comments
Open

Ray XGboost #55

szilard opened this issue Dec 16, 2021 · 8 comments
Labels

Comments

@szilard
Copy link
Owner

szilard commented Dec 16, 2021

https://docs.ray.io/en/latest/xgboost-ray.html

@szilard
Copy link
Owner Author

szilard commented Dec 16, 2021

m5.4xlarge 16c (8+8HT)

1M rows

integer encoding for simplicity

@szilard
Copy link
Owner Author

szilard commented Dec 16, 2021

XGBoost Ray setup:

sudo docker run --rm  --shm-size=20gb -ti -p 8787:8787 continuumio/anaconda3 /bin/bash

pip3 install -U xgboost_ray

ipython

--shm-size=20gb is to avoid this:

2021-12-16 12:32:27,069 WARNING services.py:1816 -- WARNING: The object store is using /tmp instead of /dev/shm 
because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by 
deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-
size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 
30% of available RAM.

@szilard
Copy link
Owner Author

szilard commented Dec 16, 2021

plain XGBoost (without Ray):

import pandas as pd
import numpy as np
from sklearn import preprocessing 
from sklearn import metrics

import xgboost as xgb


d_train = pd.read_csv("https://raw.githubusercontent.com/szilard/benchm-ml--data/master/int_enc/train-1m-intenc.csv")
d_test = pd.read_csv("https://raw.githubusercontent.com/szilard/benchm-ml--data/master/int_enc/test-1m-intenc.csv")

X_train = d_train.iloc[:, :-1].to_numpy()
y_train = d_train.iloc[:,-1:].to_numpy()
X_test = d_test.iloc[:, :-1].to_numpy()
y_test = d_test.iloc[:,-1:].to_numpy()


dxgb_train = xgb.DMatrix(X_train, label = y_train)
dxgb_test = xgb.DMatrix(X_test)

param = {'max_depth':10, 'eta':0.1, 'objective':'binary:logistic', 'tree_method':'hist'}             
%time md = xgb.train(param, dxgb_train, num_boost_round = 100)

y_pred = md.predict(dxgb_test)   
print(metrics.roc_auc_score(y_test, y_pred))
Wall time: 2.99 s
0.7527781837199401

@szilard
Copy link
Owner Author

szilard commented Dec 16, 2021

XGBoost with Ray:

import pandas as pd
import numpy as np
from sklearn import preprocessing 
from sklearn import metrics

import xgboost as xgb
import xgboost_ray as xgb_ray


d_train = pd.read_csv("https://raw.githubusercontent.com/szilard/benchm-ml--data/master/int_enc/train-1m-intenc.csv")
d_test = pd.read_csv("https://raw.githubusercontent.com/szilard/benchm-ml--data/master/int_enc/test-1m-intenc.csv")

X_train = d_train.iloc[:, :-1].to_numpy()
y_train = d_train.iloc[:,-1:].to_numpy().reshape(1000000)
X_test = d_test.iloc[:, :-1].to_numpy()
y_test = d_test.iloc[:,-1:].to_numpy().reshape(100000)


dxgb_ray_train = xgb_ray.RayDMatrix(X_train, y_train)
dxgb_test = xgb.DMatrix(X_test)

param = {'max_depth':10, 'eta':0.1, 'objective':'binary:logistic', 'tree_method':'hist'}    
ray_params = xgb_ray.RayParams(num_actors=2, cpus_per_actor=2)

%time md = xgb_ray.train(param, dxgb_ray_train, ray_params = ray_params, num_boost_round = 100)

y_pred = md.predict(dxgb_test)   
print(metrics.roc_auc_score(y_test, y_pred))


@szilard
Copy link
Owner Author

szilard commented Dec 16, 2021

In [17]: %time md = xgb_ray.train(param, dxgb_ray_train, ray_params = ray_params, num_boost_round = 100)
2021-12-16 13:13:57,508 INFO main.py:971 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.
2021-12-16 13:13:58,646 INFO main.py:1016 -- [RayXGBoost] Starting XGBoost training.
(_RemoteRayXGBoostActor pid=1089) [13:13:58] task [xgboost.ray]:140217036009280 got new rank 0
(_RemoteRayXGBoostActor pid=1096) [13:13:58] task [xgboost.ray]:140405006025248 got new rank 1
(_RemoteRayXGBoostActor pid=1089) [13:13:59] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
(_RemoteRayXGBoostActor pid=1096) [13:13:59] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
2021-12-16 13:14:11,726 INFO main.py:1495 -- [RayXGBoost] Finished XGBoost training on training data with total N=1,000,000 in 14.51 seconds (13.07 pure XGBoost training time).
CPU times: user 1.18 s, sys: 556 ms, total: 1.74 s
Wall time: 16.9 s

In [18]:

In [18]: y_pred = md.predict(dxgb_test)

In [19]: print(metrics.roc_auc_score(y_test, y_pred))
0.7527781837199401

@szilard
Copy link
Owner Author

szilard commented Dec 16, 2021

Changing number of actors, threads ("CPUs per actor"):

n_actors n_threads Time
4 1 18.9
8 1 19.5
16 1 22.9
4 4 15.8
1 16 12.1
2 2 16.9
1 2 17.7
2 1 20.2
no ray (16) 3.0
1 1 24.6
no ray 1 15.1

("no ray" means plain XGBoost)

@szilard
Copy link
Owner Author

szilard commented Dec 16, 2021

10M rows:

d_train = pd.read_csv("https://benchm-ml--int-enc.s3-us-west-2.amazonaws.com/train-10m-intenc.csv")
d_test = pd.read_csv("https://benchm-ml--int-enc.s3-us-west-2.amazonaws.com/test-10m-intenc.csv")
n_actors n_threads Time
16 1 48.8
1 16 37
4 4 37.8
no ray (16) 22.9

@szilard
Copy link
Owner Author

szilard commented Dec 16, 2021

also RAM usage during training:

plain XGBoost: 2.8GB
Ray: 10.8GB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant