Snowpark XGBRegressor Ignores Sample Weights, Producing Identical Predictions for Different Models #111

robertlessmore opened this issue Jul 29, 2024 · 1 comment
bug Something isn't working


  What version of Python are you using?

Python 3.11.8 | packaged by Anaconda, Inc. | (main, Feb 26 2024, 21:34:05) [MSC v.1916 64 bit (AMD64)]

  What are the component versions in the environment?

  1. What did you do?
    from import XGBRegressor
    from snowflake.snowpark.functions import col, random, sin, when, lit
    from utils import get_session

session = get_session.session()

N = 105

df = session.range(1, N).to_df("ind").with_column(
"x_0", ((random() % _ONE_MILLION)/_ONE_MILLION)

df = df

df = df.with_columns(["weights1","weights2","weights3"],[lit(1.0),when(col("ind") < lit(N / 10), 1.0).otherwise(0.0),when(col("ind") > lit(N / 10), 1.0).otherwise(0.0)])

df = df.with_column(
when(col("ind") < lit(N / 10), 1.0).otherwise(0.0) * col("x_0") +
when(col("ind") > lit(N / 10), 1.0).otherwise(0.0) * sin(10*col("x_0"))

parameters = {

model1 = XGBRegressor(
output_cols= ["PREDICTION1"],

model2 = XGBRegressor(
output_cols= ["PREDICTION2"],

model3 = XGBRegressor(
output_cols= ["PREDICTION3"],


models = [model1, model2, model3]
for m in models:

test = session.range(-1, 1,0.01).to_df("X_0").with_column(

for m in models:
test = m.predict(test)

test_snow = test.toPandas()

0 -1.00 0.544021 0.515664 0.515664 0.515664
1 -0.99 0.457536 0.405519 0.405519 0.405519
2 -0.98 0.366479 0.183660 0.183660 0.183660
3 -0.97 0.271761 0.211220 0.211220 0.211220
4 -0.96 0.174327 0.039056 0.039056 0.039056
.. ... ... ... ... ...
195 0.95 -0.075151 0.047328 0.047328 0.047328
196 0.96 -0.174327 -0.060364 -0.060364 -0.060364
197 0.97 -0.271761 0.034832 0.034832 0.034832
198 0.98 -0.366479 -0.278535 -0.278535 -0.278535
199 0.99 -0.457536 -0.390598 -0.390598 -0.390598

  1. What did you expect to see?
    I expected different models to produce different predictions due to the varying sample weights (weights1, weights2, weights3). Specifically:
  • PREDICTION1 should reflect a model trained on the entire dataset equally.
  • PREDICTION2 should reflect a model influenced more by the first 10,000 samples, which follow a linear pattern.
  • PREDICTION3 should reflect a model influenced more by the samples beyond 10,000, which follow a sinusoidal pattern.

However, the Snowflake Snowpark implementation of XGBRegressor seems to ignore the sample weights, resulting in identical predictions for all models. Running a similar experiment directly with the standard xgboost library outside of Snowflake results in distinct linear and sinusoidal predictions for model2 and model3, respectively.

Thank you for reporting this issue, I was able to use your example to reproduce it on my end. We will investigate this issue as a bug.

@sfc-gh-afero sfc-gh-afero added the bug Something isn't working label Jul 31, 2024
