Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45190: [C++][Compute] Add rank_quantile function #45259

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Jan 14, 2025

Rationale for this change

Add a "rank_quantile" function following the Wikipedia definition:
https://en.wikipedia.org/wiki/Percentile_rank

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes, an additional compute function.

@pitrou pitrou marked this pull request as ready for review January 14, 2025 17:55
@pitrou
Copy link
Member Author

pitrou commented Jan 14, 2025

@zanmato1984 @jorisvandenbossche Do you want to take a look at this?

Naming-wise, I hesitated between "rank_percentile" and "percentile_rank". I chose the former for better tab-completion.

@zanmato1984
Copy link
Contributor

zanmato1984 commented Jan 15, 2025

If the factor is free to specify, is the name rank_quantile more accurate? (BTW I prefer rank_xxx to xxx_rank).

Update 1: Ah I just see factor only affects the output scale, so it doesn't change the sense of "percentile". Please ignore my previous question.

Update 2: Oops, sorry I think my original question remains: if factor determines "what quantile" the output is, as implied by the comment of factor:

/// Use 1.0 for results in (0, 1), 100.0 for percentages, etc.

, then shall we name it rank_quantile? Or is rank_percentile so way more commonly used that no one will misinterpret it even though the factor can change the meaning of "percentile"?

@pitrou
Copy link
Member Author

pitrou commented Jan 15, 2025

rank_quantile sounds ok to me. I find occurrences of "quantile rank" on Google, so it's not unheard of.

@zanmato1984
Copy link
Contributor

Hi @pitrou , I want to suggest a refinement that simplifies the code structure of your current pr. I think the best way to show it is via actual code changes, so may I open up a pr targeting your branch? Thanks.

@pitrou
Copy link
Member Author

pitrou commented Jan 17, 2025

Yes, you can!

@zanmato1984
Copy link
Contributor

pitrou#13 filed. @pitrou PTAL.

@pitrou pitrou force-pushed the percentile-rank-kernel branch from af782d4 to 4fd918a Compare January 20, 2025 09:40
@zanmato1984
Copy link
Contributor

Seems the latest force push doesn't include the changes I've made. Anything wrong? :)

@pitrou
Copy link
Member Author

pitrou commented Jan 20, 2025

Hmm, that's weird. I might have messed things up, let me take a look...

@pitrou pitrou force-pushed the percentile-rank-kernel branch from 4fd918a to 880b7a5 Compare January 20, 2025 10:28
@pitrou
Copy link
Member Author

pitrou commented Jan 20, 2025

Ok, this is fixed :)

@pitrou pitrou changed the title GH-45190: [C++][Compute] Add rank_percentile function GH-45190: [C++][Compute] Add rank_quantile function Jan 20, 2025
@pitrou pitrou force-pushed the percentile-rank-kernel branch from 880b7a5 to 5cc6b50 Compare January 20, 2025 15:46
@pitrou pitrou requested review from zanmato1984 and WillAyd January 20, 2025 15:47
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 20, 2025
@pitrou pitrou force-pushed the percentile-rank-kernel branch from 5cc6b50 to 8c59c76 Compare January 20, 2025 15:55
Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with two nits.

cpp/src/arrow/compute/kernels/vector_sort_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/vector_sort_test.cc Outdated Show resolved Hide resolved
@pitrou pitrou force-pushed the percentile-rank-kernel branch from 0dc9ebf to 3997d5f Compare January 21, 2025 09:18
@zanmato1984
Copy link
Contributor

This is good to go at any time.

@pitrou
Copy link
Member Author

pitrou commented Jan 22, 2025

Yes, I was hoping for opinions on the general idea and API, but perhaps we can simply merge.

@WillAyd
Copy link
Contributor

WillAyd commented Jan 22, 2025

Sorry I did not comment sooner but I think the idea and API are great. If anything, it could be simpler to remove the factor and just leave it to the end user to specify that, but I don't think it's a big deal either way

@zanmato1984
Copy link
Contributor

You can keep this open as long as you feel necessary for more ideas or comments. I was just saying I have no more to comment :)

static auto kRankQuantileOptionsType = GetFunctionOptionsType<RankQuantileOptions>(
DataMember("sort_keys", &RankQuantileOptions::sort_keys),
DataMember("null_placement", &RankQuantileOptions::null_placement),
DataMember("factor", &RankQuantileOptions::factor));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do have time maybe its worth removing the factor to simplify? What is the advantage of including into the QuantileRanker class versus return something in the range 0..1 and having the user supply the factor if they need?

I figure the advantage to taking the factor out would be to simplify the implementation and give more room for vectorized multiplication applications

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can probably simplify this. Multiplication optimizations would be nice as well, but certainly not critical given than sorting and computing the rankings is bound to be much slower than the final multiplication step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zanmato1984 @cyb70289 @icexelloss Do you think the factor would be useful to keep?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants