Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let FieldDef lazy-calculate and cache hash code #892

Merged
merged 3 commits into from
Nov 3, 2023

Conversation

zhaih
Copy link
Contributor

@zhaih zhaih commented Apr 5, 2023

Sometimes _dataSchema could be very complex and makes computing hash code for this class extremely slow. Since all data fields participated in hash code calculation is final, we should able to calculate it at construction time and avoid paying the cost repeatedly.

@zackthehuman
Copy link
Contributor

Thanks for the contribution, Patrick! Based on the usage of FieldDef I think we should understand the implications and tradeoffs with how we compute the hashcode.

Here are a few things to consider:

  • FieldDef is not final, and can be extended. If we precompute the hashcode by calling hashCode() in the constructor, then we'll do it for all subclasses too, even if that isn't the desired behavior.
  • Caching the hashcode will improve performance as long as the same instance is reused -- do we know if that is the usage pattern? If we're creating new FieldDef or derived class instances frequently, then this caching and eager computation may actually negatively impact performance.
  • There are classes like Parameter which extend FieldDef, and the above points may apply here.

@zhaih
Copy link
Contributor Author

zhaih commented Apr 5, 2023

Hi Zack, thank you for the comment:

  1. For child classes, I think usually people either use super.hashCode() as part of the calculation or just do it from scratch, I don't think either way will be a true blocker
  2. For our use case (and I believe for maybe most use cases using RestLi), the ActionRequestBuilder is the only place we're using it, and the key is get from a ResourceSpecImpl class which stores those FieldsDef as long as the service is running. I can't say that's for every use cases, but I do think the use case should be fairly normal.

Regarding the potential cost to the constructor, it can be solved by lazily initialize the _hashCode field, but maybe need some synchronization to avoid multiple concurrent write (but maybe that's acceptable since the hash code should not change?)

Let me know what you think, thanks!

@zackthehuman
Copy link
Contributor

Let me know what you think, thanks!

I chatted with some folks about this internally and our recommendation is to compute the value lazily so that it is computed the first time hashCode() is called. Your point about synchronization is valid, but I'm convinced that it's not a huge problem since it should compute to the same value (as you mentioned).

How does this sound?

@zhaih
Copy link
Contributor Author

zhaih commented Apr 10, 2023

Sure, I just changed it to compute lazily

@zackthehuman zackthehuman added the performance For identifying performance problems or making improvements label Apr 10, 2023
Copy link
Contributor

@zackthehuman zackthehuman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good to me. We should look out for any performance regressions due to auto-boxing/unboxing from Integer to int. If that happens, we can revisit this.

Comment on lines +131 to +136
if (_hashCode == null) {
// If this method is called by multiple thread, there might be multiple concurrent write
// here, but since the hashCode should be the same it is tolerable
_hashCode = computeHashCode();
}
return _hashCode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this comment here for my future self and/or other maintainers:

We compute the hash code of DataMap lazily using synchronize(this) and a form of double-checked locking:

public int dataComplexHashCode()
{
  if (_dataComplexHashCode != 0)
  {
    return _dataComplexHashCode;
  }

  synchronized (this)
  {
    if (_dataComplexHashCode == 0)
    {
      _dataComplexHashCode = DataComplexHashCode.nextHashCode();
    }
  }

  return _dataComplexHashCode;
}

This approach avoids auto boxing/unboxing between Integer and int but makes the assumption that the computed hash code will never be 0. Doing something like this is also an option we can take in the future if we want to explore this more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar but more general idea maybe just use another boolean to track whether the hashcode is computed. And better maybe using Atomics instead of synchronized block, like

public int hashCode() {
  if (_computed.get() == false) {
    int hashCode = computeHashCode();
    if (_computed.get() == false) {
      _hashCode.compareAndSwap(INIT_VALUE, hashCode);
      _computed.set(true);
    }
  }
  return _hashCode.get();
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW here there's little guarantee the write will ever be visible to other threads. Have you considered making _hashCode at least volatile:

private volatile Integer _hashCode;

It preserves the original desired behavior of not wanting a full lock but at least makes it memory-safe.

Alternatively, why not just compute it at construction and make it final? Presumably instances of FieldDef aren't dynamically generated, they come from parsed .pdl and .pdsc schemas right? You could just eat the cost exactly once and forget about it, especially since it will happen very early on and likely completely out of the query serving codepath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @PapaCharlie , thanks for chiming in:
For the flushing to memory problem: it's valid but since we'll not set it based on it's previous value, essentially all threads will set it to the same value if they cannot see it. So eventually that's not a safety issue but rather a performance issue (we compute it # of thread times rather than once). I like to compute it in ctor better (and I actually first implemented it in this way) but @zackthehuman is worrying about impacting the performance of the ctor so we changed it to lazy later on.

@zhaih zhaih changed the title Let FieldDef pre-calculate and cache hash code Let FieldDef lazy-calculate and cache hash code Apr 12, 2023
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@zackthehuman zackthehuman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the long wait! Let's goooooo!

@zackthehuman zackthehuman merged commit 64faf7e into linkedin:master Nov 3, 2023
2 checks passed
brycezhongqing pushed a commit that referenced this pull request Nov 13, 2023
Sometimes _dataSchema could be very complex and makes computing hash code for this class extremely slow. Since all data fields participated in hash code calculation is final, we should able to calculate it at construction time and avoid paying the cost repeatedly.

* Update FieldDef lazily calculate and cache hash code

---------

Co-authored-by: Patrick Zhai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance For identifying performance problems or making improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants