Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

galp5 performance mode should be tuned #210

Closed
curiousercreative opened this issue Jan 25, 2021 · 14 comments · Fixed by #212
Closed

galp5 performance mode should be tuned #210

curiousercreative opened this issue Jan 25, 2021 · 14 comments · Fixed by #212
Assignees

Comments

@curiousercreative
Copy link
Contributor

curiousercreative commented Jan 25, 2021

Distribution (run cat /etc/os-release):

NAME="Pop!_OS"
VERSION="20.10"
ID=pop
ID_LIKE="ubuntu debian"
PRETTY_NAME="Pop!_OS 20.10"
VERSION_ID="20.10"
HOME_URL="https://pop.system76.com"
SUPPORT_URL="https://support.system76.com"
BUG_REPORT_URL="https://github.com/pop-os/pop/issues"
PRIVACY_POLICY_URL="https://system76.com/privacy"
VERSION_CODENAME=groovy
UBUNTU_CODENAME=groovy
LOGO=distributor-logo-pop-os

Issue/Bug Description:
With performance power profile selected on AC power for i7-1165G7:

  • TDP boosts to 44-45W for up to 28s (observed with pcm)
  • Thermal throttling at 88C
  • TDP normalizes to 28W
  • all-core frequency 3.6-3.8GHz w/ adequate cooling

Writing MSRs (93C, pl1 45W 99999) results in:

  • TDP boosts to 44-45W indefinitely (as thermals allow)
  • Thermal throttling at 93C (exceeding 95C seems to trigger some cooldown)
  • all-core frequency 4.0-4.1GHz w/ adequate cooling (meets spec)

Other Notes:

@jackpot51
Copy link
Member

I believe a TDP of 40W is acceptable for the galp5 on performance mode. The limit you see when on DC is to prevent overdrawing the battery. It is critical that this limit remains in place.

@jackpot51 jackpot51 self-assigned this Jan 25, 2021
@curiousercreative
Copy link
Contributor Author

curiousercreative commented Jan 25, 2021

@jackpot51 you're referring to this in regards to overdrawing the battery? As as end user increasing that limit, am I risking a shorter battery life (in terms of discharge cycles) or risking catastrophic battery failure or board failure?

As far as the 40W TDP, without any modifications galp5 i7 will boost to 44-45W on its own. The increased thermal throttle point of 93C and boost time window (as long as possible) are what I'd like to see added to performance mode.

@jackpot51
Copy link
Member

Overdrawing can result in lower battery lifespan or system shutdowns. The PL2 can be set to 51 as it currently is but PL1 should probably not exceed 40W to avoid thermal runaway

@curiousercreative
Copy link
Contributor Author

@jackpot51 thanks for those details. Any opposition to greatly expanding the Tau and increasing the thermal throttle point for performance mode?

@jackpot51
Copy link
Member

If you adjust PL1 tau doesn't need to be adjusted. PL1 is indefinite

@jackpot51
Copy link
Member

Thermal throttle at 93C should be fine in performance mode

@curiousercreative
Copy link
Contributor Author

Here's default MSR on cold boot connected to AC power, balanced power profile:

# undervolt.py --read
temperature target: -12 (88C)
powerlimit: 51.0W (short: 0.0009765625s - enabled) / 28.0W (long: 28.0s - enabled)

# cat /sys/class/powercap/intel-rapl:0/constraint_0_power_limit_uw
28000000
# cat /sys/class/powercap/intel-rapl:0/constraint_0_max_power_uw
28000000
# cat /sys/class/powercap/intel-rapl\:0/constraint_0_time_window_us
27983872

# cat /sys/class/powercap/intel-rapl:0/constraint_1_power_limit_uw
51000000
# cat /sys/class/powercap/intel-rapl:0/constraint_1_max_power_uw
0
# cat /sys/class/powercap/intel-rapl\:0/constraint_1_time_window_us
976

And here's what pcm looks like for the first 28 seconds of a heavy workload:

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI |  TEMP

   0    0     1.18   0.81   1.46    1.46    4347 K   8137 K    0.44    0.54    0.00    0.00     22
   1    0     1.12   0.77   1.46    1.46    4494 K   8441 K    0.44    0.55    0.00    0.00     22
   2    0     1.12   0.77   1.46    1.46    4507 K   8448 K    0.44    0.53    0.00    0.00     14
   3    0     1.12   0.77   1.46    1.46    4454 K   8344 K    0.44    0.53    0.00    0.00     14
   4    0     1.12   0.77   1.46    1.46    4473 K   8378 K    0.44    0.54    0.00    0.00     20
   5    0     1.12   0.77   1.46    1.46    4467 K   8384 K    0.44    0.53    0.00    0.00     20
   6    0     1.11   0.76   1.46    1.46    4461 K   8418 K    0.44    0.54    0.00    0.00     12
   7    0     1.09   0.75   1.46    1.46    4593 K   8402 K    0.43    0.55    0.00    0.00     12
---------------------------------------------------------------------------------------------------------------
 SKT    0     1.12   0.77   1.46    1.46      35 M     66 M    0.44    0.54    0.00    0.00     12
---------------------------------------------------------------------------------------------------------------
 TOTAL  *     1.12   0.77   1.46    1.46      35 M     66 M    0.44    0.54    0.00    0.00     N/A

 Instructions retired:   25 G ; Active cycles:   32 G ; Time (TSC): 2801 Mticks ; C0 (active,non-halted) core residency: 100.00 %

 C1 core residency: 0.00 %; C3 core residency: 0.00 %; C6 core residency: 0.00 %; C7 core residency: 0.00 %;
 C0 package residency: 100.00 %; C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %; C8 package residency: 0.00 %; C9 package residency: 0.00 %; C10 package residency: 0.00 %;
                             ┌────────────────────────────────────────────────────────────────────────────────┐
 Core    C-state distribution│00000000000000000000000000000000000000000000000000000000000000000000000000000000│
                             └────────────────────────────────────────────────────────────────────────────────┘
                             ┌────────────────────────────────────────────────────────────────────────────────┐
 Package C-state distribution│00000000000000000000000000000000000000000000000000000000000000000000000000000000│
                             └────────────────────────────────────────────────────────────────────────────────┘

 PHYSICAL CORE IPC                 : 1.54 => corresponds to 38.49 % utilization for cores in active state
 Instructions per nominal CPU cycle: 2.25 => corresponds to 56.16 % core utilization over time interval
 SMI count: 0
---------------------------------------------------------------------------------------------------------------
MEM (GB)->|  READ |  WRITE |   IO   | CPU energy |
---------------------------------------------------------------------------------------------------------------
 SKT   0     0.00     0.00     0.00      43.42              
---------------------------------------------------------------------------------------------------------------

@curiousercreative
Copy link
Contributor Author

curiousercreative commented Jan 25, 2021

If I increase that Tau (and nothing else) with the below, it'll maintain that high TDP of 40+W

# echo '999999999' > /sys/class/powercap/intel-rapl:0/constraint_0_time_window_us

@jackpot51
Copy link
Member

I recommend instead adjusting the pl1 to 40W because it is easier to inspect than a tau change. Also, the default tau is useful for defining a short-term power limit as PL2. With tau set so large, these two limits essentially merge together.

Try creating ModelProfiles similar to the lemp9 at the end of src/daemon/profiles.rs. The balanced pl1 is 28, pl2 is 51, tcc_offset is 12. For performance, pl1 could be 40, pl2 51, and tcc_offset 7. For battery, the values from the lemp9 are reasonable.

@curiousercreative
Copy link
Contributor Author

@jackpot51 I can try working on a PR for what you describe, but it seems that just adjusting the pl1 results in a TDP of 33-37W (just under themal limits) whereas just increasing the time window results in 40-45W (thermal limits). I'm new to all this, but I've also noticed that if I increase the pl1 without increasing the time window and then increase the time window, I'll still be limited to that 35W TDP. This difference in TDP determines whether one is able to achieve max all-core turbo.

@jackpot51
Copy link
Member

How are you measuring CPU power usage?

@curiousercreative
Copy link
Contributor Author

@jackpot51
Copy link
Member

See what measurements you get from this script: https://github.com/system76/ec/blob/master/power.sh

It creates a CSV file called power.csv where a bunch of power and thermal metrics are stored, and is what we generally use to evaluate fan curves

@curiousercreative
Copy link
Contributor Author

@jackpot51 thanks for that tip, helpful to see the readings over time. I'll open a PR for adding the galp5 profile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants