Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ryzen & TSC bugs #16545

Open
pchome opened this issue Jan 11, 2025 · 3 comments
Open

Ryzen & TSC bugs #16545

pchome opened this issue Jan 11, 2025 · 3 comments

Comments

@pchome
Copy link

pchome commented Jan 11, 2025

Quick summary

I found new information about Ryzen and tsc.
To continue discussion started in #16382, related to #16240 and #16499

Details

This is obviously not RPCS3 bug, but rather known for years bugs in some zen CPUs. Notably two different problems usually reported by Ryzen users:

  1. TSC can randomly become unstable after either cold or hot reboot. As a marker of this problem there is messages in dmesg like tsc: Marking TSC unstable due to clocksource watchdog. Can be "fixed" by the user doing opposed reboot type or adding tsc=reliable or even hpet=disable tsc=reliable clocksource=tsc nmi_watchdog=0 nowatchdog to kernel params in bootloader.
  2. TSC unstable due to synchronization. As a marker of this problem there is messages in dmesg like tsc: Marking TSC unstable due to check_tsc_sync_source failed. Require firmware update[1][2].
  3. An different bug with the quick PIT calibration[4].

The problem with the second one is that some vendors just ignore older models or (in case of Lenovo) provide Linux support to some limited model list[3]. So it's unclear if some models will be ever fixed, while others already was fixed in newer bios updates.

RPCS3 currently have problematic TSC code disabled, but maybe it would be better to dynamically disable that if current clocksource is not tsc. Or even have an configuration option. Otherwise this issue can be safely closed as 'Not our bug'.

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
hpet
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
hpet acpi_pm

[1] https://bugzilla.kernel.org/show_bug.cgi?id=202525
[2] https://bugzilla.kernel.org/show_bug.cgi?id=216161
[3] https://forums.lenovo.com/topic/findpost/27/5090932/5822354
[4] https://lore.kernel.org/all/[email protected]/

Some general info about zen1 & zen+ CPUs

https://www.agner.org/optimize/blog/read.php?i=838 (context: zen1 & zen+)

The Ryzen is saving power quite aggressively. Unused units are clock gated, and the clock frequency is varying quite dramatically with the workload and the temperature. In my tests, I often saw a clock frequency as low as 8% of the nominal frequency in cases where disk access was the limiting factor, while the clock frequency could be as high as 114% of the nominal frequency after a very long sequence of CPU-intensive code. Such a high frequency cannot be obtained if all eight cores are active because of the increase in temperature.

The varying clock frequency was a big problem for my performance tests because it was impossible to get precise and reproducible measurements of computation times. It helps to warm up the processor with a long sequence of dummy calculations, but the clock counts were still somewhat inaccurate. The Time Stamp Counter (TSC), which is used for measuring the execution time of small pieces of code, is counting at the nominal frequency. The Ryzen processor has another counter called Actual Performance Frequency Clock Counter (APERF) which is similar to the Core Clock Counter in Intel processors. Unfortunately, the APERF counter can only be read in kernel mode, unlike the TSC which is accessible to the test program running in user mode. I had to calculate the actual clock counts in the following way: The TSC and APERF counters are both read in a device driver immediately before and after a run of the test sequence. The ratio between the TSC count and the APERF count obtained in this way is then used as a correction factor which is applied to all TSC counts obtained during the running of the test sequence. This method is awkward, but the results appear to be quite precise, except in the cases where the frequency is varying considerably during the test sequence. My test program is available at www.agner.org/optimize/#testp


Not important, but just in case

AMD has a different way of dealing with instruction set extensions than Intel. AMD keeps adding new instructions and remove them again if they fail to gain popularity, while Intel keeps supporting even the most obscure and useless undocumented instructions dating back to the first 8086. AMD introduced the FMA4 and XOP instruction set extensions with Bulldozer, and some not very useful extensions called TBM with Piledriver. Now they are dropping all these again. XOP and TBM are no longer supported in Ryzen. FMA4 is not officially supported on Ryzen, but I found that the FMA4 instructions actually work correctly on Ryzen, even though the CPUID instruction says that FMA4 is not supported.

Detailed results and list of instruction timings are in my manuals: www.agner.org/optimize/#manuals.

see https://en.wikipedia.org/wiki/FMA_instruction_set#FMA4_instruction_set

I can't measure CPU freqs precisely (due to lack of CPPC (Collaborative Processor Performance Control) support (?)) but I saw some screenshots from AMD official Ryzen Master tool for Windows that some cores can reach almost zero Mhz.

Well, I can see that some heavy tasks are starting from lower than minimal frequency, while by default monitoring constantly showing just minimal frequency if there is no load.

I read info from the first link very briefly, but I guess it's possible to use one of known third party kernel drivers rather then custom one mentioned in the article.
https://github.com/leogx9r/ryzen_smu (and forks)
https://github.com/ocerman/zenpower (and forks)
https://github.com/FlyGoat/RyzenAdj

Likely most of them require root if able to modify some system parameters.
But just more info to add.

Also: https://www.hwinfo.com/forum/threads/effective-clock-vs-instant-discrete-clock.5958/
something about "Effective clock"

I hope this information will help somehow.

@pchome pchome changed the title [Feature request] Ryzen & TSC ™ Ryzen & TSC bugs Jan 21, 2025
@pchome
Copy link
Author

pchome commented Jan 21, 2025

I have patched my Linux kernel with patches available in [1] and [4] (just in case) and now kernel can set current clocksource as tsc at boot. TSC even stayed after some sleep/wake cycles.

I can confirm that with TSC some old problematic releases (#16382) are more or less fine. Also frame times are more stable with TSC while CPU usage sometimes is notably lower.

$  dmesg | grep -i -E '(tsc|clocksource)'
...
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2295.445 MHz processor
[    0.072450] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.178123] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    0.185155] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x21166635d41, max_idle_ns: 440795308349 ns
[    0.001715] TSC direct sync: CPU2 observed -6931769983 warp. Overhead: 0
[    0.001715] TSC direct sync: CPU4 observed -6931769983 warp. Overhead: 0
[    0.001715] TSC direct sync: CPU6 observed -6931770006 warp. Overhead: 0
[    0.001715] TSC direct sync: CPU1 observed -6931770075 warp. Overhead: 0
[    0.001715] TSC direct sync: CPU1 observed 46 warp. Overhead: 184
[    0.001715] TSC direct sync: CPU3 observed -6931770052 warp. Overhead: 0
[    0.001715] TSC direct sync: CPU5 observed -6931770006 warp. Overhead: 0
[    0.001715] TSC direct sync: CPU7 observed -6931770029 warp. Overhead: 0
[    0.310249] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[    0.368200] clocksource: Switched to clocksource tsc-early
[    0.376594] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    1.425764] tsc: Refined TSC clocksource calibration: 2297.655 MHz
[    1.425788] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x211e8e141e8, max_idle_ns: 440795258344 ns
[    1.425835] clocksource: Switched to clocksource tsc
...

@kd-11 , @elad335
The Details section was updated.

@elad335
Copy link
Contributor

elad335 commented Jan 25, 2025

Please test with #16618, would be interesting to see if it actually solves this issue. Otherwise I'll consider filtering the affected CPUs or opt for a more complex workaround.

@pchome
Copy link
Author

pchome commented Jan 26, 2025

kernel patch on, current clocksource: tsc

67703b4 ok
67703b4 + #16618 ok

kernel patch off, current clocksource: hpet

67703b4 + #16618 random fps deeps or freezing for one or two seconds


For filtering maybe something like this will work (for linux):

  1. read /sys/devices/system/clocksource/clocksource0/current_clocksource and enable that tsc code if tsc
  2. if not tsc then read /sys/devices/system/clocksource/clocksource0/available_clocksource and check if tsc present in the list
  3. if tsc available notify (log) user they could have better experience with tsc as default clocksource
  4. otherwise warn (log) user about something wrong with their system

FYI
Also googled something about rdtsc : https://github.com/xuwd1/rdtsc-notes#measuring-overhead-on-ryzen-7840h-and-its-weird-10ns-tsc-resolution
I modified their test for 2.3GHz freq and got

$ ./ryzen7840h 
Out of 16384 samples 16384 was multiple of 23

https://github.com/xuwd1/rdtsc-notes#2-pitfall-of-constant-tsc

With a constant tsc, a tsc clock difference reading can always be safely interpreted to real world time difference. For instance, with a constant tsc calibrated to 3.8GHz, if two consecutive reading to the tsc gives a clock difference of 38, it is always true that the time interval between the two measurement was 10ns. However it is worth noting that even with a constant tsc, for an clock-accurate microbench result, it is still needed to fix the clock speed of the CPU to its base clock. This is because if not, with constant tsc the processor clock and the tsc clock is essentially running asynchronously. For instance, if the processor's running at 1900MHz while the tsc has a constant freq of 3800MHz, for some operation that takes 5 cycles to finish, a tsc-timed bench result would tell that it takes 10 clock cycles to complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants