-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible to get hotspot and memory temps for RTX 30 series? #45
Comments
Usually what we do is reverse engineer an app that reports correct values to find out how it works. We did it with another function to change cooler settings for RTX cards earlier, and this also seems like a problem that requires some reverse engineering. I have tried GPUZ and HWInfo on my system and they don't have any memory info available for RTX2060 (HWINFO has a Hotspot temp but no memory temp) so I can't personally take a deeper look into this. As for |
Ok, Take a look at this data:
this is the result that I got along with the supplied If you are interested to see what you get on your GPU, checkout the |
unless it is a max-avg-min sorta thing. |
@falahati as I've stated elsewhere, I have not checked that on 20xx series as I don't have access to them.
Me neither, I've invented the name (because I got the offset by reverse engineering some software). As for mask - I think it's which sensors you request (or their historical points), however I must admit I haven't tried any values except "all ones", and the only masks that I really tried ('cause they worked) are 8- and 10-bit "all ones" (8-bit for when there is no VRAM temperature sensor - Quadro RTX 3000 / GTX 1080 / GTX 1080 Ti; 10-bit for RTX 3090 / RTX 3060). Then you must interpret the temperatures correctly - they're So, taking your output as example:
I interpret temperatures there as P.S. TBH what bugs me most in the structure I had to declare is the non-4-divisible size of paddings... they look very strange :) |
Another thought on interpreting the temperatures - I've tried a closer look on those at my RTX cards, and I think that maybe the hotspot is Also do note that if I take your outputs where you request
No garbage values in there, they all seem reasonable! Also I must've interpreted your |
I get another value as part of the padding (last byte) which happens when I am mixing and matching the bits, can you test that on your device to see if it makes some sense there? ( |
I'm not sure I follow what you want from me here. That value resembles a bitmask for me (as I cannot interpret it as a temperature value due to it either being 0 or >100 degrees). Whenever I test using bitmask of all ones, second unknown (all 87 bytes) is always 0 to me, and first unknown is all-zero except last byte which is
(maybe those would make sense to you; numbers after |
I've also noted that the last octet of first unknown is changing from measurement to measurement, but is always divisible by |
Well, after trying my latest things on my 1080 / 1080 Ti I no longer think that "bitsize" controls the history... it's probably really different sensors, because my 10xx cards only support |
Thanks @JustAMan for testing on 1080 and providing additional information. This sure helps to figure out what is going on. Since all these programs seem to be able to show at least one decimal point, I am thinking that we are probably missing something here regarding this. But then if the high 16 bits are the decimals, why sometimes it is present and sometimes it is not. I think I have to compare the returned values with a program like HwInfo or GPUZ to see if I can find any correlation between the numbers. I also tested the values while pushing the card for a little and it does seem inline with your theory that each number represents a different sensor (if not physically, at least virtually, and for sure has nothing to do with historic data).
This is the other part that needs a little investigation, for me, this number is always zero when requesting a single bit, but if multiple bits are requested, this value is filled. I have to take a deeper look into this, can you please tell me from what application did you extract this address? And if you guys know about more applications that use this new method, please feel free to tell me their names so that I can take a look. Some might have information that can be used. |
Thank you for looking into this so quickly! This enhancement request has been languishing on LibreHardwareMonitor for months. I guess we were barking up the wrong tree. Thanks to @JustAMan for getting this started. I put multiple Temp Test results for a EVGA 3080ti FTW on google drive(for readability of this thread). The matching HWInfo screenshots are probably a second behind but the mem/hotspot temps somewhat match @JustAMan observations Run 2: hotspot=temperatures[1],m=3 vram=temperatures[9] m=512 https://drive.google.com/drive/folders/1Ik0mXvknbpIpsuxG-UQfuFTyGxgUD8Gb?usp=sharing I guess I need to look at this more systematically with HWINFO logging. Why did NVidia keep this function undocumented? I read somewhere Gamersnexus Steve Burke asked Nvidia about the temp apis and they were cagey about it. The EVGA FTW series has 9 extra temp sensors as shown in the last section below but the masked values in the test don't appear to show any of these extra sensors keeping things simple. If you want more data for different 30 series cards, we should round up requestors for the original LibreHardwareMonitor issue. Let me know if there is anything else I can do to help. |
I tried matching up hotspot/vram temp readings from HWInfo with JustAMan's python script and both mostly match. About 10-20% of the time the temp readings vary by more than 1 degree which I think has to do with exactly when the temp reading must have been taken within that second. The anomalous 6 degree difference was when the GPU was ramping up fast to launch a game. It's highly possible that within a second a reading can vary. But I guess as an end user, I'd be very happy to have these readings as I'm just trying to make sure I have continuous readings taken via ohmgraphite and set an alert when there is sustained readings over 80C. |
sooooo ... i hope it helps a bit. Please keep in mind that the values are not always precise (cause of sync) RTX 3080 Asus TUF read out with GPU-Z GPU: 30,4°C (<>f__AnonymousType3 GPU: 61,6°C (<>f__AnonymousType3 GPU: 56,3°C (<>f__AnonymousType3 |
@falahati Okay, after a little more experiments I believe that the structure should actually be slightly different (ref: https://github.com/JustAMan/pynvraw/tree/0.0.2): typedef struct {
NvU32 version;
NvU32 sensor_mask;
NvU32 unknown[8];
NvU32 sensors[32];
} NV_GPU_THERMAL_EX; As you can see, it doesn't have any weird non-4-divisible offsets anymore and makes much more sense overall IMHO :) And then to get the temperatures in Celsius you just take That said, maybe there would be cards which have more than 10 sensors working through this interface (like the one here #45 (comment) with a bunch of extra sensors), but my cards have no more than 10 sensors only. Trying all the way from 32 sensors down to 1 doesn't hurt, though 😄 |
Thanks so much for looking at this! I have a reference 3090 (PNY) if you need a tester. |
Thank you all guys. @RaptorTP, your results do in fact match with what is discussed here. thanks for the confirmation. I have made the changes necessary to the This is mine on a 2060 (laptop edition): I also renamed the function to |
this is the compiled version if you can't compile the whole project: |
@falahati Thanks so much for the fast turnaround and @JustAMan for figuring this all out. I'm absolutely thrilled by the prospect of having these temps. The temps match in the same way @JustAMan's python code did to HWInfo and you can see the graphs are very tight. You can see the google spreadsheet used to generate the graphs Tomorrow I'll try to see what the other columns match up to in HWInfo. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@hsterk, the provided charts indicate that we are ok with the numbers, the discrepancies are probably due to rounding (HWInfo rounds the number to one decimal point). |
Tried it on an MSI RTX3060 that I had access to but got no memory temp or additional temperatures except the 8 I get with RTX2060 with GPU-Z and HWInfo, supposedly not supporting it; so I am useless here. xD |
ping @MaynardMiner for potential testing |
Just a reminder I've offered up my 3090 for testing. I'm trying to not interrupt but I think I may have been lost in the shuffle ;) |
@xcasxcursex Thanks, but can you check and see if HWInfo shows additional sensor information for your card? Apparently, it is only available for some cards. If it did, take a picture next to the sample project I have compiled and put up a few posts above. |
I don't have the second entry with those additional sensors, nor the extra fan (mine's a triple fan cooler but they all run the same speed). Your app seems like it might be doing better than HWinfo is, it's returning 10 distinct temps. . Edit: It's pretty clear running alongside hwinfo, that index 9 is the Memory Junction, with indices 0 and 1 being GPU Core and Hotspot. The other values don't behave like junk data. Option one scrolls endlessly like so:
Here's a little while running unigine heaven maxed to get it hot. After a time, the only non-unique value I saw was [2], that's always identical to [0]. My point here is not that I'm certain it is a duplicate 'filler', but that it suggests all the other values are actual temp sensors and not junk. |
I'll try with a different dataset but I couldn't get any of the unmatched values (cols 2-8) to EVGA FTW extra power or memory sensors (Above chart is still data from the last batch) |
Based on what I saw in GPU-z, I am going to guess that those additional sensors are i2c-controlled (if gpu-z can show these). Also, basing on common sense I think we can assume some of the sensors are for VRM (the power circuitry; there are quite a few power measurements being done for different sources - voltage, current, energy, etc). |
BTW after looking a few weeks into numbers I believe that my 3060 does not
have a memory temperature sensor at all, so sensor #8 is something else,
maybe, as I guessed earlier, a power management or whatever related to
backplate.
Ways to tell could be to disassemble a device and run it without cooling
while being continuously filmed by a thermal camera, then comparing what
sensors say to what is detected... But that could ruin a card under
experiment.
пн, 27 сент. 2021 г., 7:18 Tim Sirmovics ***@***.***>:
… Here is a side by side with GPU-Z and HWiNFO64 on a 3080 TI:
[image: image]
<https://user-images.githubusercontent.com/10243286/134844767-a74a319b-0832-4ab6-8c31-01de947ba252.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHW6RVCPHNUECFSMIKNUA3UD7WANANCNFSM5BBFUPIA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Based on the @hsterk analyses and images shared by other users, there doesn't seem to be a clear relationship for these values except maybe for the core temp. |
@falahati in the screenshots from myself and other users above, array index [9] correlates exactly with Memory Temperature on the examples with 3080 TI's and 3090's. Do you think this sensor is specific to cards with GDDR6X memory? |
In my efforts to pin down the root cause of a bug, I've noticed the example app above causes it. Link to the relevant issue: LibreHardwareMonitor/LibreHardwareMonitor#598 |
@xcasxcursex |
I don't know anything that specific. This bug has been around since the 30 series has, but never in an open source tool, so it was just "when I use this app, the game stutters". Nvidia have actively ignored it. It's only this morning I've been presented with a) an open source app that triggers the bug and b) one where I can toggle the broken feature on and off. I'm pretty sure this is going to give us the ability to pin it down to something more specific, now. I opened the issue and linked it above but it does look like it might come back to this project, maybe while on its way to nvidia (I'm assuming it is a problem with their API since closed-source projects have the exact same bug, so either it's the API below this, or closed-source projects have borrowed your code without giving credit, and the former seems more likely). I'm just posting here to create a link between them for troubleshooting purposes. |
Any update regarding this issue? |
I think the low-level API is already available on the I am yet to merge it upstream tho. |
Hello. A single GPU temperature(for my 3080ti) is reported by NvAPIWrapper but RTX cards also have memory and hotspot temperatures.
I see NvAPI_GPU_GetThermalSettings is being used but NvAPI_GPU_GetAllTempsEx( 0x65FE3AAD) appears to have the hotspot and vram temperatures. Any plans to add this?
The text was updated successfully, but these errors were encountered: