-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: revert Windows change to boot-time timers #35482
Comments
Is anybody working on the alternative approach mentioned above? |
No, I am not making this change. @zx2c4 said in #31528 (comment)
CL 191957 fixes this problem. If we revert the CL and replace it with QueryUnbiasedInterruptTime, the problem will reappear. Also using QueryUnbiasedInterruptTime will make nanotime implementation about 2 times slower. See #31528 (comment) for details. Alex |
I can find some time to make the change that Alex originally prototyped if no one else is available. But in the meantime, a patch release of Go has regressed programs running in containers, so should we consider reverting this and then apply a more appropriate change later?
As I understand it, Go on Linux has the same behavior as Go on Windows used to have before this change. WireGuard is specialized software and may have to work around this in both Windows and Linux. I also think the fix in CL 191957 is incomplete--AFAIK, PowerRegisterSuspendResumeNotification only provides notifications for machines transitioning across classic sleep states, not when machines enter connected standby (which is used instead of sleep in some newer devices). In these cases, you will still see a difference between (biased) interrupt time and WaitForSingleObject's relative timers, so WireGuard presumably will still run into problems. The right fix for WireGuard may be to offer a new kind of timer that uses absolute (wall clock) timeouts on Windows, which is affected by changes to system UTC time (NTP or otherwise) but not by sleep states. If that's insufficient, I can help investigate if there are other options that might be appropriate.
If slowing nanotime down from 2ns to 4ns is problematic, we can look at whether the internal definition of QueryUnbiasedInterruptTime is stable enough to inline into Go's runtime (which is what was done for the current version of nanotime: it was apparently cloned from QueryInterruptTime). I'd really like to avoid this, though, because the current definition relies on a private export from ntdll to the kernel that is not part of the external API or ABI. |
This patch wouldn't have been accepted if the discussion about it had referenced #24595.
Jason has stated that WireGuard requires timer patches for Linux, Darwin, and (prior to 1.13.3) Windows. Reverting this won't harm WG. I contacted him yesterday to point out this issue, and he ack'd, so I imagine he'll respond soon.
I've been advocating for this on #24595 but so far, no traction... |
Quoting @mpx:
I don't think that trading correctness for the sake of backward compatibility is a right choice. I also don't think that adding new API to |
@DmitriyMV, that probably belongs in the thread you quoted; it's off-topic here. |
@DmitriyMV , right now, with this change, we are inconsistent between Linux and Windows, and inconsistent between different Windows devices (connected standby vs. classic sleep). That inconsistency seems like the worst possible situation to be in. |
I have no opinion on this matter. I will let Ian decide what to do here. Alex |
Please close this issue and do not make any such revert. The premises here are flawed.
|
Doing things the right way on Linux and other platforms is a work in progress. Feel happy for the rare case in which the Windows implementation achieves the correct implementation (using BOOTTIME) first. |
This was a breaking change in a bug fix release that introduces inconsistent behavior between operating systems. There is clearly no consensus that this is the right change for Linux or the change would have been made already. It's very strange to me that this is considered an acceptable approach to the evolution of the Go runtime. |
More false claims. A lot to unpack in three sentences. Here we go:
"Would have been made already" is a ridiculous conclusion to jump to. They tried it, but it broke some userland in unexpected ways. The timer maintainers want to do it. It's just a matter of figuring out how.
No, it fixed a regression with Windows timer buckets. Before the fix, Go timers were not reliable following a compatibility-breaking OS change from Microsoft. It also keeps WireGuard viable on Windows. Not taking into account sleep time makes WireGuard and other network protocols impossible to implement. It's possible your branch of Microsoft isn't interested in WireGuard, but I'm told some NDIS people are playing with it.
As mentioned, it's an ongoing work in progress to bring BOOTTIME support to other operating systems.
Clearly "introduces" is the wrong word, since Windows had always been like this, until Windows 8, at which time there were actually two timers semantics being used at the same time, causing problems. The bug fix moved things back to only using one set of timer semantics, fixing the problem. And guess which set of timer semantics it chose in order to fix the problem? The one that had always been used on Windows in Go since the beginning. It didn't introduce a new one, as that could have caused problems. Instead it went back to providing the same timer semantics that Go had originally. |
I'm going to double check my Windows 8 claim--it may have been Windows 7 (which would make sense because that's when QueryUnbiasedInterruptTime was introduced). I'll ask someone down the hall who has worked on the Windows timer infrastructure when I get a chance. In any case, before the change to the Go runtime, Go programs inherited the system timer behavior. As far as I know, there was never a case before Go 1.13.3 that Go attempted to force a BOOTTIME-style timer behavior on Windows (or on Linux). So yes, this change was a breaking change to Go timer semantics on Windows. I'm certainly sympathetic that WireGuard needs a solution here, but there is other Go software out there too. |
Your Windows 8 claim is correct. MSDN confirms this, as does my own testing. From https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitforsingleobject : "Windows XP, Windows Server 2003, Windows Vista, Windows 7, Windows Server 2008 and Windows Server 2008 R2: The dwMilliseconds value does include time spent in low-power states. For example, the timeout does keep counting down while the computer is asleep. Windows 8, Windows Server 2012, Windows 8.1, Windows Server 2012 R2, Windows 10 and Windows Server 2016: The dwMilliseconds value does not include time spent in low-power states. For example, the timeout does not keep counting down while the computer is asleep."
Before 1.13.3, Go relied on BOOTTIME-style timer behavior working, even though it did not on certain newer Windows platforms, and so Go was broken on those platforms and we got bug reports. The Go behavior on Windows has always been to rely on this BOOTTIME-style of behavior, but Windows 8 put us in an inconsistent state. 1.13.3 fixed the inconsistency by reverting to the semantics Windows users of Go have always relied on. What you're suggesting here is introducing totally new semantics that we've never relied on. That sounds like a new proposal, and something I emphatically n'ack. |
I'm writing an app for multiple laptop platforms (Windows, MacOS, Linux) and want the same timer behavior across them. If the default behavior on Windows (or certain versions thereof) differs, I'd expect a switch to throw which aligns them. That means a switch to let Windows 7 work like 8/10 and Unix, or vice versa. My current project could carry a runtime patch for this (I already need 2 stdlib patches on Windows). I agree that Go should provide timers with boot-time semantics, but probably not as an upgrade to time.Timer/Ticker. How would apps that depend on timers reliably evaluate such an upgrade? What is the cost to adapt if the evaluation indicates trouble? Go timers have been "broken" on Windows 8/10 for seven years and I've seen one bug report, filed in 2019; are there others? What did WireGuard do on Windows 8/10 before this patch? How does it handle timers on MacOS/Unix? |
@zx2c4 Could you please elaborate on why you wanted to remove the release-blocker label? We were going over release-blocker issues in a meeting, and because no one knew why it was removed, we thought it was a gopherbot bug and re-added it. We learned in #35755 that you requested it to be removed, but it wasn't visible to us at the time. |
Sleep was broken.
We patch Golang.
Oh, whoops, didn't think that'd be a big deal. The actual release-blocker bug is the docker issue somebody reported earlier -- #35447. Breaking docker seems very bad. This thread here, on the other hand, is some bikeshedding on if we should change Go's behavior from how it was originally designed way back when to something new and different. Not sure why this discussion would need to block a release. |
I think it's disrespectful to brush this issue off as bikeshedding. I don't mind if you clear the release blocking tag (I didn't add it), but discussing the technical merits of a bug fix that you implemented is anything but bikeshedding. |
So we have a likely near-term solution: |
NEWS FLASH NEWS FLASH NEWS FLASH NEWS FLASH NEWS FLASH NEWS FLASH Some new results just in, which will basically change this entire debate and allow us to entirely defer big invasive changes until 1.15. Check out this discrepancy between Docker and non-Docker during S3 sleep: https://data.zx2c4.com/docker-uses-program-time-windows-dec-2019.mp4 This is running: package main
import _ "unsafe"
//go:linkname nanotime runtime.nanotime
func nanotime() int64
func main() {
start := nanotime()
lastSecond := int64(0)
for {
now := nanotime()
secondsSinceStart := (now - start) / 1000000000
if secondsSinceStart > lastSecond {
println(secondsSinceStart)
lastSecond = secondsSinceStart
}
}
} What you see in that screencast is that Docker uses "program time", whereas real Windows uses "real time". THIS MEANS THAT THE ORIGINAL SIMPLE COMMIT FIXES THE DOCKER ISSUE That behavior there makes the entire system consistent. So at this point, I'd strongly recommend merging that commit, closing this issue, and starting a new discussion on "real time" vs "program time" and new APIs for Go 1.15. |
Change https://golang.org/cl/208317 mentions this issue: |
Change https://golang.org/cl/211280 mentions this issue: |
Oh goodness. Just making sure I understand completely, since runtime.nanotime is reading "interrupt time" in your example, "interrupt time" in Docker for Windows is actually "unbiased interrupt time" ("program time") and there's perhaps no monotonic clock that's actually on "real time" in Docker?
Are we fairly certain this is the only cause of error 2? This means running on bare Windows and running on Docker for Windows will behave differently, but 1) maybe there's no way around that, and 2) maybe it doesn't matter so much because people don't tend to run Docker on laptops anyway? Thanks for working hard on the Docker issue. As you pointed out, this makes option 1 viable, where we stay on "real time" for both Sub and Sleep for Windows and try to come up with a more unified, consistent answer for 1.15. I'm okay with that because, if we do change the semantics for 1.15, we just have one big convergence of time behavior in 1.15, rather than changing Windows behavior in 1.14 and then again in 1.15. |
Change https://golang.org/cl/211307 mentions this issue: |
Yes, exactly.
Pretty sure. I'm now chasing this around in the kernel in IDA. It looks to me that when it's unable to find the power management node for the powrprof functions, that same absence also will result in the hooks to the timer advancement code not being run.
Right. Glad we're on the same page here. I too would like to see everything unified across platforms, and 1.15 seems like the right time to do that. |
Excellent, now you have a safe patch for the Windows runtime in the Wireguard build, to go with your runtime patches for MacOS #35012 (comment) & Linux #24595 (comment) \o/ |
No, WireGuard for Windows isn't going to be "patching the runtime", and Go shouldn't be introducing regressions either without careful consideration. The Docker bug was a significant regression; that's fixed now. No need to introduce yet-another-one. However, I'm up for revisiting all behavior for 1.15, where we'll have plenty of time to discuss this and prepare, code-wise, for whatever the implications. |
No. It will only wake up the
That last part is correct: the suspend/resume notifier should only need to call
Yes. |
Oh interesting. That's a nice design improvement.
Great. After the Docker-fix CL is committed, I'll send a simplification for 1.14 that runs |
@aclements what was the possible "change again" in 1.15? Addition of NewTimerAt() and friends? A key rationale for changing to program/monotonic in 1.14 is that it's the native model for Win8/10, which has been in use for 7 years. @jstarks noted that it's odd for Go to second-guess that. At this point, most Windows laptops in use are running Win8/10. |
@aclements also pointed out that the patch in question ignores some sleep states, see #35482 (comment) Is that acceptable? Is it fixable? |
I wasn't able to reproduce that claim, actually. That other patch I sent uses a different notification mechanism, though, and I think if we encounter bug reports from users we'll be able to swap out mechanisms. In my testing though, the existing one was fine. |
That's right. Or, at least some sort of OS convergence that we've had more time to think about, whether that's NewTimerAt or something else.
I'm not that concerned with the "native" model of any particular OS, especially when OSes can't agree on what that model should be (e.g., Windows moving to monotonic time, Linux trying [though failing] to move away from monotonic time). I think monotonic time is definitely part of the right answer, but it's also not the whole answer, which is why I'm okay with putting this on hold to minimize design thrashing, and making headway on a whole answer for 1.15. |
OK. One more use case to consider: apps which use TCP to reach other apps on the same laptop, e.g. a "localhost web app" (which is what I'm building). When you suspend such a system, you don't want either side to time-out and drop a connection on resume. I occasionally see failures like this in my app on a Win7 laptop. That hasn't appeared in the modest amount of testing I've done on Win8/10. EDIT: presumably because the browser may get a timeout on Win7, but not Win8/10. |
If the stdlib exported a function that returns "interrupt time" (current source of runtime.nanotime) could that be used to implement @bradfitz suggestion in #35482 (comment) ? |
Issue #31528 was fixed by CL 191957. But after CL 191957 was submitted, Austin suggested an improvement to it. And Jason implemented the improvement in CL 198417. I could have selected CL 191957 to test the issue. Any commit after CL 191957 is good. I chose CL 198417.
I did not test current tip. I think issue #31528 is still fixed on current tip. But issue is broken again on CL 210437.
@aclements I don't have time to invest into this. Whatever you decide, I will be happy.
@zx2c4 thank you for confirming that Docker uses "program time".
Yes. CL 208317 will allow Docker to run. But Docker time behavior is still different from real Windows. And we need to bring them in line somehow.
Sounds reasonable to me. Alex |
Since timing unification is likely in 1.15, could we add a runtime env var in 1.14 to change Windows to program/monotonic timing? That would let Windows devs & users preview the change. That same env var could be used to switch Windows back to real/boot timing in 1.15. We don't need to support two modes indefinitely, but two releases with both options would be helpful. |
I don't think it's clear that program/monotonic time for everything is the obvious path forward for 1.15, so that wouldn't be an effective way to "preview" the change. |
Well runtime.nanotime either uses "interrupt time" or "unbiased interrupt time"; is there another option? I'm only suggesting that Unix-style timing/sleep be accessible in 1.14. |
Change https://golang.org/cl/213198 mentions this issue: |
…eNotification on systems with "program time" timer Systems where PowerRegisterSuspendResumeNotification returns ERROR_ FILE_NOT_FOUND are also systems where nanotime() is on "program time" rather than "real time". The chain for this is: powrprof.dll!PowerRegisterSuspendResumeNotification -> umpdc.dll!PdcPortOpen -> ntdll.dll!ZwAlpcConnectPort("\\PdcPort") -> syscall -> ntoskrnl.exe!AlpcpConnectPort Opening \\.\PdcPort fails with STATUS_OBJECT_NAME_NOT_FOUND when pdc.sys hasn't been initialized. Pdc.sys also provides the various hooks for sleep resumption events, which means if it's not loaded, then our "real time" timer is actually on "program time". Finally STATUS_OBJECT_NAME_ NOT_FOUND is passed through RtlNtStatusToDosError, which returns ERROR_ FILE_NOT_FOUND. Therefore, in the case where the function returns ERROR_ FILE_NOT_FOUND, we don't mind, since the timer we're using will correspond fine with the lack of sleep resumption notifications. This applies, for example, to Docker users. Updates #35447 Updates #35482 Fixes #35746 Change-Id: I9e1ce5bbc54b9da55ff7a3918b5da28112647eee Reviewed-on: https://go-review.googlesource.com/c/go/+/211280 Run-TryBot: Jason A. Donenfeld <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Austin Clements <[email protected]> Reviewed-by: Jason A. Donenfeld <[email protected]>
…eNotification on systems with "program time" timer Systems where PowerRegisterSuspendResumeNotification returns ERROR_ FILE_NOT_FOUND are also systems where nanotime() is on "program time" rather than "real time". The chain for this is: powrprof.dll!PowerRegisterSuspendResumeNotification -> umpdc.dll!PdcPortOpen -> ntdll.dll!ZwAlpcConnectPort("\\PdcPort") -> syscall -> ntoskrnl.exe!AlpcpConnectPort Opening \\.\PdcPort fails with STATUS_OBJECT_NAME_NOT_FOUND when pdc.sys hasn't been initialized. Pdc.sys also provides the various hooks for sleep resumption events, which means if it's not loaded, then our "real time" timer is actually on "program time". Finally STATUS_OBJECT_NAME_ NOT_FOUND is passed through RtlNtStatusToDosError, which returns ERROR_ FILE_NOT_FOUND. Therefore, in the case where the function returns ERROR_ FILE_NOT_FOUND, we don't mind, since the timer we're using will correspond fine with the lack of sleep resumption notifications. This applies, for example, to Docker users. Updates #35447 Updates #35482 Fixes #36377 Change-Id: I9e1ce5bbc54b9da55ff7a3918b5da28112647eee Reviewed-on: https://go-review.googlesource.com/c/go/+/208317 Reviewed-by: Jason A. Donenfeld <[email protected]> Reviewed-by: Austin Clements <[email protected]> Run-TryBot: Jason A. Donenfeld <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-on: https://go-review.googlesource.com/c/go/+/213198
There's another report of PowerRegisterSuspendResumeNotification failure in docker, #36557 |
An engineering lead on the Windows Base team (kernel, fs, etc) asked us to revert d85072, from #31528, because it changed Windows timers to advance during sleep; everywhere else Go has monotonic timers (see also #24595 #35012).
Quoting @jstarks from #35447 (comment)
@alexbrainman suggested an alternative approach to fixing the reported issue, via QueryUnbiasedInterruptTime() in #31528 (comment). Let's try to adopt that for 1.14.
We should backport that to 1.12 & 1.13, also reverting the commit which landed in 1.13.3, see #34130.
cc @ianlancetaylor @rsc @aclements @zx2c4 @jmontgomery-jc
@gopherbot add OS-Windows
@gopherbot add release-blocker
The text was updated successfully, but these errors were encountered: