-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential memory problems on Waldo #42
Comments
Update: We have found two sources of higher than expected memory usage by the Bluesky Run Engine: bluesky/bluesky#1513 removes a cache of status tasks that was largely unecessary and caused every point to add to a collection that was not emptied in the course of a scan. bluesky/bluesky#1543 Uses a more memory efficient representation of grid scans which do not snake their axes. |
I am attempting to use grid_scan_wp with shape 11,101,101 as a test for memory performance (on the test computer). re-manager memory
The re-manager log reported I will try to perform this same test on bluesky v1.9 to get a sense of how much of the problem has been alleviated. |
That is a rate of ~2 MB / min, which is evidently enough to be problematic, but significantly better than before |
I am curious if |
I can confirm the recent bluesky fixes have helped memory leakage greatly. Here is the
|
I ran a 112211 point |
Current suspicions:
Regardless, next step is to try this out without live table/BEC enabled, as if it still presents as a problem, then probably neither of these are the culprit (meaning back to the drawing board a bit), but if it does go away then we have a short term workaround and some very specific areas to look. |
I have a test without BEC loaded running, I will note that for some technical and some laziness reasons it also disables the insertion of documents into mongo, so that should be noted as a possible additional cause if the problem is alleviated, so not as clean of a "if I do X then it must be Y" as I would like, but not that far off either. |
A test with a script and ophyd.sim hardware and built in bluesky grid-scan showed virtually no memory accumulation (a little bit of settling time at the start, but completely stable after that, including with BEC enabled, admittedly with the change to its retention of output lines) So I think that brings me to "its not likely the RE worker itself that is accumulating memory, probably need to dig into queueserver" |
Executive Summary
This is an ongoing investigation into symptoms that have recurred that seem very similar to #41.
Namely, some scans, particularly large ones (such as 3D scans), fail to complete.
Initial report, Symptoms, troubleshooting
On or about 2022-08-05, reports initiated that scans were halting, preventing overnight queues from completing.
The initial scan which failed was itself a 3D scan and was enqueued following a 3D scan.
The system was restarted and the queue set to run.
When in the halted state, which occurred a few hours later, there was no indication of an error in the logs of the re-manager, but it had been several minutes since the last datapoint had been printed to the table in the logs.
Some basic troubleshooting was performed:
qserver status
: returned expected output, that the queue was runningdocker stats --no-stream
: nothing crazy on memory usage, re-manager was spinning at 100% CPU usageqserver re pause immediate
:Failed to pause the running plan: Timeout occurred
qserver manager stop safe off
: Caused Watchdog messages similar to Incident: Memory problems when plotting 3D data #41At this point, the docker containers were restarted and started as expected.
@ddkohler noted:
@emkaufman reported:
When I arrived in lab on 2022-08-08, I took a quick look, confirmed that the fixes to #41 were in place, so this is not seemingly the same issue (though it is similar).
I noted that the docker desktop app indicated that the containers had been built "22 minutes ago" at ~9:50, inquired if they had been rebuilt.
Nobody had rebuilt them and they had been running for 15 hours at that point, so it later became evident that the "22 minutes ago" was simply never updated.
On 2022-08-11 @emkaufman updated:
Actions taken:
While investigation is ongoing, avoid taking the scans which fail on the lab machine as much as possible
Construct a test Windows machine using an old office computer that can hopefully reproduce the symptoms we see and provide a mechanism to address these concerns with minimal interruption to the lab machines when possible.
@pcruzparri is in charge of bringing that machine online and doing initial reproduction testing
The text was updated successfully, but these errors were encountered: