-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
agentctl.sh: Sanity check PID_FILE and remove at stop #278
base: master
Are you sure you want to change the base?
Conversation
PID files are inherently problematic because PIDs are reused. This means that at start up if we find a PID file and there is a running process with that PID we do nothing. Unfortunately it is possible that our process long since died, PIDs have wrapped and are now getting reused and the process we found is an unrelated process. The problem is even more likely at system start time if we find a PID file from the previous system boot. In this case the system is starting up and spawning many new processes and it is likely that the low numbered PID we received on the previous boot will be near other processes starting in parallel. To help solve this problem we can remove stale PID files when we shut down cleanly. Additionally we can perform a basic sanity check if we find a PID file and find the PID running to compare that they at least have the command we expect. This commit performs both of these actions. Additionally PID files are typically written to /var/run or /run which get cleared on reboots. I am not making that change since I do not know if there would be any ramifications to moving the location of the PID file. Signed-off-by: Shawn Bohrer <[email protected]>
;; | ||
esac | ||
if [[ "$PS_STAT" -eq "$PID" ]]; then | ||
CMD=$(ps -p $PID -o 'comm=') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pull request. Can you explain what those 2 lines do (208 and 209)?. I am not a shell expert :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'm using ps to find the name of the program for PID $PID which we got from the PID_FILE.
$ cat /var/glu/data/logs/org.linkedin.glu.agent-server.pid
29302
$ ps -p 29302 -o 'comm='
java
I then do a string comparison to see if the command I found is equal to the command I expect to run with $JAVA_CMD. In my case $JAVA_CMD contains the full path to java for example JAVA_CMD=/opt/local/java/bin/java so basename removes all leading directory components.
$ basename /opt/local/java/bin/java
java
If you are curious basename works even with relative paths or if there is no leading directories.
$ basename java
java
$ basename ../bin/java
java
I suspect $(basename "$JAVA_CMD") will always return "java", but maybe someone out there is using something like Android's Dalvik so I figure it is best to find the command from JAVA_CMD.
This is all just a basic sanity check. We could still find a process with the PID we are looking for that happens to be a java process, but is not the glu-agent. However that is much more unlikely than the situation without the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation.
- Is
basename
a standard command on Mac/Linux/Unix? - not sure if that would help but one of the command line argument is this one:
-Dorg.linkedin.app.name=org.linkedin.glu.agent-server
so if there is an easier way to check for this vs simply checking forjava
then you would be more certain that it is the agent...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basename is part of GNU coreutils on Linux and part of BSD general commands on Mac. The mac man page says basename is part of the POSIX.2 standard so I'm pretty sure other Unix variants will have it as well.
We could check the command line arguments like you suggest but the test is more complicated in shell and I don't really think it is necessary. Since I'm removing the PID_FILE on clean shutdowns it is less likely that we'll have a stale PID_FILE. My systems have very few java processes so the odds of having a stale PID_FILE and having the PID wrap and get used by a different java process are very small. Perhaps others run lots of java processes, but I seem to be the first person who has hit (or at least reported/fixed) this issue in the first place which means it was already not that common even before my change. Maybe LinkedIn doesn't reboot or doesn't start glu-agent at startup.
Yes, basename is pretty much available everywhere. I use it all the time. prep would be perfect to find a process id based on that command line argument. However at skyhighnetworks we always have 2 agents running, and I know others also have multiple agents per host: The -f option makes it do a pattern match on the entire command line and it return the process ids, so there is no sed/awk/gymnastic. Unfortunately pgrep and pkill are not always present on less common unix versions, though I only have access to Linux and Mac OS X boxes where it is present. I recommend changing this lines: |
I normally stay away from bashisms but I'll note this is actually a bash script (#!/bin/bash). I'm happy to change to whatever is believed to be the most readable. |
I don't have a preference personally. I will merge the changes with the next release of glu (unscheduled at this time). |
I merged your changes to the current version of glu thinking that it would work. Unfortunately it is not working. This is the output message during the shutdown of the tutorial:
As you can see it says PID file not found. And of course the agent is not shutting down.
I need to back out the merge and retest everything before I can release glu... |
PID files are inherently problematic because PIDs are reused. This
means that at start up if we find a PID file and there is a running
process with that PID we do nothing. Unfortunately it is possible that
our process long since died, PIDs have wrapped and are now getting
reused and the process we found is an unrelated process. The problem is
even more likely at system start time if we find a PID file from the
previous system boot. In this case the system is starting up and
spawning many new processes and it is likely that the low numbered PID
we received on the previous boot will be near other processes starting
in parallel.
To help solve this problem we can remove stale PID files when we shut
down cleanly. Additionally we can perform a basic sanity check if we
find a PID file and find the PID running to compare that they at least
have the command we expect. This commit performs both of these actions.
Additionally PID files are typically written to /var/run or /run
which get cleared on reboots. I am not making that change since I do
not know if there would be any ramifications to moving the location of
the PID file.
Signed-off-by: Shawn Bohrer [email protected]