Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agentctl.sh: Sanity check PID_FILE and remove at stop #278

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sbohrer
Copy link

@sbohrer sbohrer commented Feb 11, 2015

PID files are inherently problematic because PIDs are reused. This
means that at start up if we find a PID file and there is a running
process with that PID we do nothing. Unfortunately it is possible that
our process long since died, PIDs have wrapped and are now getting
reused and the process we found is an unrelated process. The problem is
even more likely at system start time if we find a PID file from the
previous system boot. In this case the system is starting up and
spawning many new processes and it is likely that the low numbered PID
we received on the previous boot will be near other processes starting
in parallel.

To help solve this problem we can remove stale PID files when we shut
down cleanly. Additionally we can perform a basic sanity check if we
find a PID file and find the PID running to compare that they at least
have the command we expect. This commit performs both of these actions.

Additionally PID files are typically written to /var/run or /run
which get cleared on reboots. I am not making that change since I do
not know if there would be any ramifications to moving the location of
the PID file.

Signed-off-by: Shawn Bohrer [email protected]

PID files are inherently problematic because PIDs are reused.  This
means that at start up if we find a PID file and there is a running
process with that PID we do nothing.  Unfortunately it is possible that
our process long since died, PIDs have wrapped and are now getting
reused and the process we found is an unrelated process.  The problem is
even more likely at system start time if we find a PID file from the
previous system boot.  In this case the system is starting up and
spawning many new processes and it is likely that the low numbered PID
we received on the previous boot will be near other processes starting
in parallel.

To help solve this problem we can remove stale PID files when we shut
down cleanly.  Additionally we can perform a basic sanity check if we
find a PID file and find the PID running to compare that they at least
have the command we expect.  This commit performs both of these actions.

Additionally PID files are typically written to /var/run or /run
which get cleared on reboots.  I am not making that change since I do
not know if there would be any ramifications to moving the location of
the PID file.

Signed-off-by: Shawn Bohrer <[email protected]>
;;
esac
if [[ "$PS_STAT" -eq "$PID" ]]; then
CMD=$(ps -p $PID -o 'comm=')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pull request. Can you explain what those 2 lines do (208 and 209)?. I am not a shell expert :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'm using ps to find the name of the program for PID $PID which we got from the PID_FILE.

$ cat /var/glu/data/logs/org.linkedin.glu.agent-server.pid
29302
$ ps -p 29302 -o 'comm='
java

I then do a string comparison to see if the command I found is equal to the command I expect to run with $JAVA_CMD. In my case $JAVA_CMD contains the full path to java for example JAVA_CMD=/opt/local/java/bin/java so basename removes all leading directory components.

$ basename /opt/local/java/bin/java
java

If you are curious basename works even with relative paths or if there is no leading directories.

$ basename java
java
$ basename ../bin/java
java

I suspect $(basename "$JAVA_CMD") will always return "java", but maybe someone out there is using something like Android's Dalvik so I figure it is best to find the command from JAVA_CMD.

This is all just a basic sanity check. We could still find a process with the PID we are looking for that happens to be a java process, but is not the glu-agent. However that is much more unlikely than the situation without the check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation.

  • Is basename a standard command on Mac/Linux/Unix?
  • not sure if that would help but one of the command line argument is this one: -Dorg.linkedin.app.name=org.linkedin.glu.agent-server so if there is an easier way to check for this vs simply checking for java then you would be more certain that it is the agent...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basename is part of GNU coreutils on Linux and part of BSD general commands on Mac. The mac man page says basename is part of the POSIX.2 standard so I'm pretty sure other Unix variants will have it as well.

We could check the command line arguments like you suggest but the test is more complicated in shell and I don't really think it is necessary. Since I'm removing the PID_FILE on clean shutdowns it is less likely that we'll have a stale PID_FILE. My systems have very few java processes so the odds of having a stale PID_FILE and having the PID wrap and get used by a different java process are very small. Perhaps others run lots of java processes, but I seem to be the first person who has hit (or at least reported/fixed) this issue in the first place which means it was already not that common even before my change. Maybe LinkedIn doesn't reboot or doesn't start glu-agent at startup.

@sodul
Copy link
Contributor

sodul commented Feb 12, 2015

Yes, basename is pretty much available everywhere. I use it all the time.
http://ss64.com/bash/basename.html

prep would be perfect to find a process id based on that command line argument. However at skyhighnetworks we always have 2 agents running, and I know others also have multiple agents per host:
$ pgrep -f ' -Dorg.linkedin.app.name=org.linkedin.glu.agent-server'
10054
11944

The -f option makes it do a pattern match on the entire command line and it return the process ids, so there is no sed/awk/gymnastic. Unfortunately pgrep and pkill are not always present on less common unix versions, though I only have access to Linux and Mac OS X boxes where it is present.

I recommend changing this lines:
if [[ "$PS_STAT" -eq "$PID" ]]; then
to
if [ "$PS_STAT" -eq "$PID" ]; then
Since the double square brackets is a bash form. If you want to use the bash form you would rather use the more readable alternatives such as:
if (( PS_STAT == PID )); then
which does a real numeral test and does not require $, or:
if [[ $PS_STAT == $PID ]]; then
which does a text equal (quotes are optional when using double square brackets).

@sbohrer
Copy link
Author

sbohrer commented Feb 12, 2015

Since the double square brackets is a bash form.

I normally stay away from bashisms but I'll note this is actually a bash script (#!/bin/bash). I'm happy to change to whatever is believed to be the most readable.

@ypujante
Copy link
Member

I don't have a preference personally. I will merge the changes with the next release of glu (unscheduled at this time).

@ypujante
Copy link
Member

I merged your changes to the current version of glu thinking that it would work. Unfortunately it is not working. This is the output message during the shutdown of the tutorial:

### Stopping Agent...
Loading config [/Volumes/Vault/deployment/content/glu/org.linkedin.glu.packaging-all-5.5.6/tutorial/distributions/tutorial/agents/org.linkedin.glu.agent-server-agent-1-tutorialZooKeeperCluster-5.5.6/5.5.6/conf/pre_master_conf.sh]...
Loading config [/Volumes/Vault/deployment/content/glu/org.linkedin.glu.packaging-all-5.5.6/tutorial/distributions/tutorial/agents/org.linkedin.glu.agent-server-agent-1-tutorialZooKeeperCluster-5.5.6/5.5.6/conf/master_conf.sh]...
Status: PID file [/Volumes/Vault/deployment/content/glu/org.linkedin.glu.packaging-all-5.5.6/tutorial/distributions/tutorial/agents/org.linkedin.glu.agent-server-agent-1-tutorialZooKeeperCluster-5.5.6/data/logs/org.linkedin.glu.agent-server.pid] not found !!!

As you can see it says PID file not found. And of course the agent is not shutting down.

11234 AgentMain

I need to back out the merge and retest everything before I can release glu...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants