-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker process core dump: Assertion failed: spool->msg_status == MSG_INVALID #676
Comments
I work with @nickstenning and wanted to provide additional context to this bug: It looks like this is the same bug at #670 Doing a bit of debugging it appears that when a spool is cycled via the spool_fetch_msg_callback and spool_fetch_msg functions, if someone has passed an invalid message id, the code falls through the Line 479 in ff48ec7
MSG_EXPECTED instead of MSG_INVALID .
Since this is an assert() (yes I know we can define The simple patch I've applied to our system in production adds handling to convert a I didn't see a clear mechanism to validate the message ID passed in from This issue is amplified under load as there is a higher likelihood that the spool will be run through the offending code. When this happens all subscribers connected to the particular nginx worker loses their connection / stream and must reconnect to continue receiving. Using the code as is without defining I'm happy to contribute my patch, but it feels like it might be the wrong approach for the project. |
Reproduction steps include adding load to nchan with the rough config Nick provided and then request a non-existent channel with an invalid id in 'Last-Event-ID' header. I found it was easier to reproduce if I interrupted the connection before the 30 second timeout. If nchan had no load it was much harder to get a crash. Not providing a last-event-id seems to preclude the assert being hit. Ostensibly fewer workers will amplify it. This is with a built-from-source version as Nick outlined. |
I've recently moved something from our setup to the latest version of nchan and have been looking at this error. I have found a way to reproduce via docker, and after a single message. Dockerfile (with build instructions)Build with:
# Original Source https://gist.github.com/hermanbanken/96f0ff298c162a522ddbba44cad31081 (with tweaks)
#################
# Builder Image #
#################
# https://hub.docker.com/_/nginx - stable-alpine-slim
FROM nginx:1.24.0-alpine-slim AS builder
# nginx:alpine contains NGINX_VERSION environment variable which we can re-use, like so:
# ENV NGINX_VERSION 1.24.0
# Our NCHAN version
ENV NCHAN_VERSION 1.3.6
# Download sources
RUN wget "http://nginx.org/download/nginx-${NGINX_VERSION}.tar.gz" -O nginx.tar.gz && \
wget "https://github.com/slact/nchan/archive/v${NCHAN_VERSION}.tar.gz" -O nchan.tar.gz
# For latest build deps, see https://github.com/nginxinc/docker-nginx/blob/master/stable/alpine/Dockerfile
RUN apk add --no-cache --virtual .build-deps \
gcc \
libc-dev \
make \
openssl-dev \
pcre-dev \
zlib-dev \
linux-headers \
libxslt-dev \
gd-dev \
geoip-dev \
libedit-dev \
bash \
alpine-sdk \
findutils
# Reuse same cli arguments as the nginx:alpine image used to build
RUN CONFARGS=$(nginx -V 2>&1 | sed -n -e 's/^.*arguments: //p') \
mkdir -p /usr/src && \
tar -zxC /usr/src -f nginx.tar.gz && \
tar -xzvf "nchan.tar.gz" && \
NCHANDIR="$(pwd)/nchan-${NCHAN_VERSION}" && \
cd /usr/src/nginx-$NGINX_VERSION && \
./configure --with-compat $CONFARGS --add-dynamic-module=$NCHANDIR && \
make && make install
################
# Final Image #
################
FROM nginx:1.24.0-alpine-slim
# Extract the dynamic module NCHAN from the builder image
COPY --from=builder /usr/local/nginx/modules/ngx_nchan_module.so /usr/local/nginx/modules/ngx_nchan_module.so
# Remove default nginx configs and add our base config
RUN rm /etc/nginx/conf.d/default.conf
COPY nginx.conf /etc/nginx/nginx.conf
CMD ["nginx", "-g", "daemon off;"]
# Generated from: https://github.com/slact/nchan#conceptual-overview
load_module /usr/local/nginx/modules/ngx_nchan_module.so;
worker_processes 5;
events {
worker_connections 1024;
}
http {
server {
listen 80;
location = /sub {
nchan_subscriber websocket;
nchan_channel_id foobar;
}
location = /pub {
nchan_publisher;
nchan_channel_id foobar;
nchan_message_timeout 5s;
}
}
} 🧠 I think the How to reproduce
$ docker run -it -v "$(pwd)/nginx.conf:/etc/nginx/nginx.conf" -p 8000:80 nchan:msg-invalid
$ curl -X POST -d "hello" "http://localhost:8000/pub"
$ curl -X GET "http://localhost:8000/pub"
queued messages: 1
last requested: -1 sec. ago
active subscribers: 0
last message id: 1704197750:0 Take note of
$ curl --verbose -X GET "http://localhost:8000/pub"
...
< HTTP/1.1 404 Not Found
...
$ npx wscat -c "ws://localhost:8000/sub?last_message_id=1704197750:0
Connected (press CTRL+C to quit)
> ✅ Get a valid websocket connection. Ctrl+D or Ctrl+C to kill the connection.
$ npx wscat -c "ws://localhost:8000/sub?last_message_id=1704197750:0
Connected (press CTRL+C to quit)
Disconnected (code: 1006, reason: "") Nginx log:
Unknowns:
|
Another less consistent way to get the spool error:
$ curl -X POST -d "hello" "http://localhost:8000/pub"
$ npx wscat -c "ws://localhost:8000/sub?last_event_id=$(date +%s):0" -x quit ☝️ Run the above a couple of times, get the following log:
|
Thanks for that @b-n , I'll have a fix shortly |
fixed in master (commit 7765ec1 ) This fix touches a very hot code path, so it took some time to verify. I will tag a version in another week or so. |
I've built a new image from bleeding edge (e.g. d73ab7a), and we're currently using it in production. Good news - it works as it was prior, nothing particular breaking. I was hoping that this would fix the error we were seeing, however still seeing it coming in with production workloads: Assertion failed: spool->msg_status == MSG_INVALID (//nchan-d73ab7a034b8fa9cb18d59bbf473ca65048054e8/src/store/spool.c: spool_fetch_msg: 477) │
│ 2024/01/26 11:45:36 [alert] 1#1: worker process 2100 exited on signal 6 It's very possible that my reproduction steps above actually found a different error that you managed to fix though 🤷. Note: I haven't tried my reproduction steps above on the master changes, just running the master branch in production essentially and looking at logs. I'll see if I can gather any other information and/or trigger points. |
@b-n I found that it didn't require a message on the queue for the requested channel (in fact I think if the channel has a message there isn't a crash). (We aren't using websocket, but a more boring pub/sub). The core is that someone is requesting an invalid message id (I think on a channel with nothing in it). We have some users that poll ancient channels we will not ever publish to again (outside of a uuid collision) and it seems this is the triggering scenario. I also noticed that when the load on nchan dropped off it was |
Yeah I've noticed something similar too. e.g. We're using this on a homebrewed deployment tool (that we're slowly refactoring) to show statuses of development environments, and streaming the logs to the UI. Some developers have page pinned, so when their machine starts up, it just starts pulling the old messages from the last event time. In this case, we're using nchan.js on the FE side. I even went as far to force a page refresh every hour (meta content-reload), and to override the localstorage value when the page loads. Of course there are people that just have an inactive page from the time I implemented these changes as work arounds, so it still hasn't stopped people requesting old event ids. Last event Id is still handy to poll for us since we want the client to reconnect if the connection drops. I can confirm that it looks like everything is working fine, until someone sends an old last_event_id request that is expired (at least that's when I see it happening). I'm not sure if it needs to be a specific event that has already expired, or whether it's just "In the past prior to an expired event". And it also looks like when the worker does eventually get this old message and dies, that it takes all other connections with it of course. Makes the log hard to read since the order they are reported to nginx is not necessarily the order you'd expect. Sorry for the wall of text! |
I can confirm it can be never used/never existing, my local repro steps used both expired and never existing. |
@tempusfrangit could you please check the fix I posted to master? I believe this will fix your issue as well. |
@b-n welp, ok. Think you can get me a trace from a coredump? |
@b-n @tempusfrangit nevermind i figured it out. will post a fix later today. |
alright fix pushed to master with tests for both known ways to trigger the bug Lines 380 to 381 in cc705a9
No, it wasn't right. It wasn't right at all. |
Ah nice! I'll give it a roll on Monday and let you know the results 😄 Curious, if we were not using the in memory version, but the redis backed version, would we have skipped this issue altogether? (I was tempted to go through the code, but thought it's easier to just ask). |
@b-n we use the redis backed one so you'd still have hit it. |
@b-n Nchan uses the in-memory storage engine as a local cache layer for Redis, so yes this would affect Redis users too. |
@slact Looks like there are some build errors showing up now Compile error```console #8 18.79 cc -c -fPIC -pipe -O -W -Wall -Wpointer-arith -Wno-unused-parameter -Werror -g -I src/core -I src/event -I src/event/modules -I src/os/unix -I //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src -I objs -I src/http -I src/http/modules \ #8 18.79 -o objs/addon/subscribers/benchmark.o \ #8 18.79 //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src/subscribers/benchmark.c #8 18.84 cc -c -fPIC -pipe -O -W -Wall -Wpointer-arith -Wno-unused-parameter -Werror -g -I src/core -I src/event -I src/event/modules -I src/os/unix -I //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src -I objs -I src/http -I src/http/modules \ #8 18.84 -o objs/addon/store/spool.o \ #8 18.84 //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src/store/spool.c #8 18.88 //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src/store/spool.c: In function 'spool_fetch_msg_callback': #8 18.88 //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src/store/spool.c:305:33: error: unused variable 'nuspool' [-Werror=unused-variable] #8 18.88 305 | subscriber_pool_t *spool, *nuspool; #8 18.88 | ^~~~~~~ #8 18.88 //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src/store/spool.c:304:25: error: variable 'prev_status' set but not used [-Werror=unused-but-set-variable] #8 18.88 304 | nchan_msg_status_t prev_status; #8 18.88 | ^~~~~~~~~~~ #8 19.01 cc1: all warnings being treated as errors #8 19.01 make[1]: *** [objs/Makefile:1625: objs/addon/store/spool.o] Error 1 #8 19.01 make[1]: Leaving directory '/usr/src/nginx-1.24.0' #8 19.01 make: *** [Makefile:10: build] Error 2 #8 ERROR: process "/bin/sh -c CONFARGS=$(nginx -V 2>&1 | sed -n -e 's/^.*arguments: //p') mkdir -p /usr/src && tar -zxC /usr/src -f nginx.tar.gz && tar -xzvf \"nchan.tar.gz\" && NCHANDIR=\"$(pwd)/nchan-${NCHAN_VERSION}\" && cd /usr/src/nginx-$NGINX_VERSION && ./configure --with-compat $CONFARGS --add-dynamic-module=$NCHANDIR && make && make install" did not complete successfully: exit code: 2 ------ > [builder 4/4] RUN CONFARGS=$(nginx -V 2>&1 | sed -n -e 's/^.*arguments: //p') mkdir -p /usr/src && tar -zxC /usr/src -f nginx.tar.gz && tar -xzvf "nchan.tar.gz" && NCHANDIR="$(pwd)/nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7" && cd /usr/src/nginx-1.24.0 && ./configure --with-compat $CONFARGS --add-dynamic-module=$NCHANDIR && make && make install: #8 18.88 //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src/store/spool.c:305:33: error: unused variable 'nuspool' [-Werror=unused-variable] #8 18.88 305 | subscriber_pool_t *spool, *nuspool; #8 18.88 | ^~~~~~~ #8 18.88 //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src/store/spool.c:304:25: error: variable 'prev_status' set but not used [-Werror=unused-but-set-variable] #8 18.88 304 | nchan_msg_status_t prev_status; #8 18.88 | ^~~~~~~~~~~ #8 19.01 cc1: all warnings being treated as errors #8 19.01 make[1]: *** [objs/Makefile:1625: objs/addon/store/spool.o] Error 1 #8 19.01 make[1]: Leaving directory '/usr/src/nginx-1.24.0' #8 19.01 make: *** [Makefile:10: build] Error 2 ------ Dockerfile:40 -------------------- 39 | # Reuse same cli arguments as the nginx:alpine image used to build 40 | >>> RUN CONFARGS=$(nginx -V 2>&1 | sed -n -e 's/^.*arguments: //p') \ 41 | >>> mkdir -p /usr/src && \ 42 | >>> tar -zxC /usr/src -f nginx.tar.gz && \ 43 | >>> tar -xzvf "nchan.tar.gz" && \ 44 | >>> NCHANDIR="$(pwd)/nchan-${NCHAN_VERSION}" && \ 45 | >>> cd /usr/src/nginx-$NGINX_VERSION && \ 46 | >>> ./configure --with-compat $CONFARGS --add-dynamic-module=$NCHANDIR && \ 47 | >>> make && make install ```This is a build where |
@slact I finally got around to actually implementing the change. It looks like there are no more MSG_INVALID errors coming through for us. I will keep 👀 on this thread/next realease since our current build targets the commit currently on master. But yeah, good news 😄 |
Hi, Same error seen in below: $/usr/local/nginx/sbin/nginx -V
nginx version: nginx/1.27.2
built with OpenSSL 1.0.2k-fips 26 Jan 2017
TLS SNI support enabled
configure arguments: --add-module=/opt/nchan-1.3.7/ --with-http_auth_request_module --with-http_ssl_module $cat nginx_error.log
/opt/nchan-1.3.7//src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2024/10/09 17:59:16 [alert] 1980#0: worker process 1986 exited on signal 6 (core dumped)
/opt/nchan-1.3.7//src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2024/10/09 18:30:46 [alert] 1980#0: worker process 1985 exited on signal 6 (core dumped) $cat /proc/version
Linux version 4.19.91-009.ali4000.alios7.x86_64 (admin@ae2f75727d32) (gcc version 6.5.1 20181026 (Alibaba 6.5.1-1) (GCC)) #1 SMP Mon Jan 25 10:47:38 CST 2021 |
Hello! We're running nchan in production and encountering this error which is regularly crashing worker processes and dumping core. We might be able to share a core dump if that's helpful but as it may contain customer data that might not be that easy unless we can find a way to reproduce the problem in a non-production environment.
The crash is occurring repeatedly for periods of a few hours and then not for hours at a time, which makes me think it may be related to a particular payload or client.
We're running nchan 1.3.6 and nginx 1.21.3 on alpine 3.18.2, i.e. with musl.
Assertion failed: spool->msg_status == MSG_INVALID (/usr/src/nchan-1.3.6/src/store/spool.c: spool_fetch_msg: 479)
We have a relatively straightforward nchan setup. I'll share some (lightly edited) configs below.
publish.conf
subscribe-http.conf
subscribe-https.conf
The text was updated successfully, but these errors were encountered: