Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sharedata.flush 导致 coredump #1894

Open
ghost90240 opened this issue Mar 19, 2024 · 2 comments
Open

sharedata.flush 导致 coredump #1894

ghost90240 opened this issue Mar 19, 2024 · 2 comments

Comments

@ghost90240
Copy link

skynet版本是1.5.0(2021-11-09)
问题是在热更配置后,某个服务执行 sharedata.flush 导致 coredump
查看了有关 sharedata 的 issues,有2个是1.5.0版本后修复的
#1820,工具查看了都无超 int32 的 key
#1797,这个应该没关系,但是查看了下对应导致core的服务处理任务数量,每天量级100w,core是在第11天热更的时候,达不到回绕。
查了2天没啥头绪(这个问题非常偶然,无法复现,热更的时候上千个服,只有1个服core了)

(gdb) bt
#0  luaS_remove (L=0x7f68628bf388, ts=0x7f6882f28450) at lstring.c:211
#1  0x0000000000418a35 in freeobj (L=0x7f68628bf388, o=0x7f6882f28450) at lgc.c:795
#2  0x0000000000418c8e in sweep2old (L=0x7f68628bf388, p=0x7f6815d4a340) at lgc.c:1082
#3  0x000000000041a1f0 in atomic2gen (L=0x7f68628bf388, g=0x7f6847b7d4d0) at lgc.c:1294
#4  0x000000000041a5d6 in entergen (L=0x7f68628bf388, g=0x7f6847b7d4d0) at lgc.c:1332
#5  0x000000000041a6d3 in fullgen (L=0x7f68628bf388, isemergency=<value optimized out>) at lgc.c:1375
#6  luaC_fullgc (L=0x7f68628bf388, isemergency=<value optimized out>) at lgc.c:1730
#7  0x0000000000413057 in lua_gc (L=0x7f68628bf388, what=<value optimized out>) at lapi.c:1195
#8  0x00000000004306a7 in luaB_collectgarbage (L=0x7f68628bf388) at lbaselib.c:248
#9  0x000000000041705e in precallC (L=0x7f68628bf388, func=<value optimized out>, nresults=0) at ldo.c:510
#10 luaD_precall (L=0x7f68628bf388, func=<value optimized out>, nresults=0) at ldo.c:576
#11 0x00000000004263ff in luaV_execute (L=<value optimized out>, ci=<value optimized out>) at lvm.c:1684
#12 0x0000000000416e63 in unroll (L=0x7f68628bf388, ud=<value optimized out>) at ldo.c:725
#13 0x0000000000415dec in luaD_rawrunprotected (L=0x7f68628bf388, f=0x417190 <resume>, ud=0x7f68a1635dcc) at ldo.c:144
#14 0x0000000000416c84 in lua_resume (L=0x7f68628bf388, from=<value optimized out>, nargs=3, nresults=0x7f68a1635e2c) at ldo.c:830
#15 0x00007f68a343e455 in lua_resumeX (L=0x7f684754ca68, co_index=1, n=3) at service-src/service_snlua.c:90
#16 auxresume (L=0x7f684754ca68, co_index=1, n=3) at service-src/service_snlua.c:146
#17 timing_resume (L=0x7f684754ca68, co_index=1, n=3) at service-src/service_snlua.c:198
#18 0x00007f68a343e760 in luaB_coresume (L=0x7f684754ca68) at service-src/service_snlua.c:217
#19 0x00000000004175bf in precallC (L=0x7f684754ca68, ci=<value optimized out>, func=<value optimized out>, 
    narg1=<value optimized out>, delta=<value optimized out>) at ldo.c:510
#20 luaD_pretailcall (L=0x7f684754ca68, ci=<value optimized out>, func=<value optimized out>, narg1=<value optimized out>, 
    delta=<value optimized out>) at ldo.c:531
#21 0x0000000000425c33 in luaV_execute (L=<value optimized out>, ci=<value optimized out>) at lvm.c:1709
#22 0x00000000004172e7 in ccall (L=0x7f684754ca68, func=<value optimized out>, nResults=-1) at ldo.c:618
#23 luaD_callnoyield (L=0x7f684754ca68, func=<value optimized out>, nResults=-1) at ldo.c:636
#24 0x0000000000415dec in luaD_rawrunprotected (L=0x7f684754ca68, f=0x4134b0 <f_call>, ud=0x7f68a1636160) at ldo.c:144
#25 0x0000000000416a8f in luaD_pcall (L=0x7f684754ca68, func=<value optimized out>, u=<value optimized out>, old_top=224, 
    ef=<value optimized out>) at ldo.c:934
#26 0x00000000004133c9 in lua_pcallk (L=0x7f684754ca68, nargs=<value optimized out>, nresults=-1, errfunc=<value optimized out>, 
    ctx=<value optimized out>, k=<value optimized out>) at lapi.c:1063
#27 0x000000000042f8ff in luaB_xpcall (L=0x7f684754ca68) at lbaselib.c:494
#28 0x000000000041705e in precallC (L=0x7f684754ca68, func=<value optimized out>, nresults=2) at ldo.c:510
#29 luaD_precall (L=0x7f684754ca68, func=<value optimized out>, nresults=2) at ldo.c:576
#30 0x00000000004263ff in luaV_execute (L=<value optimized out>, ci=<value optimized out>) at lvm.c:1684
#31 0x00000000004172e7 in ccall (L=0x7f684754ca68, func=<value optimized out>, nResults=0) at ldo.c:618
#32 luaD_callnoyield (L=0x7f684754ca68, func=<value optimized out>, nResults=0) at ldo.c:636
#33 0x0000000000415dec in luaD_rawrunprotected (L=0x7f684754ca68, f=0x4134b0 <f_call>, ud=0x7f68a1636490) at ldo.c:144
#34 0x0000000000416a8f in luaD_pcall (L=0x7f684754ca68, func=<value optimized out>, u=<value optimized out>, old_top=48, 
    ef=<value optimized out>) at ldo.c:934
#35 0x00000000004133c9 in lua_pcallk (L=0x7f684754ca68, nargs=<value optimized out>, nresults=0, errfunc=<value optimized out>, 
    ctx=<value optimized out>, k=<value optimized out>) at lapi.c:1063
#36 0x00007f689b61a05d in _cb (context=0x7f683649ed60, ud=<value optimized out>, type=9, session=318, source=3, 
    msg=<value optimized out>, sz=16) at lualib-src/lua-skynet.c:67
#37 0x0000000000409e3d in dispatch_message (ctx=0x7f683649ed60, msg=0x7f68a1636650) at skynet-src/skynet_server.c:286
#38 0x000000000040a6bf in skynet_context_message_dispatch (sm=0x7f68a5452140, q=0x7f68365f8e00, weight=-1)
    at skynet-src/skynet_server.c:414
#39 0x000000000040b53d in thread_worker (p=<value optimized out>) at skynet-src/skynet_start.c:163
#40 0x0000003262807aa1 in start_thread () from /lib64/libpthread.so.0
#41 0x00000032624e8c4d in clone () from /lib64/libc.so.6
@cloudwu
Copy link
Owner

cloudwu commented Mar 19, 2024

首先,问题出现在 lua gc 的时候,看起来是 lua vm 内部的状态错了。虽然是调用 sharedata.flush 导致,但只能说明

https://github.com/cloudwu/skynet/blob/v1.5.0/lualib/skynet/sharedata.lua#L60

sharedata.flush 这个操作调用了 fullgc ( collectgarbage() )而已。我不认为是 sharedata 本身的问题。

ps. 无论如何,都没有理由把 skynet 停留在旧版本,除非你有独自维护它的能力。且 lua 本身也在更迭,同样也没有理由停留在某个旧版本。比如 https://lua.org/bugs.html 这里可以看到,每个小版本都 fix 了大量的 bug 。

从 coredump log 看,https://github.com/cloudwu/skynet/blob/v1.5.0/3rd/lua/lstring.c#L211 这一行指: gc 在清理短字符串时,vm 里的短字符串 hash 表上的链表指针出错了。sharedata 库也没有能力写坏它。

我认为你需要排查的是所有 C 代码,找到内存越界,或其它内存错误。 至少,你可以先检查 double free 等简单的问题: https://github.com/cloudwu/skynet/wiki/MemoryHook

因为 coredump 很罕见,那么你需要重点考虑那些很少运行到的 C 代码。

@ghost90240
Copy link
Author

ghost90240 commented Mar 19, 2024

谢谢云大的意见。
MEMORY_CHECK 已经带上了的,我更到最新版本后续再观察下。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants