Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syncd crash in syncd::VendorSai::logSet() during docker startup #21180

Closed
anamehra opened this issue Dec 14, 2024 · 16 comments · Fixed by sonic-net/sonic-sairedis#1505 · May be fixed by sonic-net/sonic-sairedis#1518
Closed
Assignees
Labels
Awaiting Info ⌛ Triaged this issue has been triaged

Comments

@anamehra
Copy link
Contributor

Description

We have observed this syncd crash once on single asic system recentl during config reload:

(gdb) bt
#0 0x00007fce07be70ca in std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1 0x000055e218d7f837 in syncd::VendorSai::logSet(_sai_api_t, _sai_log_level_t) ()
#2 0x000055e218d53487 in syncd::Syncd::saiLoglevelNotify(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) ()
#3 0x000055e218d6d5d2 in std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >), std::_Bind<void (syncd::Syncd::(syncd::Syncd, std::_Placeholder<1>, std::_Placeholder<2>))(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)> >::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator >&&, std::__cxx11::basic_string<char, std::char_traits, std::allocator >&&) ()
#4 0x00007fce08ccdbc9 in swss::Logger::linkToDbWithOutput(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)> const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)> const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#5 0x00007fce08ccdf4a in swss::Logger::linkToDb(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)> const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#6 0x000055e218d55422 in syncd::Syncd::setSaiApiLogLevel() ()
#7 0x000055e218d66382 in syncd::Syncd::Syncd(std::shared_ptrsairedis::SaiInterface, std::shared_ptrsyncd::CommandLineOptions, bool) ()
#8 0x000055e218d4f82d in syncd_main(int, char**) ()
#9 0x000055e218d4d98f in main ()
(gdb) thread apply all bt

Looks like the crash happened very early in bringup stage. No appearent errors seein in syslogs.

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@tjchadaga
Copy link
Contributor

@anamehra - could you please help clarify the image version and platform on which this is seen?

@tjchadaga
Copy link
Contributor

@anamehra - please also upload the techsupport

@tjchadaga tjchadaga added Triaged this issue has been triaged Awaiting Info ⌛ labels Dec 18, 2024
@sdszhang
Copy link
Contributor

admin@xxx:~$ show version

SONiC Software Version: SONiC.internal-202405-cisco-111.111616281-31df541974
SONiC OS Version: 12
Distribution: Debian 12.6
Kernel: 6.1.0-22-2-amd64
Build commit: 31df541974
Build date: Fri Jan  3 05:56:33 UTC 2025
Built by: azureuser@490b1164c000000

Platform: x86_64-88_lc0_36fh-r0
HwSKU: Cisco-88-LC0-36FH-O36
ASIC: cisco-8000
ASIC Count: 3
Serial Number: xxx
Model Number: 88-LC0-36FH

@XuChen-MSFT
Copy link
Contributor

last sonic-buildimage commit of affected image is as below:

79591e1 (2024-10-28 17:52) - Update cisco-8000.ini to ref=202311.1.0.6 (#20639)

@abdosi
Copy link
Contributor

abdosi commented Jan 22, 2025

@kcudnik : can you help here ?

@kcudnik
Copy link
Contributor

kcudnik commented Jan 22, 2025

As you can see crash is in std map on insert and rebalance tree, I'm 99% sure that this is race condition since that set is made after swss common linktodb bind notification, and just checked that logset is not protected by mutex, I will make PR to fix this

@kcudnik
Copy link
Contributor

kcudnik commented Jan 22, 2025

Do you have consistent repro of this ? Can you show other threads backlog from this dump ?

@kcudnik
Copy link
Contributor

kcudnik commented Jan 22, 2025

i see you have " thread apply all bt but its empy ?

@anamehra
Copy link
Contributor Author

anamehra commented Jan 22, 2025

Thanks fo rlooking into this, @kcudnik !

The issue is not easily reproducible and is rarely seen. We had only a couple of instances known for this on two different platforms in the last 3 months.

Here is the 'thread apply all bt' from another instance:

root@svcstr2-8800-lc2-1:/core_crash# gdb /usr/bin/syncd -c syncd.1735976886.88.0.core 
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/syncd...
Reading symbols from /usr/lib/debug/.build-id/00/ac77220eba4b1ed2a99368b3fcd86112fbdef2.debug...
[New LWP 116]
[New LWP 88]
[New LWP 117]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/syncd -u -s -B null -p /usr/share/sonic/hwsku/sai.profile -l'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fdb4dc7b0ca in std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
[Current thread is 1 (Thread 0x7fdb43db2700 (LWP 116))]
(gdb) thread apply all bt

Thread 3 (Thread 0x7fdb435b1700 (LWP 117)):
#0  0x00007fdb4da931e1 in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fdb4da989c3 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x0000563f8fa186bd in std::this_thread::sleep_for<long, std::ratio<1l, 1l> > (__rtime=..., __rtime=...) at /usr/include/c++/10/thread:401
#3  syncd::TimerWatchdog::threadFunction (this=0x563f917cc548) at TimerWatchdog.cpp:128
#4  0x00007fdb4dc8ded0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fdb4de52ea7 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007fdb4dacca6f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 2 (Thread 0x7fdb43e87200 (LWP 88)):
#0  0x00007fdb4de5d08c in read () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fdb4f4ad1f5 in redisBufferRead () from /usr/lib/x86_64-linux-gnu/libhiredis.so.0.14
#2  0x00007fdb4f4ad660 in redisGetReply () from /usr/lib/x86_64-linux-gnu/libhiredis.so.0.14
#3  0x00007fdb4f3f8de0 in swss::RedisReply::RedisReply(swss::RedisContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#4  0x00007fdb4f3f8e97 in swss::RedisReply::RedisReply(swss::RedisContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#5  0x00007fdb4f405274 in swss::DBConnector::select(swss::DBConnector*) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#6  0x00007fdb4f40bd1f in swss::DBConnector::DBConnector(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, bool, swss::SonicDBKey const&) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#7  0x00007fdb4f40bee5 in swss::DBConnector::DBConnector(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, bool) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#8  0x00007fdb4f3f4b08 in swss::Logger::linkToDbWithOutput(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#9  0x00007fdb4f3f587a in swss::Logger::linkToDb(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#10 0x0000563f8f9f8732 in syncd::Syncd::setSaiApiLogLevel (this=0x563f917cc150) at Syncd.cpp:4797
#11 0x0000563f8fa09b34 in syncd::Syncd::Syncd (this=0x563f917cc150, vendorSai=std::shared_ptr<sairedis::SaiInterface> (use count 3, weak count 0) = {...}, cmd=std::shared_ptr<syncd::CommandLineOptions> (use count 3, weak count 0) = {...}, isWarmStart=<optimized out>) at Syncd.cpp:70
#12 0x0000563f8f9f1f1d in __gnu_cxx::new_allocator<syncd::Syncd>::construct<syncd::Syncd, std::shared_ptr<syncd::VendorSai>&, std::shared_ptr<syncd::CommandLineOptions>&, bool&> (this=<optimized out>, __p=0x563f917cc150) at /usr/include/c++/10/new:175
#13 std::allocator_traits<std::allocator<syncd::Syncd> >::construct<syncd::Syncd, std::shared_ptr<syncd::VendorSai>&, std::shared_ptr<syncd::CommandLineOptions>&, bool&> (__p=0x563f917cc150, __a=...) at /usr/include/c++/10/bits/alloc_traits.h:512
#14 std::_Sp_counted_ptr_inplace<syncd::Syncd, std::allocator<syncd::Syncd>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<std::shared_ptr<syncd::VendorSai>&, std::shared_ptr<syncd::CommandLineOptions>&, bool&> (__a=..., this=0x563f917cc140) at /usr/include/c++/10/bits/shared_ptr_base.h:551
#15 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<syncd::Syncd, std::allocator<syncd::Syncd>, std::shared_ptr<syncd::VendorSai>&, std::shared_ptr<syncd::CommandLineOptions>&, bool&> (__a=..., __p=<synthetic pointer>: <optimized out>, this=<synthetic pointer>) at /usr/include/c++/10/bits/shared_ptr_base.h:682
#16 std::__shared_ptr<syncd::Syncd, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<syncd::Syncd>, std::shared_ptr<syncd::VendorSai>&, std::shared_ptr<syncd::CommandLineOptions>&, bool&> (__tag=..., this=<synthetic pointer>) at /usr/include/c++/10/bits/shared_ptr_base.h:1371
#17 std::shared_ptr<syncd::Syncd>::shared_ptr<std::allocator<syncd::Syncd>, std::shared_ptr<syncd::VendorSai>&, std::shared_ptr<syncd::CommandLineOptions>&, bool&> (__tag=..., this=<synthetic pointer>) at /usr/include/c++/10/bits/shared_ptr.h:408
#18 std::allocate_shared<syncd::Syncd, std::allocator<syncd::Syncd>, std::shared_ptr<syncd::VendorSai>&, std::shared_ptr<syncd::CommandLineOptions>&, bool&> (__a=...) at /usr/include/c++/10/bits/shared_ptr.h:860
#19 std::make_shared<syncd::Syncd, std::shared_ptr<syncd::VendorSai>&, std::shared_ptr<syncd::CommandLineOptions>&, bool&> () at /usr/include/c++/10/bits/shared_ptr.h:876
#20 syncd_main (argc=argc@entry=8, argv=argv@entry=0x7ffe15bbda98) at syncd_main.cpp:69
#21 0x0000563f8f9efcef in main (argc=8, argv=0x7ffe15bbda98) at main.cpp:9

Thread 1 (Thread 0x7fdb43db2700 (LWP 116)):
#0  0x00007fdb4dc7b0ca in std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x0000563f8fa23977 in std::_Rb_tree<_sai_api_t, std::pair<_sai_api_t const, _sai_log_level_t>, std::_Select1st<std::pair<_sai_api_t const, _sai_log_level_t> >, std::less<_sai_api_t>, std::allocator<std::pair<_sai_api_t const, _sai_log_level_t> > >::_M_insert_node (__z=0x7fdb3c004410, __p=<optimized out>, __x=0x0, this=0x563f917cd9a8) at /usr/include/c++/10/bits/stl_tree.h:2367
#2  std::_Rb_tree<_sai_api_t, std::pair<_sai_api_t const, _sai_log_level_t>, std::_Select1st<std::pair<_sai_api_t const, _sai_log_level_t> >, std::less<_sai_api_t>, std::allocator<std::pair<_sai_api_t const, _sai_log_level_t> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<_sai_api_t const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<_sai_api_t const, _sai_log_level_t> >, std::piecewise_construct_t const&, std::tuple<_sai_api_t const&>&&, std::tuple<>&&) (__pos=..., this=0x563f917cd9a8) at /usr/include/c++/10/bits/stl_tree.h:2468
#3  std::map<_sai_api_t, _sai_log_level_t, std::less<_sai_api_t>, std::allocator<std::pair<_sai_api_t const, _sai_log_level_t> > >::operator[] (__k=<synthetic pointer>: SAI_API_MIRROR, this=0x563f917cd9a8) at /usr/include/c++/10/bits/stl_map.h:501
#4  syncd::VendorSai::logSet (this=0x563f917cd6d0, api=SAI_API_MIRROR, log_level=SAI_LOG_LEVEL_NOTICE) at VendorSai.cpp:1684
#5  0x0000563f8f9f5f37 in syncd::Syncd::saiLoglevelNotify (this=0x563f917cc150, strApi="SAI_API_MIRROR", strLogLevel="SAI_LOG_LEVEL_NOTICE") at Syncd.cpp:4765
#6  0x0000563f8fa115f2 in std::__invoke_impl<void, void (syncd::Syncd::*&)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), syncd::Syncd*&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__f=<optimized out>, __t=<optimized out>, __f=<optimized out>, __t=<optimized out>) at /usr/include/c++/10/bits/char_traits.h:329
#7  std::__invoke<void (syncd::Syncd::*&)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), syncd::Syncd*&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__fn=<optimized out>) at /usr/include/c++/10/bits/invoke.h:95
#8  std::_Bind<void (syncd::Syncd::*(syncd::Syncd*, std::_Placeholder<1>, std::_Placeholder<2>))(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>,--Type <RET> for more, q to quit, c to continue without paging-- 
 std::allocator<char> >)>::__call<void, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, 0ul, 1ul, 2ul>(std::tuple<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&>&&, std::_Index_tuple<0ul, 1ul, 2ul>) (__args=..., this=<optimized out>) at /usr/include/c++/10/functional:416
#9  std::_Bind<void (syncd::Syncd::*(syncd::Syncd*, std::_Placeholder<1>, std::_Placeholder<2>))(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>::operator()<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) (this=<optimized out>) at /usr/include/c++/10/functional:499
#10 std::__invoke_impl<void, std::_Bind<void (syncd::Syncd::*(syncd::Syncd*, std::_Placeholder<1>, std::_Placeholder<2>))(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::__invoke_other, std::_Bind<void (syncd::Syncd::*(syncd::Syncd*, std::_Placeholder<1>, std::_Placeholder<2>))(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) (__f=...) at /usr/include/c++/10/bits/invoke.h:60
#11 std::__invoke_r<void, std::_Bind<void (syncd::Syncd::*(syncd::Syncd*, std::_Placeholder<1>, std::_Placeholder<2>))(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::_Bind<void (syncd::Syncd::*(syncd::Syncd*, std::_Placeholder<1>, std::_Placeholder<2>))(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) (__fn=...) at /usr/include/c++/10/bits/invoke.h:153
#12 std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), std::_Bind<void (syncd::Syncd::*(syncd::Syncd*, std::_Placeholder<1>, std::_Placeholder<2>))(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> >::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) (__functor=..., __args#0=..., __args#1=...) at /usr/include/c++/10/bits/std_function.h:291
#13 0x00007fdb4f3f3c92 in swss::Logger::settingThread() () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#14 0x00007fdb4dc8ded0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#15 0x00007fdb4de52ea7 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#16 0x00007fdb4dacca6f in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) 
 




@kcudnik
Copy link
Contributor

kcudnik commented Jan 22, 2025

from thsoe threads it does not seem like other thread is actively accessing std::map, maybe it alredy did, since i don't see any other scenario that _Rb_tree_insert_and_rebalance would crash, it's this line here:
https://github.com/sonic-net/sonic-sairedis/blob/master/syncd/VendorSai.cpp#L1890 and that's jsut a simple map with enum to enum:
https://github.com/sonic-net/sonic-sairedis/blob/master/syncd/VendorSai.h#L234
i will try to reproduce this locally and will get back, im 99% shure that mutex accessing that map will solve the issue here

@kcudnik
Copy link
Contributor

kcudnik commented Jan 22, 2025

wrote this:

#include <map>
#include <thread>

std::map<int,int> m;

int i = 0;

void run()
{
    while(1)
    {
        m[i] = i;

        i++;
    }
}

int main()
{

    std::thread t1(run);
    std::thread t2(run);

    t1.join();
    t2.join();

} // g++ a.cpp -O2 -g -lpthread

immediately got crash:

(gdb) thread apply all bt

Thread 3 (Thread 0x7f704f0ee700 (LWP 2193)):
#0  _int_malloc (av=av@entry=0x7f7040000020, bytes=bytes@entry=40) at malloc.c:4116
#1  0x00007f704f2dd299 in __GI___libc_malloc (bytes=40) at malloc.c:3066
#2  0x00007f704f4fab29 in operator new(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x000055acfb7e4451 in __gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<int const, int> > >::allocate (this=0x55acfb7e7060 <m>, __n=1) at /usr/include/c++/9/ext/new_allocator.h:102
#4  std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<int const, int> > > >::allocate (__a=..., __n=1) at /usr/include/c++/9/bits/alloc_traits.h:443
#5  std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_get_node (this=0x55acfb7e7060 <m>) at /usr/include/c++/9/bits/stl_tree.h:580
#6  std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_create_node<std::piecewise_construct_t const&, std::tuple<int const&>, std::tuple<> >(std::piecewise_construct_t const&, std::tuple<int const&>&&, std::tuple<>&&) (this=0x55acfb7e7060 <m>) at /usr/include/c++/9/bits/stl_tree.h:630
#7  std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<int const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<int const, int> >, std::piecewise_construct_t const&, std::tuple<int const&>&&, std::tuple<>&&) (__pos=..., this=0x55acfb7e7060 <m>) at /usr/include/c++/9/bits/stl_tree.h:2460
#8  std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >::operator[] (__k=@0x55acfb7e7040: 923, this=0x55acfb7e7060 <m>) at /usr/include/c++/9/bits/stl_map.h:499
#9  run () at a.cpp:13
#10 0x00007f704f526df4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007f704f63a609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#12 0x00007f704f362353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f704f0ef740 (LWP 2192)):
#0  __pthread_clockjoin_ex (threadid=140120339441408, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
#1  0x00007f704f527057 in std::thread::join() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x000055acfb7e424b in main () at a.cpp:25

Thread 1 (Thread 0x7f704e8ed700 (LWP 2194)):
#0  0x00007f704f51250a in std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x000055acfb7e4497 in std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_insert_node (__z=0x7f7048005390, __p=<optimized out>, __x=0x0, this=0x55acfb7e7060 <m>) at /usr/include/c++/9/bits/stl_tree.h:2359
#2  std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<int const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<int const, int> >, std::piecewise_construct_t const&, std::tuple<int const&>&&, std::tuple<>&&) (__pos=..., this=0x55acfb7e7060 <m>) at /usr/include/c++/9/bits/stl_tree.h:2467
#3  std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >::operator[] (__k=@0x55acfb7e7040: 923, this=0x55acfb7e7060 <m>) at /usr/include/c++/9/bits/stl_map.h:499
#4  run () at a.cpp:13
--Type <RET> for more, q to quit, c to continue without paging--
#5  0x00007f704f526df4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f704f63a609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f704f362353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

same crash: _Rb_tree_insert_and_rebalance

in my example since it's simple 1st trhead is doing node alocation, i was expecting something similar in syncd code, but this is evidently the cause, will make FIX and post this

@kcudnik
Copy link
Contributor

kcudnik commented Jan 22, 2025

fix here: sonic-net/sonic-sairedis#1505

@rlhui
Copy link
Contributor

rlhui commented Jan 22, 2025

@bingwang-ms , this is not specific to t2.

@bingwang-ms
Copy link
Contributor

@kcudnik This looks like a day-1 issue?

@kcudnik
Copy link
Contributor

kcudnik commented Jan 22, 2025

@kcudnik This looks like a day-1 issue?

what that means ?

@bingwang-ms
Copy link
Contributor

@kcudnik This looks like a day-1 issue?

what that means ?

I mean this issue has been there since day-1, it's not caused by a recent change. Do we confirm that?

yejianquan added a commit to sonic-net/sonic-sairedis that referenced this issue Feb 6, 2025
* [syncd] Move logSet logGet under mutex to prevent race condition

Fixes: sonic-net/sonic-buildimage#21180

Mutex is added to protect m_logLevelMap when doing logSet from multiple thread

Co-authored-by: Jianquan Ye <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Info ⌛ Triaged this issue has been triaged
Projects
Status: No status
Status: Done
8 participants