Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.8.0
-
None
-
3
-
9223372036854775807
Description
During testing we lost one of the IB leaf switches which caused all of our lustre router to crash with the following error:
2015-12-11T10:53:29.539273-05:00 c0-0c0s2n3 LNetError: 4675:0:(o2iblnd.c:399:kiblnd_find_peer_locked()) ASSERTION( peer->ibp_connecting > 0 || peer->ibp_accepting > 0 || !list_empty(&peer->ibp_conns) ) failed:
2015-12-11T10:53:29.539305-05:00 c0-0c0s2n3 LNetError: 4675:0:(o2iblnd.c:399:kiblnd_find_peer_locked()) LBUG
2015-12-11T10:53:29.539313-05:00 c0-0c0s2n3 Pid: 4675, comm: kgnilnd_sd_02
2015-12-11T10:53:29.539319-05:00 c0-0c0s2n3 Call Trace:
2015-12-11T10:53:29.539324-05:00 c0-0c0s2n3 [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
2015-12-11T10:53:29.539332-05:00 c0-0c0s2n3 [<ffffffff81004eb9>] dump_trace+0x89/0x430
2015-12-11T10:53:29.539339-05:00 c0-0c0s2n3 [<ffffffffa025bac0>] lbug_with_loc+0x90/0x1d0 [libcfs]
2015-12-11T10:53:29.539348-05:00 c0-0c0s2n3 [<ffffffffa036f32b>] kiblnd_find_peer_locked+0x14b/0x150 [ko2iblnd]
2015-12-11T10:53:29.539358-05:00 c0-0c0s2n3 [<ffffffffa036f379>] kiblnd_query+0x49/0x1c0 [ko2iblnd]
2015-12-11T10:53:29.539364-05:00 c0-0c0s2n3 [<ffffffffa02d5aee>] lnet_post_send_locked+0x2ee/0x740 [lnet]
2015-12-11T10:53:29.539369-05:00 c0-0c0s2n3 [<ffffffffa02d84f0>] lnet_send+0x6a0/0xcf0 [lnet]
2015-12-11T10:53:29.539375-05:00 c0-0c0s2n3 [<ffffffffa02cbe94>] lnet_finalize+0x424/0x800 [lnet]
2015-12-11T10:53:29.539380-05:00 c0-0c0s2n3 [<ffffffffa03d256b>] kgnilnd_recv+0x73b/0xdf0 [kgnilnd]
2015-12-11T10:53:29.539385-05:00 c0-0c0s2n3 [<ffffffffa02d432f>] lnet_ni_recv+0xcf/0x330 [lnet]
2015-12-11T10:53:29.539389-05:00 c0-0c0s2n3 [<ffffffffa02dac26>] lnet_parse+0x3c6/0xe40 [lnet]
2015-12-11T10:53:29.539394-05:00 c0-0c0s2n3 [<ffffffffa03d8111>] kgnilnd_check_fma_rx+0x1af1/0x1f50 [kgnilnd]
2015-12-11T10:53:29.539406-05:00 c0-0c0s2n3 [<ffffffffa03dbbc4>] kgnilnd_process_conns+0x554/0x15d0 [kgnilnd]
2015-12-11T10:53:29.539411-05:00 c0-0c0s2n3 [<ffffffffa03dcf1e>] kgnilnd_scheduler+0x2de/0x5f0 [kgnilnd]
2015-12-11T10:53:29.539416-05:00 c0-0c0s2n3 [<ffffffff81067ace>] kthread+0x9e/0xb0
2015-12-11T10:53:29.539421-05:00 c0-0c0s2n3 [<ffffffff81490074>] kernel_thread_helper+0x4/0x10
2015-12-11T10:53:29.539427-05:00 c0-0c0s2n3 Kernel panic - not syncing: LBUG
2015-12-11T10:53:29.539432-05:00 c0-0c0s2n3 Pid: 4675, comm: kgnilnd_sd_02 Tainted: P 3.0.101-0.46.1_1.0502.8871-cray_gem_s #1
2015-12-11T10:53:29.539437-05:00 c0-0c0s2n3 Call Trace:
2015-12-11T10:53:29.539442-05:00 c0-0c0s2n3 [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
2015-12-11T10:53:29.539447-05:00 c0-0c0s2n3 [<ffffffff81004eb9>] dump_trace+0x89/0x430
2015-12-11T10:53:29.539452-05:00 c0-0c0s2n3 [<ffffffff810060bc>] show_trace_log_lvl+0x5c/0x80
2015-12-11T10:53:29.539457-05:00 c0-0c0s2n3 [<ffffffff810060f5>] show_trace+0x15/0x20
2015-12-11T10:53:29.539462-05:00 c0-0c0s2n3 [<ffffffff8148b31c>] dump_stack+0x79/0x84
2015-12-11T10:53:29.539467-05:00 c0-0c0s2n3 [<ffffffff8148b3bb>] panic+0x94/0x1da
2015-12-11T10:53:29.539473-05:00 c0-0c0s2n3 [<ffffffffa025bbf1>] lbug_with_loc+0x1c1/0x1d0 [libcfs]
2015-12-11T10:53:29.539479-05:00 c0-0c0s2n3 [<ffffffffa036f32b>] kiblnd_find_peer_locked+0x14b/0x150 [ko2iblnd]
2015-12-11T10:53:29.539484-05:00 c0-0c0s2n3 [<ffffffffa036f379>] kiblnd_query+0x49/0x1c0 [ko2iblnd]
2015-12-11T10:53:29.539489-05:00 c0-0c0s2n3 [<ffffffffa02d5aee>] lnet_post_send_locked+0x2ee/0x740 [lnet]
2015-12-11T10:53:29.539494-05:00 c0-0c0s2n3 [<ffffffffa02d84f0>] lnet_send+0x6a0/0xcf0 [lnet]
2015-12-11T10:53:29.539500-05:00 c0-0c0s2n3 [<ffffffffa02cbe94>] lnet_finalize+0x424/0x800 [lnet]
2015-12-11T10:53:29.539505-05:00 c0-0c0s2n3 [<ffffffffa03d256b>] kgnilnd_recv+0x73b/0xdf0 [kgnilnd]
2015-12-11T10:53:29.539511-05:00 c0-0c0s2n3 [<ffffffffa02d432f>] lnet_ni_recv+0xcf/0x330 [lnet]
2015-12-11T10:53:29.539519-05:00 c0-0c0s2n3 [<ffffffffa02dac26>] lnet_parse+0x3c6/0xe40 [lnet]
2015-12-11T10:53:29.539526-05:00 c0-0c0s2n3 [<ffffffffa03d8111>] kgnilnd_check_fma_rx+0x1af1/0x1f50 [kgnilnd]
2015-12-11T10:53:29.539532-05:00 c0-0c0s2n3 [<ffffffffa03dbbc4>] kgnilnd_process_conns+0x554/0x15d0 [kgnilnd]
2015-12-11T10:53:29.539537-05:00 c0-0c0s2n3 [<ffffffffa03dcf1e>] kgnilnd_scheduler+0x2de/0x5f0 [kgnilnd]
2015-12-11T10:53:29.539544-05:00 c0-0c0s2n3 [<ffffffff81067ace>] kthread+0x9e/0xb0
2015-12-11T10:53:29.539550-05:00 c0-0c0s2n3 [<ffffffff81490074>] kernel_thread_helper+0x4/0x10
Attachments
Issue Links
- is related to
-
LU-7390 Router memory leak if we start a new router on a operationel configuration
- Open
-
LU-7646 Infinite CON RACE Condition after rebooting LNet router
- Resolved
-
LU-5718 RDMA too fragmented with router
- Resolved
-
LU-7210 ASSERTION( peer->ibp_connecting == 0 )
- Resolved
-
LU-3322 ko2iblnd support for different map_on_demand and peer_credits between systems
- Resolved
-
LU-7314 In kiblnd_rejected(), NULL pointer 'cp' may be passed to function and can be dereferenced there
- Resolved
-
LU-7676 OSS Servers stuck in connecting/disconnect loop
- Resolved