Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.15.1
-
3
-
9223372036854775807
Description
May 1 05:09:25 scratchn011 kernel: LustreError: 25042:0:(niobuf.c:772:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & MSGHDR_AT_SUPPORT) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed: May 1 05:09:25 scratchn011 kernel: LustreError: 25042:0:(niobuf.c:772:ptl_send_rpc()) LBUG May 1 05:09:25 scratchn011 kernel: IEC: 026000003: LASSERT: { "pid": "25042", "ext_pid": "0", "filename": "niobuf.c", "line": "772", "func_name": "ptl_send_rpc", "assert_info": "( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & MSGHDR_AT__SUPPORT) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed: " } May 1 05:09:25 scratchn011 kernel: IEC: 026000004: LBUG: { "pid": "25042", "ext_pid": "0", "filename": "niobuf.c", "line": "772", "func_name": "ptl_send_rpc" } May 1 05:09:25 scratchn011 kernel: Pid: 25042, comm: ptlrpcd_06_02 3.10.0-957.1.3957.1.3.x4.4.25.x86_64 #1 SMP Mon Sep 20 16:59:46 PDT 2021 May 1 05:09:25 scratchn011 kernel: Call Trace: May 1 05:09:25 scratchn011 kernel: [<0>] libcfs_call_trace+0x8e/0xf0 [libcfs] May 1 05:09:25 scratchn011 kernel: [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] May 1 05:09:25 scratchn011 kernel: [<0>] ptl_send_rpc+0xcfd/0xf10 [ptlrpc] May 1 05:09:25 scratchn011 kernel: [<0>] ptlrpc_check_set.part.25+0x18ec/0x1e50 [ptlrpc] May 1 05:09:25 scratchn011 kernel: [<0>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] May 1 05:09:25 scratchn011 kernel: [<0>] ptlrpcd_check+0x4ab/0x590 [ptlrpc] May 1 05:09:25 scratchn011 kernel: [<0>] ptlrpcd+0x4b8/0x560 [ptlrpc] May 1 05:09:25 scratchn011 kernel: [<0>] kthread+0xd1/0xe0 crash> obd_import.imp_state,imp_msghdr_flags,imp_connect_data ffff94044a276000 imp_state = LUSTRE_IMP_CONNECTING imp_msghdr_flags = (unknown: 0) imp_connect_data = { ocd_connect_flags = 2323857477600284832, } crash> p/x 2323857477600284832&0x1000000ULL $3 = 0x1000000
this is a race between connect and re-send threads.
769 LASSERT(AT_OFF || imp->imp_state != LUSTRE_IMP_FULL || 770 (imp->imp_msghdr_flags & MSGHDR_AT_SUPPORT) || 771 !(imp->imp_connect_data.ocd_connect_flags & 772 OBD_CONNECT_AT));
the assertion has 4 verification
When connection happens in the middle of assertion, a second part of assertion fails. And this leads to a wrong fail. A simple way to make this checks valid requires an atomic checking, with spin lock. But this is a hot path and spin lock would affect performance. So I prefer changing assertion to a warning.
Attachments
Issue Links
- is related to
-
LU-17540 sync and delay before LBUG() calls panic()
- Resolved