[LU-16297] ptl_send_rpc() ASSERTION ( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & MSGHDR_AT_SUPPORT) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) Created: 03/Nov/22  Updated: 03/Jan/23  Resolved: 03/Jan/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.1
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
May 1 05:09:25 scratchn011 kernel: LustreError: 25042:0:(niobuf.c:772:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & MSGHDR_AT_SUPPORT) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed: May 1 05:09:25 scratchn011 kernel: LustreError: 25042:0:(niobuf.c:772:ptl_send_rpc()) LBUG
May 1 05:09:25 scratchn011 kernel: IEC: 026000003: LASSERT:

{ "pid": "25042", "ext_pid": "0", "filename": "niobuf.c", "line": "772", "func_name": "ptl_send_rpc", "assert_info": "( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & MSGHDR_AT__SUPPORT) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed: " }
May 1 05:09:25 scratchn011 kernel: IEC: 026000004: LBUG:

{ "pid": "25042", "ext_pid": "0", "filename": "niobuf.c", "line": "772", "func_name": "ptl_send_rpc" }
May 1 05:09:25 scratchn011 kernel: Pid: 25042, comm: ptlrpcd_06_02 3.10.0-957.1.3957.1.3.x4.4.25.x86_64 #1 SMP Mon Sep 20 16:59:46 PDT 2021
May 1 05:09:25 scratchn011 kernel: Call Trace:
May 1 05:09:25 scratchn011 kernel: [<0>] libcfs_call_trace+0x8e/0xf0 [libcfs]
May 1 05:09:25 scratchn011 kernel: [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
May 1 05:09:25 scratchn011 kernel: [<0>] ptl_send_rpc+0xcfd/0xf10 [ptlrpc]
May 1 05:09:25 scratchn011 kernel: [<0>] ptlrpc_check_set.part.25+0x18ec/0x1e50 [ptlrpc]
May 1 05:09:25 scratchn011 kernel: [<0>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
May 1 05:09:25 scratchn011 kernel: [<0>] ptlrpcd_check+0x4ab/0x590 [ptlrpc]
May 1 05:09:25 scratchn011 kernel: [<0>] ptlrpcd+0x4b8/0x560 [ptlrpc]
May 1 05:09:25 scratchn011 kernel: [<0>] kthread+0xd1/0xe0

crash> obd_import.imp_state,imp_msghdr_flags,imp_connect_data ffff94044a276000
  imp_state = LUSTRE_IMP_CONNECTING
  imp_msghdr_flags = (unknown: 0)
  imp_connect_data = {
    ocd_connect_flags = 2323857477600284832,
  }
crash> p/x 2323857477600284832&0x1000000ULL
$3 = 0x1000000

this is a race between connect and re-send threads.

769         LASSERT(AT_OFF || imp->imp_state != LUSTRE_IMP_FULL ||
770                 (imp->imp_msghdr_flags & MSGHDR_AT_SUPPORT) ||
771                 !(imp->imp_connect_data.ocd_connect_flags &
772                 OBD_CONNECT_AT));

the assertion has 4 verification
When connection happens in the middle of assertion, a second part of assertion fails. And this leads to a wrong fail. A simple way to make this checks valid requires an atomic checking, with spin lock. But this is a hot path and spin lock would affect performance. So I prefer changing assertion to a warning.



 Comments   
Comment by Gerrit Updater [ 03/Nov/22 ]

"Alexander <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49029
Subject: LU-16297 ptlrpc: don't panic during reconnection
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 10bb3582bceb8107ed552d5554faf49e4586858d

Comment by Gerrit Updater [ 03/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49029/
Subject: LU-16297 ptlrpc: don't panic during reconnection
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: df31c4c0b39b8845911344e6fadc008bcba40bb1

Comment by Peter Jones [ 03/Jan/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:25:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.