[LU-15435] __req_capsule_get(): ASSERTION( msg != ((void *)0) ) failed Created: 11/Jan/22  Updated: 21/Jan/23  Resolved: 27/Mar/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Alexander Zarochentsev Assignee: Alexander Zarochentsev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-15112 attempt to register an OST with dupli... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
[ 8032.711092] Lustre: DEBUG MARKER: == replay-single test 120: DNE fail abort should stop both normal and DNE replay ========================================================== 02:17:06 (1641539826)
[ 8035.011569] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[ 8039.713015] LustreError: 3379:0:(client.c:1256:ptlrpc_import_delay_req()) @@@ IMP_CLOSED  req@ffff8800b8280e40 x1721270907084672/t0(0) o41->lustre-MDT0001-osp-MDT0000@0@lo:24/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:QU/0/ffffffff rc 0/-1 job:'osp-pre-1-0.0'
[ 8044.896335] LDISKFS-fs (dm-0): recovery complete
[ 8044.897515] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[ 8044.930912] LustreError: 17886:0:(layout.c:2135:__req_capsule_get()) ASSERTION( msg != ((void *)0) ) failed: 
[ 8044.932996] LustreError: 17886:0:(layout.c:2135:__req_capsule_get()) LBUG
[ 8044.934449] Pid: 17886, comm: mount.lustre 3.10.0-7.9-debug #1 SMP Fri Dec 3 22:45:13 EST 2021
[ 8044.936411] Call Trace:
[ 8044.937112] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[ 8044.938341] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 8044.939612] [<0>] __req_capsule_get+0x5d6/0x6a0 [ptlrpc]
[ 8044.940945] [<0>] req_capsule_server_get+0x18/0x20 [ptlrpc]
[ 8044.942292] [<0>] mgc_target_register+0x219/0x530 [mgc]
[ 8044.943544] [<0>] mgc_set_info_async+0x3cb/0x1350 [mgc]
[ 8044.944799] [<0>] server_start_targets+0x12ae/0x2bc0 [obdclass]
[ 8044.946183] [<0>] server_fill_super+0x11fa/0x18a0 [obdclass]
[ 8044.947564] [<0>] lustre_fill_super+0x3e8/0x440 [lustre]
[ 8044.948776] [<0>] mount_nodev+0x4d/0xb0
[ 8044.949791] [<0>] lustre_mount+0x18/0x20 [lustre]
[ 8044.950876] [<0>] mount_fs+0x39/0x1b0
[ 8044.951675] [<0>] vfs_kern_mount+0x67/0x110
[ 8044.952573] [<0>] do_mount+0x1ef/0xce0
[ 8044.953666] [<0>] SyS_mount+0x83/0xd0
[ 8044.954575] [<0>] system_call_fastpath+0x1f/0x24
[ 8044.955738] [<0>] 0xfffffffffffffffe
[ 8044.956664] Kernel panic - not syncing: LBUG


 Comments   
Comment by Alexander Zarochentsev [ 11/Jan/22 ]

recently landed "LU-15112 mgc: do not ignore target registration failure" exposes a race in
ptlrpc between late reply for a target registration request (in our case) and ptlrpc_abort_inflight() from another thread. I.e. late reply sets only rq_replied flag and leaves rq_repmsg == NULL as the request already passed RQ_PHASE_INTERPRET w/o unlinking the reply buffer from the network stack.
It would be good to fix the race to prevent unexpected memory writes from LNET (regardless of the LU-15112 patch ).

Comment by Alexey Lyashkov [ 11/Jan/22 ]

http://testing.linuxhacker.ru/lustre-reports/20938/testresults/replay-single-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

as example.

# grep -e ffff8801162e0000 -e 1721270907086464 oleg350-server-vmcore-lustredebug.txt
00000100:00100000:2.0:1641539838.802557:0:17886:0:(client.c:1733:ptlrpc_send_new_req()) Sending RPC req@ffff8801162e0000 pname:cluuid:pid:xid:nid:opc:job mount.lustre:f02f8fc1-f5c4-4b25-8bc1-6741dcc62ca5:17886:1721270907086464:0@lo:253:mount.lustre.0
00000100:00100000:1.0:1641539838.802613:0:17893:0:(service.c:2134:ptlrpc_server_handle_req_in()) got req x1721270907086464
00000100:00100000:1.0:1641539838.802622:0:17893:0:(service.c:2303:ptlrpc_server_handle_request()) Handling RPC req@ffff8800a466e850 pname:cluuid+ref:pid:xid:nid:opc:job ll_mgs_0002:0+-99:17886:x1721270907086464:12345-0@lo:253:mount.lustre.0
00000100:00100000:3.0:1641539838.802627:0:3378:0:(client.c:3314:ptlrpc_abort_inflight()) @@@ inflight  req@ffff8801162e0000 x1721270907086464/t0(0) o253->MGC192.168.203.150@tcp@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 1641539845 ref 2 fl Rpc:Qr/0/ffffffff rc 0/-1 job:'mount.lustre.0'
00000100:00100000:1.0:1641539838.802643:0:17893:0:(service.c:2352:ptlrpc_server_handle_request()) Handled RPC req@ffff8800a466e850 pname:cluuid+ref:pid:xid:nid:opc:job ll_mgs_0002:0+-99:17886:x1721270907086464:12345-0@lo:253:mount.lustre.0 Request processed in 20us (50us total) trans 0 rc -107/-107
00000100:00100000:2.0:1641539838.802643:0:17886:0:(client.c:2210:ptlrpc_check_set()) Completed RPC req@ffff8801162e0000 pname:cluuid:pid:xid:nid:opc:job mount.lustre:f02f8fc1-f5c4-4b25-8bc1-6741dcc62ca5:17886:1721270907086464:0@lo:253:mount.lustre.0
Comment by Gerrit Updater [ 14/Jan/22 ]

"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46132
Subject: LU-15435 ptlrpc: unregister reply buffer on rq_err
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fc2e3e4c66509a8bf44d4fc3e914440a4848a933

Comment by Gerrit Updater [ 27/Mar/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46132/
Subject: LU-15435 ptlrpc: unregister reply buffer on rq_err
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d8012811cc6ff9c7f0fb1ddfec9461e9ff963e54

Comment by Peter Jones [ 27/Mar/22 ]

Landed for 2.15

Generated at Sat Feb 10 03:18:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.