Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15435

__req_capsule_get(): ASSERTION( msg != ((void *)0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      [ 8032.711092] Lustre: DEBUG MARKER: == replay-single test 120: DNE fail abort should stop both normal and DNE replay ========================================================== 02:17:06 (1641539826)
      [ 8035.011569] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
      [ 8039.713015] LustreError: 3379:0:(client.c:1256:ptlrpc_import_delay_req()) @@@ IMP_CLOSED  req@ffff8800b8280e40 x1721270907084672/t0(0) o41->lustre-MDT0001-osp-MDT0000@0@lo:24/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:QU/0/ffffffff rc 0/-1 job:'osp-pre-1-0.0'
      [ 8044.896335] LDISKFS-fs (dm-0): recovery complete
      [ 8044.897515] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      [ 8044.930912] LustreError: 17886:0:(layout.c:2135:__req_capsule_get()) ASSERTION( msg != ((void *)0) ) failed: 
      [ 8044.932996] LustreError: 17886:0:(layout.c:2135:__req_capsule_get()) LBUG
      [ 8044.934449] Pid: 17886, comm: mount.lustre 3.10.0-7.9-debug #1 SMP Fri Dec 3 22:45:13 EST 2021
      [ 8044.936411] Call Trace:
      [ 8044.937112] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
      [ 8044.938341] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [ 8044.939612] [<0>] __req_capsule_get+0x5d6/0x6a0 [ptlrpc]
      [ 8044.940945] [<0>] req_capsule_server_get+0x18/0x20 [ptlrpc]
      [ 8044.942292] [<0>] mgc_target_register+0x219/0x530 [mgc]
      [ 8044.943544] [<0>] mgc_set_info_async+0x3cb/0x1350 [mgc]
      [ 8044.944799] [<0>] server_start_targets+0x12ae/0x2bc0 [obdclass]
      [ 8044.946183] [<0>] server_fill_super+0x11fa/0x18a0 [obdclass]
      [ 8044.947564] [<0>] lustre_fill_super+0x3e8/0x440 [lustre]
      [ 8044.948776] [<0>] mount_nodev+0x4d/0xb0
      [ 8044.949791] [<0>] lustre_mount+0x18/0x20 [lustre]
      [ 8044.950876] [<0>] mount_fs+0x39/0x1b0
      [ 8044.951675] [<0>] vfs_kern_mount+0x67/0x110
      [ 8044.952573] [<0>] do_mount+0x1ef/0xce0
      [ 8044.953666] [<0>] SyS_mount+0x83/0xd0
      [ 8044.954575] [<0>] system_call_fastpath+0x1f/0x24
      [ 8044.955738] [<0>] 0xfffffffffffffffe
      [ 8044.956664] Kernel panic - not syncing: LBUG
      

      Attachments

        Issue Links

          Activity

            [LU-15435] __req_capsule_get(): ASSERTION( msg != ((void *)0) ) failed
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46132/
            Subject: LU-15435 ptlrpc: unregister reply buffer on rq_err
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d8012811cc6ff9c7f0fb1ddfec9461e9ff963e54

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46132/ Subject: LU-15435 ptlrpc: unregister reply buffer on rq_err Project: fs/lustre-release Branch: master Current Patch Set: Commit: d8012811cc6ff9c7f0fb1ddfec9461e9ff963e54

            "Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46132
            Subject: LU-15435 ptlrpc: unregister reply buffer on rq_err
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: fc2e3e4c66509a8bf44d4fc3e914440a4848a933

            gerrit Gerrit Updater added a comment - "Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46132 Subject: LU-15435 ptlrpc: unregister reply buffer on rq_err Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fc2e3e4c66509a8bf44d4fc3e914440a4848a933
            shadow Alexey Lyashkov added a comment - - edited

            http://testing.linuxhacker.ru/lustre-reports/20938/testresults/replay-single-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

            as example.

            # grep -e ffff8801162e0000 -e 1721270907086464 oleg350-server-vmcore-lustredebug.txt
            00000100:00100000:2.0:1641539838.802557:0:17886:0:(client.c:1733:ptlrpc_send_new_req()) Sending RPC req@ffff8801162e0000 pname:cluuid:pid:xid:nid:opc:job mount.lustre:f02f8fc1-f5c4-4b25-8bc1-6741dcc62ca5:17886:1721270907086464:0@lo:253:mount.lustre.0
            00000100:00100000:1.0:1641539838.802613:0:17893:0:(service.c:2134:ptlrpc_server_handle_req_in()) got req x1721270907086464
            00000100:00100000:1.0:1641539838.802622:0:17893:0:(service.c:2303:ptlrpc_server_handle_request()) Handling RPC req@ffff8800a466e850 pname:cluuid+ref:pid:xid:nid:opc:job ll_mgs_0002:0+-99:17886:x1721270907086464:12345-0@lo:253:mount.lustre.0
            00000100:00100000:3.0:1641539838.802627:0:3378:0:(client.c:3314:ptlrpc_abort_inflight()) @@@ inflight  req@ffff8801162e0000 x1721270907086464/t0(0) o253->MGC192.168.203.150@tcp@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 1641539845 ref 2 fl Rpc:Qr/0/ffffffff rc 0/-1 job:'mount.lustre.0'
            00000100:00100000:1.0:1641539838.802643:0:17893:0:(service.c:2352:ptlrpc_server_handle_request()) Handled RPC req@ffff8800a466e850 pname:cluuid+ref:pid:xid:nid:opc:job ll_mgs_0002:0+-99:17886:x1721270907086464:12345-0@lo:253:mount.lustre.0 Request processed in 20us (50us total) trans 0 rc -107/-107
            00000100:00100000:2.0:1641539838.802643:0:17886:0:(client.c:2210:ptlrpc_check_set()) Completed RPC req@ffff8801162e0000 pname:cluuid:pid:xid:nid:opc:job mount.lustre:f02f8fc1-f5c4-4b25-8bc1-6741dcc62ca5:17886:1721270907086464:0@lo:253:mount.lustre.0
            
            shadow Alexey Lyashkov added a comment - - edited http://testing.linuxhacker.ru/lustre-reports/20938/testresults/replay-single-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/ as example. # grep -e ffff8801162e0000 -e 1721270907086464 oleg350-server-vmcore-lustredebug.txt 00000100:00100000:2.0:1641539838.802557:0:17886:0:(client.c:1733:ptlrpc_send_new_req()) Sending RPC req@ffff8801162e0000 pname:cluuid:pid:xid:nid:opc:job mount.lustre:f02f8fc1-f5c4-4b25-8bc1-6741dcc62ca5:17886:1721270907086464:0@lo:253:mount.lustre.0 00000100:00100000:1.0:1641539838.802613:0:17893:0:(service.c:2134:ptlrpc_server_handle_req_in()) got req x1721270907086464 00000100:00100000:1.0:1641539838.802622:0:17893:0:(service.c:2303:ptlrpc_server_handle_request()) Handling RPC req@ffff8800a466e850 pname:cluuid+ref:pid:xid:nid:opc:job ll_mgs_0002:0+-99:17886:x1721270907086464:12345-0@lo:253:mount.lustre.0 00000100:00100000:3.0:1641539838.802627:0:3378:0:(client.c:3314:ptlrpc_abort_inflight()) @@@ inflight req@ffff8801162e0000 x1721270907086464/t0(0) o253->MGC192.168.203.150@tcp@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 1641539845 ref 2 fl Rpc:Qr/0/ffffffff rc 0/-1 job:'mount.lustre.0' 00000100:00100000:1.0:1641539838.802643:0:17893:0:(service.c:2352:ptlrpc_server_handle_request()) Handled RPC req@ffff8800a466e850 pname:cluuid+ref:pid:xid:nid:opc:job ll_mgs_0002:0+-99:17886:x1721270907086464:12345-0@lo:253:mount.lustre.0 Request processed in 20us (50us total) trans 0 rc -107/-107 00000100:00100000:2.0:1641539838.802643:0:17886:0:(client.c:2210:ptlrpc_check_set()) Completed RPC req@ffff8801162e0000 pname:cluuid:pid:xid:nid:opc:job mount.lustre:f02f8fc1-f5c4-4b25-8bc1-6741dcc62ca5:17886:1721270907086464:0@lo:253:mount.lustre.0

            recently landed "LU-15112 mgc: do not ignore target registration failure" exposes a race in
            ptlrpc between late reply for a target registration request (in our case) and ptlrpc_abort_inflight() from another thread. I.e. late reply sets only rq_replied flag and leaves rq_repmsg == NULL as the request already passed RQ_PHASE_INTERPRET w/o unlinking the reply buffer from the network stack.
            It would be good to fix the race to prevent unexpected memory writes from LNET (regardless of the LU-15112 patch ).

            zam Alexander Zarochentsev added a comment - recently landed " LU-15112 mgc: do not ignore target registration failure" exposes a race in ptlrpc between late reply for a target registration request (in our case) and ptlrpc_abort_inflight() from another thread. I.e. late reply sets only rq_replied flag and leaves rq_repmsg == NULL as the request already passed RQ_PHASE_INTERPRET w/o unlinking the reply buffer from the network stack. It would be good to fix the race to prevent unexpected memory writes from LNET (regardless of the LU-15112 patch ).

            People

              zam Alexander Zarochentsev
              zam Alexander Zarochentsev
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: