Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13195

replay-single test_118: dt_declare_record_write() ASSERTION( dt->do_body_ops ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0, Lustre 2.12.10
    • Lustre 2.14.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for liuying <emoly.liu@intel.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/ca28353a-46ca-11ea-91a9-52540065bddc

      test_118 failed with the following error:

      == replay-single test 118: invalidate osp update will not cause update log corruption ================ 17:21:23 (1580750483)
      CMD: trevis-19vm4 lctl set_param fail_loc=0x1705
      

      and the following stack trace on the console:

      [10146.524712] LustreError: 19994:0:(dt_object.h:2191:dt_declare_record_write()) ASSERTION( dt->do_body_ops ) failed: 
      [10146.525745] LustreError: 19994:0:(dt_object.h:2191:dt_declare_record_write()) LBUG
      [10146.526542] Pid: 19994, comm: mdt_out00_000 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Sat Jan 18 23:01:59 UTC 2020
      [10146.527632] Call Trace:
      [10146.527905]  [<ffffffffc0c348ac>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [10146.528605]  [<ffffffffc0c3495c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [10146.529291]  [<ffffffffc10f65eb>] out_write_add_exec+0x13b/0x1b0 [ptlrpc]
      [10146.530275]  [<ffffffffc10eed43>] out_write+0x333/0x370 [ptlrpc]
      [10146.530971]  [<ffffffffc10f1086>] out_handle+0x1566/0x1bb0 [ptlrpc]
      [10146.531652]  [<ffffffffc10e7eca>] tgt_request_handle+0x95a/0x1610 [ptlrpc]
      [10146.532417]  [<ffffffffc108b816>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [10146.533234]  [<ffffffffc108f8a4>] ptlrpc_main+0xbb4/0x1550 [ptlrpc]
      
      

      This issue happened several times in Maloo testing but no more logs were collected.
      https://testing.whamcloud.com/sub_tests/121c5288-447b-11ea-bffa-52540065bddc
      https://testing.whamcloud.com/sub_tests/7cefb86c-4362-11ea-86b2-52540065bddc

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      replay-single test_118 - trevis-19vm4 crashed during replay-single test_118

      Attachments

        Issue Links

          Activity

            [LU-13195] replay-single test_118: dt_declare_record_write() ASSERTION( dt->do_body_ops ) failed

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47010/
            Subject: LU-13195 osp: osp_send_update_req() should check generation
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: b18246d8b78a308c32a5f78eee581f16dae5dc44

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47010/ Subject: LU-13195 osp: osp_send_update_req() should check generation Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: b18246d8b78a308c32a5f78eee581f16dae5dc44

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46863/
            Subject: LU-13195 osp: invalidate object on write error
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: be237a523e1208888f8f7d10e2a88709ea823a74

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46863/ Subject: LU-13195 osp: invalidate object on write error Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: be237a523e1208888f8f7d10e2a88709ea823a74

            "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47010
            Subject: LU-13195 osp: osp_send_update_req() should check generation
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: e555f223a6ad340e7f326478cd09ba36b4b8bbb2

            gerrit Gerrit Updater added a comment - "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47010 Subject: LU-13195 osp: osp_send_update_req() should check generation Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: e555f223a6ad340e7f326478cd09ba36b4b8bbb2

            "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46863
            Subject: LU-13195 osp: invalidate object on write error
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: a6555d5b7f9a76250e8460adcb3d8a089356f490

            gerrit Gerrit Updater added a comment - "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46863 Subject: LU-13195 osp: invalidate object on write error Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: a6555d5b7f9a76250e8460adcb3d8a089356f490

            Sorry, new failure was LU-15139, which is similar, but not identical.

            adilger Andreas Dilger added a comment - Sorry, new failure was LU-15139 , which is similar, but not identical.

            damn... looking

            bzzz Alex Zhuravlev added a comment - damn... looking

            +1 in runtest (with all the fix patches):
            https://testing.whamcloud.com/test_sets/7110dd40-b540-469d-a773-874769fe527b

            [ 4327.451873] Lustre: DEBUG MARKER: copying 607 files from /etc /bin to /mnt/lustre/d1.runtests/etc /bin at Fri Oct 22 22:51:39 UTC 2021
            [ 4334.578564] LustreError: 12949:0:(dt_object.h:2310:dt_declare_record_write()) ASSERTION( dt->do_body_ops ) failed: [0x200011571:0x1:0x0] doesn't exit
            [ 4334.581584] LustreError: 12949:0:(dt_object.h:2310:dt_declare_record_write()) LBUG
            [ 4334.583011] Pid: 12949, comm: mdt_out00_001 4.18.0-240.22.1.el8_lustre.x86_64 #1 SMP Mon Oct 4 16:46:22 UTC 2021
            [ 4334.585079] Call Trace TBD:
            [ 4334.585913] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs]
            [ 4334.586882] [<0>] lbug_with_loc+0x43/0x80 [libcfs]
            [ 4334.588281] [<0>] out_write_add_exec+0x17d/0x1e0 [ptlrpc]
            [ 4334.589374] [<0>] out_write+0x166/0x380 [ptlrpc]
            [ 4334.590282] [<0>] out_handle+0x16af/0x20e0 [ptlrpc]
            [ 4334.591293] [<0>] tgt_request_handle+0xc93/0x1a00 [ptlrpc]
            [ 4334.592391] [<0>] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
            [ 4334.593680] [<0>] ptlrpc_main+0xc06/0x1550 [ptlrpc]
            [ 4334.594667] [<0>] kthread+0x112/0x130
            [ 4334.595389] [<0>] ret_from_fork+0x35/0x40
            
            eaujames Etienne Aujames added a comment - +1 in runtest (with all the fix patches): https://testing.whamcloud.com/test_sets/7110dd40-b540-469d-a773-874769fe527b [ 4327.451873] Lustre: DEBUG MARKER: copying 607 files from /etc /bin to /mnt/lustre/d1.runtests/etc /bin at Fri Oct 22 22:51:39 UTC 2021 [ 4334.578564] LustreError: 12949:0:(dt_object.h:2310:dt_declare_record_write()) ASSERTION( dt->do_body_ops ) failed: [0x200011571:0x1:0x0] doesn't exit [ 4334.581584] LustreError: 12949:0:(dt_object.h:2310:dt_declare_record_write()) LBUG [ 4334.583011] Pid: 12949, comm: mdt_out00_001 4.18.0-240.22.1.el8_lustre.x86_64 #1 SMP Mon Oct 4 16:46:22 UTC 2021 [ 4334.585079] Call Trace TBD: [ 4334.585913] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs] [ 4334.586882] [<0>] lbug_with_loc+0x43/0x80 [libcfs] [ 4334.588281] [<0>] out_write_add_exec+0x17d/0x1e0 [ptlrpc] [ 4334.589374] [<0>] out_write+0x166/0x380 [ptlrpc] [ 4334.590282] [<0>] out_handle+0x16af/0x20e0 [ptlrpc] [ 4334.591293] [<0>] tgt_request_handle+0xc93/0x1a00 [ptlrpc] [ 4334.592391] [<0>] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc] [ 4334.593680] [<0>] ptlrpc_main+0xc06/0x1550 [ptlrpc] [ 4334.594667] [<0>] kthread+0x112/0x130 [ 4334.595389] [<0>] ret_from_fork+0x35/0x40
            pjones Peter Jones added a comment -

            So... complete it seems

            pjones Peter Jones added a comment - So... complete it seems

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45042/
            Subject: LU-13195 osp: osp_send_update_req() should check generation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: dff1e0d21c8c6bb20d63669252190795198bc49f

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45042/ Subject: LU-13195 osp: osp_send_update_req() should check generation Project: fs/lustre-release Branch: master Current Patch Set: Commit: dff1e0d21c8c6bb20d63669252190795198bc49f

            with the latest https://review.whamcloud.com/45042 I can't reproduce LBUG() anymore. basically it's a race - failed (by intention) create doesn't invalidate request-in-progress properly, then that survived request (containing a write to object just failed to create) flies to remote MDT and we get the LBUG().
            with LBUG resolved I observed another issue - few OSP structures from that inproperly invalidated request can leak. I think this is slightly different issue and plan to fix that with another patch.

            bzzz Alex Zhuravlev added a comment - with the latest https://review.whamcloud.com/45042 I can't reproduce LBUG() anymore. basically it's a race - failed (by intention) create doesn't invalidate request-in-progress properly, then that survived request (containing a write to object just failed to create) flies to remote MDT and we get the LBUG(). with LBUG resolved I observed another issue - few OSP structures from that inproperly invalidated request can leak. I think this is slightly different issue and plan to fix that with another patch.

            People

              bzzz Alex Zhuravlev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: