Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12717

ASSERTION( !lod_obj_is_striped(child) ) failed

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.10.8
    • None
    • Clients: 2.12.0, CentOS 7.6
    • 3
    • 9223372036854775807

    Description

      LBUG today on oak-MDT0000, never seem this one before. We have had some big data transfers using dsync going on on Sherlock (2.12.0 clients). Might be related, or not.

      [4954375.921845] LustreError: 15102:0:(tgt_handler.c:628:process_req_last_xid()) @@@ Unexpected xid 5d6425ffe4140 vs. last_xid 5d6425ffe418f
        req@ffffa1597f41f200 x1642955450237248/t0(0) o101->98bbe778-4f70-8a89-d80e-d6a8120c693b@10.8.2.23@o2ib6:663/0 lens 736/0 e 0 to 0 dl 1567111883 ref 1 fl Interpret:/2/ffffffff rc 0/-1
      [4954542.487326] LustreError: 15290:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Unknown attr bits: 0x60000
      [4954542.517377] LustreError: 15290:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Skipped 3754300 previous similar messages
      [4954874.316190] LustreError: 15347:0:(lod_object.c:3919:lod_ah_init()) ASSERTION( !lod_obj_is_striped(child) ) failed: 
      [4954874.351112] LustreError: 15347:0:(lod_object.c:3919:lod_ah_init()) LBUG
      [4954874.373452] Pid: 15347, comm: mdt01_049 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018
      [4954874.406359] Call Trace:
      [4954874.414973]  [<ffffffffc08af7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [4954874.437035]  [<ffffffffc08af87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [4954874.459664]  [<ffffffffc135a89f>] lod_ah_init+0x23f/0xde0 [lod]
      [4954874.479751]  [<ffffffffc13d306b>] mdd_object_make_hint+0xcb/0x190 [mdd]
      [4954874.502388]  [<ffffffffc13bed50>] mdd_create_data+0x330/0x730 [mdd]
      [4954874.523606]  [<ffffffffc129140c>] mdt_mfd_open+0xc5c/0xe70 [mdt]
      [4954874.544523]  [<ffffffffc1291b9b>] mdt_finish_open+0x57b/0x690 [mdt]
      [4954874.565743]  [<ffffffffc1293478>] mdt_reint_open+0x17c8/0x3190 [mdt]
      [4954874.587229]  [<ffffffffc1288cb3>] mdt_reint_rec+0x83/0x210 [mdt]
      [4954874.607567]  [<ffffffffc126a19b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      [4954874.630197]  [<ffffffffc126a6c2>] mdt_intent_reint+0x162/0x430 [mdt]
      [4954874.651677]  [<ffffffffc126d4cb>] mdt_intent_opc+0x1eb/0xaf0 [mdt]
      [4954874.672619]  [<ffffffffc1275d68>] mdt_intent_policy+0x138/0x320 [mdt]
      [4954874.694668]  [<ffffffffc0be82dd>] ldlm_lock_enqueue+0x38d/0x980 [ptlrpc]
      [4954874.719320]  [<ffffffffc0c11c03>] ldlm_handle_enqueue0+0xa83/0x1670 [ptlrpc]
      [4954874.743104]  [<ffffffffc0c977f2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [4954874.764026]  [<ffffffffc0c9b72a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      [4954874.787245]  [<ffffffffc0c4404b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      [4954874.813872]  [<ffffffffc0c47792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [4954874.835628]  [<ffffffff8babdf21>] kthread+0xd1/0xe0
      [4954874.852252]  [<ffffffff8c1255f7>] ret_from_fork_nospec_end+0x0/0x39
      [4954874.873448]  [<ffffffffffffffff>] 0xffffffffffffffff
      [4954874.890366] Kernel panic - not syncing: LBUG
      

      I do have a crash dump if you're interested. MDT failover was smooth so not a big deal:

      Aug 29 14:04:49 oak-md1-s1 kernel: Lustre: oak-MDT0000: Recovery over after 0:55, of 1464 clients 1464 recovered and 0 were evicted.
      

       

      Attachments

        Activity

          [LU-12717] ASSERTION( !lod_obj_is_striped(child) ) failed

          Hi! This issue hit us again today, even though we're now using SSDs on all Oak's MDTs. I see that Lai's patch above (https://review.whamcloud.com/36100) was almost ready to land and even had Andreas' approval. It would probably be too much effort to port it to 2.10.8 (that we're still running on Oak), but would it be possible that you look at the patch again so that it can land into master. That way, this rare issue would be avoided in the future. Thanks!

          Apr 30 15:28:38 oak-md1-s2 kernel: LustreError: 9033:0:(lod_object.c:3919:lod_ah_init()) ASSERTION( !lod_obj_is_striped(child) ) failed: 
          Apr 30 15:28:38 oak-md1-s2 kernel: LustreError: 9033:0:(lod_object.c:3919:lod_ah_init()) LBUG
          Apr 30 15:28:38 oak-md1-s2 kernel: Pid: 9033, comm: mdt01_088 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019
          Apr 30 15:28:38 oak-md1-s2 kernel: Call Trace:
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc0ddb7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc0ddb87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc174fb4f>] lod_ah_init+0x23f/0xde0 [lod]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc17cc09b>] mdd_object_make_hint+0xcb/0x190 [mdd]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc17b7d50>] mdd_create_data+0x330/0x730 [mdd]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc168a3fc>] mdt_mfd_open+0xc5c/0xe70 [mdt]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc168ab8b>] mdt_finish_open+0x57b/0x690 [mdt]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc168d09d>] mdt_reint_open+0x23fd/0x3190 [mdt]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc1681ca3>] mdt_reint_rec+0x83/0x210 [mdt]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc166318b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc16636b2>] mdt_intent_reint+0x162/0x430 [mdt]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc16664bb>] mdt_intent_opc+0x1eb/0xaf0 [mdt]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc166ed58>] mdt_intent_policy+0x138/0x320 [mdt]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc123d2dd>] ldlm_lock_enqueue+0x38d/0x980 [ptlrpc]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc1266c03>] ldlm_handle_enqueue0+0xa83/0x1670 [ptlrpc]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc12ec892>] tgt_enqueue+0x62/0x210 [ptlrpc]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc12f07ca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc129905b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc129c7a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffff866c2e81>] kthread+0xd1/0xe0
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffff86d77c37>] ret_from_fork_nospec_end+0x0/0x39
          Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
          
          sthiell Stephane Thiell added a comment - Hi! This issue hit us again today, even though we're now using SSDs on all Oak's MDTs. I see that Lai's patch above ( https://review.whamcloud.com/36100 ) was almost ready to land and even had Andreas' approval. It would probably be too much effort to port it to 2.10.8 (that we're still running on Oak), but would it be possible that you look at the patch again so that it can land into master. That way, this rare issue would be avoided in the future. Thanks! Apr 30 15:28:38 oak-md1-s2 kernel: LustreError: 9033:0:(lod_object.c:3919:lod_ah_init()) ASSERTION( !lod_obj_is_striped(child) ) failed: Apr 30 15:28:38 oak-md1-s2 kernel: LustreError: 9033:0:(lod_object.c:3919:lod_ah_init()) LBUG Apr 30 15:28:38 oak-md1-s2 kernel: Pid: 9033, comm: mdt01_088 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019 Apr 30 15:28:38 oak-md1-s2 kernel: Call Trace: Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc0ddb7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc0ddb87c>] lbug_with_loc+0x4c/0xa0 [libcfs] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc174fb4f>] lod_ah_init+0x23f/0xde0 [lod] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc17cc09b>] mdd_object_make_hint+0xcb/0x190 [mdd] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc17b7d50>] mdd_create_data+0x330/0x730 [mdd] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc168a3fc>] mdt_mfd_open+0xc5c/0xe70 [mdt] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc168ab8b>] mdt_finish_open+0x57b/0x690 [mdt] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc168d09d>] mdt_reint_open+0x23fd/0x3190 [mdt] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc1681ca3>] mdt_reint_rec+0x83/0x210 [mdt] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc166318b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc16636b2>] mdt_intent_reint+0x162/0x430 [mdt] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc16664bb>] mdt_intent_opc+0x1eb/0xaf0 [mdt] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc166ed58>] mdt_intent_policy+0x138/0x320 [mdt] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc123d2dd>] ldlm_lock_enqueue+0x38d/0x980 [ptlrpc] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc1266c03>] ldlm_handle_enqueue0+0xa83/0x1670 [ptlrpc] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc12ec892>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc12f07ca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc129905b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffc129c7a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffff866c2e81>] kthread+0xd1/0xe0 Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffff86d77c37>] ret_from_fork_nospec_end+0x0/0x39 Apr 30 15:28:38 oak-md1-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff

          Hi Lai,

          Thanks! As we're still running 2.10.8 on Oak, this patch apparently will require some back porting:

          Making all in .
          /tmp/rpmbuild-lustre-sthiell-SRrpJSP1/BUILD/lustre-2.10.8_4_g05af3ab/lustre/lod/lod_object.c: In function 'lod_invalidate':
          /tmp/rpmbuild-lustre-sthiell-SRrpJSP1/BUILD/lustre-2.10.8_4_g05af3ab/lustre/lod/lod_object.c:4883:2: error: implicit declaration of function 'lod_striping_free' [-Werror=implicit-function-declaration]
            lod_striping_free(env, lod_dt_obj(dt));
            ^
          cc1: all warnings being treated as errors
          

          In any case, for us, we're in the process of migrating MDT0 to a SSD backed RAID-10 volume (offline device-level copy + resize2fs). We'll do the same for MDT1 later (as soon as I get more SSDs). Meanwhile we have tested NRS TBF to limit the number of op/s and this has been helpful and it didn't crash again even with dsync running, but of course the performance is not optimal.

          sthiell Stephane Thiell added a comment - Hi Lai, Thanks! As we're still running 2.10.8 on Oak, this patch apparently will require some back porting: Making all in . /tmp/rpmbuild-lustre-sthiell-SRrpJSP1/BUILD/lustre-2.10.8_4_g05af3ab/lustre/lod/lod_object.c: In function 'lod_invalidate': /tmp/rpmbuild-lustre-sthiell-SRrpJSP1/BUILD/lustre-2.10.8_4_g05af3ab/lustre/lod/lod_object.c:4883:2: error: implicit declaration of function 'lod_striping_free' [-Werror=implicit-function-declaration] lod_striping_free(env, lod_dt_obj(dt)); ^ cc1: all warnings being treated as errors In any case, for us, we're in the process of migrating MDT0 to a SSD backed RAID-10 volume (offline device-level copy + resize2fs). We'll do the same for MDT1 later (as soon as I get more SSDs). Meanwhile we have tested NRS TBF to limit the number of op/s and this has been helpful and it didn't crash again even with dsync running, but of course the performance is not optimal.
          laisiyao Lai Siyao added a comment -

          It may be related with journal full: upon journal full, the LOV setting transaction may fail, but it doesn't free allocated striping in LOV declare_set, and next LOV setting will trigger lod_obj_is_striped() assertion.

          laisiyao Lai Siyao added a comment - It may be related with journal full: upon journal full, the LOV setting transaction may fail, but it doesn't free allocated striping in LOV declare_set, and next LOV setting will trigger lod_obj_is_striped() assertion.
          laisiyao Lai Siyao added a comment -

          Hi Stephane, I uploaded a patch, you can apply it and see whether dsync works with it.

          laisiyao Lai Siyao added a comment - Hi Stephane, I uploaded a patch, you can apply it and see whether dsync works with it.

          Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36100
          Subject: LU-12717 mdd: free striping upon LOV setting error
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 47dd1cae1c02f63840cb9e301239917ca2138de9

          gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36100 Subject: LU-12717 mdd: free striping upon LOV setting error Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 47dd1cae1c02f63840cb9e301239917ca2138de9

          Another hint:

          [root@oak-md1-s2 127.0.0.1-2019-08-31-18:02:59]# grep -c __jbd2_log_wait_for_space foreach_bt-crash-oak-md1-s2-2019-08-31-18-02-59 
          290
          

          That doesn't sound good... Still, it shouldn't LBUG. Let me know what you think.

          sthiell Stephane Thiell added a comment - Another hint: [root@oak-md1-s2 127.0.0.1-2019-08-31-18:02:59]# grep -c __jbd2_log_wait_for_space foreach_bt-crash-oak-md1-s2-2019-08-31-18-02-59 290 That doesn't sound good... Still, it shouldn't LBUG. Let me know what you think.

          People

            laisiyao Lai Siyao
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: