Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5233

2.6 DNE stress testing: (lod_object.c:930:lod_declare_attr_set()) ASSERTION( lo->ldo_stripe ) failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0
    • None
    • 3
    • 14584

    Description

      On the same system as LU-5204 (with OST38/0026 still not reachable from MDS1/MDT0), we hit this LBUG on MDS1 during stress testing:

      0>LustreError: 26714:0:(lod_object.c:930:lod_declare_attr_set()) ASSERTION( lo->ldo_stripe ) failed:
      <0>LustreError: 26714:0:(lod_object.c:930:lod_declare_attr_set()) LBUG
      <4>Pid: 26714, comm: mdt02_089
      <4>
      <4>Call Trace:
      <4> [<ffffffffa0c55895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa0c55e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa15d70e0>] lod_declare_attr_set+0x600/0x660 [lod]
      <4> [<ffffffffa16338b8>] mdd_declare_object_initialize+0xa8/0x290 [mdd]
      <4> [<ffffffffa1635018>] mdd_create+0xb88/0x1870 [mdd]
      <4> [<ffffffffa1506217>] mdt_reint_create+0xcf7/0xed0 [mdt]
      <4> [<ffffffffa1500a81>] mdt_reint_rec+0x41/0xe0 [mdt]
      <4> [<ffffffffa14e5e93>] mdt_reint_internal+0x4c3/0x7c0 [mdt]
      <4> [<ffffffffa14e671b>] mdt_reint+0x6b/0x120 [mdt]
      <4> [<ffffffffa103a2ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
      <4> [<ffffffffa0fe9d1a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
      <4> [<ffffffffa0fe9000>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
      <4> [<ffffffff8109aee6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

      Additionally, we had the following stuck thread:
      <3>INFO: task mdt01_020:26426 blocked for more than 120 seconds.
      <3> Not tainted 2.6.32-431.5.1.el6.x86_64 #1
      <3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      <6>mdt01_020 D 000000000000000a 0 26426 2 0x00000000
      <4> ffff880ffa4d7af0 0000000000000046 0000000000000000 ffffffffa0c6bd75
      <4> 0000000100000000 ffffc9003aa25030 0000000000000246 0000000000000246
      <4> ffff88100aaae638 ffff880ffa4d7fd8 000000000000fbc8 ffff88100aaae638
      <4>Call Trace:
      <4> [<ffffffffa0c6bd75>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
      <4> [<ffffffffa0d225db>] lu_object_find_at+0xab/0x350 [obdclass]
      <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa0d22896>] lu_object_find+0x16/0x20 [obdclass]
      <4> [<ffffffffa14e2ea6>] mdt_object_find+0x56/0x170 [mdt]
      <4> [<ffffffffa14e4d2b>] mdt_intent_policy+0x75b/0xca0 [mdt]
      <4> [<ffffffffa0f8e899>] ldlm_lock_enqueue+0x369/0x930 [ptlrpc]
      <4> [<ffffffffa0fb7d8f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
      <4> [<ffffffffa1039f02>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
      <4> [<ffffffffa103a2ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
      <4> [<ffffffffa0fe9d1a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
      <4> [<ffffffffa0fe9000>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
      <4> [<ffffffff8109aee6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

      For some time before the LBUG. This thread is - in all of these instances - stuck in a rather odd spot in cfs_hash_bd_lookup_intent:
      match = intent_add ? NULL : hnode;
      hlist_for_each(ehnode, hhead) {
      if (!cfs_hash_keycmp(hs, key, ehnode))
      continue;

      Specifically, it reports as being stuck on the cfs_hash_keycmp line. It's not clear to me how a thread could get stuck there. I may be missing some operation it's doing as part of that.

      I'll make the dump available shortly.

      Attachments

        Issue Links

          Activity

            [LU-5233] 2.6 DNE stress testing: (lod_object.c:930:lod_declare_attr_set()) ASSERTION( lo->ldo_stripe ) failed

            Patch landed to Master. Please reopen ticket if there is more work needed.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Please reopen ticket if there is more work needed.
            di.wang Di Wang added a comment - http://review.whamcloud.com/10772
            di.wang Di Wang added a comment -

            Jodi:

            Yes, since it is a LBUG, probably could be a blocker, or at least critical one. But I think I know the reason, I will cook a patch soon.

            di.wang Di Wang added a comment - Jodi: Yes, since it is a LBUG, probably could be a blocker, or at least critical one. But I think I know the reason, I will cook a patch soon.

            Di,
            Can you please have a look at this one and complete an initial assessment to determine if this should be a blocker for 2.6?

            jlevi Jodi Levi (Inactive) added a comment - Di, Can you please have a look at this one and complete an initial assessment to determine if this should be a blocker for 2.6?

            There was also a client which was stuck waiting on a reply from MDS001/MDT000 before it crashed [Obviously, there were many time outs after it crashed, but before that.], and the times match roughly with those for the stuck thread. The stuck thread is probably a separate issue from the LBUG, but I don't want to separate them until we're further along.

            Here's the client bug information:
            At 23:33:48, MDS0 died with an LBUG. (LU-5233)

            One of the client nodes got stuck before that - This is thread refusing to exit because it's stuck in Lustre (Many other client threads were also stuck behind this one for the MDC rpc lock in mdc_close):
            console-20140618:2014-06-18T23:07:16.160830-05:00 c0-0c1s4n2 <node_health:5.1> APID:1236942 (Application_Exited_Check) WARNING: Stack trace for process 13769:
            console-20140618:2014-06-18T23:07:16.261778-05:00 c0-0c1s4n2 <node_health:5.1> APID:1236942 (Application_Exited_Check) STACK:
            ptlrpc_set_wait+0x2e5/0x8c0 [ptlrpc];
            ptlrpc_queue_wait+0x8b/0x230 [ptlrpc];
            mdc_close+0x1ed/0xa50 [mdc];
            lmv_close+0x242/0x5b0 [lmv];
            ll_close_inode_openhandle+0x2fa/0x10a0 [lustre];
            ll_md_real_close+0xb0/0x210 [lustre];
            ll_file_release+0x68c/0xb60 [lustre];
            fput+0xe2/0x200;
            filp_close+0x63/0x90;
            put_files_struct+0x84/0xe0;
            exit_files+0x53/0x70;
            do_exit+0x1ec/0x990;
            do_group_exit+0x4c/0xc0;
            get_signal_to_deliver+0x243/0x490;
            do_notify_resume+0xe0/0x7f0;
            int_signal+0x12/0x17;
            0x20061a87;
            0xffffffffffffffff;

            The client is waiting for a ptlrpc reply. I strongly suspect this corresponds to the stuck thread messages on the MDS.
            Unfortunately, by the time the node was dumped, the client had given up waiting and all of the tasks have exited (and the dk log is empty). So there's no way to confirm from the client side.

            The first stuck thread messages on the MDS come here:

            Jun 18 23:16:36 galaxy-esf-mds001 kernel: INFO: task mdt01_020:26426 blocked for more than 120 seconds.
            <4>Call Trace:
            <4> [<ffffffffa0c6bd75>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
            <4> [<ffffffffa0d21fc4>] ? htable_lookup+0x1c4/0x1e0 [obdclass]
            <4> [<ffffffffa0d225db>] lu_object_find_at+0xab/0x350 [obdclass]
            <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
            <4> [<ffffffffa0d22896>] lu_object_find+0x16/0x20 [obdclass]
            <4> [<ffffffffa14e2ea6>] mdt_object_find+0x56/0x170 [mdt]
            <4> [<ffffffffa14e4d2b>] mdt_intent_policy+0x75b/0xca0 [mdt]
            <4> [<ffffffffa0f8e899>] ldlm_lock_enqueue+0x369/0x930 [ptlrpc]
            <4> [<ffffffffa0fb7d8f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
            <4> [<ffffffffa1039f02>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
            <4> [<ffffffffa103a2ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
            <4> [<ffffffffa0fe9d1a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
            <4> [<ffffffffa0fe9000>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
            <4> [<ffffffff8109aee6>] kthread+0x96/0xa0
            <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
            <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
            <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

            And are repeated up until when it LBUGged (always the same task).

            The stuck thread message from the client is coming on task exit, so it's already been stuck for some amount of time. The first stuck thread message on the MDS (Stuck for 600 seconds) comes 9 minutes or so after the client reports a stuck thread. So the time frames are pretty good.

            Without digging through data structures on the MDS I can't be sure, it seems likely the stuck thread on the MDS is the cause of the problem on the client.

            paf Patrick Farrell (Inactive) added a comment - There was also a client which was stuck waiting on a reply from MDS001/MDT000 before it crashed [Obviously, there were many time outs after it crashed, but before that.] , and the times match roughly with those for the stuck thread. The stuck thread is probably a separate issue from the LBUG, but I don't want to separate them until we're further along. Here's the client bug information: At 23:33:48, MDS0 died with an LBUG. ( LU-5233 ) One of the client nodes got stuck before that - This is thread refusing to exit because it's stuck in Lustre (Many other client threads were also stuck behind this one for the MDC rpc lock in mdc_close): console-20140618:2014-06-18T23:07:16.160830-05:00 c0-0c1s4n2 <node_health:5.1> APID:1236942 (Application_Exited_Check) WARNING: Stack trace for process 13769: console-20140618:2014-06-18T23:07:16.261778-05:00 c0-0c1s4n2 <node_health:5.1> APID:1236942 (Application_Exited_Check) STACK: ptlrpc_set_wait+0x2e5/0x8c0 [ptlrpc] ; ptlrpc_queue_wait+0x8b/0x230 [ptlrpc] ; mdc_close+0x1ed/0xa50 [mdc] ; lmv_close+0x242/0x5b0 [lmv] ; ll_close_inode_openhandle+0x2fa/0x10a0 [lustre] ; ll_md_real_close+0xb0/0x210 [lustre] ; ll_file_release+0x68c/0xb60 [lustre] ; fput+0xe2/0x200; filp_close+0x63/0x90; put_files_struct+0x84/0xe0; exit_files+0x53/0x70; do_exit+0x1ec/0x990; do_group_exit+0x4c/0xc0; get_signal_to_deliver+0x243/0x490; do_notify_resume+0xe0/0x7f0; int_signal+0x12/0x17; 0x20061a87; 0xffffffffffffffff; The client is waiting for a ptlrpc reply. I strongly suspect this corresponds to the stuck thread messages on the MDS. Unfortunately, by the time the node was dumped, the client had given up waiting and all of the tasks have exited (and the dk log is empty). So there's no way to confirm from the client side. The first stuck thread messages on the MDS come here: Jun 18 23:16:36 galaxy-esf-mds001 kernel: INFO: task mdt01_020:26426 blocked for more than 120 seconds. <4>Call Trace: <4> [<ffffffffa0c6bd75>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs] <4> [<ffffffffa0d21fc4>] ? htable_lookup+0x1c4/0x1e0 [obdclass] <4> [<ffffffffa0d225db>] lu_object_find_at+0xab/0x350 [obdclass] <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0d22896>] lu_object_find+0x16/0x20 [obdclass] <4> [<ffffffffa14e2ea6>] mdt_object_find+0x56/0x170 [mdt] <4> [<ffffffffa14e4d2b>] mdt_intent_policy+0x75b/0xca0 [mdt] <4> [<ffffffffa0f8e899>] ldlm_lock_enqueue+0x369/0x930 [ptlrpc] <4> [<ffffffffa0fb7d8f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] <4> [<ffffffffa1039f02>] tgt_enqueue+0x62/0x1d0 [ptlrpc] <4> [<ffffffffa103a2ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc] <4> [<ffffffffa0fe9d1a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] <4> [<ffffffffa0fe9000>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] <4> [<ffffffff8109aee6>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 And are repeated up until when it LBUGged (always the same task). The stuck thread message from the client is coming on task exit, so it's already been stuck for some amount of time. The first stuck thread message on the MDS (Stuck for 600 seconds) comes 9 minutes or so after the client reports a stuck thread. So the time frames are pretty good. Without digging through data structures on the MDS I can't be sure, it seems likely the stuck thread on the MDS is the cause of the problem on the client.

            MDS dump will here in < 10 minutes:
            ftp.cray.com
            u: anonymous
            p: anonymous

            Then:
            cd outbound/LU-5233/
            And then the file is:
            mds001_mdt000_LU5233.tar.gz

            paf Patrick Farrell (Inactive) added a comment - MDS dump will here in < 10 minutes: ftp.cray.com u: anonymous p: anonymous Then: cd outbound/ LU-5233 / And then the file is: mds001_mdt000_LU5233.tar.gz

            People

              di.wang Di Wang
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: