Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5233

2.6 DNE stress testing: (lod_object.c:930:lod_declare_attr_set()) ASSERTION( lo->ldo_stripe ) failed

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Lustre 2.6.0
    • Labels:
    • Severity:
      3
    • Rank (Obsolete):
      14584

      Description

      On the same system as LU-5204 (with OST38/0026 still not reachable from MDS1/MDT0), we hit this LBUG on MDS1 during stress testing:

      0>LustreError: 26714:0:(lod_object.c:930:lod_declare_attr_set()) ASSERTION( lo->ldo_stripe ) failed:
      <0>LustreError: 26714:0:(lod_object.c:930:lod_declare_attr_set()) LBUG
      <4>Pid: 26714, comm: mdt02_089
      <4>
      <4>Call Trace:
      <4> [<ffffffffa0c55895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa0c55e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa15d70e0>] lod_declare_attr_set+0x600/0x660 [lod]
      <4> [<ffffffffa16338b8>] mdd_declare_object_initialize+0xa8/0x290 [mdd]
      <4> [<ffffffffa1635018>] mdd_create+0xb88/0x1870 [mdd]
      <4> [<ffffffffa1506217>] mdt_reint_create+0xcf7/0xed0 [mdt]
      <4> [<ffffffffa1500a81>] mdt_reint_rec+0x41/0xe0 [mdt]
      <4> [<ffffffffa14e5e93>] mdt_reint_internal+0x4c3/0x7c0 [mdt]
      <4> [<ffffffffa14e671b>] mdt_reint+0x6b/0x120 [mdt]
      <4> [<ffffffffa103a2ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
      <4> [<ffffffffa0fe9d1a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
      <4> [<ffffffffa0fe9000>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
      <4> [<ffffffff8109aee6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

      Additionally, we had the following stuck thread:
      <3>INFO: task mdt01_020:26426 blocked for more than 120 seconds.
      <3> Not tainted 2.6.32-431.5.1.el6.x86_64 #1
      <3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      <6>mdt01_020 D 000000000000000a 0 26426 2 0x00000000
      <4> ffff880ffa4d7af0 0000000000000046 0000000000000000 ffffffffa0c6bd75
      <4> 0000000100000000 ffffc9003aa25030 0000000000000246 0000000000000246
      <4> ffff88100aaae638 ffff880ffa4d7fd8 000000000000fbc8 ffff88100aaae638
      <4>Call Trace:
      <4> [<ffffffffa0c6bd75>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
      <4> [<ffffffffa0d225db>] lu_object_find_at+0xab/0x350 [obdclass]
      <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa0d22896>] lu_object_find+0x16/0x20 [obdclass]
      <4> [<ffffffffa14e2ea6>] mdt_object_find+0x56/0x170 [mdt]
      <4> [<ffffffffa14e4d2b>] mdt_intent_policy+0x75b/0xca0 [mdt]
      <4> [<ffffffffa0f8e899>] ldlm_lock_enqueue+0x369/0x930 [ptlrpc]
      <4> [<ffffffffa0fb7d8f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
      <4> [<ffffffffa1039f02>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
      <4> [<ffffffffa103a2ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
      <4> [<ffffffffa0fe9d1a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
      <4> [<ffffffffa0fe9000>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
      <4> [<ffffffff8109aee6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

      For some time before the LBUG. This thread is - in all of these instances - stuck in a rather odd spot in cfs_hash_bd_lookup_intent:
      match = intent_add ? NULL : hnode;
      hlist_for_each(ehnode, hhead) {
      if (!cfs_hash_keycmp(hs, key, ehnode))
      continue;

      Specifically, it reports as being stuck on the cfs_hash_keycmp line. It's not clear to me how a thread could get stuck there. I may be missing some operation it's doing as part of that.

      I'll make the dump available shortly.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                di.wang Di Wang
                Reporter:
                paf Patrick Farrell (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: