Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12136

mdt threads blocked in mdd_create

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.10.6
    • None
    • 3.10.0-693.2.2.el7_lustre.pl1.x86_64
    • 3
    • 9223372036854775807

    Description

      We had an issue yesterday on Oak storage with Lustre 2.10.6. MDT0 didn't crash but filesystem got stuck. Several stack traces showed up on oak-md1-s2 (serving MDT0000). Note: Oak uses DNE1 and another MDT0001 is mounted on oak-md1-s1, but I didn't find any stack trace on this one. A restart of MDT0000 fixed the issue (after a workaround to mitigate LU-8992).

      My short-term plan is to upgrade Oak to 2.10.7 in a rolling fashion but I thought it would be of interest to have a ticket to track this issue. I'm also attaching kernel logs from this server as oak-md1-s2-kernel.log where all stack traces can be seen.

      First call trace was:

      Mar 29 09:38:38 oak-md1-s2 kernel: INFO: task mdt00_003:3491 blocked for more than 120 seconds.
      Mar 29 09:38:38 oak-md1-s2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Mar 29 09:38:38 oak-md1-s2 kernel: mdt00_003       D ffffffff00000000     0  3491      2 0x00000080
      Mar 29 09:38:38 oak-md1-s2 kernel: ffff88201e3f74b8 0000000000000046 ffff88201e3c3f40 ffff88201e3f7fd8
      Mar 29 09:38:38 oak-md1-s2 kernel: ffff88201e3f7fd8 ffff88201e3f7fd8 ffff88201e3c3f40 ffff88201e3c3f40
      Mar 29 09:38:38 oak-md1-s2 kernel: ffff88101fc13248 ffff88101fc13240 fffffffe00000001 ffffffff00000000
      Mar 29 09:38:38 oak-md1-s2 kernel: Call Trace:
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff816aadd5>] rwsem_down_write_failed+0x225/0x3a0
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff81332047>] call_rwsem_down_write_failed+0x17/0x30
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff816a87cd>] down_write+0x2d/0x3d
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc121b34f>] lod_alloc_qos.constprop.17+0x1af/0x1590 [lod]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0fa49a1>] ? qsd_op_begin0+0x181/0x940 [lquota]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0ed322f>] ? ldiskfs_xattr_ibody_get+0xef/0x1a0 [ldiskfs]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc12204d1>] lod_qos_prep_create+0x1291/0x17f0 [lod]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1220bf9>] ? lod_prepare_inuse+0x1c9/0x2e0 [lod]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1220f6d>] lod_prepare_create+0x25d/0x360 [lod]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc121578e>] lod_declare_striped_create+0x1ee/0x970 [lod]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1217c04>] lod_declare_create+0x1e4/0x540 [lod]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc12828cf>] mdd_declare_create_object_internal+0xdf/0x2f0 [mdd]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1274023>] mdd_declare_create+0x53/0xe20 [mdd]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1277ec9>] mdd_create+0x879/0x1400 [mdd]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc114ab93>] mdt_reint_open+0x2173/0x3190 [mdt]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0931dde>] ? upcall_cache_get_entry+0x20e/0x8f0 [obdclass]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc112fad3>] ? ucred_set_jobid+0x53/0x70 [mdt]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc113fa40>] mdt_reint_rec+0x80/0x210 [mdt]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc112131b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1121842>] mdt_intent_reint+0x162/0x430 [mdt]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc112c5ae>] mdt_intent_policy+0x43e/0xc70 [mdt]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0afc12f>] ? ldlm_resource_get+0x9f/0xa30 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0af5277>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b1e9e3>] ldlm_handle_enqueue0+0x9c3/0x1680 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b46bc0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0ba3e92>] tgt_enqueue+0x62/0x210 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0ba7d95>] tgt_request_handle+0x925/0x1370 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b50bf6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b4d228>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b54332>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b538a0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
      

      Attachments

        1. LU-12136-OSS-logs.tar.gz
          49 kB
          Stephane Thiell
        2. oak-md1-s2-kernel.log
          450 kB
          Stephane Thiell

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              sthiell Stephane Thiell
              Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: