Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.10.6
-
None
-
3.10.0-693.2.2.el7_lustre.pl1.x86_64
-
3
-
9223372036854775807
Description
We had an issue yesterday on Oak storage with Lustre 2.10.6. MDT0 didn't crash but filesystem got stuck. Several stack traces showed up on oak-md1-s2 (serving MDT0000). Note: Oak uses DNE1 and another MDT0001 is mounted on oak-md1-s1, but I didn't find any stack trace on this one. A restart of MDT0000 fixed the issue (after a workaround to mitigate LU-8992).
My short-term plan is to upgrade Oak to 2.10.7 in a rolling fashion but I thought it would be of interest to have a ticket to track this issue. I'm also attaching kernel logs from this server as oak-md1-s2-kernel.log where all stack traces can be seen.
First call trace was:
Mar 29 09:38:38 oak-md1-s2 kernel: INFO: task mdt00_003:3491 blocked for more than 120 seconds. Mar 29 09:38:38 oak-md1-s2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 29 09:38:38 oak-md1-s2 kernel: mdt00_003 D ffffffff00000000 0 3491 2 0x00000080 Mar 29 09:38:38 oak-md1-s2 kernel: ffff88201e3f74b8 0000000000000046 ffff88201e3c3f40 ffff88201e3f7fd8 Mar 29 09:38:38 oak-md1-s2 kernel: ffff88201e3f7fd8 ffff88201e3f7fd8 ffff88201e3c3f40 ffff88201e3c3f40 Mar 29 09:38:38 oak-md1-s2 kernel: ffff88101fc13248 ffff88101fc13240 fffffffe00000001 ffffffff00000000 Mar 29 09:38:38 oak-md1-s2 kernel: Call Trace: Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70 Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff816aadd5>] rwsem_down_write_failed+0x225/0x3a0 Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff81332047>] call_rwsem_down_write_failed+0x17/0x30 Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff816a87cd>] down_write+0x2d/0x3d Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc121b34f>] lod_alloc_qos.constprop.17+0x1af/0x1590 [lod] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0fa49a1>] ? qsd_op_begin0+0x181/0x940 [lquota] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0ed322f>] ? ldiskfs_xattr_ibody_get+0xef/0x1a0 [ldiskfs] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc12204d1>] lod_qos_prep_create+0x1291/0x17f0 [lod] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1220bf9>] ? lod_prepare_inuse+0x1c9/0x2e0 [lod] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1220f6d>] lod_prepare_create+0x25d/0x360 [lod] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc121578e>] lod_declare_striped_create+0x1ee/0x970 [lod] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1217c04>] lod_declare_create+0x1e4/0x540 [lod] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc12828cf>] mdd_declare_create_object_internal+0xdf/0x2f0 [mdd] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1274023>] mdd_declare_create+0x53/0xe20 [mdd] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1277ec9>] mdd_create+0x879/0x1400 [mdd] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc114ab93>] mdt_reint_open+0x2173/0x3190 [mdt] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0931dde>] ? upcall_cache_get_entry+0x20e/0x8f0 [obdclass] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc112fad3>] ? ucred_set_jobid+0x53/0x70 [mdt] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc113fa40>] mdt_reint_rec+0x80/0x210 [mdt] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc112131b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1121842>] mdt_intent_reint+0x162/0x430 [mdt] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc112c5ae>] mdt_intent_policy+0x43e/0xc70 [mdt] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0afc12f>] ? ldlm_resource_get+0x9f/0xa30 [ptlrpc] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0af5277>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b1e9e3>] ldlm_handle_enqueue0+0x9c3/0x1680 [ptlrpc] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b46bc0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0ba3e92>] tgt_enqueue+0x62/0x210 [ptlrpc] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0ba7d95>] tgt_request_handle+0x925/0x1370 [ptlrpc] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b50bf6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b4d228>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff810c4822>] ? default_wake_function+0x12/0x20 Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b54332>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b538a0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0 Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Attachments
Issue Links
- is related to
-
LU-10697 MDT locking issues after failing over OSTs from hung OSS
- Open