Details
-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
Lustre 2.9.0
-
None
-
3.10.0-514.10.2.el7_lustre.x86_64, lustre-2.9.0_srcc6-1.el7.centos.x86_64
-
3
-
9223372036854775807
Description
Our MDT was stuck or barely usable twice in a row lately, and the second time we took a crash dump, which shows that several threads were blocked in lod_qos_prep_create...
PID: 291558 TASK: ffff88203c7b2f10 CPU: 9 COMMAND: "mdt01_030"
#0 [ffff881a157f7588] __schedule at ffffffff8168b6a5
#1 [ffff881a157f75f0] schedule at ffffffff8168bcf9
#2 [ffff881a157f7600] rwsem_down_write_failed at ffffffff8168d4a5
#3 [ffff881a157f7688] call_rwsem_down_write_failed at ffffffff81327067
#4 [ffff881a157f76d0] down_write at ffffffff8168aebd
#5 [ffff881a157f76e8] lod_qos_prep_create at ffffffffa124031c [lod]
#6 [ffff881a157f77a8] lod_declare_striped_object at ffffffffa1239a8c [lod]
#7 [ffff881a157f77f0] lod_declare_object_create at ffffffffa123b0f1 [lod]
#8 [ffff881a157f7838] mdd_declare_object_create_internal at ffffffffa129d21f [mdd]
#9 [ffff881a157f7880] mdd_declare_create at ffffffffa1294133 [mdd]
#10 [ffff881a157f78f0] mdd_create at ffffffffa1295689 [mdd]
#11 [ffff881a157f79e8] mdt_reint_open at ffffffffa1176f05 [mdt]
#12 [ffff881a157f7ad8] mdt_reint_rec at ffffffffa116c4a0 [mdt]
#13 [ffff881a157f7b00] mdt_reint_internal at ffffffffa114edc2 [mdt]
#14 [ffff881a157f7b38] mdt_intent_reint at ffffffffa114f322 [mdt]
#15 [ffff881a157f7b78] mdt_intent_policy at ffffffffa1159b9c [mdt]
#16 [ffff881a157f7bd0] ldlm_lock_enqueue at ffffffffa0b461e7 [ptlrpc]
#17 [ffff881a157f7c28] ldlm_handle_enqueue0 at ffffffffa0b6f3a3 [ptlrpc]
#18 [ffff881a157f7cb8] tgt_enqueue at ffffffffa0befe12 [ptlrpc]
#19 [ffff881a157f7cd8] tgt_request_handle at ffffffffa0bf4275 [ptlrpc]
#20 [ffff881a157f7d20] ptlrpc_server_handle_request at ffffffffa0ba01fb [ptlrpc]
#21 [ffff881a157f7de8] ptlrpc_main at ffffffffa0ba42b0 [ptlrpc]
#22 [ffff881a157f7ec8] kthread at ffffffff810b06ff
#23 [ffff881a157f7f50] ret_from_fork at ffffffff81696b98
The disk array (from Dell) that we use for the MDT doesn't report any issue. The load was not particularly high. kmem -i does report 76 GB of free memory (60% of TOTAL MEM).
Attaching the output of `foreach bt`, maybe somebody will have a clue.
Each time, failing over the MDT resumed operations, but the recovery was a bit long and with a few evictions.
Lustre: oak-MDT0000: Recovery over after 13:39, of 1144 clients 1134 recovered and 10 were evicted.
Thanks!
Stephane
Hi, Stephane
That's good news, if OST fail to create objects due to backend storage problem, the creation on MDT will be blocked, we can't do much about in such situation but waiting for the storage recovered. Can we close this ticket now? Thanks.