Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.9.0
Labels:
None
Environment:
3.10.0-514.10.2.el7_lustre.x86_64, lustre-2.9.0_srcc6-1.el7.centos.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Our MDT was stuck or barely usable twice in a row lately, and the second time we took a crash dump, which shows that several threads were blocked in lod_qos_prep_create...

PID: 291558  TASK: ffff88203c7b2f10  CPU: 9   COMMAND: "mdt01_030"
 #0 [ffff881a157f7588] __schedule at ffffffff8168b6a5
 #1 [ffff881a157f75f0] schedule at ffffffff8168bcf9
 #2 [ffff881a157f7600] rwsem_down_write_failed at ffffffff8168d4a5
 #3 [ffff881a157f7688] call_rwsem_down_write_failed at ffffffff81327067
 #4 [ffff881a157f76d0] down_write at ffffffff8168aebd
 #5 [ffff881a157f76e8] lod_qos_prep_create at ffffffffa124031c [lod]
 #6 [ffff881a157f77a8] lod_declare_striped_object at ffffffffa1239a8c [lod]
 #7 [ffff881a157f77f0] lod_declare_object_create at ffffffffa123b0f1 [lod]
 #8 [ffff881a157f7838] mdd_declare_object_create_internal at ffffffffa129d21f [mdd]
 #9 [ffff881a157f7880] mdd_declare_create at ffffffffa1294133 [mdd]
#10 [ffff881a157f78f0] mdd_create at ffffffffa1295689 [mdd]
#11 [ffff881a157f79e8] mdt_reint_open at ffffffffa1176f05 [mdt]
#12 [ffff881a157f7ad8] mdt_reint_rec at ffffffffa116c4a0 [mdt]
#13 [ffff881a157f7b00] mdt_reint_internal at ffffffffa114edc2 [mdt]
#14 [ffff881a157f7b38] mdt_intent_reint at ffffffffa114f322 [mdt]
#15 [ffff881a157f7b78] mdt_intent_policy at ffffffffa1159b9c [mdt]
#16 [ffff881a157f7bd0] ldlm_lock_enqueue at ffffffffa0b461e7 [ptlrpc]
#17 [ffff881a157f7c28] ldlm_handle_enqueue0 at ffffffffa0b6f3a3 [ptlrpc]
#18 [ffff881a157f7cb8] tgt_enqueue at ffffffffa0befe12 [ptlrpc]
#19 [ffff881a157f7cd8] tgt_request_handle at ffffffffa0bf4275 [ptlrpc]
#20 [ffff881a157f7d20] ptlrpc_server_handle_request at ffffffffa0ba01fb [ptlrpc]
#21 [ffff881a157f7de8] ptlrpc_main at ffffffffa0ba42b0 [ptlrpc]
#22 [ffff881a157f7ec8] kthread at ffffffff810b06ff
#23 [ffff881a157f7f50] ret_from_fork at ffffffff81696b98

The disk array (from Dell) that we use for the MDT doesn't report any issue. The load was not particularly high. kmem -i does report 76 GB of free memory (60% of TOTAL MEM).

Attaching the output of `foreach bt`, maybe somebody will have a clue.

Each time, failing over the MDT resumed operations, but the recovery was a bit long and with a few evictions.

Lustre: oak-MDT0000: Recovery over after 13:39, of 1144 clients 1134 recovered and 10 were evicted.

Thanks!
Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

oak-io1-s1.lustre.log
20/Jun/17 5:20 PM
1.33 MB
Stephane Thiell
oak-io1-s2.lustre.log
20/Jun/17 5:20 PM
44 kB
Stephane Thiell
oak-md1-s1.foreach_bt.txt
19/Jun/17 9:55 PM
429 kB
Stephane Thiell
oak-md1-s1.lustre.log
20/Jun/17 5:20 PM
274 kB
Stephane Thiell

Activity

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 19/Jun/17 9:54 PM

Updated:: 18/Jul/17 1:59 PM

Resolved:: 18/Jul/17 1:59 PM