Details
-
Bug
-
Resolution: Won't Fix
-
Major
-
None
-
Lustre 2.1.5
-
None
-
4
-
11731
Description
after recover of a crashed mds the system load goes to >800
Filesystem is DOWN. We need help to bring the filesystem online!
here is the error
Lustre: Skipped 2 previous similar messages
Lustre: Service thread pid 7014 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 7014, comm: mdt_01
Call Trace:
[<ffffffff8151d552>] schedule_timeout+0x192/0x2e0
[<ffffffff8107bf80>] ? process_timeout+0x0/0x10
[<ffffffffa04e45e1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
[<ffffffffa0da2508>] osc_create+0x528/0xdc0 [osc]
[<ffffffff8105fab0>] ? default_wake_function+0x0/0x20
[<ffffffffa0e13337>] lov_check_and_create_object+0x187/0x570 [lov]
[<ffffffffa0e13a1b>] qos_remedy_create+0x1db/0x220 [lov]
[<ffffffffa0e1059a>] lov_fini_create_set+0x24a/0x1200 [lov]
[<ffffffffa0dfa0f2>] lov_create+0x792/0x1400 [lov]
[<ffffffffa11000d6>] ? mdd_get_md+0x96/0x2f0 [mdd]
[<ffffffff8105fab0>] ? default_wake_function+0x0/0x20
[<ffffffffa1120916>] ? mdd_read_unlock+0x26/0x30 [mdd]
[<ffffffffa110490e>] mdd_lov_create+0x9ee/0x1ba0 [mdd]
[<ffffffffa1116871>] mdd_create+0xf81/0x1a90 [mdd]
[<ffffffffa121edf3>] ? osd_oi_lookup+0x83/0x110 [osd_ldiskfs]
[<ffffffffa121956c>] ? osd_object_init+0xdc/0x3e0 [osd_ldiskfs]
[<ffffffffa124f3f7>] cml_create+0x97/0x250 [cmm]
[<ffffffffa118b5e1>] ? mdt_version_get_save+0x91/0xd0 [mdt]
[<ffffffffa11a106e>] mdt_reint_open+0x1aae/0x28a0 [mdt]
[<ffffffffa077a724>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]
[<ffffffffa111956e>] ? md_ucred+0x1e/0x60 [mdd]
[<ffffffffa1189c81>] mdt_reint_rec+0x41/0xe0 [mdt]
[<ffffffffa1180ed4>] mdt_reint_internal+0x544/0x8e0 [mdt]
[<ffffffffa118153d>] mdt_intent_reint+0x1ed/0x530 [mdt]
[<ffffffffa117fc09>] mdt_intent_policy+0x379/0x690 [mdt]
[<ffffffffa0736351>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
[<ffffffffa075c1ad>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
[<ffffffffa1180586>] mdt_enqueue+0x46/0x130 [mdt]
[<ffffffffa1175772>] mdt_handle_common+0x932/0x1750 [mdt]
[<ffffffffa1176665>] mdt_regular_handle+0x15/0x20 [mdt]
[<ffffffffa078ab4e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
[<ffffffffa0789f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffa0789f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
[<ffffffffa0789f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Attachments
Issue Links
- is related to
-
LU-4335 MDS hangs due to mdt thread hung/inactive
-
- Resolved
-
Sure Bob, this last thread should be the one blocking all the others in JBD2 layer! And again, and like in one of my original updates on 19/Nov/2013, the concerned device is "dm-3-8" that Mahmoud confirmed to be the MDT.
And yes, it looks like a dup of
LU-4794, but also earlier tickets that have been simply closed due to no new occurrence ...What would be cool now would be to identify if this last thread has been scheduled recently, if not why or if yes, why it is looping+re-schedule()ing (t_updates != NULL?) ...