Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0
-
3
-
5859
Description
I am having replaye-single test 48 consistently hanging.
There's a stack-trace for a hung task in the log and that task never finishes it looks like:
[246707.608040] LNet: Service thread pid 16278 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [246707.608566] Pid: 16278, comm: mdt00_001 [246707.608714] [246707.608715] Call Trace: [246707.609128] [<ffffffffa0f9b7ae>] cfs_waitq_wait+0xe/0x10 [libcfs] [246707.609381] [<ffffffffa09d26d4>] osp_precreate_reserve+0x3a4/0x620 [osp] [246707.609664] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [246707.609914] [<ffffffffa09d1633>] osp_declare_object_create+0x163/0x540 [osp] [246707.610746] [<ffffffffa098a4bd>] lod_qos_declare_object_on+0xed/0x4c0 [lod] [246707.611049] [<ffffffffa098c094>] lod_alloc_rr.clone.2+0x624/0xd90 [lod] [246707.611313] [<ffffffffa098db8c>] lod_qos_prep_create+0xe5c/0x1848 [lod] [246707.611610] [<ffffffffa098886b>] lod_declare_striped_object+0x14b/0x920 [lod] [246707.612053] [<ffffffffa0989348>] lod_declare_object_create+0x308/0x4f0 [lod] [246707.612465] [<ffffffffa07364bf>] mdd_declare_object_create_internal+0xaf/0x1d0 [mdd] [246707.612926] [<ffffffffa07475ea>] mdd_create+0x39a/0x1550 [mdd] [246707.613334] [<ffffffffa08cd759>] mdt_reint_open+0x1079/0x1860 [mdt] [246707.613649] [<ffffffffa1075140>] ? lu_ucred+0x20/0x30 [obdclass] [246707.613897] [<ffffffffa0898655>] ? mdt_ucred+0x15/0x20 [mdt] [246707.614105] [<ffffffffa08b8651>] mdt_reint_rec+0x41/0xe0 [mdt] [246707.614347] [<ffffffffa08b1b13>] mdt_reint_internal+0x4e3/0x7e0 [mdt] [246707.614559] [<ffffffffa08b20dd>] mdt_intent_reint+0x1ed/0x500 [mdt] [246707.614854] [<ffffffffa08adca5>] mdt_intent_policy+0x3c5/0x800 [mdt] [246707.615163] [<ffffffffa11c643a>] ldlm_lock_enqueue+0x2ea/0x890 [ptlrpc] [246707.615486] [<ffffffffa11ef3b7>] ldlm_handle_enqueue0+0x4f7/0x1090 [ptlrpc] [246707.615812] [<ffffffffa08ad7f6>] mdt_enqueue+0x46/0x130 [mdt] [246707.616091] [<ffffffffa08a1822>] mdt_handle_common+0x932/0x1750 [mdt] [246707.616327] [<ffffffffa08a2715>] mdt_regular_handle+0x15/0x20 [mdt] [246707.616560] [<ffffffffa121d953>] ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc] [246707.616994] [<ffffffffa0f9b66e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [246707.617304] [<ffffffffa1216621>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [246707.617595] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [246707.617888] [<ffffffffa122048d>] ptlrpc_main+0xb3d/0x18e0 [ptlrpc] [246707.618203] [<ffffffffa121f950>] ? ptlrpc_main+0x0/0x18e0 [ptlrpc] [246707.618431] [<ffffffff8100c14a>] child_rip+0xa/0x20 [246707.618628] [<ffffffffa121f950>] ? ptlrpc_main+0x0/0x18e0 [ptlrpc] [246707.618944] [<ffffffffa121f950>] ? ptlrpc_main+0x0/0x18e0 [ptlrpc] [246707.619190] [<ffffffff8100c140>] ? child_rip+0x0/0x20
I have a crash dump for such occurence as well
This dump is with patch from lu2285 applied, but also happens without lu2285 patch in.
Landed for 2.4