[LU-2500] replay-single test 48 lockup Created: 15/Dec/12  Updated: 22/Dec/12  Resolved: 22/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: HB

Severity: 3
Rank (Obsolete): 5859

 Description   

I am having replaye-single test 48 consistently hanging.
There's a stack-trace for a hung task in the log and that task never finishes it looks like:

[246707.608040] LNet: Service thread pid 16278 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[246707.608566] Pid: 16278, comm: mdt00_001
[246707.608714] 
[246707.608715] Call Trace:
[246707.609128]  [<ffffffffa0f9b7ae>] cfs_waitq_wait+0xe/0x10 [libcfs]
[246707.609381]  [<ffffffffa09d26d4>] osp_precreate_reserve+0x3a4/0x620 [osp]
[246707.609664]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
[246707.609914]  [<ffffffffa09d1633>] osp_declare_object_create+0x163/0x540 [osp]
[246707.610746]  [<ffffffffa098a4bd>] lod_qos_declare_object_on+0xed/0x4c0 [lod]
[246707.611049]  [<ffffffffa098c094>] lod_alloc_rr.clone.2+0x624/0xd90 [lod]
[246707.611313]  [<ffffffffa098db8c>] lod_qos_prep_create+0xe5c/0x1848 [lod]
[246707.611610]  [<ffffffffa098886b>] lod_declare_striped_object+0x14b/0x920 [lod]
[246707.612053]  [<ffffffffa0989348>] lod_declare_object_create+0x308/0x4f0 [lod]
[246707.612465]  [<ffffffffa07364bf>] mdd_declare_object_create_internal+0xaf/0x1d0 [mdd]
[246707.612926]  [<ffffffffa07475ea>] mdd_create+0x39a/0x1550 [mdd]
[246707.613334]  [<ffffffffa08cd759>] mdt_reint_open+0x1079/0x1860 [mdt]
[246707.613649]  [<ffffffffa1075140>] ? lu_ucred+0x20/0x30 [obdclass]
[246707.613897]  [<ffffffffa0898655>] ? mdt_ucred+0x15/0x20 [mdt]
[246707.614105]  [<ffffffffa08b8651>] mdt_reint_rec+0x41/0xe0 [mdt]
[246707.614347]  [<ffffffffa08b1b13>] mdt_reint_internal+0x4e3/0x7e0 [mdt]
[246707.614559]  [<ffffffffa08b20dd>] mdt_intent_reint+0x1ed/0x500 [mdt]
[246707.614854]  [<ffffffffa08adca5>] mdt_intent_policy+0x3c5/0x800 [mdt]
[246707.615163]  [<ffffffffa11c643a>] ldlm_lock_enqueue+0x2ea/0x890 [ptlrpc]
[246707.615486]  [<ffffffffa11ef3b7>] ldlm_handle_enqueue0+0x4f7/0x1090 [ptlrpc]
[246707.615812]  [<ffffffffa08ad7f6>] mdt_enqueue+0x46/0x130 [mdt]
[246707.616091]  [<ffffffffa08a1822>] mdt_handle_common+0x932/0x1750 [mdt]
[246707.616327]  [<ffffffffa08a2715>] mdt_regular_handle+0x15/0x20 [mdt]
[246707.616560]  [<ffffffffa121d953>] ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
[246707.616994]  [<ffffffffa0f9b66e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[246707.617304]  [<ffffffffa1216621>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[246707.617595]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[246707.617888]  [<ffffffffa122048d>] ptlrpc_main+0xb3d/0x18e0 [ptlrpc]
[246707.618203]  [<ffffffffa121f950>] ? ptlrpc_main+0x0/0x18e0 [ptlrpc]
[246707.618431]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[246707.618628]  [<ffffffffa121f950>] ? ptlrpc_main+0x0/0x18e0 [ptlrpc]
[246707.618944]  [<ffffffffa121f950>] ? ptlrpc_main+0x0/0x18e0 [ptlrpc]
[246707.619190]  [<ffffffff8100c140>] ? child_rip+0x0/0x20

I have a crash dump for such occurence as well
This dump is with patch from lu2285 applied, but also happens without lu2285 patch in.



 Comments   
Comment by Alex Zhuravlev [ 17/Dec/12 ]

can you attach the full dmesg and lustre logs please ?

Comment by Alex Zhuravlev [ 18/Dec/12 ]

I was able to reproduce this.

Comment by Alex Zhuravlev [ 18/Dec/12 ]

please try with http://review.whamcloud.com/4846

Comment by Peter Jones [ 22/Dec/12 ]

Landed for 2.4

Generated at Sat Feb 10 01:25:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.