[LU-6143] Client hangs with Lustre 2.6 Created: 20/Jan/15  Updated: 19/Mar/15  Resolved: 19/Mar/15

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James A Simmons Assignee: Jinshan Xiong (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

RHEL6.5 clients running Lustre 2.6.0


Severity: 3
Rank (Obsolete): 17122

 Description   

Recently users have run into conditions where their file transfers never complete. The back trace observed is as follows:

Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: INFO: task tar:12650 blocked
for more than 120 seconds.
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: Tainted: G W
--------------- 2.6.32-504.el6.x86_64 #1
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: tar D
0000000000000002 0 12650 11138 0x00000080
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: ffff8805f6919aa8
0000000000000086 ffff8805f6919a48 ffff8805f6919a28
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: ffff8805f6919a28
0000000000000082 ffff880122eebaa8 ffff8800282919a0
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: ffffffff81aac480
0000000000000282 ffff8808398b7058 ffff8805f6919fd8
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: Call Trace:
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0a41e25>] ?
lustre_msg_buf+0x55/0x60 [ptlrpc]
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffff8152b346>]
__mutex_lock_slowpath+0x96/0x210
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0a696ce>] ?
req_capsule_get_size+0x4e/0x90 [ptlrpc]
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffff8152ae6b>]
mutex_lock+0x2b/0x50
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0c1a96c>]
mdc_reint+0x3c/0x3b0 [mdc]
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0c1c6f0>]
mdc_setattr+0x2a0/0xa00 [mdc]
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0bc6962>]
lmv_setattr+0x232/0x5c0 [lmv]
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0cfcfe6>]
ll_md_setattr+0x116/0x940 [lustre]
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0a1fc60>] ?
ldlm_completion_ast+0x0/0x930 [ptlrpc]
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffff8152c166>] ?
down_read+0x16/0x30
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0cfde6f>]
ll_setattr_raw+0x24f/0x10d0 [lustre]
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffff811b07b0>] ?
mntput_no_expire+0x30/0x110
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0cfed55>]
ll_setattr+0x65/0xd0 [lustre]
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffff811ad0a8>]
notify_change+0x168/0x340
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffff811c15ac>]
utimes_common+0xdc/0x1b0
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffff810f036e>] ?
call_rcu+0xe/0x10
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffff811b07b0>] ?
mntput_no_expire+0x30/0x110
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffff811c1750>]
do_utimes+0xd0/0x170
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffff811c18f2>]
sys_utimensat+0x32/0x90
Jan 2 23:10:28 dtn01.ccs.ornl.gov kernel: [<ffffffff8100b072>]
system_call_fastpath+0x16/0x1b

Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: Call Trace:
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0a41e25>] ?
lustre_msg_buf+0x55/0x60 [ptlrpc]
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffff8152b346>]
__mutex_lock_slowpath+0x96/0x210
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0a696ce>] ?
req_capsule_get_size+0x4e/0x90 [ptlrpc]
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffff8152ae6b>]
mutex_lock+0x2b/0x50
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0c1a96c>]
mdc_reint+0x3c/0x3b0 [mdc]
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0c1c6f0>]
mdc_setattr+0x2a0/0xa00 [mdc]
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0bc6962>]
lmv_setattr+0x232/0x5c0 [lmv]
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0cfcfe6>]
ll_md_setattr+0x116/0x940 [lustre]
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0a1fc60>] ?
ldlm_completion_ast+0x0/0x930 [ptlrpc]
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffff8152c166>] ?
down_read+0x16/0x30
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0cfde6f>]
ll_setattr_raw+0x24f/0x10d0 [lustre]
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffff811b07b0>] ?
mntput_no_expire+0x30/0x110
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffffa0cfed55>]
ll_setattr+0x65/0xd0 [lustre]
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffff811ad0a8>]
notify_change+0x168/0x340
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffff811c15ac>]
utimes_common+0xdc/0x1b0
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffff810f036e>] ?
call_rcu+0xe/0x10
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffff811b07b0>] ?
mntput_no_expire+0x30/0x110
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffff811c1750>]
do_utimes+0xd0/0x170
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffff811c18f2>]
sys_utimensat+0x32/0x90
Jan 2 23:12:28 dtn01.ccs.ornl.gov kernel: [<ffffffff8100b072>]
system_call_fastpath+0x16/0x1b



 Comments   
Comment by James A Simmons [ 20/Jan/15 ]

Looks like LU-4427.

Comment by Jian Yu [ 20/Jan/15 ]

Hi Jinshan,

Is this similar to LU-6085 / LU-5968 ?

Comment by Jinshan Xiong (Inactive) [ 20/Jan/15 ]

Hi Yujian,

They are different. This one is stuck on acquiring mdc_get_rpc_lock().

Comment by Jinshan Xiong (Inactive) [ 21/Jan/15 ]

Hi James, is the node still alive? I'd like to get the stack trace of all processes on that node. You can collect this when it's reproduced next time in case that node is already rebooted.

Comment by Jian Yu [ 12/Mar/15 ]

Hi James,
Did the issue occur recently? If not, can we close it as "Cannot Reproduce" now and reopen it when it occurs again and with enough logs gathered?

Comment by Jian Yu [ 19/Mar/15 ]

The ticket is closed. If it occurs again, please feel free to reopen it.

Generated at Sat Feb 10 01:57:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.