[LU-3217] Client hung in cl_io_loop/cl_io_lock path Created: 24/Apr/13  Updated: 03/Sep/13  Resolved: 03/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Wally Wang (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Cray XE compute node client running with SLES11 SP1 or SP2


Severity: 2
Rank (Obsolete): 7850

 Description   

Lustre 2.3.63 client node appeared to be deadlocked and hung causing client node lost of heartbeat. The client OS is SLES11 SP1 or SP2.

PID: 9665 TASK: ffff880105662100 CPU: 16 COMMAND: "read2_01"
#0 [ffff880105a2bb38] schedule at ffffffff812db5e5
#1 [ffff880105a2bbd0] libcfs_debug_msg at ffffffffa017bd81
#2 [ffff880105a2bc30] cl_lock_trace0 at ffffffffa02d4063
#3 [ffff880105a2bcd0] cl_lock_mutex_tail at ffffffffa02d43ad
#4 [ffff880105a2bcf0] cl_lock_mutex_get at ffffffffa02d5ba2
#5 [ffff880105a2bd20] cl_lock_release at ffffffffa02d6ba1
#6 [ffff880105a2bd50] cl_lock_link_fini at ffffffffa02ddc52
#7 [ffff880105a2bd80] cl_io_unlock at ffffffffa02dde25
#8 [ffff880105a2bdc0] cl_io_loop at ffffffffa02deb55
#9 [ffff880105a2bdf0] ll_file_io_generic at ffffffffa07a9978
#10 [ffff880105a2be60] ll_file_aio_write at ffffffffa07a9d61
#11 [ffff880105a2beb0] ll_file_write at ffffffffa07ab422
#12 [ffff880105a2bf10] vfs_write at ffffffff81117f3b
#13 [ffff880105a2bf40] sys_write at ffffffff81118105
#14 [ffff880105a2bf80] system_call_fastpath at ffffffff8100305b
RIP: 0000000020013000 RSP: 00007fffffffa368 RFLAGS: 00010246
RAX: 0000000000000001 RBX: ffffffff8100305b RCX: fefefefefefefeff
RDX: 0000000000000019 RSI: 00000000400e1ce0 RDI: 0000000000000004
RBP: 00007fffffffa4f0 R8: 00000000400e1ce0 R9: 6165722f74736574
R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000020020dc0
R13: 0000000020020d80 R14: 0000000000000000 R15: 0000000000000004
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

....

PID: 9655 TASK: ffff880105bc1820 CPU: 6 COMMAND: "read2_01"
#0 [ffff880105d63b38] schedule at ffffffff812db5e5
#1 [ffff880105d63bd0] libcfs_debug_msg at ffffffffa017bd81
#2 [ffff880105d63c30] our_vma at ffffffffa07e3914
#3 [ffff880105d63c60] vvp_io_rw_lock at ffffffffa0802282
#4 [ffff880105d63d30] vvp_io_write_lock at ffffffffa0802636
#5 [ffff880105d63d40] cl_io_lock at ffffffffa02de535
#6 [ffff880105d63dc0] cl_io_loop at ffffffffa02deb2a
#7 [ffff880105d63df0] ll_file_io_generic at ffffffffa07a9978
#8 [ffff880105d63e60] ll_file_aio_write at ffffffffa07a9d61
#9 [ffff880105d63eb0] ll_file_write at ffffffffa07ab422
#10 [ffff880105d63f10] vfs_write at ffffffff81117f3b
#11 [ffff880105d63f40] sys_write at ffffffff81118105
#12 [ffff880105d63f80] system_call_fastpath at ffffffff8100305b
RIP: 0000000020013000 RSP: 00007fffffffa368 RFLAGS: 00010246
RAX: 0000000000000001 RBX: ffffffff8100305b RCX: fefefefefefefeff
RDX: 0000000000000019 RSI: 00000000400e1ce0 RDI: 0000000000000004
RBP: 00007fffffffa4f0 R8: 00000000400e1ce0 R9: 6165722f74736574
R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000020020dc0
R13: 0000000020020d80 R14: 0000000000000000 R15: 0000000000000004
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

a dump is available on:

ftp.cray.com:/outbound/mas63-sp1-down.tar.bz2



 Comments   
Comment by Jodi Levi (Inactive) [ 24/Apr/13 ]

Jinshan,
Could you have a look at this one?
Thank you!

Comment by Cory Spitz [ 10/May/13 ]

Wally reported that "We haven't run into this problem with the latest 2.4(tag 2.3.65), but we need more testing on our bigger test systems". We'll continue to test and keep you posted.

Comment by Cory Spitz [ 03/Sep/13 ]

Sorry for the very slow follow-up. Our reproducer stopped reproducing when we moved to 2.3.65. Further testing on subsequent RCs and 2.4.0 were fine as well. We should close this ticket now.

Comment by Peter Jones [ 03/Sep/13 ]

ok thanks Cory!

Generated at Sat Feb 10 01:31:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.