[LU-6346] reading a file for a client hangs but is ok from others clients Created: 06/Mar/15  Updated: 13/Mar/15  Resolved: 13/Mar/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Zhenyu Xu
Resolution: Not a Bug Votes: 0
Labels: None

Attachments: File clientdbug.out.gz    
Severity: 3
Rank (Obsolete): 17763

 Description   

Trying to read a specific file (fort.1261) hangs from a client but is ok from other clients. I got debug trace from the client side fid [0x20009104a:0x79c0:0x0]

I can upload the mds debug logs to the ftp site if needed.



 Comments   
Comment by Mahmoud Hanafi [ 06/Mar/15 ]

uploaded mds debug to /uploads/LU6346/mdsdebug.out.gz

Comment by Peter Jones [ 06/Mar/15 ]

Bobijam

Could you please advise on this issue?

Thanks

Peter

Comment by Zhenyu Xu [ 09/Mar/15 ]

Do the other "ok" clients has the same lustre/kernel version as the problematic client?

Can you get the stack trace of the hung process?

Comment by Mahmoud Hanafi [ 12/Mar/15 ]

it looks like a process is stuck trying to open the file

0]kdb> btp 6718
Stack traceback for pid 6718
0xffff8809acffc2c0     6718    63813  0   11   D  0xffff8809acffc930  read
 [<ffffffff8147356b>] thread_return+0x0/0x295
 [<ffffffff81475725>] rwsem_down_failed_common+0xb5/0x160
 [<ffffffff81273e44>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff81474a1e>] down_read+0xe/0x10
 [<ffffffffa0cc16cb>] ll_glimpse_size+0x2b/0x70 [lustre]
 [<ffffffffa0cc6b88>] ll_inode_revalidate_it+0x198/0x1a0 [lustre]
 [<ffffffffa0cc6bce>] ll_getattr_it+0x3e/0x160 [lustre]
 [<ffffffffa0cc6d1f>] ll_getattr+0x2f/0x40 [lustre]
 [<ffffffff81161dc7>] vfs_fstat+0x37/0x60
 [<ffffffff81161e0f>] sys_newfstat+0x1f/0x50
 [<ffffffff8147d792>] system_call_fastpath+0x16/0x1b
 [<00007fffed1a2d84>] 0x7fffed1a2d84

and doing a ls -lr got stuck the same.

[0]kdb> btp 65618
Stack traceback for pid 65618
0xffff880937de6080    65618    62532  0   23   D  0xffff880937de66f0  ls
 [<ffffffff8147356b>] thread_return+0x0/0x295
 [<ffffffff81475725>] rwsem_down_failed_common+0xb5/0x160
 [<ffffffff81273e44>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff81474a1e>] down_read+0xe/0x10
 [<ffffffffa0cc16cb>] ll_glimpse_size+0x2b/0x70 [lustre]
 [<ffffffffa0cc6b88>] ll_inode_revalidate_it+0x198/0x1a0 [lustre]
 [<ffffffffa0cc6bce>] ll_getattr_it+0x3e/0x160 [lustre]
 [<ffffffffa0cc6d1f>] ll_getattr+0x2f/0x40 [lustre]
 [<ffffffff81161b77>] vfs_fstatat+0x67/0xb0
 [<ffffffff81161c4f>] sys_newlstat+0x1f/0x50
 [<ffffffff8147d792>] system_call_fastpath+0x16/0x1b
 [<00007fffece04df5>] 0x7fffece04df5
Comment by Zhenyu Xu [ 13/Mar/15 ]

Can you upload the backtrace of all processes?

Comment by Mahmoud Hanafi [ 13/Mar/15 ]

Found the issue. It was a single OST connective issue with the client. Please close this ticket.

Comment by Peter Jones [ 13/Mar/15 ]

ok - thanks Mahmoud!

Generated at Sat Feb 10 01:59:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.