[LU-6346] reading a file for a client hangs but is ok from others clients Created: 06/Mar/15 Updated: 13/Mar/15 Resolved: 13/Mar/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Zhenyu Xu |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 17763 |
| Description |
|
Trying to read a specific file (fort.1261) hangs from a client but is ok from other clients. I got debug trace from the client side fid [0x20009104a:0x79c0:0x0] I can upload the mds debug logs to the ftp site if needed. |
| Comments |
| Comment by Mahmoud Hanafi [ 06/Mar/15 ] |
|
uploaded mds debug to /uploads/LU6346/mdsdebug.out.gz |
| Comment by Peter Jones [ 06/Mar/15 ] |
|
Bobijam Could you please advise on this issue? Thanks Peter |
| Comment by Zhenyu Xu [ 09/Mar/15 ] |
|
Do the other "ok" clients has the same lustre/kernel version as the problematic client? Can you get the stack trace of the hung process? |
| Comment by Mahmoud Hanafi [ 12/Mar/15 ] |
|
it looks like a process is stuck trying to open the file 0]kdb> btp 6718
Stack traceback for pid 6718
0xffff8809acffc2c0 6718 63813 0 11 D 0xffff8809acffc930 read
[<ffffffff8147356b>] thread_return+0x0/0x295
[<ffffffff81475725>] rwsem_down_failed_common+0xb5/0x160
[<ffffffff81273e44>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff81474a1e>] down_read+0xe/0x10
[<ffffffffa0cc16cb>] ll_glimpse_size+0x2b/0x70 [lustre]
[<ffffffffa0cc6b88>] ll_inode_revalidate_it+0x198/0x1a0 [lustre]
[<ffffffffa0cc6bce>] ll_getattr_it+0x3e/0x160 [lustre]
[<ffffffffa0cc6d1f>] ll_getattr+0x2f/0x40 [lustre]
[<ffffffff81161dc7>] vfs_fstat+0x37/0x60
[<ffffffff81161e0f>] sys_newfstat+0x1f/0x50
[<ffffffff8147d792>] system_call_fastpath+0x16/0x1b
[<00007fffed1a2d84>] 0x7fffed1a2d84
and doing a ls -lr got stuck the same. [0]kdb> btp 65618
Stack traceback for pid 65618
0xffff880937de6080 65618 62532 0 23 D 0xffff880937de66f0 ls
[<ffffffff8147356b>] thread_return+0x0/0x295
[<ffffffff81475725>] rwsem_down_failed_common+0xb5/0x160
[<ffffffff81273e44>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff81474a1e>] down_read+0xe/0x10
[<ffffffffa0cc16cb>] ll_glimpse_size+0x2b/0x70 [lustre]
[<ffffffffa0cc6b88>] ll_inode_revalidate_it+0x198/0x1a0 [lustre]
[<ffffffffa0cc6bce>] ll_getattr_it+0x3e/0x160 [lustre]
[<ffffffffa0cc6d1f>] ll_getattr+0x2f/0x40 [lustre]
[<ffffffff81161b77>] vfs_fstatat+0x67/0xb0
[<ffffffff81161c4f>] sys_newlstat+0x1f/0x50
[<ffffffff8147d792>] system_call_fastpath+0x16/0x1b
[<00007fffece04df5>] 0x7fffece04df5
|
| Comment by Zhenyu Xu [ 13/Mar/15 ] |
|
Can you upload the backtrace of all processes? |
| Comment by Mahmoud Hanafi [ 13/Mar/15 ] |
|
Found the issue. It was a single OST connective issue with the client. Please close this ticket. |
| Comment by Peter Jones [ 13/Mar/15 ] |
|
ok - thanks Mahmoud! |