[LU-4552] osc_cache.c:899:osc_extent_wait() timeout quite often Created: 28/Jan/14 Updated: 05/Mar/14 Resolved: 07/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0, Lustre 2.4.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL6 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 12438 | ||||||||
| Description |
|
We hit client hangs quite often on the all login nodes and following Lustre error messages printed out. It can't be recovery until client reboots. Jan 22 17:23:23 ff01 kernel: LustreError: 84026:0,(osc_cache.c:899:osc_extent_wait()) extent ffff8831a49b0678@{[0 > 0/255], [3|0|+|rpc|wihY|ffff88283005bc48], [4096|1|+||ffff8828fb76b228|256|ffff88319695e040]} home2-OST000b-osc-ffff883fdbbd8800: wait ext to 0 timedout, recovery in progress?
|
| Comments |
| Comment by Peter Jones [ 28/Jan/14 ] |
|
Hongchao Could you please advise on this ticket? Thanks Peter |
| Comment by Shuichi Ihara (Inactive) [ 28/Jan/14 ] |
|
OK, reproduced many times. And we confirmed this hang happened when a software installer (install binaries to Lustre) is running and "du" command is also running as background job on same client. As far as we tested, finally, the same hang happened (running installer and du commands were hang). However, the filesystem was still accessible. (It was possible to do IO to Lustre and it worked) So, this is different situation before apply However, we are still hitting application hangs (not rater than node hang) even applied btw, we saw following call trace messages before osc_extent_wait() timeout messages, but this doesn't mapper whether applied Jan 29 03:30:51 ff04 kernel: INFO: task ldlm_bl_32:122671 blocked for more than 120 seconds. Jan 29 03:30:51 ff04 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 29 03:30:51 ff04 kernel: ldlm_bl_32 D 000000000000000b 0 122671 2 0x00000000 Jan 29 03:30:51 ff04 kernel: ffff883f1f083ce0 0000000000000046 0000000000000000 0000000100000020 Jan 29 03:30:51 ff04 kernel: 52e7f60b0000000b 0000000000052daf 0001df2f00000000 0000015f00000000 Jan 29 03:30:51 ff04 kernel: ffff883f1f081ab8 ffff883f1f083fd8 000000000000fb88 ffff883f1f081ab8 Jan 29 03:30:51 ff04 kernel: Call Trace: Jan 29 03:30:51 ff04 kernel: [<ffffffff8150ed3e>] __mutex_lock_slowpath+0x13e/0x180 Jan 29 03:30:51 ff04 kernel: [<ffffffff8150ebdb>] mutex_lock+0x2b/0x50 Jan 29 03:30:51 ff04 kernel: [<ffffffffa0d4225f>] cl_lock_mutex_get+0x6f/0xd0 [obdclass] Jan 29 03:30:51 ff04 kernel: [<ffffffffa1286b9a>] osc_ldlm_blocking_ast+0x7a/0x350 [osc] Jan 29 03:30:51 ff04 kernel: [<ffffffffa0c75a81>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jan 29 03:30:51 ff04 kernel: [<ffffffffa0f06680>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc] Jan 29 03:30:51 ff04 kernel: [<ffffffffa0f06bb1>] ldlm_bl_thread_main+0x261/0x3c0 [ptlrpc] Jan 29 03:30:51 ff04 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20 Jan 29 03:30:51 ff04 kernel: [<ffffffffa0f06950>] ? ldlm_bl_thread_main+0x0/0x3c0 [ptlrpc] Jan 29 03:30:51 ff04 kernel: [<ffffffff81096916>] kthread+0x96/0xa0 Jan 29 03:30:51 ff04 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jan 29 03:30:51 ff04 kernel: [<ffffffff81096880>] ? kthread+0x0/0xa0 Jan 29 03:30:51 ff04 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 |
| Comment by Jinshan Xiong (Inactive) [ 29/Jan/14 ] |
|
Will you show us the backtrace of all processes on the node when this occurs? |
| Comment by Shuichi Ihara (Inactive) [ 29/Jan/14 ] |
|
Hi Jinshan, |
| Comment by Jinshan Xiong (Inactive) [ 30/Jan/14 ] |
|
is it easy to be reproduced? In that case, it'll be a good idea to share us the reproduce program. |
| Comment by Hongchao Zhang [ 30/Jan/14 ] |
|
HI Ihara, do you use rpm as the software installer? Thanks |
| Comment by Shuichi Ihara (Inactive) [ 30/Jan/14 ] |
|
Jinshan, Hongchao |
| Comment by Shuichi Ihara (Inactive) [ 30/Jan/14 ] |
|
Here is backtrace of 1) before dump calltrace, 2) after dump calltrace and 3) after printout of osc_extent_wait() timeout messages. |
| Comment by Bruno Faccini (Inactive) [ 31/Jan/14 ] |
|
Hello Shuichi, |
| Comment by Shuichi Ihara (Inactive) [ 31/Jan/14 ] |
|
Thanks, Bruno! |
| Comment by Bruno Faccini (Inactive) [ 07/Feb/14 ] |
|
Hello Shuichi, do you agree if I close this ticket as a dup of |
| Comment by Shuichi Ihara (Inactive) [ 07/Feb/14 ] |
|
Bruno, Yes, as far as we tested, I think it's duplicated issue of |
| Comment by Peter Jones [ 07/Feb/14 ] |
|
Thanks Ihara |