[LU-3655] Reoccurrence of permanent eviction scenario Created: 29/Jul/13 Updated: 18/Jul/17 Resolved: 18/Jul/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Sebastien Buisson (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9404 |
| Description |
|
Hi, I am afraid we suffer again from the issue described in So those 5 patches might not be enough to fix this problem. Here is the information collected from the crash: crash>dmesg ... LustreError: 65257:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages LustreError: 65257:0:(cl_io.c:967:cl_io_cancel()) Canceling ongoing page trasmission ... crash> ps | grep 65257 65257 2 5 ffff880fe2ac27d0 IN 0.0 0 0 [ldlm_bl_62] crash> bt 65257 PID: 65257 TASK: ffff880fe2ac27d0 CPU: 5 COMMAND: "ldlm_bl_62" #0 [ffff880fe32a7ae0] schedule at ffffffff81484c15 #1 [ffff880fe32a7ba8] cfs_waitq_wait at ffffffffa055a6de [libcfs] #2 [ffff880fe32a7bb8] cl_sync_io_wait at ffffffffa067f3cb [obdclass] #3 [ffff880fe32a7c58] cl_io_submit_sync at ffffffffa067f643 [obdclass] #4 [ffff880fe32a7cb8] cl_lock_page_out at ffffffffa0676997 [obdclass] #5 [ffff880fe32a7d28] osc_lock_flush at ffffffffa0a6abaf [osc] #6 [ffff880fe32a7d78] osc_lock_cancel at ffffffffa0a6acbf [osc] #7 [ffff880fe32a7dc8] cl_lock_cancel0 at ffffffffa0675575 [obdclass] #8 [ffff880fe32a7df8] cl_lock_cancel at ffffffffa067639b [obdclass] #9 [ffff880fe32a7e18] osc_ldlm_blocking_ast at ffffffffa0a6bd9a [osc] #10 [ffff880fe32a7e88] ldlm_handle_bl_callback at ffffffffa07a0293 [ptlrpc] #11 [ffff880fe32a7eb8] ldlm_bl_thread_main at ffffffffa07a06d1 [ptlrpc] #12 [ffff880fe32a7f48] kernel_thread at ffffffff8100412a crash> dmesg | grep 'SYNC IO' LustreError: 3140:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages LustreError: 63611:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages LustreError: 65257:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages LustreError: 65316:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages LustreError: 65235:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages LustreError: 65277:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages LustreError: 63605:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages Sebastien. |
| Comments |
| Comment by Peter Jones [ 29/Jul/13 ] |
|
Niu Could you please comment on this one? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 30/Jul/13 ] |
|
Hi, Sebastien Is there any log from OST? Is there any other abnormal messages from client log except the "SYNC IO failed with error: -110 ..." ? Thanks. |
| Comment by Diego Moreno (Inactive) [ 08/Aug/13 ] |
|
2 files in attachment:
No more information in the log... and nothing on the MDS |
| Comment by Niu Yawei (Inactive) [ 09/Aug/13 ] |
|
There are only few lines of messages in the attached logs, looks line the client has been evicted by OST, so the sync write to the OST failed, but I don't see why the client was evicted from the log. Maybe there was some network problem between the client and OST? |
| Comment by Alexandre Louvet [ 23/Aug/13 ] |
|
Niu, Unfortunately, we doesn't have more info in the server log. The OSS is quiet for hours until we get the lock callback timeout. There is nothing in the OSS syslog before the callback message and the physical network (infiniband) doesn't show errors. note that we are running with Lustre 2.1.5 + some patches that play in the network land :
Alex. |
| Comment by Alexandre Louvet [ 18/Apr/14 ] |
|
Back on stage. We are seeing this issue more and more. Any idea of what can be collected ? Regards, |
| Comment by Niu Yawei (Inactive) [ 21/Apr/14 ] |
|
I think you should collect log from both client and OSS, and get a full stack-trace on client. |
| Comment by Niu Yawei (Inactive) [ 18/Jul/17 ] |
|
Close old 2.1 issue. |