[LU-7067] vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed Created: 31/Aug/15 Updated: 07/Sep/17 Resolved: 22/Jun/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Hyperion/SWL |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Running SWL clients timeout and LBUG after server (OST) failover Aug 31 11:22:37 iwc98 kernel: Lustre: 5465:0:(client.c:2014:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Aug 31 11:23:04 iwc98 kernel: Lustre: lustre-OST000a-osc-ffff8808713e3c00: Connection restored to lustre-OST000a (at 192.168.120.24@o2ib) Aug 31 11:33:32 iwc98 kernel: Lustre: 5457:0:(client.c:2014:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1441045957/real 1441045957] req@ffff8806e1ec2080 x1510687803291588/t0(0) o8->stzfs-OST000a-osc-ffff880871117c00@192.168.120.23@o2ib:28/4 lens 520/544 e 0 to 1 dl 1441046012 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Aug 31 11:33:32 iwc98 kernel: Lustre: 5457:0:(client.c:2014:ptlrpc_expire_one_request()) Skipped 32 previous similar messages Aug 31 11:34:55 iwc98 kernel: LustreError: 11-0: lustre-OST000a-osc-ffff8808713e3c00: operation ost_write to node 192.168.120.24@o2ib failed: rc = -107 Aug 31 11:34:55 iwc98 kernel: Lustre: lustre-OST000a-osc-ffff8808713e3c00: Connection to lustre-OST000a (at 192.168.120.24@o2ib) was lost; in progress operations using this service will wait for recovery to complete Aug 31 11:34:55 iwc98 kernel: Lustre: Skipped 2 previous similar messages Aug 31 11:35:19 iwc98 kernel: LustreError: 167-0: lustre-OST000a-osc-ffff8808713e3c00: This client was evicted by lustre-OST000a; in progress operations using this service will fail. Aug 31 11:35:22 iwc98 kernel: LustreError: 107401:0:(vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed: Aug 31 11:35:22 iwc98 kernel: LustreError: 107401:0:(vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed: Aug 31 11:35:22 iwc98 kernel: LustreError: 107399:0:(vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed: Aug 31 11:35:22 iwc98 kernel: LustreError: 107399:0:(vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed: Aug 31 11:35:22 iwc98 kernel: LustreError: 107399:0:(vvp_io.c:1076:vvp_io_write_start()) LBUG Aug 31 11:35:22 iwc98 kernel: LustreError: 107399:0:(vvp_io.c:1076:vvp_io_write_start()) LBUG Aug 31 11:35:22 iwc98 kernel: Pid: 107399, comm: ior Aug 31 11:35:22 iwc98 kernel: Aug 31 11:35:22 iwc98 kernel: Call Trace: Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0470875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0470e77>] lbug_with_loc+0x47/0xb0 [libcfs] Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0a5f27c>] vvp_io_write_start+0x5bc/0x5c0 [lustre] Aug 31 11:35:22 iwc98 kernel: [<ffffffffa047cbf1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Aug 31 11:35:22 iwc98 kernel: [<ffffffffa060444a>] cl_io_start+0x6a/0x140 [obdclass] Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0608994>] cl_io_loop+0xb4/0x1b0 [obdclass] Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0a077b7>] ll_file_io_generic+0x317/0xab0 [lustre] Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0a09c9b>] ll_file_aio_write+0x20b/0x860 [lustre] Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0a0a41b>] ll_file_write+0x12b/0x260 [lustre] Aug 31 11:35:22 iwc98 kernel: [<ffffffff8118e3a8>] vfs_write+0xb8/0x1a0 Aug 31 11:35:22 iwc98 kernel: [<ffffffff8118ed71>] sys_write+0x51/0x90 Aug 31 11:35:22 iwc98 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Cliff White (Inactive) [ 31/Aug/15 ] |
|
Lustre log dump by LBUG |
| Comment by Zhenyu Xu [ 01/Sep/15 ] |
$ cat iwc98.lbug.txt Debug log: 431576 lines, 431576 kept, 0 dropped, 0 bad. lustre log does not have relevant info. |
| Comment by Cliff White (Inactive) [ 04/Nov/15 ] |
|
Ran 48 hours SWL- did not recur |
| Comment by James A Simmons [ 28/Mar/17 ] |
|
Just started to see this bug in production with latest lustre 2.8. Its one particular app and the bug only manifest some times. I will see if I can collect some logs for you |
| Comment by Zhenyu Xu [ 12/May/17 ] |
|
it looks like |
| Comment by James A Simmons [ 22/Jun/17 ] |
|
Yes it is |