[LU-7067] vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed Created: 31/Aug/15  Updated: 07/Sep/17  Resolved: 22/Jun/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Hyperion/SWL


Attachments: File iwc98.lbug.txt.gz    
Issue Links:
Duplicate
Related
is related to LU-7382 (vvp_io.c:573:vvp_io_update_iov()) AS... Reopened
is related to LU-9491 v2.9: silent data corruption with wri... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running SWL clients timeout and LBUG after server (OST) failover

Aug 31 11:22:37 iwc98 kernel: Lustre: 5465:0:(client.c:2014:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Aug 31 11:23:04 iwc98 kernel: Lustre: lustre-OST000a-osc-ffff8808713e3c00: Connection restored to lustre-OST000a (at 192.168.120.24@o2ib)
Aug 31 11:33:32 iwc98 kernel: Lustre: 5457:0:(client.c:2014:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1441045957/real 1441045957]  req@ffff8806e1ec2080 x1510687803291588/t0(0) o8->stzfs-OST000a-osc-ffff880871117c00@192.168.120.23@o2ib:28/4 lens 520/544 e 0 to 1 dl 1441046012 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Aug 31 11:33:32 iwc98 kernel: Lustre: 5457:0:(client.c:2014:ptlrpc_expire_one_request()) Skipped 32 previous similar messages
Aug 31 11:34:55 iwc98 kernel: LustreError: 11-0: lustre-OST000a-osc-ffff8808713e3c00: operation ost_write to node 192.168.120.24@o2ib failed: rc = -107
Aug 31 11:34:55 iwc98 kernel: Lustre: lustre-OST000a-osc-ffff8808713e3c00: Connection to lustre-OST000a (at 192.168.120.24@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Aug 31 11:34:55 iwc98 kernel: Lustre: Skipped 2 previous similar messages
Aug 31 11:35:19 iwc98 kernel: LustreError: 167-0: lustre-OST000a-osc-ffff8808713e3c00: This client was evicted by lustre-OST000a; in progress operations using this service will fail.
Aug 31 11:35:22 iwc98 kernel: LustreError: 107401:0:(vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed:
Aug 31 11:35:22 iwc98 kernel: LustreError: 107401:0:(vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed:
Aug 31 11:35:22 iwc98 kernel: LustreError: 107399:0:(vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed:
Aug 31 11:35:22 iwc98 kernel: LustreError: 107399:0:(vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed:
Aug 31 11:35:22 iwc98 kernel: LustreError: 107399:0:(vvp_io.c:1076:vvp_io_write_start()) LBUG
Aug 31 11:35:22 iwc98 kernel: LustreError: 107399:0:(vvp_io.c:1076:vvp_io_write_start()) LBUG
Aug 31 11:35:22 iwc98 kernel: Pid: 107399, comm: ior
Aug 31 11:35:22 iwc98 kernel:
Aug 31 11:35:22 iwc98 kernel: Call Trace:
Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0470875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0470e77>] lbug_with_loc+0x47/0xb0 [libcfs]
Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0a5f27c>] vvp_io_write_start+0x5bc/0x5c0 [lustre]
Aug 31 11:35:22 iwc98 kernel: [<ffffffffa047cbf1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Aug 31 11:35:22 iwc98 kernel: [<ffffffffa060444a>] cl_io_start+0x6a/0x140 [obdclass]
Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0608994>] cl_io_loop+0xb4/0x1b0 [obdclass]
Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0a077b7>] ll_file_io_generic+0x317/0xab0 [lustre]
Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0a09c9b>] ll_file_aio_write+0x20b/0x860 [lustre]
Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0a0a41b>] ll_file_write+0x12b/0x260 [lustre]
Aug 31 11:35:22 iwc98 kernel: [<ffffffff8118e3a8>] vfs_write+0xb8/0x1a0
Aug 31 11:35:22 iwc98 kernel: [<ffffffff8118ed71>] sys_write+0x51/0x90
Aug 31 11:35:22 iwc98 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Cliff White (Inactive) [ 31/Aug/15 ]

Lustre log dump by LBUG

Comment by Zhenyu Xu [ 01/Sep/15 ]
$ cat iwc98.lbug.txt 
Debug log: 431576 lines, 431576 kept, 0 dropped, 0 bad.

lustre log does not have relevant info.

Comment by Cliff White (Inactive) [ 04/Nov/15 ]

Ran 48 hours SWL- did not recur

Comment by James A Simmons [ 28/Mar/17 ]

Just started to see this bug in production with latest lustre 2.8. Its one particular app and the bug only manifest some times. I will see if I can collect some logs for you

Comment by Zhenyu Xu [ 12/May/17 ]

it looks like LU-9491.

Comment by James A Simmons [ 22/Jun/17 ]

Yes it is LU-9491. Since running the patch in production this problem has gone away. I pushed the fix to https://review.whamcloud.com/#/c/27782

Generated at Sat Feb 10 02:05:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.