Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7067

vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • Hyperion/SWL
    • 3
    • 9223372036854775807

    Description

      Running SWL clients timeout and LBUG after server (OST) failover

      Aug 31 11:22:37 iwc98 kernel: Lustre: 5465:0:(client.c:2014:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
      Aug 31 11:23:04 iwc98 kernel: Lustre: lustre-OST000a-osc-ffff8808713e3c00: Connection restored to lustre-OST000a (at 192.168.120.24@o2ib)
      Aug 31 11:33:32 iwc98 kernel: Lustre: 5457:0:(client.c:2014:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1441045957/real 1441045957]  req@ffff8806e1ec2080 x1510687803291588/t0(0) o8->stzfs-OST000a-osc-ffff880871117c00@192.168.120.23@o2ib:28/4 lens 520/544 e 0 to 1 dl 1441046012 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Aug 31 11:33:32 iwc98 kernel: Lustre: 5457:0:(client.c:2014:ptlrpc_expire_one_request()) Skipped 32 previous similar messages
      Aug 31 11:34:55 iwc98 kernel: LustreError: 11-0: lustre-OST000a-osc-ffff8808713e3c00: operation ost_write to node 192.168.120.24@o2ib failed: rc = -107
      Aug 31 11:34:55 iwc98 kernel: Lustre: lustre-OST000a-osc-ffff8808713e3c00: Connection to lustre-OST000a (at 192.168.120.24@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Aug 31 11:34:55 iwc98 kernel: Lustre: Skipped 2 previous similar messages
      Aug 31 11:35:19 iwc98 kernel: LustreError: 167-0: lustre-OST000a-osc-ffff8808713e3c00: This client was evicted by lustre-OST000a; in progress operations using this service will fail.
      Aug 31 11:35:22 iwc98 kernel: LustreError: 107401:0:(vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed:
      Aug 31 11:35:22 iwc98 kernel: LustreError: 107401:0:(vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed:
      Aug 31 11:35:22 iwc98 kernel: LustreError: 107399:0:(vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed:
      Aug 31 11:35:22 iwc98 kernel: LustreError: 107399:0:(vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed:
      Aug 31 11:35:22 iwc98 kernel: LustreError: 107399:0:(vvp_io.c:1076:vvp_io_write_start()) LBUG
      Aug 31 11:35:22 iwc98 kernel: LustreError: 107399:0:(vvp_io.c:1076:vvp_io_write_start()) LBUG
      Aug 31 11:35:22 iwc98 kernel: Pid: 107399, comm: ior
      Aug 31 11:35:22 iwc98 kernel:
      Aug 31 11:35:22 iwc98 kernel: Call Trace:
      Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0470875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0470e77>] lbug_with_loc+0x47/0xb0 [libcfs]
      Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0a5f27c>] vvp_io_write_start+0x5bc/0x5c0 [lustre]
      Aug 31 11:35:22 iwc98 kernel: [<ffffffffa047cbf1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      Aug 31 11:35:22 iwc98 kernel: [<ffffffffa060444a>] cl_io_start+0x6a/0x140 [obdclass]
      Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0608994>] cl_io_loop+0xb4/0x1b0 [obdclass]
      Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0a077b7>] ll_file_io_generic+0x317/0xab0 [lustre]
      Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0a09c9b>] ll_file_aio_write+0x20b/0x860 [lustre]
      Aug 31 11:35:22 iwc98 kernel: [<ffffffffa0a0a41b>] ll_file_write+0x12b/0x260 [lustre]
      Aug 31 11:35:22 iwc98 kernel: [<ffffffff8118e3a8>] vfs_write+0xb8/0x1a0
      Aug 31 11:35:22 iwc98 kernel: [<ffffffff8118ed71>] sys_write+0x51/0x90
      Aug 31 11:35:22 iwc98 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
      

      Attachments

        Issue Links

          Activity

            [LU-7067] vvp_io.c:1076:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed

            Yes it is LU-9491. Since running the patch in production this problem has gone away. I pushed the fix to https://review.whamcloud.com/#/c/27782

            simmonsja James A Simmons added a comment - Yes it is LU-9491 . Since running the patch in production this problem has gone away. I pushed the fix to https://review.whamcloud.com/#/c/27782
            bobijam Zhenyu Xu added a comment -

            it looks like LU-9491.

            bobijam Zhenyu Xu added a comment - it looks like LU-9491 .

            Just started to see this bug in production with latest lustre 2.8. Its one particular app and the bug only manifest some times. I will see if I can collect some logs for you

            simmonsja James A Simmons added a comment - Just started to see this bug in production with latest lustre 2.8. Its one particular app and the bug only manifest some times. I will see if I can collect some logs for you

            Ran 48 hours SWL- did not recur

            cliffw Cliff White (Inactive) added a comment - Ran 48 hours SWL- did not recur
            bobijam Zhenyu Xu added a comment -
            $ cat iwc98.lbug.txt 
            Debug log: 431576 lines, 431576 kept, 0 dropped, 0 bad.
            

            lustre log does not have relevant info.

            bobijam Zhenyu Xu added a comment - $ cat iwc98.lbug.txt Debug log: 431576 lines, 431576 kept, 0 dropped, 0 bad. lustre log does not have relevant info.

            Lustre log dump by LBUG

            cliffw Cliff White (Inactive) added a comment - Lustre log dump by LBUG

            People

              bobijam Zhenyu Xu
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: