Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12503

LustreError: 19435:0:(vvp_io.c:1056:vvp_io_write_start()) LBUG

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0, Lustre 2.12.4
    • Lustre 2.10.6, Lustre 2.12.2
    • None
    • Server: PowerEdge R640 with 64 GB memory and Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
      OS: CentOS 7.5.1804
      Lustre client: 2.12.2
    • 3
    • 9223372036854775807

    Description

      We are running our lustre file system on 1 mds and 8 oss nodes. we are running lustre 2.10.6 on the lustre servers and clients.

      On one of the clients, we are exporting lustre via NFS3 and smb, it has been working fine for more than a year, but recently the client which is exporting lustre as NFS and smb start to crash due to a lustre bug as following:

       

      2014.148312] LustreError: 19435:0:(vvp_io.c:1056:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed: ki_pos 1209601876 [1209597952, 1210056704)
      [ 2014.148338] LustreError: 19435:0:(vvp_io.c:1056:vvp_io_write_start()) LBUG
      [ 2014.148352] Pid: 19435, comm: nfsd 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019
      [ 2014.148353] Call Trace:
      [ 2014.148376] [<ffffffffc0a0d7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [ 2014.148389] [<ffffffffc0a0d87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [ 2014.148394] [<ffffffffc1061270>] vvp_io_write_start+0x790/0x820 [lustre]
      [ 2014.148419] [<ffffffffc0cb5328>] cl_io_start+0x68/0x130 [obdclass]
      [ 2014.148449] [<ffffffffc0cb74fc>] cl_io_loop+0xcc/0x1c0 [obdclass]
      [ 2014.148462] [<ffffffffc101765b>] ll_file_io_generic+0x63b/0xcb0 [lustre]
      [ 2014.148470] [<ffffffffc10182f2>] ll_file_aio_write+0x442/0x590 [lustre]
      [ 2014.148476] [<ffffffff8d040e6b>] do_sync_readv_writev+0x7b/0xd0
      [ 2014.148480] [<ffffffff8d042aae>] do_readv_writev+0xce/0x260
      [ 2014.148482] [<ffffffff8d042cd5>] vfs_writev+0x35/0x60
      [ 2014.148484] [<ffffffffc0699f90>] nfsd_vfs_write+0xc0/0x3a0 [nfsd]
      [ 2014.148492] [<ffffffffc069c962>] nfsd_write+0x112/0x2a0 [nfsd]
      [ 2014.148498] [<ffffffffc06a3070>] nfsd3_proc_write+0xc0/0x160 [nfsd]
      [ 2014.148504] [<ffffffffc0694810>] nfsd_dispatch+0xe0/0x290 [nfsd]
      [ 2014.148509] [<ffffffffc0610cf3>] svc_process_common+0x493/0x760 [sunrpc]
      [ 2014.148523] [<ffffffffc06110c3>] svc_process+0x103/0x190 [sunrpc]
      [ 2014.148531] [<ffffffffc069416f>] nfsd+0xdf/0x150 [nfsd]
      [ 2014.148535] [<ffffffff8cec1da1>] kthread+0xd1/0xe0
      [ 2014.148539] [<ffffffff8d575c1d>] ret_from_fork_nospec_begin+0x7/0x21
      [ 2014.148543] [<ffffffffffffffff>] 0xffffffffffffffff
      [ 2014.148551] Kernel panic - not syncing: LBUG
      [ 2014.148561] CPU: 2 PID: 19435 Comm: nfsd Kdump: loaded Tainted: G OE ------------ 3.10.0-957.21.3.el7.x86_64 #1
      [ 2014.148579] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.8 05/21/2018
      [ 2014.148592] Call Trace:
      [ 2014.148603] [<ffffffff8d563107>] dump_stack+0x19/0x1b
      [ 2014.148615] [<ffffffff8d55c810>] panic+0xe8/0x21f
      [ 2014.148629] [<ffffffffc0a0d8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [ 2014.148650] [<ffffffffc1061270>] vvp_io_write_start+0x790/0x820 [lustre]
      [ 2014.148675] [<ffffffffc0cb3357>] ? cl_lock_request+0x67/0x1f0 [obdclass]
      [ 2014.148699] [<ffffffffc0cb5328>] cl_io_start+0x68/0x130 [obdclass]
      [ 2014.148722] [<ffffffffc0cb74fc>] cl_io_loop+0xcc/0x1c0 [obdclass]
      [ 2014.148739] [<ffffffffc101765b>] ll_file_io_generic+0x63b/0xcb0 [lustre]
      [ 2014.148753] [<ffffffff8ced3250>] ? check_preempt_curr+0x80/0xa0
      [ 2014.148771] [<ffffffffc10182f2>] ll_file_aio_write+0x442/0x590 [lustre]
      [ 2014.148784] [<ffffffff8d040e6b>] do_sync_readv_writev+0x7b/0xd0
      [ 2014.148914] [<ffffffff8d042aae>] do_readv_writev+0xce/0x260
      [ 2014.149049] [<ffffffffc1017eb0>] ? ll_file_splice_read+0x1e0/0x1e0 [lustre]
      [ 2014.149185] [<ffffffffc1018440>] ? ll_file_aio_write+0x590/0x590 [lustre]
      [ 2014.149318] [<ffffffff8d11e003>] ? ima_get_action+0x23/0x30
      [ 2014.149447] [<ffffffff8d11d51e>] ? process_measurement+0x8e/0x250
      [ 2014.149578] [<ffffffff8d03f087>] ? do_dentry_open+0x1e7/0x2e0
      [ 2014.149708] [<ffffffff8d042cd5>] vfs_writev+0x35/0x60
      [ 2014.149841] [<ffffffffc0699f90>] nfsd_vfs_write+0xc0/0x3a0 [nfsd]
      [ 2014.149975] [<ffffffffc069c962>] nfsd_write+0x112/0x2a0 [nfsd]
      [ 2014.150109] [<ffffffffc06a3070>] nfsd3_proc_write+0xc0/0x160 [nfsd]
      [ 2014.150243] [<ffffffffc0694810>] nfsd_dispatch+0xe0/0x290 [nfsd]
      [ 2014.150381] [<ffffffffc0610cf3>] svc_process_common+0x493/0x760 [sunrpc]
      [ 2014.150489] LustreError: 19462:0:(vvp_io.c:1056:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed: ki_pos 1211699028 [1211695104, 1212153856)
      [ 2014.150491] LustreError: 19462:0:(vvp_io.c:1056:vvp_io_write_start()) LBUG
      [ 2014.150492] Pid: 19462, comm: nfsd 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019
      [ 2014.150492] Call Trace:
      [ 2014.150514] [<ffffffffc0a0d7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [ 2014.150519] [<ffffffffc0a0d87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [ 2014.150533] [<ffffffffc1061270>] vvp_io_write_start+0x790/0x820 [lustre]
      [ 2014.150551] [<ffffffffc0cb5328>] cl_io_start+0x68/0x130 [obdclass]
      [ 2014.150564] [<ffffffffc0cb74fc>] cl_io_loop+0xcc/0x1c0 [obdclass]
      [ 2014.150571] [<ffffffffc101765b>] ll_file_io_generic+0x63b/0xcb0 [lustre]
      [ 2014.150577] [<ffffffffc10182f2>] ll_file_aio_write+0x442/0x590 [lustre]
      [ 2014.150580] [<ffffffff8d040e6b>] do_sync_readv_writev+0x7b/0xd0
      [ 2014.150581] [<ffffffff8d042aae>] do_readv_writev+0xce/0x260
      [ 2014.150583] [<ffffffff8d042cd5>] vfs_writev+0x35/0x60
      [ 2014.150589] [<ffffffffc0699f90>] nfsd_vfs_write+0xc0/0x3a0 [nfsd]
      [ 2014.150594] [<ffffffffc069c962>] nfsd_write+0x112/0x2a0 [nfsd]
      [ 2014.150599] [<ffffffffc06a3070>] nfsd3_proc_write+0xc0/0x160 [nfsd]
      [ 2014.150603] [<ffffffffc0694810>] nfsd_dispatch+0xe0/0x290 [nfsd]
      [ 2014.150613] [<ffffffffc0610cf3>] svc_process_common+0x493/0x760 [sunrpc]
      [ 2014.150621] [<ffffffffc06110c3>] svc_process+0x103/0x190 [sunrpc]
      [ 2014.150625] [<ffffffffc069416f>] nfsd+0xdf/0x150 [nfsd]
      [ 2014.150627] [<ffffffff8cec1da1>] kthread+0xd1/0xe0
      [ 2014.150630] [<ffffffff8d575c1d>] ret_from_fork_nospec_begin+0x7/0x21
      [ 2014.150634] [<ffffffffffffffff>] 0xffffffffffffffff
      [ 2014.152515] LustreError: 19480:0:(vvp_io.c:1056:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed: ki_pos 1213796180 [1213792256, 1214251008)
      [ 2014.152517] LustreError: 19480:0:(vvp_io.c:1056:vvp_io_write_start()) LBUG
      [ 2014.152518] Pid: 19480, comm: nfsd 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019
      [ 2014.152519] Call Trace:
      [ 2014.152542] [<ffffffffc0a0d7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [ 2014.152548] [<ffffffffc0a0d87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [ 2014.152569] [<ffffffffc1061270>] vvp_io_write_start+0x790/0x820 [lustre]
      [ 2014.152593] [<ffffffffc0cb5328>] cl_io_start+0x68/0x130 [obdclass]
      [ 2014.152610] [<ffffffffc0cb74fc>] cl_io_loop+0xcc/0x1c0 [obdclass]
      [ 2014.152620] [<ffffffffc101765b>] ll_file_io_generic+0x63b/0xcb0 [lustre]
      [ 2014.152630] [<ffffffffc10182f2>] ll_file_aio_write+0x442/0x590 [lustre]
      [ 2014.152632] [<ffffffff8d040e6b>] do_sync_readv_writev+0x7b/0xd0
      [ 2014.152634] [<ffffffff8d042aae>] do_readv_writev+0xce/0x260
      [ 2014.152635] [<ffffffff8d042cd5>] vfs_writev+0x35/0x60
      [ 2014.152643] [<ffffffffc0699f90>] nfsd_vfs_write+0xc0/0x3a0 [nfsd]
      [ 2014.152649] [<ffffffffc069c962>] nfsd_write+0x112/0x2a0 [nfsd]
      [ 2014.152655] [<ffffffffc06a3070>] nfsd3_proc_write+0xc0/0x160 [nfsd]
      [ 2014.152661] [<ffffffffc0694810>] nfsd_dispatch+0xe0/0x290 [nfsd]
      [ 2014.152671] [<ffffffffc0610cf3>] svc_process_common+0x493/0x760 [sunrpc]
      [ 2014.152679] [<ffffffffc06110c3>] svc_process+0x103/0x190 [sunrpc]
      [ 2014.152685] [<ffffffffc069416f>] nfsd+0xdf/0x150 [nfsd]
      [ 2014.152687] [<ffffffff8cec1da1>] kthread+0xd1/0xe0
      [ 2014.152689] [<ffffffff8d575c1d>] ret_from_fork_nospec_begin+0x7/0x21
      [ 2014.152693] [<ffffffffffffffff>] 0xffffffffffffffff
      [ 2014.157437] [<ffffffffc06110c3>] svc_process+0x103/0x190 [sunrpc]
      [ 2014.157572] [<ffffffffc069416f>] nfsd+0xdf/0x150 [nfsd]
      [ 2014.157704] [<ffffffffc0694090>] ? nfsd_destroy+0x80/0x80 [nfsd]
      [ 2014.157835] [<ffffffff8cec1da1>] kthread+0xd1/0xe0
      [ 2014.157963] [<ffffffff8cec1cd0>] ? insert_kthread_work+0x40/0x40
      [ 2014.158094] [<ffffffff8d575c1d>] ret_from_fork_nospec_begin+0x7/0x21
      [ 2014.158224] [<ffffffff8cec1cd0>] ? insert_kthread_work+0x40/0x40
      (END)

       

      We have updated that client to lustre 2.12.2, but it did not help 

      Attachments

        Issue Links

          Activity

            [LU-12503] LustreError: 19435:0:(vvp_io.c:1056:vvp_io_write_start()) LBUG

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37034
            Subject: LU-12503 llite: file write pos mimatch
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 7e79fc11ed73b291b8e7a4805b3f1144d71ff83f

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37034 Subject: LU-12503 llite: file write pos mimatch Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 7e79fc11ed73b291b8e7a4805b3f1144d71ff83f
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36021/
            Subject: LU-12503 llite: file write pos mimatch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1d2aa1513dc4e65813ad0bea138966a55244dbde

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36021/ Subject: LU-12503 llite: file write pos mimatch Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1d2aa1513dc4e65813ad0bea138966a55244dbde
            bobijam Zhenyu Xu added a comment -

            yes, I hope the fix patch can handle the issue, and add the debug patch to catch info if that's not the right fix for this issue.

            bobijam Zhenyu Xu added a comment - yes, I hope the fix patch can handle the issue, and add the debug patch to catch info if that's not the right fix for this issue.
            pjones Peter Jones added a comment -

            bobijam

            It looks like you have turned your original debug patch into a fix and now have added a new debug patch. Are you hoping for halifu to use both of these?

            Peter

            pjones Peter Jones added a comment - bobijam It looks like you have turned your original debug patch into a fix and now have added a new debug patch. Are you hoping for halifu to use both of these? Peter

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/36021
            Subject: LU-12503 llite: debug file pos mimatch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bf11eb0f6c4ead18897b14d3ff2b8ef09a72f97a

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/36021 Subject: LU-12503 llite: debug file pos mimatch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bf11eb0f6c4ead18897b14d3ff2b8ef09a72f97a

            I agree with Patrick. It's a fix but not one to handle this problem. I was hoping it might address this issue

            simmonsja James A Simmons added a comment - I agree with Patrick. It's a fix but not one to handle this problem. I was hoping it might address this issue

            Ah, right.

            James, I'm almost certain the code your patch touches is only called in the dump page cache path, which is strictly a special, extreme debug path, and there's basically no way it would be invoked here.  Have I missed something?  Otherwise it can't be the fix for this.  (It's still correct and useful, it's just not a fix for this.)

            pfarrell Patrick Farrell (Inactive) added a comment - Ah, right. James, I'm almost certain the code your patch touches is only called in the dump page cache path, which is strictly a special, extreme debug path, and there's basically no way it would be invoked here.  Have I missed something?  Otherwise it can't be the fix for this.  (It's still correct and useful, it's just not a fix for this.)
            pjones Peter Jones added a comment -

            ok so James's patch has landed to master. Can we port it to b2_12 so that halifu can verify whether it is a fix? Or simmonsja does the latest crash info confirm that this is indeed the issue?

            pjones Peter Jones added a comment - ok so James's patch has landed to master. Can we port it to b2_12 so that halifu can verify whether it is a fix? Or simmonsja does the latest crash info confirm that this is indeed the issue?

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35765/
            Subject: LU-12503 vvp_dev: increment *pos in .next
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 02336a9a5d096dc9a603ed0e77e0c7cf7b41ffb3

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35765/ Subject: LU-12503 vvp_dev: increment *pos in .next Project: fs/lustre-release Branch: master Current Patch Set: Commit: 02336a9a5d096dc9a603ed0e77e0c7cf7b41ffb3

            My server crashed today. I managed to upload vmcore-dmesg.txt file.

            Let me know if you need more information.

             

            halifu Saerda Halifu (Inactive) added a comment - My server crashed today. I managed to upload vmcore-dmesg.txt file. Let me know if you need more information.  

            People

              bobijam Zhenyu Xu
              halifu Saerda Halifu (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: