[LU-12503] LustreError: 19435:0:(vvp_io.c:1056:vvp_io_write_start()) LBUG Created: 02/Jul/19 Updated: 01/Jun/20 Resolved: 14/Dec/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.6, Lustre 2.12.2 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Saerda Halifu | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Server: PowerEdge R640 with 64 GB memory and Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Epic/Theme: | NFS | ||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We are running our lustre file system on 1 mds and 8 oss nodes. we are running lustre 2.10.6 on the lustre servers and clients. On one of the clients, we are exporting lustre via NFS3 and smb, it has been working fine for more than a year, but recently the client which is exporting lustre as NFS and smb start to crash due to a lustre bug as following:
2014.148312] LustreError: 19435:0:(vvp_io.c:1056:vvp_io_write_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed: ki_pos 1209601876 [1209597952, 1210056704)
We have updated that client to lustre 2.12.2, but it did not help |
| Comments |
| Comment by Saerda Halifu [ 05/Jul/19 ] |
|
Dear Zhenyu Xu,
Thanks for looking into this issue, will it be possible to get a patch for this bug before 2.13.0 release? Is there any way to avoid it? And I very much would like to know what is actually causing this bug?
Best Regards
Saerda |
| Comment by Zhenyu Xu [ 12/Jul/19 ] |
|
Observations so far: thread 19435 has iocb->ki_pos = 0x4819,0F54 while its io->u.ci_rw.crw_pos is 0x4819,0000 write count is 64K(0x1,000) thread 19462 0x4839,0F54 0x4839,0000 64K thread 19480 0x4859,0F54 0x4859,0000 64K I don't know how come the ki_pos is not updated by page alignment (it should be IMO). The iocb's ki_pos could get updated during __generic_file_write_iter() in vvp_io_write_start(), while later got updated again with crw_pos. This code was changed in commit 2b0a34fe43bf4fce5560af61a45e5393c96070a9, before the commit, ll_file_io_generic() uses its own iocb and pos, and only update outside kiocb's ki_pos by the bytes that have been written after finished the IO loop. While I'm not sure whether that change affects it or not (i.e. the root cause hasn't been found), but the hunch is that using variables local to the function should avoid the complexity. |
| Comment by Peter Jones [ 14/Jul/19 ] |
|
This issue was initially seen on 2.10.6 which pre-dates the Peter |
| Comment by Gerrit Updater [ 15/Jul/19 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/35516 |
| Comment by Zhenyu Xu [ 15/Jul/19 ] |
|
H Saerda Halifu, The debug patch (https://review.whamcloud.com/35516) is for client only, apply it to one client node and to see whether this issue hit again (hoping it's easy to reproduce) , if that's the case, collect the logs and upload here. Thank you. |
| Comment by Saerda Halifu [ 15/Jul/19 ] |
|
Hi Zhenyu Xu,
Thanks, I will apply it, and will let you know. Best Regards
Saerda |
| Comment by Peter Jones [ 22/Jul/19 ] |
|
halifu any change in the frequency of the occurrence of the crash with the debug patch applied? |
| Comment by Peter Jones [ 31/Jul/19 ] |
|
halifu any news? |
| Comment by Saerda Halifu [ 01/Aug/19 ] |
|
Hi Peter,
Sorry for the late answer, I was away for vacation. I have downloaded lustre-2.12.1-1.src.rpm, and unpacked it, replaced vvp_io.c file with the new one from the patch. I was able to create a new source rpm but I am not able to rebuild the Lustre-client rpm from this new source. when I run : rpmbuild --rebuild --without servers lustre-2.12.1-1.el7.src.rpm I got the following error message: Making all in .
For me it looks like it fails before it comes to the changes in the code? Am I doing something wrong ?
Best Regards Saerda |
| Comment by Peter Jones [ 01/Aug/19 ] |
|
I would have suggested that it would be simpler to just use the build products from the Jenkins build- https://build.whamcloud.com/job/lustre-reviews/67091/arch=x86_64,build_type=client,distro=el7.6,ib_stack=inkernel/artifact/artifacts/ - but I see that you are using RHEL 7.5 so perhaps building from the SRPMs in the same location would be easier. That or temporarily move just one client to a kernel version that allow you to use the build products (then reset it back to what you want to use after running the test) |
| Comment by Saerda Halifu [ 05/Aug/19 ] |
|
Hi Peter, Thanks for the tips, I manage to get the right build and installed it. I have following rpms installed on this lustre client: rpm -qa |grep lustre
Now I started NFS service on this lustre client, I will see when/if this will crash, if it does, I will upload the dump.
Best Regards Saerda |
| Comment by Peter Jones [ 11/Aug/19 ] |
|
How frequently was the crash occurring before the debug patch was applied? |
| Comment by Gerrit Updater [ 11/Aug/19 ] |
|
James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/35765 |
| Comment by James A Simmons [ 11/Aug/19 ] |
|
Just wondering if this is a side effect of a bug found upstream that was fixed. Can you give it a try. |
| Comment by Saerda Halifu [ 12/Aug/19 ] |
|
Hi, My server manages to run without problem for the last 6 days after I have applied the debug patch. Before that, the server used to crash quite often, some times a couple of hours after I start NFS export, some times after a day or so. I also went through the logs, didn't see anything suspicious.
I think this might have something to do with user activities, for example how users are reading and writing data to NFS exports.
Best Regards
Saerda
|
| Comment by Peter Jones [ 12/Aug/19 ] |
|
ok, well, considering that this issue has been seen on older versions too I think that we should drop the prriority from Blocker to Critical. |
| Comment by Saerda Halifu [ 15/Aug/19 ] |
|
My server crashed today. I managed to upload vmcore-dmesg.txt file. Let me know if you need more information.
|
| Comment by Gerrit Updater [ 21/Aug/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35765/ |
| Comment by Peter Jones [ 21/Aug/19 ] |
|
ok so James's patch has landed to master. Can we port it to b2_12 so that halifu can verify whether it is a fix? Or simmonsja does the latest crash info confirm that this is indeed the issue? |
| Comment by Patrick Farrell (Inactive) [ 21/Aug/19 ] |
|
Ah, right. James, I'm almost certain the code your patch touches is only called in the dump page cache path, which is strictly a special, extreme debug path, and there's basically no way it would be invoked here. Have I missed something? Otherwise it can't be the fix for this. (It's still correct and useful, it's just not a fix for this.) |
| Comment by James A Simmons [ 21/Aug/19 ] |
|
I agree with Patrick. It's a fix but not one to handle this problem. I was hoping it might address this issue |
| Comment by Gerrit Updater [ 02/Sep/19 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/36021 |
| Comment by Peter Jones [ 02/Sep/19 ] |
|
It looks like you have turned your original debug patch into a fix and now have added a new debug patch. Are you hoping for halifu to use both of these? Peter |
| Comment by Zhenyu Xu [ 03/Sep/19 ] |
|
yes, I hope the fix patch can handle the issue, and add the debug patch to catch info if that's not the right fix for this issue. |
| Comment by Gerrit Updater [ 14/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36021/ |
| Comment by Peter Jones [ 14/Dec/19 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 16/Dec/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37034 |
| Comment by Gerrit Updater [ 16/Dec/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37035 |
| Comment by Gerrit Updater [ 20/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37034/ |
| Comment by Gerrit Updater [ 20/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37035/ |
| Comment by Gerrit Updater [ 14/Mar/20 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/37921 |