[LU-869] sanity.sh test_81b failed with "dd should fail for -ENOSPC, but succeed" Created: 21/Nov/11 Updated: 29/Sep/15 Resolved: 29/Sep/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0, Lustre 2.1.1, Lustre 2.1.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 4249 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/d3d44298-114c-11e1-ad46-52540025f9af. |
| Comments |
| Comment by Johann Lombardi (Inactive) [ 24/Nov/11 ] |
|
Actually, the write failed with EIO instead of ENOSPC: == sanity test 81b: OST should return -ENOSPC when retry still fails ================================= 15:41:11 (1321486871) fail_loc=0x228 write: Input/output error sanity test_81b: @@@@@@ FAIL: dd should fail for -ENOSPC, but succeed. But the client still got ENOSPC: 00000100:02020000:0.0:1321486871.819507:0:4759:0:(client.c:1141:ptlrpc_check_status()) 11-0: an error occurred while communicating with 10.10.4.77@tcp. The ost_write operation failed with -28 Looking closer, ENOSPC was returned on an async write and the following fsync got EIO: 00000080:00200000:0.0:1321486871.744302:0:30812:0:(vvp_io.c:601:vvp_io_write_start()) write: [0, 4096) 00000080:00200000:0.0:1321486871.744305:0:30812:0:(xattr.c:419:ll_getxattr()) VFS Op:inode=144115238843711498/33554443(ffff8800751fe630), xattr security.capability 00000080:00200000:0.0:1321486871.744309:0:30812:0:(xattr.c:284:ll_getxattr_common()) VFS Op:inode=144115238843711498/33554443(ffff8800751fe630) 00000080:00200000:0.0:1321486871.744326:0:30812:0:(rw.c:218:ll_cl_init()) 0@[0x200000bd2:0xa:0x0] -> 0 ffff88007561e878 ffff880075153de8 00000080:00200000:0.0:1321486871.744344:0:30812:0:(vvp_io.c:1033:vvp_io_commit_write()) [0x200000bd2:0xa:0x0] updating i_size 4096 00000100:00100000:0.0:1321486871.744406:0:4759:0:(client.c:1411:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:cce1ed61-bd46-1d91-8ece-ef709b91e2ba:4759:1385676188533420:10.10.4.77@tcp:4 00000100:02020000:0.0:1321486871.819507:0:4759:0:(client.c:1141:ptlrpc_check_status()) 11-0: an error occurred while communicating with 10.10.4.77@tcp. The ost_write operation failed with -28 00000100:00100000:0.0:1321486871.819541:0:4759:0:(client.c:1745:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:cce1ed61-bd46-1d91-8ece-ef709b91e2ba:4759:1385676188533420:10.10.4.77@tcp:4 I think the problem could be probably be avoided if using synchronous write (like direct i/o) in the test. |
| Comment by Peter Jones [ 24/Nov/11 ] |
|
Niu Could you please investigate more? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 24/Nov/11 ] |
|
Hi, Sarah Can this be reproduced on kernels other than 2.1.52--PRISTINE-2.6.32.36-0.5-default? I suspect it because some special kernel treat the ENOSPC differently in filemap_fdatawrite_range(), if that's true, I think we can just relax the test error condition a little bit, all kinds of write error can be regarded as pass. Where can we get the source of 2.6.32.36? |
| Comment by Sarah Liu [ 27/Nov/11 ] |
|
No, this error only shown on SLES kernel |
| Comment by Niu Yawei (Inactive) [ 28/Nov/11 ] |
|
I checked the source of 2.6.32.36, but didn't see why the ENOSPC isn't returned for the write syscall. I'll provide a patch changing the test from O_SYNC write to O_DIRECT write as Johann suggested. |
| Comment by Niu Yawei (Inactive) [ 28/Nov/11 ] |
|
http://review.whamcloud.com/1748 Sarah, could you try if this works for the sles kernel? |
| Comment by Niu Yawei (Inactive) [ 29/Nov/11 ] |
|
I rechecked the code, seems the culprit is that we don't use writeback bit for pages, which breaks the kernel wait page writeback APIs. Kernel 2.6.32.36 doesn't have writev, so it's write code path is different with other kernels having writev. For O_SYNC write, 2.6.32.36 calls vfs_fsync_range() to sync pages and wait, and the vfs_fsync_range() calls filemap_write_and_wait_range() to wait for io completion then check the AS_ENOSPC on mapping->flags, however, since we never set writeback bit for writing pages, this wait then check function might be finished before io completed! That's why the AS_ENOSPC set on io completion isn't detected. Other kernerls which have writev goes different code path, they call sync_page_range() to wait & check, the sync_page_range() always call generic_osync_inode() before calling the wait_on_page_writeback_range(), so io should probably has completed before calling the wait_on_page_writeback_range(), and the AS_ENOSPC is deteced. But I don't see why write returns -EIO for 2.6.32.36 so far. Hi Sarah, could you reproduce the bug with D_TRACE log enabled? Then I could probably figure out where the -EIO is come from. Thanks in advance. |
| Comment by Sarah Liu [ 29/Nov/11 ] |
|
ok, will keep you updated |
| Comment by Niu Yawei (Inactive) [ 29/Nov/11 ] |
Hmm, looks we do use writeback bit. Anyway, let's see what can we find in D_TRACE log. |
| Comment by Sarah Liu [ 29/Nov/11 ] |
|
Here is the report with D_TRACE enabled. |
| Comment by Niu Yawei (Inactive) [ 29/Nov/11 ] |
|
Thank you, Sarah. The log shows that ll_fsync() returned ENOSPC as well, but vvp_io_write_start() returned -EIO then. 00000080:00000010:3.0:1322551344.059576:0:7829:0:(file.c:1992:ll_fsync()) slab-freed '((oinfo->oi_oa))': 208 at ffff88031cbede50. 00000080:00000010:3.0:1322551344.059577:0:7829:0:(file.c:1993:ll_fsync()) kfreed 'oinfo': 104 at ffff88031ca61c40. 00000080:00000001:3.0:1322551344.059578:0:7829:0:(file.c:1997:ll_fsync()) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4) 00000080:00000001:3.0:1322551344.059580:0:7829:0:(vvp_io.c:616:vvp_io_write_start()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb) Look at the vfs_fsync_range() err = fop->fsync(file, dentry, datasync);
if (!ret)
ret = err;
so if the 'ret' already been set as -EIO, the ENOSPC returned by fsync will be ignored, but I didn't figure out why the 'ret' is -EIO yet. |
| Comment by Andreas Dilger [ 29/Nov/11 ] |
|
Jinshan is just working on implementing proper PG_writeback support for CLIO. Possibly this will fix the problem as well. I've added him to this bug to get some advice on whether this is correct. If we know the problem is confined to this specific kernel version, and Jinshan will (I think) be delivering a fix for this soon, I would be OK with adding this test to ALWAYS_EXCEPT for this specific kernel version. That way, the test is still run on other kernels, and if we move to a new kernel and the bug is not fixed it will appear again, instead of being forgotten (like some other bugs in ALWAYS_EXCEPT today). |
| Comment by Niu Yawei (Inactive) [ 29/Nov/11 ] |
Sorry, my earlier analysis about writeback flag is wrong, we do set/clear PG_writeback for page transfer. |
| Comment by Niu Yawei (Inactive) [ 30/Nov/11 ] |
|
Ah, I realized that the kernel version is 2.6.32.43-0.4-default but not 2.6.32.36-0.5, the versions: 2.1.52--PRISTINE-2.6.32.36-0.5-default (x86_64) displayed on maloo confused me... Hi, Sarah, where can I find the source of 2.6.32.43-0.4? Thanks. |
| Comment by Peter Jones [ 04/Jan/12 ] |
|
Sarah Are you able to answer Niu's question about the 2.6.32.43-0.4 source? If not, is there someone else who can point you in the right direction Niu? |
| Comment by Sarah Liu [ 04/Jan/12 ] |
|
I tried to find the source on brent several weeks ago and failed. Niu knows that. |
| Comment by Niu Yawei (Inactive) [ 04/Jan/12 ] |
|
Maybe Chris or Mike knows where we saved the kernel source? |
| Comment by Michael MacDonald (Inactive) [ 25/Jan/12 ] |
|
SLES kernel source is hard to find, if you don't know where to look. I have downloaded the requested SRPM: mjmac@brent:~$ ls -l /scratch/k_srpm/sles11/kernel-source-2.6.32.43-0.4.1.src.rpm total 82080 -rw-r--r-- 1 mjmac mjmac 84043443 2011-07-17 19:33 kernel-source-2.6.32.43-0.4.1.src.rpm For the future, others can download kernel sources directly from novell's update server using special mirror-only credentials (Note the single-quotes around the URL – necessary to protect $RCE from shell expansion): # e.g. for the kernel requested in this ticket: wget 'https://UC1714458:a31da933d6@nu.novell.com/repo/$RCE/SLES11-SP1-Updates/sle-11-x86_64/rpm/src/kernel-source-2.6.32.43-0.4.1.src.rpm' These credentials can be entered into a web browser and used for browsing the novell updates mirror. They can't be used for logging into the main novell partner portal or anything other than nu.novell.com. |
| Comment by Niu Yawei (Inactive) [ 30/Jan/12 ] |
|
Thank you, Mike. I checked the source (2.6.32.43-0.4.1), but still don't see why the write returned -EIO. Sarah, how often did you see this bug on sles (2.6.32.43-0.4-default) client? did it appeared in every test? |
| Comment by Sarah Liu [ 30/Jan/12 ] |
|
Niu, As I remembered, it was easy to reproduced. |
| Comment by Peter Jones [ 08/Feb/12 ] |
|
Sarah Have you repeated this test with 2.1.55 yet? Peter |
| Comment by Sarah Liu [ 08/Feb/12 ] |
|
yes, hit the same error: https://maloo.whamcloud.com/test_sets/efca1288-51a8-11e1-8783-5254004bbbd3 |
| Comment by Niu Yawei (Inactive) [ 08/Feb/12 ] |
|
Hi, Sarah |
| Comment by Sarah Liu [ 09/Feb/12 ] |
|
sure, will keep you updated. it may take some time due to the 2.2 testing, sorry about that. |
| Comment by Jian Yu [ 10/Feb/12 ] |
|
Lustre Tag: v2_1_1_0_RC1 The same issue occurred: |
| Comment by Sarah Liu [ 16/Feb/12 ] |
Hi Niu, do you mean this patch? http://review.whamcloud.com/#change,1748 |
| Comment by Niu Yawei (Inactive) [ 16/Feb/12 ] |
I meant the patch attached here: |
| Comment by Sarah Liu [ 17/Feb/12 ] |
|
Hi Niu, could you please just submit the patch to review and tell me the build number? Thanks. |
| Comment by Niu Yawei (Inactive) [ 17/Feb/12 ] |
|
ok, I uploaded the diagnostic patch in the patchset 2 of http://review.whamcloud.com/#change,1748 |
| Comment by Sarah Liu [ 17/Feb/12 ] |
|
https://maloo.whamcloud.com/test_sets/6e7dd01e-599d-11e1-a26b-5254004bbbd3 For server log please the attached. |
| Comment by Niu Yawei (Inactive) [ 20/Feb/12 ] |
|
Thanks, Sarah. The log shows that AS_ENOSPC was set on the proper i_mapping, and I can't tell why it's not transfered to -ENOSPCE in wait_on_page_writeback_range() from the kernel code. |
| Comment by Andreas Dilger [ 02/Mar/12 ] |
|
Niu, could you please run this test locally to try and debug it. I'm repeatedly seeing IO errors running sanity.sh test_27[opqv] during file object creations running on a RHEL6 kernel. |
| Comment by Niu Yawei (Inactive) [ 02/Mar/12 ] |
|
Hi, Andreas, did your tests pass or not? I tried the tests locally, and they all passed, I think the EIO error messages are what we expected, since these tests are trying to create file after exhausting the precreation, and that should get EIO error from the server. (see qos_prep_create()->alloc_idx_array(), when we fail to create any stripe, -EIO will be returned the client) |
| Comment by Zhenyu Xu [ 18/Mar/12 ] |
|
another hit for 2.2 https://maloo.whamcloud.com/test_sets/9d70bce8-6ed7-11e1-b737-5254004bbbd3 == sanity test 81b: OST should return -ENOSPC when retry still fails ================================= 21:47:12 (1331786832) fail_loc=0x228 write: Input/output error sanity test_81b: @@@@@@ FAIL: dd should fail for -ENOSPC, but succeed. |
| Comment by Sarah Liu [ 19/Mar/12 ] |
|
Got this issue on 2.2-RC1 test, RHEL6 server/SUSE client: |
| Comment by Jinshan Xiong (Inactive) [ 09/Apr/12 ] |
|
it looks like this issue can only be seen on suse clients. I don't know what's the difference between rhel and suse clients. |
| Comment by Sarah Liu [ 04/Jun/12 ] |
|
tag-2.2.54 RHEL6 server/SLES client hit this issue again: https://maloo.whamcloud.com/test_sets/f0c1b98c-ad58-11e1-8152-52540035b04c |
| Comment by Jian Yu [ 04/Jun/12 ] |
|
Lustre Tag: v2_1_2_RC2 The same issue occurred: https://maloo.whamcloud.com/test_sets/a3243c4a-ae52-11e1-ae0d-52540035b04c |
| Comment by Elena Gryaznova [ 17/Mar/15 ] |
|
This looks a duplicate of |
| Comment by Andreas Dilger [ 29/Sep/15 ] |
|
Closing as a duplicate |