[LU-10059] sanityn test_32a: wrong file size Created: 02/Oct/17 Updated: 16/May/22 Resolved: 17/Mar/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.2, Lustre 2.14.0, Lustre 2.12.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | DNE | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/12f0d24e-a732-11e7-b786-5254006e85c2. The sub-test test_32a failed with the following error: wrong file size Please provide additional information about the failure here. Info required for matching: sanityn 32a |
| Comments |
| Comment by Jian Yu [ 23/Oct/17 ] |
|
More failure instance: |
| Comment by nasf (Inactive) [ 20/Nov/17 ] |
|
+1 on master: |
| Comment by Mikhail Pershin [ 07/Dec/17 ] |
|
https://testing.hpdd.intel.com/test_sets/7fef1724-db22-11e7-9c63-52540065bddc |
| Comment by Andreas Dilger [ 13/May/18 ] |
|
+1 on master |
| Comment by Patrick Farrell (Inactive) [ 20/Jan/19 ] |
|
This test does not disable lockless truncate on failure, which is not good, since it's not generally safe. I realized this from: Which panicked with an error on truncate. From 32a: That's almost certainly why |
| Comment by Patrick Farrell (Inactive) [ 20/Jan/19 ] |
|
Two thoughts.
|
| Comment by Gerrit Updater [ 20/Jan/19 ] |
|
Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34070 |
| Comment by Andreas Dilger [ 20/Jan/19 ] |
|
We introduced lockless truncate to handle cases where lots of clients were (incorrectly) opening the same file with O_TRUNC, but it caused the server to thrash. I don't know that lockless truncate has a wide applicability, but it would be good (if practical) to fix it. I think more important than lockless truncate is lockless small/unaligned write to have low latency IOPS for poorly-formed IO. |
| Comment by Patrick Farrell (Inactive) [ 21/Jan/19 ] |
|
I dug through it a bit, and the proximate cause of the failures in 32a seems to be an eviction. Can't find it in the server logs currently... If I can manage that (or reproduce it locally, I've got the test looping...), I'll take at least a quick look. I took a look at lockless truncate and convinced myself I mostly understand the basic design, but the implementation has clearly bit rotted a bit. It never worked fully in 2.x (and in fact, maybe not even in 1.8 - there's comments about it being disabled there too). I'm not sure about the lockless i/o situation more generally - That design is a bit murkier to me without investing more time. It's compelling stuff, but unless we want to try to do a quick-ish fix for just lockless truncate, I think it might be a lot of work to fix. And maybe even then. |
| Comment by Patrick Farrell (Inactive) [ 21/Jan/19 ] |
|
OK, I reproduced and have mostly figured out the lockless truncate issue. At least the issue I'm seeing here happens 100% of the time in this situation: Truncate on the same client which did the writing, and truncate a size in the middle of an extent (for example, in the test, write 8 MiB and truncate to binary 8 million. This happens if you have 8 or more OSTs). This results in a partial extent remaining on the client after the truncate: in osc_cache_truncate_start:
[...] That extent is then placed in the &oio->oi_trunc pointer, to be added back to the cache at the end of the i/o (osc_io_setattr_end) : osc_cache_truncate_end(env, oio->oi_trunc); The key thing is this: It looks like we also have to avoid writing back this extent:
This does suggest one possible route forward - This bug at least would be avoided by limiting lockless truncate to "truncate to zero". That's probably valuable, as you noted the O_TRUNC case is of significant interest. |
| Comment by Andreas Dilger [ 21/Jan/19 ] |
|
It also doesn't make sense to do "lockless" truncate for an extent that is already cached locally under a DLM lock. In that case the client should just send the truncate and advertise that it already has the lock for that extent. |
| Comment by Patrick Farrell (Inactive) [ 21/Jan/19 ] |
|
Right, but the locking has already been done by the time we learn this (cl_io_lock vs cl_io_start), and there's no guarantee the lock the client holds is sufficient for the truncate. We'd have to restart the i/o here to get the locking, and after looking at it... It's a decent bit of work.
I would consider reviving lockless truncate, but only for truncate to zero. I think that would work as is, with minimal effort. I'm going to push a patch to enable lockless truncate but only for truncate to zero, along with a quick test, just to see if anything interesting happens in autotest. |
| Comment by Patrick Farrell (Inactive) [ 21/Jan/19 ] |
|
Alternately, it could both deadlock and panic in local testing. I'm going to table this and push a patch to add lockless truncate to ALWAYS_EXCEPT. |
| Comment by Gerrit Updater [ 21/Jan/19 ] |
|
Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34081 |
| Comment by Gerrit Updater [ 30/Jan/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34070/ |
| Comment by Andreas Dilger [ 15/Oct/19 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/e9e439c2-eedd-11e9-add9-52540065bddc |
| Comment by Andreas Dilger [ 30/Oct/19 ] |
|
This hit 6x in the past week. |
| Comment by Emoly Liu [ 27/Feb/20 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/a345510a-c777-4bc1-8c30-2413be63a24a |
| Comment by James Nunez (Inactive) [ 30/Oct/20 ] |
|
It looks like we are still seeing this issue. We see two different errors/situations in the suite log. One of the errors is ‘can’t lstat’with no complaint from truncate == sanityn test 32a: lockless truncate =============================================================== 17:20:44 (1603992044) CMD: trevis-9vm6 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize CMD: trevis-9vm3.trevis.whamcloud.com params=\$(/usr/sbin/lctl get_param osc.*.lockless_truncate); [[ -z \"\" ]] && param= || param=\$(grep <<< \"\$params\"); [[ -z \$param ]] && param=\"\$params\"; while read s; do echo client \$s; done <<< \"\$param\" checking cached lockless truncate Can't lstat /mnt/lustre2/f32a.sanityn: Input/output error sanityn test_32a: @@@@@@ FAIL: wrong file size We see this error for The other error we are seeing is a truncate error and the report on different size == sanityn test 32a: lockless truncate =============================================================== 18:43:28 (1601491408) CMD: trevis-65vm4 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize CMD: trevis-65vm1.trevis.whamcloud.com params=\$(/usr/sbin/lctl get_param osc.*.lockless_truncate); [[ -z \"\" ]] && param= || param=\$(grep <<< \"\$params\"); [[ -z \$param ]] && param=\"\$params\"; while read s; do echo client \$s; done <<< \"\$param\" checking cached lockless truncate truncate: cannot truncate '/mnt/lustre/f32a.sanityn' to length 8000000: Input/output error /mnt/lustre2/f32a.sanityn has size 7340032, not 8000000 sanityn test_32a: @@@@@@ FAIL: wrong file size We see this error for |
| Comment by Gerrit Updater [ 30/Oct/20 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40496 |
| Comment by Gerrit Updater [ 07/Nov/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40496/ |
| Comment by Xiaolin Zang [ 18/Jan/22 ] |
|
We see the failure occasionally. Behind the error message Input/output error is the failure of lstat("/mnt/lustre2/f32a.sanityn"). But the file's size, stat and layout look normal, observed after the test failure. From the debug logs on mds0 (test driver), ldlm fails to connect to oss0: (-107 is no-conn) 00000100:02020000:11.0:1642177889.168635:0:9731:0:(client.c:1371:ptlrpc_check_status()) 11-0: lustre-OST0000-osc-ffff895ad1180800: operation ldlm_enqueue to node 10.6.4.23@tcp failed: rc = -107 Slightly earlier by timestamp, the oss0 debug log has the following (op 101 is ldlm_enqueue): 00000020:00080000:10.0:1642177889.168118:0:14028:0:(tgt_handler.c:770:tgt_request_handle()) operation 101 on unconnected OST from 12345-10.6.4.19@tcp Also from the dmesg on mds0: [ 1346.778759] LustreError: 11-0: lustre-OST0000-osc-ffff895ad1180800: operation ldlm_enqueue to node 10.6.4.23@tcp failed: rc = -107 It seems the mds0 and oss0 have a temporary connection error. It's unlikely due to a random network issue because other tests are OK when test_32a's fails, as we have observed many times. Uploaded the following files. The debug logs are denoted "xxxxx". sanityn.test_32a.debug_log.mds0.32a_only sanityn.test_32a.dmesg.mds0
|
| Comment by Patrick Farrell [ 17/Mar/22 ] |
|
Test removed in |