[LU-10059] sanityn test_32a: wrong file size Created: 02/Oct/17  Updated: 16/May/22  Resolved: 17/Mar/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.2, Lustre 2.14.0, Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Fixed Votes: 1
Labels: DNE

Attachments: File sanityn.test_32a.debug_log.mds0.32a_only     File sanityn.test_32a.debug_log.oss0.32a_only     File sanityn.test_32a.dmesg.mds0     File sanityn.test_32a.test_log.mds0    
Issue Links:
Related
is related to LU-10891 sanityn test 77a, 77b, 77c, 77d, 77e,... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/12f0d24e-a732-11e7-b786-5254006e85c2.

The sub-test test_32a failed with the following error:

wrong file size

Please provide additional information about the failure here.

Info required for matching: sanityn 32a



 Comments   
Comment by Jian Yu [ 23/Oct/17 ]

More failure instance:
https://testing.hpdd.intel.com/test_sets/937fd378-b3f4-11e7-a282-5254006e85c2

Comment by nasf (Inactive) [ 20/Nov/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/4b6b05f2-cd8b-11e7-a066-52540065bddc

Comment by Mikhail Pershin [ 07/Dec/17 ]

https://testing.hpdd.intel.com/test_sets/7fef1724-db22-11e7-9c63-52540065bddc

Comment by Andreas Dilger [ 13/May/18 ]

+1 on master
https://testing.hpdd.intel.com/test_sets/45ffb738-569e-11e8-b303-52540065bddc

Comment by Patrick Farrell (Inactive) [ 20/Jan/19 ]

This test does not disable lockless truncate on failure, which is not good, since it's not generally safe.  I realized this from:
https://testing.whamcloud.com/test_sets/b5edcc08-1b7c-11e9-8388-52540065bddc

Which panicked with an error on truncate.

From 32a:
" log "checking cached lockless truncate"
$TRUNCATE $DIR1/$tfile 8000000
$CHECKSTAT -s 8000000 $DIR2/$tfile || error "wrong file size""

That's almost certainly why LU-10891 is happening after this failure.

Comment by Patrick Farrell (Inactive) [ 20/Jan/19 ]

Two thoughts.

  1. Fix the test.  That's easy.  I'll push a patch.
  2. Should we remove lockless truncate?  It's not safe, so it's off, so it's not used AFAIK...  I'll dig in to this a little more before I go further, but I think we should consider just removing it.  (Especially since its own test isn't 100% reliable.  That may be a bug in lockless truncate too...)
Comment by Gerrit Updater [ 20/Jan/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34070
Subject: LU-10059 tests: sanityn 32a restore parameters
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7dc5954230ef2933242bfcb3312f3e761021f3b2

Comment by Andreas Dilger [ 20/Jan/19 ]

We introduced lockless truncate to handle cases where lots of clients were (incorrectly) opening the same file with O_TRUNC, but it caused the server to thrash. I don't know that lockless truncate has a wide applicability, but it would be good (if practical) to fix it. I think more important than lockless truncate is lockless small/unaligned write to have low latency IOPS for poorly-formed IO.

Comment by Patrick Farrell (Inactive) [ 21/Jan/19 ]

I dug through it a bit, and the proximate cause of the failures in 32a seems to be an eviction.  Can't find it in the server logs currently...  If I can manage that (or reproduce it locally, I've got the test looping...), I'll take at least a quick look.

I took a look at lockless truncate and convinced myself I mostly understand the basic design, but the implementation has clearly bit rotted a bit.  It never worked fully in 2.x (and in fact, maybe not even in 1.8 - there's comments about it being disabled there too).  I'm not sure about the lockless i/o situation more generally - That design is a bit murkier to me without investing more time.

It's compelling stuff, but unless we want to try to do a quick-ish fix for just lockless truncate, I think it might be a lot of work to fix.  And maybe even then.

Comment by Patrick Farrell (Inactive) [ 21/Jan/19 ]

OK, I reproduced and have mostly figured out the lockless truncate issue.

At least the issue I'm seeing here happens 100% of the time in this situation:

Truncate on the same client which did the writing, and truncate a size in the middle of an extent (for example, in the test, write 8 MiB and truncate to binary 8 million.  This happens if you have 8 or more OSTs).

This results in a partial extent remaining on the client after the truncate:

in osc_cache_truncate_start:
" } else {
/* this must be an overlapped extent which means only

  • part of pages in this extent have been truncated.
    */
    EASSERTF(ext->oe_start <= index, ext,
    "trunc index = %lu/%d.\n", index, partial);
    /* fix index to skip this partially truncated extent */
    index = ext->oe_end + 1;
    partial = false;"

[...]

That extent is then placed in the &oio->oi_trunc pointer, to be added back to the cache at the end of the i/o (osc_io_setattr_end) :

osc_cache_truncate_end(env, oio->oi_trunc);

The key thing is this:
That extent still exists and is attached to the relevant LDLM lock (the one used to write it out).  But since we're doing a lockless truncate, we send the punch request to the server without any LDLM locking locally, so the server tries to take the lock, and tries to call back the client write lock so it can do the truncate.

It looks like we also have to avoid writing back this extent:
" /* we need to hold this extent in OES_TRUNC state so

  • that no writeback will happen. This is to avoid
  • BUG 17397.
  • Only partial truncate can reach here, if @size is
  • not zero, the caller should provide a valid @extp. */
    LASSERT(*extp == NULL);
    *extp = osc_extent_get(ext);
    OSC_EXTENT_DUMP(D_CACHE, ext,
    "trunc at %llu\n", size);"

This does suggest one possible route forward - This bug at least would be avoided by limiting lockless truncate to "truncate to zero".

That's probably valuable, as you noted the O_TRUNC case is of significant interest.

Comment by Andreas Dilger [ 21/Jan/19 ]

It also doesn't make sense to do "lockless" truncate for an extent that is already cached locally under a DLM lock. In that case the client should just send the truncate and advertise that it already has the lock for that extent.

Comment by Patrick Farrell (Inactive) [ 21/Jan/19 ]

Right, but the locking has already been done by the time we learn this (cl_io_lock vs cl_io_start), and there's no guarantee the lock the client holds is sufficient for the truncate.  We'd have to restart the i/o here to get the locking, and after looking at it...  It's a decent bit of work.

 

I would consider reviving lockless truncate, but only for truncate to zero.  I think that would work as is, with minimal effort.  I'm going to push a patch to enable lockless truncate but only for truncate to zero, along with a quick test, just to see if anything interesting happens in autotest.

Comment by Patrick Farrell (Inactive) [ 21/Jan/19 ]

Alternately, it could both deadlock and panic in local testing.

I'm going to table this and push a patch to add lockless truncate to ALWAYS_EXCEPT.

Comment by Gerrit Updater [ 21/Jan/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34081
Subject: LU-10059 tests: Disable lockless truncate test
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d7f4f322514b522d2a23ce3a698e56a768e4bfbb

Comment by Gerrit Updater [ 30/Jan/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34070/
Subject: LU-10059 tests: sanityn 32a restore parameters
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 62b57e34d9a0df1ce4b82650d7e328db5d048b39

Comment by Andreas Dilger [ 15/Oct/19 ]

+1 on master https://testing.whamcloud.com/test_sets/e9e439c2-eedd-11e9-add9-52540065bddc

Comment by Andreas Dilger [ 30/Oct/19 ]

This hit 6x in the past week.

Comment by Emoly Liu [ 27/Feb/20 ]

+1 on master: https://testing.whamcloud.com/test_sets/a345510a-c777-4bc1-8c30-2413be63a24a

Comment by James Nunez (Inactive) [ 30/Oct/20 ]

It looks like we are still seeing this issue. We see two different errors/situations in the suite log.

One of the errors is ‘can’t lstat’with no complaint from truncate

== sanityn test 32a: lockless truncate =============================================================== 17:20:44 (1603992044)
CMD: trevis-9vm6 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize
CMD: trevis-9vm3.trevis.whamcloud.com params=\$(/usr/sbin/lctl get_param osc.*.lockless_truncate);
			 [[ -z \"\" ]] && param= ||
			 param=\$(grep  <<< \"\$params\");
			 [[ -z \$param ]] && param=\"\$params\";
			 while read s; do echo client \$s;
			 done <<< \"\$param\"
checking cached lockless truncate
Can't lstat /mnt/lustre2/f32a.sanityn: Input/output error
 sanityn test_32a: @@@@@@ FAIL: wrong file size 

We see this error for
2.12.5.67 - https://testing.whamcloud.com/test_sets/9881eb9e-1130-4da5-9312-a4451d67c59c
2.13.55.104 - https://testing.whamcloud.com/test_sets/7ad28649-b4a5-458a-8b3f-a08820a4b85c

The other error we are seeing is a truncate error and the report on different size

== sanityn test 32a: lockless truncate =============================================================== 18:43:28 (1601491408)
CMD: trevis-65vm4 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize
CMD: trevis-65vm1.trevis.whamcloud.com params=\$(/usr/sbin/lctl get_param osc.*.lockless_truncate);
			 [[ -z \"\" ]] && param= ||
			 param=\$(grep  <<< \"\$params\");
			 [[ -z \$param ]] && param=\"\$params\";
			 while read s; do echo client \$s;
			 done <<< \"\$param\"
checking cached lockless truncate
truncate: cannot truncate '/mnt/lustre/f32a.sanityn' to length 8000000: Input/output error
/mnt/lustre2/f32a.sanityn has size 7340032, not 8000000
 sanityn test_32a: @@@@@@ FAIL: wrong file size 

We see this error for
2.12.5.50 - https://testing.whamcloud.com/test_sets/7c82a5a3-67f9-4d9e-996b-e6584cbad2d3

Comment by Gerrit Updater [ 30/Oct/20 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40496
Subject: LU-10059 tests: sanityn 32a error messages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1ec9dafe19fe31b5a19151a33cdb388f359fa7c1

Comment by Gerrit Updater [ 07/Nov/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40496/
Subject: LU-10059 tests: sanityn 32a error messages
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3274e573957e8b8a067ae28c3f7d7788d40f310e

Comment by Xiaolin Zang [ 18/Jan/22 ]

We see the failure occasionally.  Behind the error message

Input/output error
 sanityn test_32a: @@@@@@ FAIL: cached truncate - wrong file size    

is the failure of lstat("/mnt/lustre2/f32a.sanityn").  But the file's size, stat and layout look normal, observed after the test failure.

From the debug logs on mds0 (test driver), ldlm fails to connect to oss0: (-107 is no-conn)

00000100:02020000:11.0:1642177889.168635:0:9731:0:(client.c:1371:ptlrpc_check_status()) 11-0: lustre-OST0000-osc-ffff895ad1180800: operation ldlm_enqueue to node 10.6.4.23@tcp failed: rc = -107

Slightly earlier by timestamp, the oss0 debug log has the following (op 101 is ldlm_enqueue):

00000020:00080000:10.0:1642177889.168118:0:14028:0:(tgt_handler.c:770:tgt_request_handle()) operation 101 on unconnected OST from 12345-10.6.4.19@tcp

Also from the dmesg on mds0:

[ 1346.778759] LustreError: 11-0: lustre-OST0000-osc-ffff895ad1180800: operation ldlm_enqueue to node 10.6.4.23@tcp failed: rc = -107
[ 1346.785968] Lustre: lustre-OST0000-osc-ffff895ad1180800: Connection to lustre-OST0000 (at 10.6.4.23@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 1346.787088] LustreError: 167-0: lustre-OST0000-osc-ffff895ad1180800: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
[ 1346.796872] Lustre: lustre-OST0000-osc-ffff895ad1180800: Connection restored to 10.6.4.23@tcp (at 10.6.4.23@tcp)
[ 1346.951394] Lustre: DEBUG MARKER: sanityn test_32a: @@@@@@ FAIL: cached truncate - wrong file size

It seems the mds0 and oss0 have a temporary connection error. It's unlikely due to a random network issue because other tests are OK when test_32a's fails, as we have observed many times.

Uploaded the following files.  The debug logs are denoted "xxxxx".

sanityn.test_32a.debug_log.mds0.32a_only    sanityn.test_32a.dmesg.mds0
sanityn.test_32a.debug_log.oss0.32a_only    sanityn.test_32a.test_log.mds0

 

Comment by Patrick Farrell [ 17/Mar/22 ]

Test removed in 

Generated at Sat Feb 10 02:31:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.