[LU-10059] sanityn test_32a: wrong file size - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.2, Lustre 2.14.0, Lustre 2.12.6
Labels:
- DNE

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/12f0d24e-a732-11e7-b786-5254006e85c2.

The sub-test test_32a failed with the following error:

wrong file size

Please provide additional information about the failure here.

Info required for matching: sanityn 32a

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

sanityn.test_32a.debug_log.mds0.32a_only
2.01 MB
18/Jan/22 8:50 PM
sanityn.test_32a.debug_log.oss0.32a_only
132 kB
18/Jan/22 8:50 PM
sanityn.test_32a.dmesg.mds0
63 kB
18/Jan/22 8:50 PM
sanityn.test_32a.test_log.mds0
6 kB
18/Jan/22 8:50 PM

Issue Links

is related to

LU-10891 sanityn test 77a, 77b, 77c, 77d, 77e, 77f, 77j and 77k all fail after 32a with 'dd at *MB on client failed (2)'

Resolved

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page Loading...

(1 mentioned in)

Activity

[LU-10059] sanityn test_32a: wrong file size

Gerrit Updater added a comment - 30/Oct/20 5:48 PM

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40496
Subject: ~~LU-10059~~ tests: sanityn 32a error messages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1ec9dafe19fe31b5a19151a33cdb388f359fa7c1

Gerrit Updater added a comment - 30/Oct/20 5:48 PM James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40496 Subject: LU-10059 tests: sanityn 32a error messages Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1ec9dafe19fe31b5a19151a33cdb388f359fa7c1

James Nunez (Inactive) added a comment - 30/Oct/20 5:34 PM

It looks like we are still seeing this issue. We see two different errors/situations in the suite log.

One of the errors is ‘can’t lstat’with no complaint from truncate

== sanityn test 32a: lockless truncate =============================================================== 17:20:44 (1603992044)
CMD: trevis-9vm6 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize
CMD: trevis-9vm3.trevis.whamcloud.com params=\$(/usr/sbin/lctl get_param osc.*.lockless_truncate);
			 [[ -z \"\" ]] && param= ||
			 param=\$(grep  <<< \"\$params\");
			 [[ -z \$param ]] && param=\"\$params\";
			 while read s; do echo client \$s;
			 done <<< \"\$param\"
checking cached lockless truncate
Can't lstat /mnt/lustre2/f32a.sanityn: Input/output error
 sanityn test_32a: @@@@@@ FAIL: wrong file size

We see this error for
2.12.5.67 - https://testing.whamcloud.com/test_sets/9881eb9e-1130-4da5-9312-a4451d67c59c
2.13.55.104 - https://testing.whamcloud.com/test_sets/7ad28649-b4a5-458a-8b3f-a08820a4b85c

The other error we are seeing is a truncate error and the report on different size

== sanityn test 32a: lockless truncate =============================================================== 18:43:28 (1601491408)
CMD: trevis-65vm4 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize
CMD: trevis-65vm1.trevis.whamcloud.com params=\$(/usr/sbin/lctl get_param osc.*.lockless_truncate);
			 [[ -z \"\" ]] && param= ||
			 param=\$(grep  <<< \"\$params\");
			 [[ -z \$param ]] && param=\"\$params\";
			 while read s; do echo client \$s;
			 done <<< \"\$param\"
checking cached lockless truncate
truncate: cannot truncate '/mnt/lustre/f32a.sanityn' to length 8000000: Input/output error
/mnt/lustre2/f32a.sanityn has size 7340032, not 8000000
 sanityn test_32a: @@@@@@ FAIL: wrong file size

We see this error for
2.12.5.50 - https://testing.whamcloud.com/test_sets/7c82a5a3-67f9-4d9e-996b-e6584cbad2d3

James Nunez (Inactive) added a comment - 30/Oct/20 5:34 PM It looks like we are still seeing this issue. We see two different errors/situations in the suite log. One of the errors is ‘can’t lstat’with no complaint from truncate == sanityn test 32a: lockless truncate =============================================================== 17:20:44 (1603992044) CMD: trevis-9vm6 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize CMD: trevis-9vm3.trevis.whamcloud.com params=\$(/usr/sbin/lctl get_param osc.*.lockless_truncate); [[ -z \"\" ]] && param= || param=\$(grep <<< \"\$params\"); [[ -z \$param ]] && param=\"\$params\"; while read s; do echo client \$s; done <<< \"\$param\" checking cached lockless truncate Can't lstat /mnt/lustre2/f32a.sanityn: Input/output error sanityn test_32a: @@@@@@ FAIL: wrong file size We see this error for 2.12.5.67 - https://testing.whamcloud.com/test_sets/9881eb9e-1130-4da5-9312-a4451d67c59c 2.13.55.104 - https://testing.whamcloud.com/test_sets/7ad28649-b4a5-458a-8b3f-a08820a4b85c The other error we are seeing is a truncate error and the report on different size == sanityn test 32a: lockless truncate =============================================================== 18:43:28 (1601491408) CMD: trevis-65vm4 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize CMD: trevis-65vm1.trevis.whamcloud.com params=\$(/usr/sbin/lctl get_param osc.*.lockless_truncate); [[ -z \"\" ]] && param= || param=\$(grep <<< \"\$params\"); [[ -z \$param ]] && param=\"\$params\"; while read s; do echo client \$s; done <<< \"\$param\" checking cached lockless truncate truncate: cannot truncate '/mnt/lustre/f32a.sanityn' to length 8000000: Input/output error /mnt/lustre2/f32a.sanityn has size 7340032, not 8000000 sanityn test_32a: @@@@@@ FAIL: wrong file size We see this error for 2.12.5.50 - https://testing.whamcloud.com/test_sets/7c82a5a3-67f9-4d9e-996b-e6584cbad2d3

Emoly Liu added a comment - 27/Feb/20 4:32 AM

+1 on master: https://testing.whamcloud.com/test_sets/a345510a-c777-4bc1-8c30-2413be63a24a

Emoly Liu added a comment - 27/Feb/20 4:32 AM +1 on master: https://testing.whamcloud.com/test_sets/a345510a-c777-4bc1-8c30-2413be63a24a

Andreas Dilger added a comment - 30/Oct/19 1:45 AM

This hit 6x in the past week.

Andreas Dilger added a comment - 30/Oct/19 1:45 AM This hit 6x in the past week.

Andreas Dilger added a comment - 15/Oct/19 12:52 AM

+1 on master https://testing.whamcloud.com/test_sets/e9e439c2-eedd-11e9-add9-52540065bddc

Andreas Dilger added a comment - 15/Oct/19 12:52 AM +1 on master https://testing.whamcloud.com/test_sets/e9e439c2-eedd-11e9-add9-52540065bddc

Gerrit Updater added a comment - 30/Jan/19 2:41 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34070/
Subject: ~~LU-10059~~ tests: sanityn 32a restore parameters
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 62b57e34d9a0df1ce4b82650d7e328db5d048b39

Gerrit Updater added a comment - 30/Jan/19 2:41 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34070/ Subject: LU-10059 tests: sanityn 32a restore parameters Project: fs/lustre-release Branch: master Current Patch Set: Commit: 62b57e34d9a0df1ce4b82650d7e328db5d048b39

Gerrit Updater added a comment - 21/Jan/19 9:53 PM

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34081
Subject: ~~LU-10059~~ tests: Disable lockless truncate test
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d7f4f322514b522d2a23ce3a698e56a768e4bfbb

Gerrit Updater added a comment - 21/Jan/19 9:53 PM Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34081 Subject: LU-10059 tests: Disable lockless truncate test Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d7f4f322514b522d2a23ce3a698e56a768e4bfbb

Patrick Farrell (Inactive) added a comment - 21/Jan/19 9:50 PM

Alternately, it could both deadlock and panic in local testing.

I'm going to table this and push a patch to add lockless truncate to ALWAYS_EXCEPT.

Patrick Farrell (Inactive) added a comment - 21/Jan/19 9:50 PM Alternately, it could both deadlock and panic in local testing. I'm going to table this and push a patch to add lockless truncate to ALWAYS_EXCEPT.

Patrick Farrell (Inactive) added a comment - 21/Jan/19 9:02 PM - edited

Right, but the locking has already been done by the time we learn this (cl_io_lock vs cl_io_start), and there's no guarantee the lock the client holds is sufficient for the truncate. We'd have to restart the i/o here to get the locking, and after looking at it... It's a decent bit of work.

I would consider reviving lockless truncate, but only for truncate to zero. I think that would work as is, with minimal effort. I'm going to push a patch to enable lockless truncate but only for truncate to zero, along with a quick test, just to see if anything interesting happens in autotest.

Patrick Farrell (Inactive) added a comment - 21/Jan/19 9:02 PM - edited Right, but the locking has already been done by the time we learn this (cl_io_lock vs cl_io_start), and there's no guarantee the lock the client holds is sufficient for the truncate. We'd have to restart the i/o here to get the locking, and after looking at it... It's a decent bit of work. I would consider reviving lockless truncate, but only for truncate to zero. I think that would work as is, with minimal effort. I'm going to push a patch to enable lockless truncate but only for truncate to zero, along with a quick test, just to see if anything interesting happens in autotest.

Andreas Dilger added a comment - 21/Jan/19 7:16 PM

It also doesn't make sense to do "lockless" truncate for an extent that is already cached locally under a DLM lock. In that case the client should just send the truncate and advertise that it already has the lock for that extent.

Andreas Dilger added a comment - 21/Jan/19 7:16 PM It also doesn't make sense to do "lockless" truncate for an extent that is already cached locally under a DLM lock. In that case the client should just send the truncate and advertise that it already has the lock for that extent.

Patrick Farrell (Inactive) added a comment - 21/Jan/19 6:16 PM

OK, I reproduced and have mostly figured out the lockless truncate issue.

At least the issue I'm seeing here happens 100% of the time in this situation:

Truncate on the same client which did the writing, and truncate a size in the middle of an extent (for example, in the test, write 8 MiB and truncate to binary 8 million. This happens if you have 8 or more OSTs).

This results in a partial extent remaining on the client after the truncate:

in osc_cache_truncate_start:
" } else {
/* this must be an overlapped extent which means only

part of pages in this extent have been truncated.
*/
EASSERTF(ext->oe_start <= index, ext,
"trunc index = %lu/%d.\n", index, partial);
/* fix index to skip this partially truncated extent */
index = ext->oe_end + 1;
partial = false;"

[...]

That extent is then placed in the &oio->oi_trunc pointer, to be added back to the cache at the end of the i/o (osc_io_setattr_end) :

osc_cache_truncate_end(env, oio->oi_trunc);

The key thing is this:
That extent still exists and is attached to the relevant LDLM lock (the one used to write it out). But since we're doing a lockless truncate, we send the punch request to the server without any LDLM locking locally, so the server tries to take the lock, and tries to call back the client write lock so it can do the truncate.

It looks like we also have to avoid writing back this extent:
" /* we need to hold this extent in OES_TRUNC state so

that no writeback will happen. This is to avoid
BUG 17397.
Only partial truncate can reach here, if @size is
not zero, the caller should provide a valid @extp. */
LASSERT(*extp == NULL);
*extp = osc_extent_get(ext);
OSC_EXTENT_DUMP(D_CACHE, ext,
"trunc at %llu\n", size);"

This does suggest one possible route forward - This bug at least would be avoided by limiting lockless truncate to "truncate to zero".

That's probably valuable, as you noted the O_TRUNC case is of significant interest.

Patrick Farrell (Inactive) added a comment - 21/Jan/19 6:16 PM OK, I reproduced and have mostly figured out the lockless truncate issue. At least the issue I'm seeing here happens 100% of the time in this situation: Truncate on the same client which did the writing, and truncate a size in the middle of an extent (for example, in the test, write 8 MiB and truncate to binary 8 million. This happens if you have 8 or more OSTs). This results in a partial extent remaining on the client after the truncate: in osc_cache_truncate_start: " } else { /* this must be an overlapped extent which means only part of pages in this extent have been truncated. */ EASSERTF(ext->oe_start <= index, ext, "trunc index = %lu/%d.\n", index, partial); /* fix index to skip this partially truncated extent */ index = ext->oe_end + 1; partial = false;" [...] That extent is then placed in the &oio->oi_trunc pointer, to be added back to the cache at the end of the i/o (osc_io_setattr_end) : osc_cache_truncate_end(env, oio->oi_trunc); The key thing is this: That extent still exists and is attached to the relevant LDLM lock (the one used to write it out). But since we're doing a lockless truncate, we send the punch request to the server without any LDLM locking locally, so the server tries to take the lock, and tries to call back the client write lock so it can do the truncate. It looks like we also have to avoid writing back this extent: " /* we need to hold this extent in OES_TRUNC state so that no writeback will happen. This is to avoid BUG 17397. Only partial truncate can reach here, if @size is not zero, the caller should provide a valid @extp. */ LASSERT(*extp == NULL); *extp = osc_extent_get(ext); OSC_EXTENT_DUMP(D_CACHE, ext, "trunc at %llu\n", size);" This does suggest one possible route forward - This bug at least would be avoided by limiting lockless truncate to "truncate to zero". That's probably valuable, as you noted the O_TRUNC case is of significant interest.

People

Assignee:: WC Triage

Reporter:: Maloo

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 02/Oct/17 8:37 PM

Updated:: 16/May/22 9:51 PM

Resolved:: 17/Mar/22 4:06 PM