[LU-12757] sanity-lfsck test 36a fails with '(N) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f2' Created: 12/Sep/19 Updated: 16/Jan/23 Resolved: 12/Aug/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HPv2 | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
We see sanity-lfsck test_36a fail in resync for the last two of the following three calls to ‘lfs mirror resync’ from test 36a: 5271 $LFS mirror resync $DIR/$tdir/f0 || 5272 error "(6) Fail to resync $DIR/$tdir/f0" 5273 $LFS mirror resync $DIR/$tdir/f1 || 5274 error "(7) Fail to resync $DIR/$tdir/f1" 5275 $LFS mirror resync $DIR/$tdir/f2 || 5276 error "(8) Fail to resync $DIR/$tdir/f2" 5277 It looks like this test started failing with these two errors on 07-September-2019 with Lustre master version 2.12.57.54. Looking at the suite_log for https://testing.whamcloud.com/test_sets/a5f2b938-d438-11e9-a2b6-52540065bddc, we see lfs mirror mirror: component 131075 not synced : No space left on device (28) lfs mirror mirror: component 131076 not synced : No space left on device (28) lfs mirror mirror: component 196613 not synced : No space left on device (28) lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f1' llapi_mirror_resync_many: No space left on device. sanity-lfsck test_36a: @@@@@@ FAIL: (7) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f1 Similarly, looking at the suite_log for https://testing.whamcloud.com/test_sets/42fbb9fe-d575-11e9-9fc9-52540065bddc, we see lfs mirror mirror: component 131075 not synced : No space left on device (28) lfs mirror mirror: component 131076 not synced : No space left on device (28) lfs mirror mirror: component 196613 not synced : No space left on device (28) lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f2' llapi_mirror_resync_many: No space left on device. sanity-lfsck test_36a: @@@@@@ FAIL: (8) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f2 It is possible that we are running out of disk space on an OST, but it seems strange that this just started earlier this month. Logs for other failures are at |
| Comments |
| Comment by James Nunez (Inactive) [ 12/Sep/19 ] |
|
I added 'lfs df' to a patch that hits this problem consistently, patch https://review.whamcloud.com/33919/ , and it doesn't look like we are out of space == sanity-lfsck test 36a: rebuild LOV EA for mirrored file (1) ======================================= 22:37:15 (1568327835) ##### The target MDT-object's LOV EA corrupted as to lose one of the mirrors information. The layout LFSCK should rebuild the LOV EA with the PFID EA of related OST-object(s) belong to the mirror. ##### 4+0 records in 4+0 records out 4194304 bytes (4.2 MB) copied, 0.126361 s, 33.2 MB/s 4+0 records in 4+0 records out 4194304 bytes (4.2 MB) copied, 0.0600958 s, 69.8 MB/s 4+0 records in 4+0 records out 4194304 bytes (4.2 MB) copied, 0.0601316 s, 69.8 MB/s UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 43584 2168 37428 6% /mnt/lustre[MDT:0] lustre-OST0000_UUID 71100 3452 57504 6% /mnt/lustre[OST:0] lustre-OST0001_UUID 71100 1272 62396 2% /mnt/lustre[OST:1] lustre-OST0002_UUID 71100 1272 61948 3% /mnt/lustre[OST:2] lustre-OST0003_UUID 71100 1880 61748 3% /mnt/lustre[OST:3] filesystem_summary: 284400 7876 243596 4% /mnt/lustre lfs mirror mirror: component 131075 not synced : No space left on device (28) lfs mirror mirror: component 131076 not synced : No space left on device (28) lfs mirror mirror: component 196613 not synced : No space left on device (28) lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f2' llapi_mirror_resync_many: No space left on device. sanity-lfsck test_36a: @@@@@@ FAIL: (8) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f2 |
| Comment by Andreas Dilger [ 12/Sep/19 ] |
|
It is very likely that the culprit is: commit 0f670d1ca9dd5af697bfbf3b95a301c61a8b4447
Author: Bobi Jam <bobijam@whamcloud.com>
AuthorDate: Wed Oct 10 14:23:55 2018 +0800
LU-11239 lfs: fix mirror resync error handling
This patch returns error for partially successful mirror resync.
Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Change-Id: I9d6c9ef5aca1674ceb7a9cbc6b790f3f7276ff5d
Reviewed-on: https://review.whamcloud.com/33537
though this is just returning the error, it isn't causing the error, AFAICS. |
| Comment by Gerrit Updater [ 13/Sep/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36176 |
| Comment by Andreas Dilger [ 13/Sep/19 ] |
|
The above patch does not fix the test failure here, it is just cosmetic to fix the error message to not have a newline in the middle: lfs mirror mirror: component 131075 not synced : No space left on device (28) I also notice that progname is "lfs mirror mirror", which is also not correct. That is because one or more of lfs_setstripe_internal() is appending argv[0] to progname internally (via the cmd[] buffer), and printing both progname and argv[0] explicitly in error messages, and lfs_mirror() is appending argv[0] to progname, and llapi_error()->error_callback_default() is appending liblustreapi_cmd as well. That is very confusing. That should be fixed in a separate patch. |
| Comment by Arshad Hussain [ 06/Oct/19 ] |
|
Again detected under -> https://testing.whamcloud.com/test_sets/b004f920-e795-11e9-b62b-52540065bddc |
| Comment by Sarah Liu [ 18/Oct/19 ] |
|
Found similar error on PPC client 2.12.3 == sanity-lfsck test 36a: rebuild LOV EA for mirrored file (1) ======================================= 22:20:32 (1570573232) ##### The target MDT-object's LOV EA corrupted as to lose one of the mirrors information. The layout LFSCK should rebuild the LOV EA with the PFID EA of related OST-object(s) belong to the mirror. ##### 4+0 records in 4+0 records out 4194304 bytes (4.2 MB) copied, 0.147001 s, 28.5 MB/s 4+0 records in 4+0 records out 4194304 bytes (4.2 MB) copied, 0.0381554 s, 110 MB/s 4+0 records in 4+0 records out 4194304 bytes (4.2 MB) copied, 0.0381345 s, 110 MB/s lfs mirror mirror: cannot get WRITE lease, ext 1: Device or resource busy (16) lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f0' llapi_lease_get_ext resync failed: Device or resource busy. sanity-lfsck test_36a: @@@@@@ FAIL: (6) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f0 |
| Comment by Andreas Dilger [ 28/Nov/19 ] |
|
It looks like the small filesystems used by sanity-lfsck.sh are causing problems in this test, because the client is consuming all of the free space as grant because "lfs mirror resync" is using O_DIRECT to do the writes, which currently does not consume grants. |
| Comment by Gerrit Updater [ 06/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36176/ |
| Comment by Peter Jones [ 06/Dec/19 ] |
|
Landed for 2.14 |
| Comment by James Nunez (Inactive) [ 06/Dec/19 ] |
|
Reopening this ticket because patch that landed changes error message formatting but does not address the problem described in this ticket. |
| Comment by Andreas Dilger [ 08/Jan/20 ] |
|
To fix this, we need one of the patches from |
| Comment by Andreas Dilger [ 09/Jul/20 ] |
|
+1 on master: Patch https://review.whamcloud.com/35896 " |
| Comment by Vikentsi Lapa [ 10/Jul/20 ] |
|
Additional details about test, I hope the will be useful to reproduce issue. In most cases failed tests contains only single MDT (list from link below was analyzed for sanity-lfsck.sh/test36a) However, sometimes tests with one MDT passed as well.
For example failed tests: output from test UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 43584 2444 37152 7% /mnt/lustre[MDT:0] lustre-OST0000_UUID 71100 7556 56544 12% /mnt/lustre[OST:0] lustre-OST0001_UUID 71100 5376 58724 9% /mnt/lustre[OST:1] lustre-OST0002_UUID 71100 10496 50508 18% /mnt/lustre[OST:2] lustre-OST0003_UUID 71100 1280 61948 3% /mnt/lustre[OST:3] Test where multiple MDTs used - passed: output from test
UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 283520 3968 277504 2% /mnt/lustre[MDT:0] lustre-MDT0001_UUID 283520 3200 278272 2% /mnt/lustre[MDT:1] lustre-MDT0002_UUID 283520 3200 278272 2% /mnt/lustre[MDT:2] lustre-MDT0003_UUID 283520 3200 278272 2% /mnt/lustre[MDT:3] lustre-OST0000_UUID 282624 16384 264192 6% /mnt/lustre[OST:0] lustre-OST0001_UUID 282624 10240 270336 4% /mnt/lustre[OST:1] lustre-OST0002_UUID 282624 22528 258048 9% /mnt/lustre[OST:2] lustre-OST0003_UUID 282624 4096 276480 2% /mnt/lustre[OST:3] When OST's number less than 3 test is skipped, so in this cases no failure in test.
|
| Comment by James Nunez (Inactive) [ 12/Aug/20 ] |
|
The patch that Andreas references that should fix this issue, https://review.whamcloud.com/#/c/35896/, landed to master on July 10 and we haven’t seen this issue since July 9. It looks like this issue is fixed and we can close this ticket. |