[LU-12757] sanity-lfsck test 36a fails with '(N) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f2' Created: 12/Sep/19  Updated: 16/Jan/23  Resolved: 12/Aug/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: HPv2

Issue Links:
Related
is related to LU-4664 sync write should consume grant on cl... Resolved
is related to LU-11239 sanity-lfsck test 36a fails with 'Fai... Resolved
is related to LU-12687 Fast ENOSPC on direct I/O Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We see sanity-lfsck test_36a fail in resync for the last two of the following three calls to ‘lfs mirror resync’ from test 36a:

5271         $LFS mirror resync $DIR/$tdir/f0 ||
5272                 error "(6) Fail to resync $DIR/$tdir/f0"
5273         $LFS mirror resync $DIR/$tdir/f1 ||
5274                 error "(7) Fail to resync $DIR/$tdir/f1"
5275         $LFS mirror resync $DIR/$tdir/f2 ||
5276                 error "(8) Fail to resync $DIR/$tdir/f2"
5277 

It looks like this test started failing with these two errors on 07-September-2019 with Lustre master version 2.12.57.54.

Looking at the suite_log for https://testing.whamcloud.com/test_sets/a5f2b938-d438-11e9-a2b6-52540065bddc, we see

lfs mirror mirror: component 131075 not synced
: No space left on device (28)
lfs mirror mirror: component 131076 not synced
: No space left on device (28)
lfs mirror mirror: component 196613 not synced
: No space left on device (28)
lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f1' llapi_mirror_resync_many: No space left on device.
 sanity-lfsck test_36a: @@@@@@ FAIL: (7) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f1 

Similarly, looking at the suite_log for https://testing.whamcloud.com/test_sets/42fbb9fe-d575-11e9-9fc9-52540065bddc, we see

lfs mirror mirror: component 131075 not synced
: No space left on device (28)
lfs mirror mirror: component 131076 not synced
: No space left on device (28)
lfs mirror mirror: component 196613 not synced
: No space left on device (28)
lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f2' llapi_mirror_resync_many: No space left on device.
 sanity-lfsck test_36a: @@@@@@ FAIL: (8) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f2 
 

It is possible that we are running out of disk space on an OST, but it seems strange that this just started earlier this month.

Logs for other failures are at
https://testing.whamcloud.com/test_sessions/279dd05c-e122-4f8f-bafe-b8299e8e0e61
https://testing.whamcloud.com/test_sessions/fe936f3a-df7d-4d23-9d28-721da7ab8f76



 Comments   
Comment by James Nunez (Inactive) [ 12/Sep/19 ]

I added 'lfs df' to a patch that hits this problem consistently, patch https://review.whamcloud.com/33919/ , and it doesn't look like we are out of space

== sanity-lfsck test 36a: rebuild LOV EA for mirrored file (1) ======================================= 22:37:15 (1568327835)
#####
The target MDT-object's LOV EA corrupted as to lose one of the 
mirrors information. The layout LFSCK should rebuild the LOV EA 
with the PFID EA of related OST-object(s) belong to the mirror.
#####
4+0 records in
4+0 records out
4194304 bytes (4.2 MB) copied, 0.126361 s, 33.2 MB/s
4+0 records in
4+0 records out
4194304 bytes (4.2 MB) copied, 0.0600958 s, 69.8 MB/s
4+0 records in
4+0 records out
4194304 bytes (4.2 MB) copied, 0.0601316 s, 69.8 MB/s
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID        43584        2168       37428   6% /mnt/lustre[MDT:0]
lustre-OST0000_UUID        71100        3452       57504   6% /mnt/lustre[OST:0]
lustre-OST0001_UUID        71100        1272       62396   2% /mnt/lustre[OST:1]
lustre-OST0002_UUID        71100        1272       61948   3% /mnt/lustre[OST:2]
lustre-OST0003_UUID        71100        1880       61748   3% /mnt/lustre[OST:3]

filesystem_summary:       284400        7876      243596   4% /mnt/lustre

lfs mirror mirror: component 131075 not synced
: No space left on device (28)
lfs mirror mirror: component 131076 not synced
: No space left on device (28)
lfs mirror mirror: component 196613 not synced
: No space left on device (28)
lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f2' llapi_mirror_resync_many: No space left on device.
 sanity-lfsck test_36a: @@@@@@ FAIL: (8) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f2 
Comment by Andreas Dilger [ 12/Sep/19 ]

It is very likely that the culprit is:

commit 0f670d1ca9dd5af697bfbf3b95a301c61a8b4447
Author:     Bobi Jam <bobijam@whamcloud.com>
AuthorDate: Wed Oct 10 14:23:55 2018 +0800

    LU-11239 lfs: fix mirror resync error handling
    
    This patch returns error for partially successful mirror resync.
    
    Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
    Change-Id: I9d6c9ef5aca1674ceb7a9cbc6b790f3f7276ff5d
    Reviewed-on: https://review.whamcloud.com/33537

though this is just returning the error, it isn't causing the error, AFAICS.

Comment by Gerrit Updater [ 13/Sep/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36176
Subject: LU-12757 utils: avoid newline inside error message
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4d39bed4489f3b38b388ec69e449fdc65afe6f19

Comment by Andreas Dilger [ 13/Sep/19 ]

The above patch does not fix the test failure here, it is just cosmetic to fix the error message to not have a newline in the middle:

lfs mirror mirror: component 131075 not synced
: No space left on device (28)

I also notice that progname is "lfs mirror mirror", which is also not correct. That is because one or more of lfs_setstripe_internal() is appending argv[0] to progname internally (via the cmd[] buffer), and printing both progname and argv[0] explicitly in error messages, and lfs_mirror() is appending argv[0] to progname, and llapi_error()->error_callback_default() is appending liblustreapi_cmd as well. That is very confusing. That should be fixed in a separate patch.

Comment by Arshad Hussain [ 06/Oct/19 ]

Again detected under -> https://testing.whamcloud.com/test_sets/b004f920-e795-11e9-b62b-52540065bddc

Comment by Sarah Liu [ 18/Oct/19 ]

Found similar error on PPC client 2.12.3
https://testing.whamcloud.com/test_sets/d9ac1a0c-eb0e-11e9-b62b-52540065bddc

== sanity-lfsck test 36a: rebuild LOV EA for mirrored file (1) ======================================= 22:20:32 (1570573232)
#####
The target MDT-object's LOV EA corrupted as to lose one of the 
mirrors information. The layout LFSCK should rebuild the LOV EA 
with the PFID EA of related OST-object(s) belong to the mirror.
#####
4+0 records in
4+0 records out
4194304 bytes (4.2 MB) copied, 0.147001 s, 28.5 MB/s
4+0 records in
4+0 records out
4194304 bytes (4.2 MB) copied, 0.0381554 s, 110 MB/s
4+0 records in
4+0 records out
4194304 bytes (4.2 MB) copied, 0.0381345 s, 110 MB/s
lfs mirror mirror: cannot get WRITE lease, ext 1: Device or resource busy (16)
lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f0' llapi_lease_get_ext resync failed: Device or resource busy.
 sanity-lfsck test_36a: @@@@@@ FAIL: (6) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f0 
Comment by Andreas Dilger [ 28/Nov/19 ]

It looks like the small filesystems used by sanity-lfsck.sh are causing problems in this test, because the client is consuming all of the free space as grant because "lfs mirror resync" is using O_DIRECT to do the writes, which currently does not consume grants.

Comment by Gerrit Updater [ 06/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36176/
Subject: LU-12757 utils: avoid newline inside error message
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3f3a241498be7e043df7e416da7fc8722a559498

Comment by Peter Jones [ 06/Dec/19 ]

Landed for 2.14

Comment by James Nunez (Inactive) [ 06/Dec/19 ]

Reopening this ticket because patch that landed changes error message formatting but does not address the problem described in this ticket.

Comment by Andreas Dilger [ 08/Jan/20 ]

To fix this, we need one of the patches from LU-4664 and/or LU-12687 to be landed, so that O_DIRECT writes used by resync do not consume all of the grants.

Comment by Andreas Dilger [ 09/Jul/20 ]

+1 on master:
https://testing.whamcloud.com/test_sets/18633cde-dc95-4d48-add3-405591582c3f

Patch https://review.whamcloud.com/35896 "LU-12687 osc: consume grants for direct I/O" is in master-next and should resolve this issue. This ticket can be closed once that patch lands and this problem is no longer seen.

Comment by Vikentsi Lapa [ 10/Jul/20 ]

Additional details about test, I hope the will be useful to reproduce issue.  In most cases failed tests contains only single MDT (list from link below was analyzed for sanity-lfsck.sh/test36a)

https://testing.whamcloud.com/search?horizon=2332800&test_set_script_id=4f25830c-64fe-11e2-bfb2-52540035b04c&sub_test_script_id=1bd8f58e-6f10-11e8-a55d-52540065bddc&source=sub_tests#redirect

However, sometimes tests with one MDT passed as well.

 

For example failed tests: output from test

UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID        43584        2444       37152   7% /mnt/lustre[MDT:0]
lustre-OST0000_UUID        71100        7556       56544  12% /mnt/lustre[OST:0]
lustre-OST0001_UUID        71100        5376       58724   9% /mnt/lustre[OST:1]
lustre-OST0002_UUID        71100       10496       50508  18% /mnt/lustre[OST:2]
lustre-OST0003_UUID        71100        1280       61948   3% /mnt/lustre[OST:3]

Test where multiple MDTs used - passed: output from test

 

UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID       283520        3968      277504   2% /mnt/lustre[MDT:0]
lustre-MDT0001_UUID       283520        3200      278272   2% /mnt/lustre[MDT:1]
lustre-MDT0002_UUID       283520        3200      278272   2% /mnt/lustre[MDT:2]
lustre-MDT0003_UUID       283520        3200      278272   2% /mnt/lustre[MDT:3]
lustre-OST0000_UUID       282624       16384      264192   6% /mnt/lustre[OST:0]
lustre-OST0001_UUID       282624       10240      270336   4% /mnt/lustre[OST:1]
lustre-OST0002_UUID       282624       22528      258048   9% /mnt/lustre[OST:2]
lustre-OST0003_UUID       282624        4096      276480   2% /mnt/lustre[OST:3]

When OST's number less than 3 test is skipped, so in this cases no failure in test.

 

 

 

 

Comment by James Nunez (Inactive) [ 12/Aug/20 ]

The patch that Andreas references that should fix this issue, https://review.whamcloud.com/#/c/35896/, landed to master on July 10 and we haven’t seen this issue since July 9. It looks like this issue is fixed and we can close this ticket.

Generated at Sat Feb 10 02:55:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.