[LU-14658] conf-sanity test_122b: failed to write file Created: 30/Apr/21  Updated: 02/Jun/22  Resolved: 06/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.12.9, Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14598 Too many FIDs to precreate OST replac... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Chris Horn <hornc@cray.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/8d081b66-0a6e-453a-b3b7-a74c663621c6

test_122b failed with the following error:

failed to write file
CMD: trevis-205vm14.trevis.whamcloud.com,trevis-205vm15 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname)
CMD: trevis-205vm14.trevis.whamcloud.com,trevis-205vm15 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi
CMD: trevis-205vm16 /usr/sbin/lctl --device lustre-OST0000 notransno
CMD: trevis-205vm16 /usr/sbin/lctl --device lustre-OST0000 readonly
CMD: trevis-205vm16 /usr/sbin/lctl mark ost1 REPLAY BARRIER on lustre-OST0000
dd: failed to open '/mnt/lustre/d122b.conf-sanity/f122b.conf-sanity': Numerical result out of range
 conf-sanity test_122b: @@@@@@ FAIL: failed to write file 

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
conf-sanity test_122b - failed to write file



 Comments   
Comment by Andreas Dilger [ 02/Jun/21 ]

Hi Alexander,
it looks like the new conf-sanity.sh test_122b that was added with patch https://review.whamcloud.com/43248 "LU-14598 ofd: fix for IDIF sequence at ofd_preprw_write" has been failing intermittently since it was landed on 2021-04-28 (37 times in the past 35 days):
https://testing.whamcloud.com/search?server_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&status%5B%5D=FAIL&test_set_script_id=7f66aa20-3db2-11e0-80c0-52540025f9af&sub_test_script_id=59bb51ec-a2ae-11e6-8b77-5254006e85c2&start_date=2021-03-01&end_date=2021-06-02&source=sub_tests#redirect

for example, the most recent two failures are at:
https://testing.whamcloud.com/sub_tests/00c3de0d-a963-4913-b117-9b69ef27b604
https://testing.whamcloud.com/sub_tests/a7dc2041-7315-416b-8472-fbe86791bc9f

Could you please take a look and see how the test/code can be changed so that the test is more robust. Thanks.

Comment by Andreas Dilger [ 02/Jun/21 ]

Note that all of the actual failures of this subtest have been with ZFS, so it may be something specific with osd-zfs, or just a timing issue because it is slower?

There are two failures for the same subtest with ldiskfs, but they reported a different error message and were both caused by the (same) patch that was being tested.

Comment by Andreas Dilger [ 27/Oct/21 ]

About 12x failures of conf-sanity test_122b in the past week, of 129 test runs:
https://testing.whamcloud.com/sub_tests/06b6cedf-12ce-4873-8c73-2b1cdf3e6e5d
https://testing.whamcloud.com/sub_tests/dcec0d91-1cbe-40ad-a142-1043b7bb9f41

Looking at the debug logs, it appears that the OSTs are being marked with state=2=OS_STATFS_READONLY:

1635276069.251190:0:2262865:0:(osp_precreate.c:1050:osp_pre_update_msfs()) lustre-OST0000-osc-MDT0002: blocks=95744 free=94720 avail=89344 avail_mb=349 hwm_mb=3 files=16043 ffree=11840 state=2: rc = 0
1635276069.759773:0:2262864:0:(osp_precreate.c:1050:osp_pre_update_msfs()) lustre-OST0000-osc-MDT0000: blocks=95744 free=94720 avail=89344 avail_mb=349 hwm_mb=3 files=16043 ffree=11840 state=2: rc = 0

and then the object allocation fails because there are no OSTs available:

1635276069.784058:0:2432715:0:(lod_qos.c:142:lod_statfs_and_check()) lustre-OST0000-osc-MDT0000: turns inactive
1635276069.784058:0:2432715:0:(lod_qos.c:170:lod_statfs_and_check()) Process leaving (rc=18446744073709551586 : -30 : ffffffffffffffe2)
1635276069.784059:0:2432715:0:(lod_qos.c:231:lod_qos_statfs_update()) Process leaving
1635276069.784066:0:2432715:0:(lod_qos.c:2541:lod_qos_prep_create()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde)
1635276069.784067:0:2432715:0:(lod_qos.c:2639:lod_qos_prep_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde)
1635276069.784067:0:2432715:0:(lod_qos.c:2695:lod_prepare_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde)
1635276069.784071:0:2432715:0:(lod_object.c:5804:lod_declare_striped_create()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde)
1635276069.784072:0:2432715:0:(lod_lov.c:492:lod_free_comp_buffer()) kfreed 'entries': 128 at 00000000b6ba0656.
1635276069.784073:0:2432715:0:(lod_object.c:5839:lod_declare_striped_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde)
1635276069.784074:0:2432715:0:(lod_object.c:5985:lod_declare_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde)
1635276069.784075:0:2432715:0:(mdd_object.c:576:mdd_declare_create_object_internal()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde)
1635276069.784076:0:2432715:0:(mdd_dir.c:2167:mdd_declare_create_object()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde)
1635276069.784077:0:2432715:0:(mdd_dir.c:2267:mdd_declare_create()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde)
1635276069.784078:0:2432715:0:(mdd_dir.c:2684:mdd_create()) Process leaving via out_stop (rc=18446744073709551582 : -34 : 0xffffffffffffffde)

The -34 = -ERANGE = Numerical result out of range is because:

__u16 lod_get_stripe_count(struct lod_device *lod, struct lod_object *lo,
                           int comp_idx, __u16 stripe_count, bool overstriping)
{
        :
        if (!stripe_count)
                stripe_count =
                        lod->lod_ost_descs.ltd_lov_desc.ld_default_stripe_count;
        if (!stripe_count)
                stripe_count = 1;
        if (stripe_count >
                lod->lod_ost_descs.ltd_lov_desc.ld_active_tgt_count &&
            !overstriping)
                stripe_count =
                        lod->lod_ost_descs.ltd_lov_desc.ld_active_tgt_count;

so it appears that ld_active_tgt_count = 0, and this is converted to -ERANGE in the caller:

int lod_qos_prep_create(const struct lu_env *env, struct lod_object *lo,
                        struct lu_attr *attr, struct thandle *th,
                        int comp_idx, __u64 reserve)
{
                :
                stripe_len = lod_get_stripe_count(d, lo, comp_idx,
                                                  lod_comp->llc_stripe_count,
                                                  lod_comp->llc_pattern &
                                                  LOV_PATTERN_OVERSTRIPING);
                if (stripe_len == 0)
                        GOTO(out, rc = -ERANGE);
Comment by Alexander Boyko [ 08/Dec/21 ]

With a fleeting glance the problem starts at readonly simulation. For ldiskfs it is a drop write IO, and for zfs lctl --device readonly. I have no idea how others tests with replay_barrier() works, the logic is too diferrent between zfs/ldiskfs. For ldiskfs environment OS_STATFS_READONLY flag is not set after replay_barrier(). The quick fix is skipping test for zfs env. I'm not familiar with zfs, does it support readonly mode similar to flakey?

#
# Set the device of a given facet as "read-only".
#
set_dev_readonly() {
        local facet=$1
        local svc=${facet}_svc

        if [[ $(facet_fstype $facet) = zfs ]] ||
           ! dm_flakey_supported $facet; then
                do_facet $facet $LCTL --device ${!svc} readonly
        else
                dm_set_dev_readonly $facet
        fi
}
Comment by Andreas Dilger [ 09/Dec/21 ]

I think ZFS supports read-only, otherwise that would cause a lot more tests to fail on ZFS?

Comment by Gerrit Updater [ 13/Dec/21 ]

"Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45841
Subject: LU-14658 tests: fix conf-sanity 122b test
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b447c4fea70d799bf13ac1e5236ca2d83b148436

Comment by Gerrit Updater [ 06/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45841/
Subject: LU-14658 tests: fix conf-sanity 122b test
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 276c22ae3869c1f775cded248d53b9c2a829c801

Comment by Peter Jones [ 06/Jan/22 ]

Landed for 2.15

Comment by Etienne Aujames [ 15/Mar/22 ]

+1 on b2_12: https://testing.whamcloud.com/test_sets/865cf735-e67b-4701-abc3-91d0c97013eb
https://testing.whamcloud.com/sub_tests/f1a6c894-d8ef-49ec-89bd-c33d57fdfafa

The message is different:

dd: failed to open '/mnt/lustre/d122b.conf-sanity/f122b.conf-sanity': No space left on device
00000002:00100000:1.0:1647298582.780569:0:3559:0:(mdc_locks.c:709:mdc_finish_enqueue()) @@@ op: 3 disposition: 17, status: -28  req@ffff9132ca930d80 x1727317208645248/t0(0) o101->lustre-MDT0000-mdc-ffff913275c8c000@10.240.44.32@tcp:12/10 lens 664/568 e 0 to 0 dl 1647298589 ref 1 fl Complete:R/0/0 rc 301/301
Comment by Gerrit Updater [ 21/Mar/22 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46873
Subject: LU-14658 tests: fix conf-sanity 122b test
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 6af0c0f659a1d91785aa53be7c3ff056b44885c4

Comment by Gerrit Updater [ 02/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46873/
Subject: LU-14658 tests: fix conf-sanity 122b test
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 3f18d493c0aca46982596fefcc798811680d2bd1

Generated at Sat Feb 10 03:11:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.