[LU-14658] conf-sanity test_122b: failed to write file Created: 30/Apr/21 Updated: 02/Jun/22 Resolved: 06/Jan/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.12.9, Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Alexander Boyko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Chris Horn <hornc@cray.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/8d081b66-0a6e-453a-b3b7-a74c663621c6 test_122b failed with the following error: failed to write file CMD: trevis-205vm14.trevis.whamcloud.com,trevis-205vm15 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname) CMD: trevis-205vm14.trevis.whamcloud.com,trevis-205vm15 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi CMD: trevis-205vm16 /usr/sbin/lctl --device lustre-OST0000 notransno CMD: trevis-205vm16 /usr/sbin/lctl --device lustre-OST0000 readonly CMD: trevis-205vm16 /usr/sbin/lctl mark ost1 REPLAY BARRIER on lustre-OST0000 dd: failed to open '/mnt/lustre/d122b.conf-sanity/f122b.conf-sanity': Numerical result out of range conf-sanity test_122b: @@@@@@ FAIL: failed to write file VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 02/Jun/21 ] |
|
Hi Alexander, for example, the most recent two failures are at: Could you please take a look and see how the test/code can be changed so that the test is more robust. Thanks. |
| Comment by Andreas Dilger [ 02/Jun/21 ] |
|
Note that all of the actual failures of this subtest have been with ZFS, so it may be something specific with osd-zfs, or just a timing issue because it is slower? There are two failures for the same subtest with ldiskfs, but they reported a different error message and were both caused by the (same) patch that was being tested. |
| Comment by Andreas Dilger [ 27/Oct/21 ] |
|
About 12x failures of conf-sanity test_122b in the past week, of 129 test runs: Looking at the debug logs, it appears that the OSTs are being marked with state=2=OS_STATFS_READONLY: 1635276069.251190:0:2262865:0:(osp_precreate.c:1050:osp_pre_update_msfs()) lustre-OST0000-osc-MDT0002: blocks=95744 free=94720 avail=89344 avail_mb=349 hwm_mb=3 files=16043 ffree=11840 state=2: rc = 0 1635276069.759773:0:2262864:0:(osp_precreate.c:1050:osp_pre_update_msfs()) lustre-OST0000-osc-MDT0000: blocks=95744 free=94720 avail=89344 avail_mb=349 hwm_mb=3 files=16043 ffree=11840 state=2: rc = 0 and then the object allocation fails because there are no OSTs available: 1635276069.784058:0:2432715:0:(lod_qos.c:142:lod_statfs_and_check()) lustre-OST0000-osc-MDT0000: turns inactive 1635276069.784058:0:2432715:0:(lod_qos.c:170:lod_statfs_and_check()) Process leaving (rc=18446744073709551586 : -30 : ffffffffffffffe2) 1635276069.784059:0:2432715:0:(lod_qos.c:231:lod_qos_statfs_update()) Process leaving 1635276069.784066:0:2432715:0:(lod_qos.c:2541:lod_qos_prep_create()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde) 1635276069.784067:0:2432715:0:(lod_qos.c:2639:lod_qos_prep_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde) 1635276069.784067:0:2432715:0:(lod_qos.c:2695:lod_prepare_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde) 1635276069.784071:0:2432715:0:(lod_object.c:5804:lod_declare_striped_create()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde) 1635276069.784072:0:2432715:0:(lod_lov.c:492:lod_free_comp_buffer()) kfreed 'entries': 128 at 00000000b6ba0656. 1635276069.784073:0:2432715:0:(lod_object.c:5839:lod_declare_striped_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde) 1635276069.784074:0:2432715:0:(lod_object.c:5985:lod_declare_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde) 1635276069.784075:0:2432715:0:(mdd_object.c:576:mdd_declare_create_object_internal()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde) 1635276069.784076:0:2432715:0:(mdd_dir.c:2167:mdd_declare_create_object()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde) 1635276069.784077:0:2432715:0:(mdd_dir.c:2267:mdd_declare_create()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde) 1635276069.784078:0:2432715:0:(mdd_dir.c:2684:mdd_create()) Process leaving via out_stop (rc=18446744073709551582 : -34 : 0xffffffffffffffde) The -34 = -ERANGE = Numerical result out of range is because:
__u16 lod_get_stripe_count(struct lod_device *lod, struct lod_object *lo,
int comp_idx, __u16 stripe_count, bool overstriping)
{
:
if (!stripe_count)
stripe_count =
lod->lod_ost_descs.ltd_lov_desc.ld_default_stripe_count;
if (!stripe_count)
stripe_count = 1;
if (stripe_count >
lod->lod_ost_descs.ltd_lov_desc.ld_active_tgt_count &&
!overstriping)
stripe_count =
lod->lod_ost_descs.ltd_lov_desc.ld_active_tgt_count;
so it appears that ld_active_tgt_count = 0, and this is converted to -ERANGE in the caller: int lod_qos_prep_create(const struct lu_env *env, struct lod_object *lo, struct lu_attr *attr, struct thandle *th, int comp_idx, __u64 reserve) { : stripe_len = lod_get_stripe_count(d, lo, comp_idx, lod_comp->llc_stripe_count, lod_comp->llc_pattern & LOV_PATTERN_OVERSTRIPING); if (stripe_len == 0) GOTO(out, rc = -ERANGE); |
| Comment by Alexander Boyko [ 08/Dec/21 ] |
|
With a fleeting glance the problem starts at readonly simulation. For ldiskfs it is a drop write IO, and for zfs lctl --device readonly. I have no idea how others tests with replay_barrier() works, the logic is too diferrent between zfs/ldiskfs. For ldiskfs environment OS_STATFS_READONLY flag is not set after replay_barrier(). The quick fix is skipping test for zfs env. I'm not familiar with zfs, does it support readonly mode similar to flakey? #
# Set the device of a given facet as "read-only".
#
set_dev_readonly() {
local facet=$1
local svc=${facet}_svc
if [[ $(facet_fstype $facet) = zfs ]] ||
! dm_flakey_supported $facet; then
do_facet $facet $LCTL --device ${!svc} readonly
else
dm_set_dev_readonly $facet
fi
}
|
| Comment by Andreas Dilger [ 09/Dec/21 ] |
|
I think ZFS supports read-only, otherwise that would cause a lot more tests to fail on ZFS? |
| Comment by Gerrit Updater [ 13/Dec/21 ] |
|
"Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45841 |
| Comment by Gerrit Updater [ 06/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45841/ |
| Comment by Peter Jones [ 06/Jan/22 ] |
|
Landed for 2.15 |
| Comment by Etienne Aujames [ 15/Mar/22 ] |
|
+1 on b2_12: https://testing.whamcloud.com/test_sets/865cf735-e67b-4701-abc3-91d0c97013eb The message is different: dd: failed to open '/mnt/lustre/d122b.conf-sanity/f122b.conf-sanity': No space left on device 00000002:00100000:1.0:1647298582.780569:0:3559:0:(mdc_locks.c:709:mdc_finish_enqueue()) @@@ op: 3 disposition: 17, status: -28 req@ffff9132ca930d80 x1727317208645248/t0(0) o101->lustre-MDT0000-mdc-ffff913275c8c000@10.240.44.32@tcp:12/10 lens 664/568 e 0 to 0 dl 1647298589 ref 1 fl Complete:R/0/0 rc 301/301 |
| Comment by Gerrit Updater [ 21/Mar/22 ] |
|
"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46873 |
| Comment by Gerrit Updater [ 02/Jun/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46873/ |