[LU-16442] obdfilter-survey test_3a: Error: 'set mdt quota type failed' Created: 04/Jan/23  Updated: 08/Mar/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: Vitaliy Kuznetsov
Resolution: Unresolved Votes: 0
Labels: topfail

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Minh Diep <mdiep@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/a6ea0c49-fb20-45e0-97f9-7055b311ef6a

test_3a failed with the following error:

set mdt quota type failed
onyx-80vm4: losetup: /dev/mapper/mds1_flakey: failed to set up loop device: No such file or directory
CMD: onyx-80vm4 test -b /dev/mapper/mds1_flakey
pdsh@onyx-80vm1: onyx-80vm4: ssh exited with exit code 1
CMD: onyx-80vm4 e2label /dev/mapper/mds1_flakey
onyx-80vm4: e2label: No such file or directory while trying to open /dev/mapper/mds1_flakey
onyx-80vm4: Couldn't find valid filesystem superblock.
pdsh@onyx-80vm1: onyx-80vm4: ssh exited with exit code 1
Starting mds1: -o localrecov,loop  /dev/mapper/mds1_flakey /mnt/lustre-mds1
CMD: onyx-80vm4 mkdir -p /mnt/lustre-mds1; mount -t lustre -o localrecov,loop  /dev/mapper/mds1_flakey /mnt/lustre-mds1
onyx-80vm4: mount: /mnt/lustre-mds1: failed to setup loop device for /dev/mapper/mds1_flakey.
pdsh@onyx-80vm1: onyx-80vm4: ssh exited with exit code 32
Start of /dev/mapper/mds1_flakey on mds1 failed 32
CMD: onyx-80vm3 mkdir -p /mnt/lustre-ost1
CMD: onyx-80vm3 dmsetup status /dev/mapper/ost1_flakey >/dev/null 2>&1
pdsh@onyx-80vm1: onyx-80vm3: ssh exited with exit code 1
CMD: onyx-80vm3 test -b /dev/mapper/ost1_flakey
pdsh@onyx-80vm1: onyx-80vm3: ssh exited with exit code 1
CMD: onyx-80vm3 loop_dev=\$(losetup -j /dev/mapper/ost1_flakey | cut -d : -f 1);

first occured: https://testing.whamcloud.com/sub_tests/11e3ecf2-6ba6-4222-8c36-301ce21f203f
coming from https://build.whamcloud.com/job/lustre-master/4369 on 12/13/2022

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
obdfilter-survey test_3a - set mdt quota type failed



 Comments   
Comment by Minh Diep [ 04/Jan/23 ]

failing ~30% of the time

Comment by Andreas Dilger [ 04/Jan/23 ]

It looks like the "set mdt quota type failed" error started on 2022-12-13.

Patches landed around that date:

$ git log --oneline --after 2022-12-10 --before 2022-12-14 --color=never
dce487f53a6f LU-16366 build: Add LCME_FL_PARITY to wirecheck
dbedb9a5f0bd LU-16364 llite: Move d_u.d_alias compat define
fedf1e8bd70c LU-16363 build: fiemap flexible array
bb951b90268b LU-16359 build: RHEL use Module.symvers during find-provides
b455b42fa875 LU-13705 utils: fix llstat -n option
c8a33e5322b0 LU-16353 config: enable_foo variables mustn't contains space
30c5421ad567 LU-16346 utils: fix lctl stack smashing
d56ea0c80a95 LU-14992 tests: add more mkdir_on_mdt0 calls
6e66cbdb5c8c LU-15816 tests: use correct ost host to manage failure
51851705e936 LU-16334 llite: update statx size/ctime for fallocate
624e78ae80cd LU-930 docs: add lfs-rm_entry.8 man page
d1dbf26afd66 LU-16291 build: make kobj_type constant
6f74bb60ff6c LU-16205 sec: reserve flag for fid2path for encrypted files
b054fcd7852f LU-16159 lod: cancel update llogs upon recovery abort
1819f6006ff5 LU-15801 ldiskfs: Server support for RHEL9
88bccc4fa4dd LU-16114 build: Update security_dentry_init_security args
c13eccf71dde LU-16112 build: ki_complete removed unused argument
99d1f12c7c5e LU-15581 utils: add check_iam util
c95973fea184 LU-6142 lustre: fix minor typos in comments
6b69d22e4cb7 LU-15707 lod: force creation of a component without a pool
e42efe35eec7 LU-16231 misc: fix stats snapshot_time to use wallclock
e96cb6ff1fea LU-16110 lprocfs: make job_stats and rename_stats valid YAML
Comment by Colin Faber [ 07/Mar/23 ]

Hi mdiep is this still failing regularly?

Comment by Andreas Dilger [ 08/Mar/23 ]

Colin, you can check this easily in Maloo by doing a subtest search for obdfilter-survey test_3a (this is automatically generated by clicking on the subtest number in any Maloo failure report, and then expanding the "Within" date range as needed):
https://testing.whamcloud.com/search?horizon=7776000&status%5B%5D=FAIL&test_set_script_id=11a69f28-4a54-11e0-a7f6-52540025f9af&sub_test_script_id=12143074-4a54-11e0-a7f6-52540025f9af&source=sub_tests#redirect

It looks like the last reported failure like this was on 2023-01-05. Patches landed on 2023-01-06 are:

$ git log --oneline --after 2023-01-05 --before 2023-01-07 master
5b06ba9d46 LU-16439 socklnd: clarify error message on timeout
557bb0004d LU-16438 llite: remove false outdated comment
4f0273b3bc LU-16413 osd-ldiskfs: fix T10PI for CentOS 8.x
374f12ba11 LU-14409 ldiskfs: remove stray tracing code
41bed753b3 LU-16387 lustre: switch OBD_ALLOC_LARGE to vmalloc faster
25d6e3ca63 LU-15626 tests: Fix shellcheck warning for acceptance-small
d5fe41a02a LU-16335 test: add fail_abort_cleanup()
d622b26d8d LU-16322: build: Add client build support for openEuler
44e2f44f29 LU-16279 lnet: improve error reporting in LUTF
34556ca18a LU-16268 mdd: set effective changelog mask correctly
445f85de2b LU-16117 build: Avoid excessive modpost warnings
61e83a6f13 LU-16113 build: Fix configure tests for lock_page_memcg
009faf132d LU-16116 build: Configure tests for rhltable, bitmap_alloc...
d54e8e95de LU-16118 build: Use pde_data() when available
14cdcd6198 LU-13642 lnet: Allow IP specification
18b4e28f18 LU-15288 lnet: increase transaction timeout
5cd5a49c72 LU-16321 osd: Allow fiemap on kernel buffers
4b9a39d3ed LU-14645 tests: test lfs setdirstripe with '/$'

That said, I was almost going to say this could be closed with "Cannot Reproduce", but looking at the test output it isn't clear whether this test is working correctly or not, even for tests that report PASS, because it is printing a ton of errors :

+ NETTYPE=tcp thrlo=2 nobjhi=1 thrhi=4 size=1024 case=network rslt_loc=/tmp targets="10.240.43.246" /usr/bin/obdfilter-survey
Tue Mar  7 00:40:54 UTC 2023 Obdfilter-survey for case=network from trevis-97vm1.trevis.whamcloud.com
ost  1 sz  1048576K rsz 1024K obj    1 thr    2 write 115686.24             ERROR rewrite 114559.96             ERROR read 113192.26             ERROR 
ost  1 sz  1048576K rsz 1024K obj    1 thr    4 write 108180.12             ERROR rewrite 106411.16             ERROR read 103443.34             ERROR 
done!
=======================> ost  1 sz  1048576K rsz 1024K obj    1 thr    2 
=============> Create 1 on localhost:echotmp_ecc
create: 1 objects
create: #1 is object id 0x10000001
=============> write localhost:echotmp_ecc
Print status every 1 seconds
--threads: starting 2 threads on device 1 running test_brw 512 wx q 256 2t268435457 g256
error: test_brw-2: #1 - Invalid argument on write
error: test_brw-1: #1 - Invalid argument on write
--threads: PID 32664 had rc=22
--threads: PID 32663 had rc=22
Total: total 2 threads 2 sec 0.000149 13422.818792/second
:
:
Generated at Sat Feb 10 03:27:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.