[LU-14406] replay-dual test 22d fails with “Remote creation failed 1” Created: 09/Feb/21  Updated: 08/Mar/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

DNE/ZFS


Issue Links:
Related
is related to LU-7372 replay-dual test_26: test failed to r... Resolved
is related to LU-6006 replay-dual test_22a: Remote creation... Resolved
is related to LU-14749 runtests test 1 hangs on MDS unmount Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-dual test_22d started failing on 30 SEPT 2020 when testing the patch for LU-13417 ‘mdd: default DNE MDT balance on new filesystems’ https://review.whamcloud.com/38553 (which has not itself landed, and cannot be the source of the test failures) with logs at https://testing.whamcloud.com/test_sets/b4dc132e-a4d6-4abb-a81c-753d8f23a18e. Since that time, this test has failed 10 times during review/patch testing. On 03 FEB 2021, we see this test fail with the same error message for branch/full testing; for DNE and ZFS with logs at https://testing.whamcloud.com/test_sets/17948bab-e647-4f32-874a-0fe07a464353.

Looking at this DNE/ZFS failure, we see the following in the suite_log:

CMD: trevis-66vm8 /usr/sbin/lctl --device lustre-MDT0000 notransno
CMD: trevis-66vm8 /usr/sbin/lctl --device lustre-MDT0000 readonly
CMD: trevis-66vm8 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
CMD: trevis-66vm6 mkdir /mnt/lustre2/d22d.replay-dual/remote_dir/dir
trevis-66vm6: mkdir: cannot create directory '/mnt/lustre2/d22d.replay-dual/remote_dir/dir': No such file or directory
pdsh@trevis-66vm5: trevis-66vm6: ssh exited with exit code 1
 replay-dual test_22d: @@@@@@ FAIL: Remote creation failed 1 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
  = /usr/lib64/lustre/tests/replay-dual.sh:725:test_22d()

Looking at replay-dual test 22d, we see that the error is in create_remote_dir_files_22() in mkdir

 607 create_remote_dir_files_22() {
 608         do_node $CLIENT2 mkdir ${MOUNT2}/$remote_dir/dir || return 1
 609         do_node $CLIENT1 createmany -o $MOUNT1/$remote_dir/dir/$tfile- 2 ||
 610                                                             return 2
 611         do_node $CLIENT2 createmany -o $MOUNT2/$remote_dir/$tfile- 2 ||
 612                                                             return 3
 613         return 0
 614 }

Logs for more failures are at
https://testing.whamcloud.com/test_sets/260d3237-ec78-46c1-88a4-f5455a9265ce
https://testing.whamcloud.com/test_sets/4b7ef1a1-cb32-42b5-a2c2-9a2c7604900b
https://testing.whamcloud.com/test_sets/3e23af01-fb55-4bcf-8667-ea706fa084b3



 Comments   
Comment by Andreas Dilger [ 11/Jun/21 ]

The very first failure on patch https://review.whamcloud.com/38553 was part of a series of 3 patches, the first one was commit 96bf2abf4a, which had parent patch https://review.whamcloud.com/39568 "LU-13852 pcc: don't alloc FID in LLITE for pcc open" commit 952a0754d9 (only landed to master 2021-05-27), and grandparent patch https://review.whamcloud.com/40103 "LU-14004 llite: default lsm update may memory leak" commit cc4825176 (also landed to master 2021-05-27), so those could not have been the source of the test failures on master.

Within the most recent 4 weeks, there were 13 TIMEOUT failures, still averaging about one failure every 2 days (out of 591 total runs, or 0.2% overall TIMEOUT rate).

It may be noteworthy that there is a 5-month gap from the first failure on 2020-09-30 (or 2020-10-01, depending on timezone), and the second failure on 2021-03-01. After 03-01, there were 55 TIMEOUT failures over a 14-week period, averaging about one failure every 2 days, but often with gaps of up to 4 days between failures. There were a total of 540 replay-dual runs in the 4-week period from 2020-09-30 and 2020-10-28, so it is not the case that this test was being skipped or only in full sessions at the time.

This indicates that while 2020-10-01 was the first such failure, it might have been an unrelated/coincidental/patch-induced failure, and the real failures started on 2021-03-01 on patch https://review.whamcloud.com/39302 commit e982ade5d, which didn't land to master until 03-10, by which time there were 7 other failures on unrelated patches. The parent of this patch was commit v2_14_50-145-g99d9638d6c, which was one of 90 patches landed on 2021-02-26:

f55fdfff5d LU-11085 nodemap: switch interval tree to in-kernel impl.
03b7befcc0 LU-13485 libcfs: FIELD_SIZEOF macro removed
c66668387a LU-12678 o2iblnd: convert peers hash table to hashtable.h
aa57e82986 LU-12678 lnet: discard LNET_MD_PHYS
dd0e7523e1 LU-12678 lnet: use init_wait() rather than init_waitqueue_entry()
0269ac4a00 LU-9859 libcfs: use wait_event_timeout() in tracefiled().
6ae187404a LU-12678 lnet: discard WIRE_ATTR
5bb641fa61 LU-13239 ldiskfs: pass inode timestamps at initial creation
cb3c65d4a1 LU-12780 ofd: don't use ptlrpc_thread for consistency verification
ec138c5c58 LU-11085 ldlm: change lock_matches() to return bool.
3db4d9a69e LU-6142 libcfs: discard cfs_strrstr()
ee5eb07d2f LU-6142 libcfs: discard cfs_firststr
6c3f0cfb4a LU-6142 libcfs: discard PO2_ROUNDUP_TYPED, LOWEST_BIT_SET
7d68bfb991 LU-6142 llite: ll_lookup_finish_locks clean up
8034d85f2b LU-6142 llite: don't cast arg to d_lustre_invalid()
fca56be02b LU-6142 lustre: use is_root_inode()
aaf0eb8696 LU-6142 llite: remove ll_dir_chain
f0736a6a52 LU-6142 lustre: remove non-static 'inline' markings.
a03765b2da LU-6142 lustre: convert snprintf to scnprintf as appropriate
734d6eb11b LU-6142 lustre: mark strings in char arrays as const
3e76334402 LU-6142 mdc: minor function cleanups.
9c4fbd1766 LU-6142 osc: minor function cleanups.
c20b866ba3 LU-6142 lustre: change various operations structs to const
68cd9825d5 LU-6142 lfsck: make all 'struct lfsck_operations' to const
c5b9054073 LU-6142 lustre: change all 'struct seq_operations' to const
140b9e6d73 LU-6142 lustre: change super/file/inode operations to const
3ae81448da LU-6142 obdclass: use cl_object_for_each more broadly.
8d8e87a5ac LU-6142 lustre: remove module_vars arg to class_register_type()
cab152600e LU-6142 lov: style cleanups in lov_set_osc_active()
33265fe88b LU-6142 lustre: change obd_ioctl_getdata() args
7b237bd306 LU-6142 lov: chnage lsm_op_find() to a non-inline function.
950200a21f LU-6142 lustre: make various 'struct file_operations' static
bf7f08479f LU-9859 libcfs: discard TCD_MAX_TYPES
fb40f0b62d LU-10391 lnet: allow lnet_connect() to use IPv6 addresses.
e4fa181abf LU-10391 lnet: allow creation of IPv6 socket.
dcc8b9c00d LU-9679 ptlrpc: list_for_each improvements.
977217520e LU-14275 tests: add ior_CLEANUP
79642e0896 LU-14388 utils: always enable ldiskfs project quota
8fc5fc5889 LU-14353 obd: move debug.c to obdecho
c2fd5297b4 LU-14305 ldiskfs: add parameters for mb_c123_threshold
f7f0b104bc LU-14289 ptlrpc: move heap.c from libcfs to ptlrpc
7a8fafe2a1 LU-14291 lustre: only include nrs headers when needed
daa388b539 LU-14291 ptlrpc: support nrs_delay for client-only builds
c90e3d8d3f LU-14285 utils: Add error message when osd_init fails
647c96562b LU-12477 llite: remove unused ll_teardown_mmaps()
c5bec6a88a LU-930 doc: fix format man page sections for lctl
1129afd348 LU-14272 tests: different mpirun options for different users
32e96e1a48 LU-14271 tests: add new node crash method
ce0b7ed044 LU-14270 tests: delay node's power up
0354fa9896 LU-14262 utils: lfs to set component flags by pool name
de60e7767c LU-14195 lustre: remove 'fs' from 'struct lvfs_run_ctxt'
d0337cab8e LU-14195 osd: don't use set_fs() for ->fiemap() calls.
9b9e19ca50 LU-14195 build: Adjust Makefile for Linux build changes.
e9c3b89bda LU-14178 ldlm: return error from ldlm_namespace_new()
9d2776f02b LU-14073 ofd: remove use of smp_read_barrier_depends()
a7f48e6c15 LU-14047 lustre: change EWOULDBLOCK to EAGAIN
5309e10858 LU-13783 libcfs: switch from ->mmap_sem to mmap_lock()
e520b6a7fa LU-9325 osd-ldisk: replace simple_strto* with kstr* functions
dd15646cc5 LU-9859 lod: use linux kernel bitmap API
a076975f9f LU-9859 libcfs: replace all CFS_CAP_* macros with CAP_*
2070e9bcc0 LU-13100 lov: grant deadlock if same OSC in two components
3bae39f0a5 LU-7853 lod: fixes bitfield in lod qos code
83e38bba62 LU-14180 utils: verify setstripe comp_end is valid
8910291fc5 LU-14207 mgs: delete "add failnid" sections on replace_nids
82c6e42d61 LU-13974 llog: check stale osp object
c5165557f5 LU-12961 mdd: avoid double call to mdd_changelog_fini()
6873482608 LU-14439 build: require a newer version of e2fsprogs
e3f17defc1 LU-13609 mgs: fix config_log buffer handling
c35c1babc7 LU-10391 socklnd: use sockaddr instead of __u32 addresses.
e5a8f3fc12 LU-13929 lnet: modify assertion in lnet_post_send_locked
437e6bea0c LU-14362 tests: sanity-flr to prepare stuff before checks
2eaa49ef0f LU-14423 osd: recognize holes in osd_is_mapped()
c45558bf56 LU-14398 llapi: add llapi_fid2path_at()
4cfe77df6f LU-14398 llapi: simplify llapi_fid2path()
3117913e21 LU-14390 gnilnd: Use DIV_ROUND_UP to calculate niov
e00733f0f8 LU-14301 lustre: add ENOTSUPP to spelling.txt
1a2b381616 LU-12766 test: convert time to seconds properly
d498d1b9cc LU-13903 build: make lustre-devel buildable for Linux client
7af92d8843 LU-14313 utils: mount error when no server support
ffa858b165 LU-14268 lod: fix layout generation inc for mirror split
910eb97c1b LU-14098 obdclass: try to skip corrupted llog records
fc8f138169 LU-9820 osd-ldiskfs: OI scrub speed limit fix
58ac9d3f18 LU-14099 build: Fix for unconfigured arch_stackwalk
6df76d3357 LU-14044 llog: check fid after convert
262b6f9c60 LU-13620 tests: pool_add_targets() fix
7ea369783f LU-13584 tests: gather_logs() fix
4bba67075a LU-13513 osp: make neterr not fatal for precreate_reserve
e45e8a92a2 LU-13453 osd-ldiskfs: do not leak inode if OI insertion fails
124b31f13e Merge "LU-9121 lnet: User Defined Selection Policy (UDSP)"
15d44e787e LU-12682 llite: fake symlink type of foreign file/dir
dfe87b089b LU-14444 gss: handle empty reqmsg in sptlrpc_req_ctx_switch
a54ecd2c2d LU-14455 mdt: fix DoM lock prolong logic
f44413717e LU-14436 tgt: only use T10PI guard when doing full sector read
ece23db121 LU-14435 doc: include lfs-flushctx manpage inside packages
f3d03bc38a LU-14430 mdd: fix inheritance of big default ACLs
7c0f6912e6 New tag 2.14.50

This huge batch of landings was the start of the 2.14.50 development branch, so it is likely one of them is the main culprit for this failure. Likely candidates are the bottom few patches LU-14430, LU-14455, LU-12682, LU-13453, but nothing obviously stands out. It may be possible to isolate this with a bisect, the scarcity of failures (1-in-500) means it would take a lot of test iterations, and may also have an unknown pre-dependency from another subtest.

Comment by Andreas Dilger [ 11/Jun/21 ]

Further note - while test_22d is not a TIMEOUT itself, there were 13 test_22d FAIL and 13 test_23b TIMEOUT in the most recent 4-week period, so they appear strongly related at this point. While the test_22d FAIL was hit 10 times in the period between 2020-09-30 and 2021-01-30 (and no failures again until 2021-03-01), with only 1 timeout on the initial LU-13417 patch (which hasn't itself landed), in fact many of those failures were for another test patch https://review.whamcloud.com/40925 "LU-13417 test: replay-dual 22d with 'lfs mkdir'". This lends credence to the theory that the failures before 2021-03-01 were some other issues, and the main cause of the timeout was one of the above patches landed shortly before that date.

Comment by Alex Zhuravlev [ 06/Aug/21 ]

https://testing.whamcloud.com/test_sessions/29cfba25-f5b3-4fb9-984d-9a2be9592ad1

Generated at Sat Feb 10 03:09:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.