[LU-14406] replay-dual test 22d fails with “Remote creation failed 1” Created: 09/Feb/21 Updated: 08/Mar/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
DNE/ZFS |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
replay-dual test_22d started failing on 30 SEPT 2020 when testing the patch for Looking at this DNE/ZFS failure, we see the following in the suite_log: CMD: trevis-66vm8 /usr/sbin/lctl --device lustre-MDT0000 notransno CMD: trevis-66vm8 /usr/sbin/lctl --device lustre-MDT0000 readonly CMD: trevis-66vm8 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000 CMD: trevis-66vm6 mkdir /mnt/lustre2/d22d.replay-dual/remote_dir/dir trevis-66vm6: mkdir: cannot create directory '/mnt/lustre2/d22d.replay-dual/remote_dir/dir': No such file or directory pdsh@trevis-66vm5: trevis-66vm6: ssh exited with exit code 1 replay-dual test_22d: @@@@@@ FAIL: Remote creation failed 1 Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6273:error() = /usr/lib64/lustre/tests/replay-dual.sh:725:test_22d() Looking at replay-dual test 22d, we see that the error is in create_remote_dir_files_22() in mkdir
607 create_remote_dir_files_22() {
608 do_node $CLIENT2 mkdir ${MOUNT2}/$remote_dir/dir || return 1
609 do_node $CLIENT1 createmany -o $MOUNT1/$remote_dir/dir/$tfile- 2 ||
610 return 2
611 do_node $CLIENT2 createmany -o $MOUNT2/$remote_dir/$tfile- 2 ||
612 return 3
613 return 0
614 }
Logs for more failures are at |
| Comments |
| Comment by Andreas Dilger [ 11/Jun/21 ] |
|
The very first failure on patch https://review.whamcloud.com/38553 was part of a series of 3 patches, the first one was commit 96bf2abf4a, which had parent patch https://review.whamcloud.com/39568 " Within the most recent 4 weeks, there were 13 TIMEOUT failures, still averaging about one failure every 2 days (out of 591 total runs, or 0.2% overall TIMEOUT rate). It may be noteworthy that there is a 5-month gap from the first failure on 2020-09-30 (or 2020-10-01, depending on timezone), and the second failure on 2021-03-01. After 03-01, there were 55 TIMEOUT failures over a 14-week period, averaging about one failure every 2 days, but often with gaps of up to 4 days between failures. There were a total of 540 replay-dual runs in the 4-week period from 2020-09-30 and 2020-10-28, so it is not the case that this test was being skipped or only in full sessions at the time. This indicates that while 2020-10-01 was the first such failure, it might have been an unrelated/coincidental/patch-induced failure, and the real failures started on 2021-03-01 on patch https://review.whamcloud.com/39302 commit e982ade5d, which didn't land to master until 03-10, by which time there were 7 other failures on unrelated patches. The parent of this patch was commit v2_14_50-145-g99d9638d6c, which was one of 90 patches landed on 2021-02-26: f55fdfff5d LU-11085 nodemap: switch interval tree to in-kernel impl. 03b7befcc0 LU-13485 libcfs: FIELD_SIZEOF macro removed c66668387a LU-12678 o2iblnd: convert peers hash table to hashtable.h aa57e82986 LU-12678 lnet: discard LNET_MD_PHYS dd0e7523e1 LU-12678 lnet: use init_wait() rather than init_waitqueue_entry() 0269ac4a00 LU-9859 libcfs: use wait_event_timeout() in tracefiled(). 6ae187404a LU-12678 lnet: discard WIRE_ATTR 5bb641fa61 LU-13239 ldiskfs: pass inode timestamps at initial creation cb3c65d4a1 LU-12780 ofd: don't use ptlrpc_thread for consistency verification ec138c5c58 LU-11085 ldlm: change lock_matches() to return bool. 3db4d9a69e LU-6142 libcfs: discard cfs_strrstr() ee5eb07d2f LU-6142 libcfs: discard cfs_firststr 6c3f0cfb4a LU-6142 libcfs: discard PO2_ROUNDUP_TYPED, LOWEST_BIT_SET 7d68bfb991 LU-6142 llite: ll_lookup_finish_locks clean up 8034d85f2b LU-6142 llite: don't cast arg to d_lustre_invalid() fca56be02b LU-6142 lustre: use is_root_inode() aaf0eb8696 LU-6142 llite: remove ll_dir_chain f0736a6a52 LU-6142 lustre: remove non-static 'inline' markings. a03765b2da LU-6142 lustre: convert snprintf to scnprintf as appropriate 734d6eb11b LU-6142 lustre: mark strings in char arrays as const 3e76334402 LU-6142 mdc: minor function cleanups. 9c4fbd1766 LU-6142 osc: minor function cleanups. c20b866ba3 LU-6142 lustre: change various operations structs to const 68cd9825d5 LU-6142 lfsck: make all 'struct lfsck_operations' to const c5b9054073 LU-6142 lustre: change all 'struct seq_operations' to const 140b9e6d73 LU-6142 lustre: change super/file/inode operations to const 3ae81448da LU-6142 obdclass: use cl_object_for_each more broadly. 8d8e87a5ac LU-6142 lustre: remove module_vars arg to class_register_type() cab152600e LU-6142 lov: style cleanups in lov_set_osc_active() 33265fe88b LU-6142 lustre: change obd_ioctl_getdata() args 7b237bd306 LU-6142 lov: chnage lsm_op_find() to a non-inline function. 950200a21f LU-6142 lustre: make various 'struct file_operations' static bf7f08479f LU-9859 libcfs: discard TCD_MAX_TYPES fb40f0b62d LU-10391 lnet: allow lnet_connect() to use IPv6 addresses. e4fa181abf LU-10391 lnet: allow creation of IPv6 socket. dcc8b9c00d LU-9679 ptlrpc: list_for_each improvements. 977217520e LU-14275 tests: add ior_CLEANUP 79642e0896 LU-14388 utils: always enable ldiskfs project quota 8fc5fc5889 LU-14353 obd: move debug.c to obdecho c2fd5297b4 LU-14305 ldiskfs: add parameters for mb_c123_threshold f7f0b104bc LU-14289 ptlrpc: move heap.c from libcfs to ptlrpc 7a8fafe2a1 LU-14291 lustre: only include nrs headers when needed daa388b539 LU-14291 ptlrpc: support nrs_delay for client-only builds c90e3d8d3f LU-14285 utils: Add error message when osd_init fails 647c96562b LU-12477 llite: remove unused ll_teardown_mmaps() c5bec6a88a LU-930 doc: fix format man page sections for lctl 1129afd348 LU-14272 tests: different mpirun options for different users 32e96e1a48 LU-14271 tests: add new node crash method ce0b7ed044 LU-14270 tests: delay node's power up 0354fa9896 LU-14262 utils: lfs to set component flags by pool name de60e7767c LU-14195 lustre: remove 'fs' from 'struct lvfs_run_ctxt' d0337cab8e LU-14195 osd: don't use set_fs() for ->fiemap() calls. 9b9e19ca50 LU-14195 build: Adjust Makefile for Linux build changes. e9c3b89bda LU-14178 ldlm: return error from ldlm_namespace_new() 9d2776f02b LU-14073 ofd: remove use of smp_read_barrier_depends() a7f48e6c15 LU-14047 lustre: change EWOULDBLOCK to EAGAIN 5309e10858 LU-13783 libcfs: switch from ->mmap_sem to mmap_lock() e520b6a7fa LU-9325 osd-ldisk: replace simple_strto* with kstr* functions dd15646cc5 LU-9859 lod: use linux kernel bitmap API a076975f9f LU-9859 libcfs: replace all CFS_CAP_* macros with CAP_* 2070e9bcc0 LU-13100 lov: grant deadlock if same OSC in two components 3bae39f0a5 LU-7853 lod: fixes bitfield in lod qos code 83e38bba62 LU-14180 utils: verify setstripe comp_end is valid 8910291fc5 LU-14207 mgs: delete "add failnid" sections on replace_nids 82c6e42d61 LU-13974 llog: check stale osp object c5165557f5 LU-12961 mdd: avoid double call to mdd_changelog_fini() 6873482608 LU-14439 build: require a newer version of e2fsprogs e3f17defc1 LU-13609 mgs: fix config_log buffer handling c35c1babc7 LU-10391 socklnd: use sockaddr instead of __u32 addresses. e5a8f3fc12 LU-13929 lnet: modify assertion in lnet_post_send_locked 437e6bea0c LU-14362 tests: sanity-flr to prepare stuff before checks 2eaa49ef0f LU-14423 osd: recognize holes in osd_is_mapped() c45558bf56 LU-14398 llapi: add llapi_fid2path_at() 4cfe77df6f LU-14398 llapi: simplify llapi_fid2path() 3117913e21 LU-14390 gnilnd: Use DIV_ROUND_UP to calculate niov e00733f0f8 LU-14301 lustre: add ENOTSUPP to spelling.txt 1a2b381616 LU-12766 test: convert time to seconds properly d498d1b9cc LU-13903 build: make lustre-devel buildable for Linux client 7af92d8843 LU-14313 utils: mount error when no server support ffa858b165 LU-14268 lod: fix layout generation inc for mirror split 910eb97c1b LU-14098 obdclass: try to skip corrupted llog records fc8f138169 LU-9820 osd-ldiskfs: OI scrub speed limit fix 58ac9d3f18 LU-14099 build: Fix for unconfigured arch_stackwalk 6df76d3357 LU-14044 llog: check fid after convert 262b6f9c60 LU-13620 tests: pool_add_targets() fix 7ea369783f LU-13584 tests: gather_logs() fix 4bba67075a LU-13513 osp: make neterr not fatal for precreate_reserve e45e8a92a2 LU-13453 osd-ldiskfs: do not leak inode if OI insertion fails 124b31f13e Merge "LU-9121 lnet: User Defined Selection Policy (UDSP)" 15d44e787e LU-12682 llite: fake symlink type of foreign file/dir dfe87b089b LU-14444 gss: handle empty reqmsg in sptlrpc_req_ctx_switch a54ecd2c2d LU-14455 mdt: fix DoM lock prolong logic f44413717e LU-14436 tgt: only use T10PI guard when doing full sector read ece23db121 LU-14435 doc: include lfs-flushctx manpage inside packages f3d03bc38a LU-14430 mdd: fix inheritance of big default ACLs 7c0f6912e6 New tag 2.14.50 This huge batch of landings was the start of the 2.14.50 development branch, so it is likely one of them is the main culprit for this failure. Likely candidates are the bottom few patches |
| Comment by Andreas Dilger [ 11/Jun/21 ] |
|
Further note - while test_22d is not a TIMEOUT itself, there were 13 test_22d FAIL and 13 test_23b TIMEOUT in the most recent 4-week period, so they appear strongly related at this point. While the test_22d FAIL was hit 10 times in the period between 2020-09-30 and 2021-01-30 (and no failures again until 2021-03-01), with only 1 timeout on the initial |
| Comment by Alex Zhuravlev [ 06/Aug/21 ] |
|
https://testing.whamcloud.com/test_sessions/29cfba25-f5b3-4fb9-984d-9a2be9592ad1 |