Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14406

replay-dual test 22d fails with “Remote creation failed 1”

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.14.0, Lustre 2.15.0
    • None
    • DNE/ZFS
    • 3
    • 9223372036854775807

    Description

      replay-dual test_22d started failing on 30 SEPT 2020 when testing the patch for LU-13417 ‘mdd: default DNE MDT balance on new filesystems’ https://review.whamcloud.com/38553 (which has not itself landed, and cannot be the source of the test failures) with logs at https://testing.whamcloud.com/test_sets/b4dc132e-a4d6-4abb-a81c-753d8f23a18e. Since that time, this test has failed 10 times during review/patch testing. On 03 FEB 2021, we see this test fail with the same error message for branch/full testing; for DNE and ZFS with logs at https://testing.whamcloud.com/test_sets/17948bab-e647-4f32-874a-0fe07a464353.

      Looking at this DNE/ZFS failure, we see the following in the suite_log:

      CMD: trevis-66vm8 /usr/sbin/lctl --device lustre-MDT0000 notransno
      CMD: trevis-66vm8 /usr/sbin/lctl --device lustre-MDT0000 readonly
      CMD: trevis-66vm8 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
      CMD: trevis-66vm6 mkdir /mnt/lustre2/d22d.replay-dual/remote_dir/dir
      trevis-66vm6: mkdir: cannot create directory '/mnt/lustre2/d22d.replay-dual/remote_dir/dir': No such file or directory
      pdsh@trevis-66vm5: trevis-66vm6: ssh exited with exit code 1
       replay-dual test_22d: @@@@@@ FAIL: Remote creation failed 1 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
        = /usr/lib64/lustre/tests/replay-dual.sh:725:test_22d()
      

      Looking at replay-dual test 22d, we see that the error is in create_remote_dir_files_22() in mkdir

       607 create_remote_dir_files_22() {
       608         do_node $CLIENT2 mkdir ${MOUNT2}/$remote_dir/dir || return 1
       609         do_node $CLIENT1 createmany -o $MOUNT1/$remote_dir/dir/$tfile- 2 ||
       610                                                             return 2
       611         do_node $CLIENT2 createmany -o $MOUNT2/$remote_dir/$tfile- 2 ||
       612                                                             return 3
       613         return 0
       614 }
      

      Logs for more failures are at
      https://testing.whamcloud.com/test_sets/260d3237-ec78-46c1-88a4-f5455a9265ce
      https://testing.whamcloud.com/test_sets/4b7ef1a1-cb32-42b5-a2c2-9a2c7604900b
      https://testing.whamcloud.com/test_sets/3e23af01-fb55-4bcf-8667-ea706fa084b3

      Attachments

        Issue Links

          Activity

            [LU-14406] replay-dual test 22d fails with “Remote creation failed 1”
            bzzz Alex Zhuravlev added a comment - https://testing.whamcloud.com/test_sessions/29cfba25-f5b3-4fb9-984d-9a2be9592ad1

            Further note - while test_22d is not a TIMEOUT itself, there were 13 test_22d FAIL and 13 test_23b TIMEOUT in the most recent 4-week period, so they appear strongly related at this point. While the test_22d FAIL was hit 10 times in the period between 2020-09-30 and 2021-01-30 (and no failures again until 2021-03-01), with only 1 timeout on the initial LU-13417 patch (which hasn't itself landed), in fact many of those failures were for another test patch https://review.whamcloud.com/40925 "LU-13417 test: replay-dual 22d with 'lfs mkdir'". This lends credence to the theory that the failures before 2021-03-01 were some other issues, and the main cause of the timeout was one of the above patches landed shortly before that date.

            adilger Andreas Dilger added a comment - Further note - while test_22d is not a TIMEOUT itself, there were 13 test_22d FAIL and 13 test_23b TIMEOUT in the most recent 4-week period, so they appear strongly related at this point. While the test_22d FAIL was hit 10 times in the period between 2020-09-30 and 2021-01-30 (and no failures again until 2021-03-01), with only 1 timeout on the initial LU-13417 patch (which hasn't itself landed), in fact many of those failures were for another test patch https://review.whamcloud.com/40925 " LU-13417 test: replay-dual 22d with 'lfs mkdir' ". This lends credence to the theory that the failures before 2021-03-01 were some other issues, and the main cause of the timeout was one of the above patches landed shortly before that date.
            adilger Andreas Dilger added a comment - - edited

            The very first failure on patch https://review.whamcloud.com/38553 was part of a series of 3 patches, the first one was commit 96bf2abf4a, which had parent patch https://review.whamcloud.com/39568 "LU-13852 pcc: don't alloc FID in LLITE for pcc open" commit 952a0754d9 (only landed to master 2021-05-27), and grandparent patch https://review.whamcloud.com/40103 "LU-14004 llite: default lsm update may memory leak" commit cc4825176 (also landed to master 2021-05-27), so those could not have been the source of the test failures on master.

            Within the most recent 4 weeks, there were 13 TIMEOUT failures, still averaging about one failure every 2 days (out of 591 total runs, or 0.2% overall TIMEOUT rate).

            It may be noteworthy that there is a 5-month gap from the first failure on 2020-09-30 (or 2020-10-01, depending on timezone), and the second failure on 2021-03-01. After 03-01, there were 55 TIMEOUT failures over a 14-week period, averaging about one failure every 2 days, but often with gaps of up to 4 days between failures. There were a total of 540 replay-dual runs in the 4-week period from 2020-09-30 and 2020-10-28, so it is not the case that this test was being skipped or only in full sessions at the time.

            This indicates that while 2020-10-01 was the first such failure, it might have been an unrelated/coincidental/patch-induced failure, and the real failures started on 2021-03-01 on patch https://review.whamcloud.com/39302 commit e982ade5d, which didn't land to master until 03-10, by which time there were 7 other failures on unrelated patches. The parent of this patch was commit v2_14_50-145-g99d9638d6c, which was one of 90 patches landed on 2021-02-26:

            f55fdfff5d LU-11085 nodemap: switch interval tree to in-kernel impl.
            03b7befcc0 LU-13485 libcfs: FIELD_SIZEOF macro removed
            c66668387a LU-12678 o2iblnd: convert peers hash table to hashtable.h
            aa57e82986 LU-12678 lnet: discard LNET_MD_PHYS
            dd0e7523e1 LU-12678 lnet: use init_wait() rather than init_waitqueue_entry()
            0269ac4a00 LU-9859 libcfs: use wait_event_timeout() in tracefiled().
            6ae187404a LU-12678 lnet: discard WIRE_ATTR
            5bb641fa61 LU-13239 ldiskfs: pass inode timestamps at initial creation
            cb3c65d4a1 LU-12780 ofd: don't use ptlrpc_thread for consistency verification
            ec138c5c58 LU-11085 ldlm: change lock_matches() to return bool.
            3db4d9a69e LU-6142 libcfs: discard cfs_strrstr()
            ee5eb07d2f LU-6142 libcfs: discard cfs_firststr
            6c3f0cfb4a LU-6142 libcfs: discard PO2_ROUNDUP_TYPED, LOWEST_BIT_SET
            7d68bfb991 LU-6142 llite: ll_lookup_finish_locks clean up
            8034d85f2b LU-6142 llite: don't cast arg to d_lustre_invalid()
            fca56be02b LU-6142 lustre: use is_root_inode()
            aaf0eb8696 LU-6142 llite: remove ll_dir_chain
            f0736a6a52 LU-6142 lustre: remove non-static 'inline' markings.
            a03765b2da LU-6142 lustre: convert snprintf to scnprintf as appropriate
            734d6eb11b LU-6142 lustre: mark strings in char arrays as const
            3e76334402 LU-6142 mdc: minor function cleanups.
            9c4fbd1766 LU-6142 osc: minor function cleanups.
            c20b866ba3 LU-6142 lustre: change various operations structs to const
            68cd9825d5 LU-6142 lfsck: make all 'struct lfsck_operations' to const
            c5b9054073 LU-6142 lustre: change all 'struct seq_operations' to const
            140b9e6d73 LU-6142 lustre: change super/file/inode operations to const
            3ae81448da LU-6142 obdclass: use cl_object_for_each more broadly.
            8d8e87a5ac LU-6142 lustre: remove module_vars arg to class_register_type()
            cab152600e LU-6142 lov: style cleanups in lov_set_osc_active()
            33265fe88b LU-6142 lustre: change obd_ioctl_getdata() args
            7b237bd306 LU-6142 lov: chnage lsm_op_find() to a non-inline function.
            950200a21f LU-6142 lustre: make various 'struct file_operations' static
            bf7f08479f LU-9859 libcfs: discard TCD_MAX_TYPES
            fb40f0b62d LU-10391 lnet: allow lnet_connect() to use IPv6 addresses.
            e4fa181abf LU-10391 lnet: allow creation of IPv6 socket.
            dcc8b9c00d LU-9679 ptlrpc: list_for_each improvements.
            977217520e LU-14275 tests: add ior_CLEANUP
            79642e0896 LU-14388 utils: always enable ldiskfs project quota
            8fc5fc5889 LU-14353 obd: move debug.c to obdecho
            c2fd5297b4 LU-14305 ldiskfs: add parameters for mb_c123_threshold
            f7f0b104bc LU-14289 ptlrpc: move heap.c from libcfs to ptlrpc
            7a8fafe2a1 LU-14291 lustre: only include nrs headers when needed
            daa388b539 LU-14291 ptlrpc: support nrs_delay for client-only builds
            c90e3d8d3f LU-14285 utils: Add error message when osd_init fails
            647c96562b LU-12477 llite: remove unused ll_teardown_mmaps()
            c5bec6a88a LU-930 doc: fix format man page sections for lctl
            1129afd348 LU-14272 tests: different mpirun options for different users
            32e96e1a48 LU-14271 tests: add new node crash method
            ce0b7ed044 LU-14270 tests: delay node's power up
            0354fa9896 LU-14262 utils: lfs to set component flags by pool name
            de60e7767c LU-14195 lustre: remove 'fs' from 'struct lvfs_run_ctxt'
            d0337cab8e LU-14195 osd: don't use set_fs() for ->fiemap() calls.
            9b9e19ca50 LU-14195 build: Adjust Makefile for Linux build changes.
            e9c3b89bda LU-14178 ldlm: return error from ldlm_namespace_new()
            9d2776f02b LU-14073 ofd: remove use of smp_read_barrier_depends()
            a7f48e6c15 LU-14047 lustre: change EWOULDBLOCK to EAGAIN
            5309e10858 LU-13783 libcfs: switch from ->mmap_sem to mmap_lock()
            e520b6a7fa LU-9325 osd-ldisk: replace simple_strto* with kstr* functions
            dd15646cc5 LU-9859 lod: use linux kernel bitmap API
            a076975f9f LU-9859 libcfs: replace all CFS_CAP_* macros with CAP_*
            2070e9bcc0 LU-13100 lov: grant deadlock if same OSC in two components
            3bae39f0a5 LU-7853 lod: fixes bitfield in lod qos code
            83e38bba62 LU-14180 utils: verify setstripe comp_end is valid
            8910291fc5 LU-14207 mgs: delete "add failnid" sections on replace_nids
            82c6e42d61 LU-13974 llog: check stale osp object
            c5165557f5 LU-12961 mdd: avoid double call to mdd_changelog_fini()
            6873482608 LU-14439 build: require a newer version of e2fsprogs
            e3f17defc1 LU-13609 mgs: fix config_log buffer handling
            c35c1babc7 LU-10391 socklnd: use sockaddr instead of __u32 addresses.
            e5a8f3fc12 LU-13929 lnet: modify assertion in lnet_post_send_locked
            437e6bea0c LU-14362 tests: sanity-flr to prepare stuff before checks
            2eaa49ef0f LU-14423 osd: recognize holes in osd_is_mapped()
            c45558bf56 LU-14398 llapi: add llapi_fid2path_at()
            4cfe77df6f LU-14398 llapi: simplify llapi_fid2path()
            3117913e21 LU-14390 gnilnd: Use DIV_ROUND_UP to calculate niov
            e00733f0f8 LU-14301 lustre: add ENOTSUPP to spelling.txt
            1a2b381616 LU-12766 test: convert time to seconds properly
            d498d1b9cc LU-13903 build: make lustre-devel buildable for Linux client
            7af92d8843 LU-14313 utils: mount error when no server support
            ffa858b165 LU-14268 lod: fix layout generation inc for mirror split
            910eb97c1b LU-14098 obdclass: try to skip corrupted llog records
            fc8f138169 LU-9820 osd-ldiskfs: OI scrub speed limit fix
            58ac9d3f18 LU-14099 build: Fix for unconfigured arch_stackwalk
            6df76d3357 LU-14044 llog: check fid after convert
            262b6f9c60 LU-13620 tests: pool_add_targets() fix
            7ea369783f LU-13584 tests: gather_logs() fix
            4bba67075a LU-13513 osp: make neterr not fatal for precreate_reserve
            e45e8a92a2 LU-13453 osd-ldiskfs: do not leak inode if OI insertion fails
            124b31f13e Merge "LU-9121 lnet: User Defined Selection Policy (UDSP)"
            15d44e787e LU-12682 llite: fake symlink type of foreign file/dir
            dfe87b089b LU-14444 gss: handle empty reqmsg in sptlrpc_req_ctx_switch
            a54ecd2c2d LU-14455 mdt: fix DoM lock prolong logic
            f44413717e LU-14436 tgt: only use T10PI guard when doing full sector read
            ece23db121 LU-14435 doc: include lfs-flushctx manpage inside packages
            f3d03bc38a LU-14430 mdd: fix inheritance of big default ACLs
            7c0f6912e6 New tag 2.14.50
            

            This huge batch of landings was the start of the 2.14.50 development branch, so it is likely one of them is the main culprit for this failure. Likely candidates are the bottom few patches LU-14430, LU-14455, LU-12682, LU-13453, but nothing obviously stands out. It may be possible to isolate this with a bisect, the scarcity of failures (1-in-500) means it would take a lot of test iterations, and may also have an unknown pre-dependency from another subtest.

            adilger Andreas Dilger added a comment - - edited The very first failure on patch https://review.whamcloud.com/38553 was part of a series of 3 patches, the first one was commit 96bf2abf4a, which had parent patch https://review.whamcloud.com/39568 " LU-13852 pcc: don't alloc FID in LLITE for pcc open " commit 952a0754d9 (only landed to master 2021-05-27), and grandparent patch https://review.whamcloud.com/40103 " LU-14004 llite: default lsm update may memory leak " commit cc4825176 (also landed to master 2021-05-27), so those could not have been the source of the test failures on master. Within the most recent 4 weeks, there were 13 TIMEOUT failures, still averaging about one failure every 2 days (out of 591 total runs, or 0.2% overall TIMEOUT rate). It may be noteworthy that there is a 5-month gap from the first failure on 2020-09-30 (or 2020-10-01, depending on timezone), and the second failure on 2021-03-01. After 03-01, there were 55 TIMEOUT failures over a 14-week period, averaging about one failure every 2 days, but often with gaps of up to 4 days between failures. There were a total of 540 replay-dual runs in the 4-week period from 2020-09-30 and 2020-10-28, so it is not the case that this test was being skipped or only in full sessions at the time. This indicates that while 2020-10-01 was the first such failure, it might have been an unrelated/coincidental/patch-induced failure, and the real failures started on 2021-03-01 on patch https://review.whamcloud.com/39302 commit e982ade5d, which didn't land to master until 03-10, by which time there were 7 other failures on unrelated patches. The parent of this patch was commit v2_14_50-145-g99d9638d6c, which was one of 90 patches landed on 2021-02-26: f55fdfff5d LU-11085 nodemap: switch interval tree to in-kernel impl. 03b7befcc0 LU-13485 libcfs: FIELD_SIZEOF macro removed c66668387a LU-12678 o2iblnd: convert peers hash table to hashtable.h aa57e82986 LU-12678 lnet: discard LNET_MD_PHYS dd0e7523e1 LU-12678 lnet: use init_wait() rather than init_waitqueue_entry() 0269ac4a00 LU-9859 libcfs: use wait_event_timeout() in tracefiled(). 6ae187404a LU-12678 lnet: discard WIRE_ATTR 5bb641fa61 LU-13239 ldiskfs: pass inode timestamps at initial creation cb3c65d4a1 LU-12780 ofd: don't use ptlrpc_thread for consistency verification ec138c5c58 LU-11085 ldlm: change lock_matches() to return bool. 3db4d9a69e LU-6142 libcfs: discard cfs_strrstr() ee5eb07d2f LU-6142 libcfs: discard cfs_firststr 6c3f0cfb4a LU-6142 libcfs: discard PO2_ROUNDUP_TYPED, LOWEST_BIT_SET 7d68bfb991 LU-6142 llite: ll_lookup_finish_locks clean up 8034d85f2b LU-6142 llite: don't cast arg to d_lustre_invalid() fca56be02b LU-6142 lustre: use is_root_inode() aaf0eb8696 LU-6142 llite: remove ll_dir_chain f0736a6a52 LU-6142 lustre: remove non-static 'inline' markings. a03765b2da LU-6142 lustre: convert snprintf to scnprintf as appropriate 734d6eb11b LU-6142 lustre: mark strings in char arrays as const 3e76334402 LU-6142 mdc: minor function cleanups. 9c4fbd1766 LU-6142 osc: minor function cleanups. c20b866ba3 LU-6142 lustre: change various operations structs to const 68cd9825d5 LU-6142 lfsck: make all 'struct lfsck_operations' to const c5b9054073 LU-6142 lustre: change all 'struct seq_operations' to const 140b9e6d73 LU-6142 lustre: change super/file/inode operations to const 3ae81448da LU-6142 obdclass: use cl_object_for_each more broadly. 8d8e87a5ac LU-6142 lustre: remove module_vars arg to class_register_type() cab152600e LU-6142 lov: style cleanups in lov_set_osc_active() 33265fe88b LU-6142 lustre: change obd_ioctl_getdata() args 7b237bd306 LU-6142 lov: chnage lsm_op_find() to a non-inline function. 950200a21f LU-6142 lustre: make various 'struct file_operations' static bf7f08479f LU-9859 libcfs: discard TCD_MAX_TYPES fb40f0b62d LU-10391 lnet: allow lnet_connect() to use IPv6 addresses. e4fa181abf LU-10391 lnet: allow creation of IPv6 socket. dcc8b9c00d LU-9679 ptlrpc: list_for_each improvements. 977217520e LU-14275 tests: add ior_CLEANUP 79642e0896 LU-14388 utils: always enable ldiskfs project quota 8fc5fc5889 LU-14353 obd: move debug.c to obdecho c2fd5297b4 LU-14305 ldiskfs: add parameters for mb_c123_threshold f7f0b104bc LU-14289 ptlrpc: move heap.c from libcfs to ptlrpc 7a8fafe2a1 LU-14291 lustre: only include nrs headers when needed daa388b539 LU-14291 ptlrpc: support nrs_delay for client-only builds c90e3d8d3f LU-14285 utils: Add error message when osd_init fails 647c96562b LU-12477 llite: remove unused ll_teardown_mmaps() c5bec6a88a LU-930 doc: fix format man page sections for lctl 1129afd348 LU-14272 tests: different mpirun options for different users 32e96e1a48 LU-14271 tests: add new node crash method ce0b7ed044 LU-14270 tests: delay node's power up 0354fa9896 LU-14262 utils: lfs to set component flags by pool name de60e7767c LU-14195 lustre: remove 'fs' from 'struct lvfs_run_ctxt' d0337cab8e LU-14195 osd: don't use set_fs() for ->fiemap() calls. 9b9e19ca50 LU-14195 build: Adjust Makefile for Linux build changes. e9c3b89bda LU-14178 ldlm: return error from ldlm_namespace_new() 9d2776f02b LU-14073 ofd: remove use of smp_read_barrier_depends() a7f48e6c15 LU-14047 lustre: change EWOULDBLOCK to EAGAIN 5309e10858 LU-13783 libcfs: switch from ->mmap_sem to mmap_lock() e520b6a7fa LU-9325 osd-ldisk: replace simple_strto* with kstr* functions dd15646cc5 LU-9859 lod: use linux kernel bitmap API a076975f9f LU-9859 libcfs: replace all CFS_CAP_* macros with CAP_* 2070e9bcc0 LU-13100 lov: grant deadlock if same OSC in two components 3bae39f0a5 LU-7853 lod: fixes bitfield in lod qos code 83e38bba62 LU-14180 utils: verify setstripe comp_end is valid 8910291fc5 LU-14207 mgs: delete "add failnid" sections on replace_nids 82c6e42d61 LU-13974 llog: check stale osp object c5165557f5 LU-12961 mdd: avoid double call to mdd_changelog_fini() 6873482608 LU-14439 build: require a newer version of e2fsprogs e3f17defc1 LU-13609 mgs: fix config_log buffer handling c35c1babc7 LU-10391 socklnd: use sockaddr instead of __u32 addresses. e5a8f3fc12 LU-13929 lnet: modify assertion in lnet_post_send_locked 437e6bea0c LU-14362 tests: sanity-flr to prepare stuff before checks 2eaa49ef0f LU-14423 osd: recognize holes in osd_is_mapped() c45558bf56 LU-14398 llapi: add llapi_fid2path_at() 4cfe77df6f LU-14398 llapi: simplify llapi_fid2path() 3117913e21 LU-14390 gnilnd: Use DIV_ROUND_UP to calculate niov e00733f0f8 LU-14301 lustre: add ENOTSUPP to spelling.txt 1a2b381616 LU-12766 test: convert time to seconds properly d498d1b9cc LU-13903 build: make lustre-devel buildable for Linux client 7af92d8843 LU-14313 utils: mount error when no server support ffa858b165 LU-14268 lod: fix layout generation inc for mirror split 910eb97c1b LU-14098 obdclass: try to skip corrupted llog records fc8f138169 LU-9820 osd-ldiskfs: OI scrub speed limit fix 58ac9d3f18 LU-14099 build: Fix for unconfigured arch_stackwalk 6df76d3357 LU-14044 llog: check fid after convert 262b6f9c60 LU-13620 tests: pool_add_targets() fix 7ea369783f LU-13584 tests: gather_logs() fix 4bba67075a LU-13513 osp: make neterr not fatal for precreate_reserve e45e8a92a2 LU-13453 osd-ldiskfs: do not leak inode if OI insertion fails 124b31f13e Merge "LU-9121 lnet: User Defined Selection Policy (UDSP)" 15d44e787e LU-12682 llite: fake symlink type of foreign file/dir dfe87b089b LU-14444 gss: handle empty reqmsg in sptlrpc_req_ctx_switch a54ecd2c2d LU-14455 mdt: fix DoM lock prolong logic f44413717e LU-14436 tgt: only use T10PI guard when doing full sector read ece23db121 LU-14435 doc: include lfs-flushctx manpage inside packages f3d03bc38a LU-14430 mdd: fix inheritance of big default ACLs 7c0f6912e6 New tag 2.14.50 This huge batch of landings was the start of the 2.14.50 development branch, so it is likely one of them is the main culprit for this failure. Likely candidates are the bottom few patches LU-14430 , LU-14455 , LU-12682 , LU-13453 , but nothing obviously stands out. It may be possible to isolate this with a bisect, the scarcity of failures (1-in-500) means it would take a lot of test iterations, and may also have an unknown pre-dependency from another subtest.

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: