Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16404

sanity test_230n: lfs mirror: cannot get UNLOCK lease, Mirroring failed

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for S Buisson <sbuisson@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/65516983-67f2-4883-b7ab-1cb5abdf58e5

      test_230n failed with the following error:

      Mirroring failed
      

      Test output is:

      == sanity test 230n: Dir migration with mirrored file ==== 22:36:31 (1671057391)
      lfs mirror mirror: cannot get UNLOCK lease, ext 4: No such file or directory (2)
      error: lfs mirror extend: /mnt/lustre/d230n.sanity/f230n.sanity: cannot merge layout: No such file or directory
       sanity test_230n: @@@@@@ FAIL: Mirroring failed 
      

      In client dmesg, we can see the client was evicted by the MDS:

      [13212.828154] LustreError: 866765:0:(file.c:242:ll_close_inode_openhandle()) lustre-clilmv-ffff93f6048ef000:
          inode [0x200002b15:0x5b9b:0x0] mdc close failed: rc = -2
      [13212.831301] LustreError: 11-0: lustre-MDT0000-mdc-ffff93f6048ef000: operation ldlm_enqueue to node 10.240.38.63@tcp failed: rc = -107
      [13212.831308] Lustre: lustre-MDT0000-mdc-ffff93f6048ef000: Connection to lustre-MDT0000 (at 10.240.38.63@tcp) was lost;
          in progress operations using this service will wait for recovery to complete
      [13212.833555] LustreError: Skipped 1 previous similar message
      [13212.838262] LustreError: 167-0: lustre-MDT0000-mdc-ffff93f6048ef000: This client was evicted by lustre-MDT0000;
          in progress operations using this service will fail.
      [13212.847340] Lustre: lustre-MDT0000-mdc-ffff93f6048ef000: Connection restored to 10.240.38.63@tcp (at 10.240.38.63@tcp)
      [13213.270161] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_230n: @@@@@@ FAIL: Mirroring failed 
      

      The MDS claims the client was not responsive to the lock request:

      [13026.421567] Lustre: DEBUG MARKER: == sanity test 230n: Dir migration with mirrored file ==== 03:41:59 (1679283719)
      [13070.199369] Lustre: mdt_rdpg00_003: service thread pid 541635 was inactive for 42.970 seconds. The thread might be hung,
           or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [13070.203029] Pid: 541635, comm: mdt_rdpg00_003 4.18.0-348.23.1.el8_lustre.x86_64 #1 SMP Thu Mar 2 00:54:25 UTC 2023
      [13070.204965] Call Trace TBD:
      [13070.206179] ldlm_completion_ast+0x7ac/0x900 [ptlrpc]
      [13070.207308] ldlm_cli_enqueue_local+0x307/0x860 [ptlrpc]
      [13070.208596] mdt_object_local_lock+0x509/0xb30 [mdt]
      [13070.209666] mdt_object_lock_internal+0x18d/0x4a0 [mdt]
      [13070.210781] mdt_object_lock+0x1b/0x20 [mdt]
      [13070.211735] mdt_close_handle_layouts+0x935/0x10b0 [mdt]
      [13070.212865] mdt_mfd_close+0x510/0xbc0 [mdt]
      [13070.213811] mdt_close_internal+0xcc/0x250 [mdt]
      [13070.214833] mdt_close+0x2c0/0x8b0 [mdt]
      [13070.215745] tgt_request_handle+0xc8c/0x1950 [ptlrpc]
      [13070.216850] ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc]
      [13070.218096] ptlrpc_main+0xc4e/0x1510 [ptlrpc]
      [13127.540308] LustreError: 482890:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 100s:
           evicting client at 10.240.38.60@tcp  ns: mdt-lustre-MDT0000_UUID lock: 0000000001a8df2b/0x3c9ac870599a2eb7
           lrc: 3/0,0 mode: CR/CR res: [0x200002b15:0x5b9b:0x0].0x0 bits 0xd/0x0 rrc: 4 type: IBT gid 0 flags: 0x60200400000020
           nid: 10.240.38.60@tcp remote: 0x5d158cc36e2e425f expref: 49 pid: 537456 timeout: 13128 lvb_type: 0
      [13128.435958] Lustre: DEBUG MARKER: sanity test_230n: @@@@@@ FAIL: Mirroring failed
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_230n - Mirroring failed

      Attachments

        Issue Links

          Activity

            [LU-16404] sanity test_230n: lfs mirror: cannot get UNLOCK lease, Mirroring failed

            The test patch passed 200x iterations of sanity test_230n with the LU-16633 patch.

            adilger Andreas Dilger added a comment - The test patch passed 200x iterations of sanity test_230n with the LU-16633 patch.

            If patch https://review.whamcloud.com/50261 "LU-16633 obdclass: fix rpc slot leakage" fixes this issue, it implies that the client AST reply can be blocked due to max_rpcs_in_flight limits. That makes me wonder if there we are blocking AST RPCs due to max_rpcs_in_flight, or if there should be a higher RPC limit for AST replies than regular MDC/OSC requests?

            The client shouldn't be evicted if it cannot send an AST reply because all of the RPC slots are used for long-running RPC requests (e.g. writes or LDLM enqueues that are blocked on different locks). The OST lock prolonging only helps for timeouts on locks actively undergoing writeout, but doesn't help if there is an AST on a different resource and the client can't even send writes under those locks because all of the RPC slots are full.

            adilger Andreas Dilger added a comment - If patch https://review.whamcloud.com/50261 " LU-16633 obdclass: fix rpc slot leakage " fixes this issue, it implies that the client AST reply can be blocked due to max_rpcs_in_flight limits. That makes me wonder if there we are blocking AST RPCs due to max_rpcs_in_flight , or if there should be a higher RPC limit for AST replies than regular MDC/OSC requests? The client shouldn't be evicted if it cannot send an AST reply because all of the RPC slots are used for long-running RPC requests (e.g. writes or LDLM enqueues that are blocked on different locks). The OST lock prolonging only helps for timeouts on locks actively undergoing writeout, but doesn't help if there is an AST on a different resource and the client can't even send writes under those locks because all of the RPC slots are full.

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50343
            Subject: LU-16404 tests: check if wakeup race fixes issue
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 69aa93cf93a7c2648d4fbb2441b96c5a53ba671f

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50343 Subject: LU-16404 tests: check if wakeup race fixes issue Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 69aa93cf93a7c2648d4fbb2441b96c5a53ba671f

            May possibly relate to LU-16633, which relates to a bug initially introduced by LU-15947.

            adilger Andreas Dilger added a comment - May possibly relate to LU-16633 , which relates to a bug initially introduced by LU-15947 .

            +12 failures in past 4 weeks, sometimes "Mirroring failed", sometimes "Migration failed".

            The first failures started on 2022-11-09 (a few earlier failures, but caused by previous subtest failures), and have been 21 on master since then (every 3-7 days), and not on any other branch. The first failure was for a patch that is fixing a later test in sanity, so very unlikely to be the source of the MDS eviction.

            There were 45 patches landed to master between 2022-11-01 and 2022-11-10:

            # git log --oneline --after 2022-11-01 --before 2022-11-10
            3737b28a9d LU-10003 lnet: to properly handle errors reported use ssize_t
            93784852c8 LU-15825 ldlm: remove linefeed from LDLM_DEBUG
            ed71af843d LU-16280 tests: conf-sanity/117 call setup when FS is not mounted
            9a0a89520e LU-16251 obdclass: fill jobid in a safe way
            b104c0a277 LU-14377 tests: make parallel-scale/rr_alloc less strict
            38a52f2fc3 LU-16243 tests: Specify ping source in test_218
            d55fe25a95 LU-16215 kfilnd: Use immediate for routed GETs
            78c681d9f4 LU-16207 build: add rpm-build BuildRequires for SLES15 SP3
            95f7ef6094 LU-16111 build: Fix include of stddef.h
            0171801df5 LU-8151 obd: Show correct shadow mountpoints for server
            26e765e0d2 LU-15011 tests: additional checks for pool spilling
            41610e6207 LU-8837 mgc: move server-only code out of mgc_request.c
            f32bdde057 LU-8837 lustre: make ldlm and target file lists
            ef90a02d12 LU-13135  quota: improve checks in OSDs to ignore quota
            896cd5b7bc LU-10391 lnet: fix build issue when IPv6 is disabled.
            122644ae19 LU-15847 target: report multiple transno to debug log
            cf121b1668 LU-16203 llog: skip bad records in llog
            53b3d1de99 LU-16152 utils: fix integer overflow in cYAML parser
            db0fb8f2b6 LU-10391 lnet: allow ping packet to contain large nids
            5243630b09 LU-15947 obdclass: improve precision of wakeups for mod_rpcs
            8f8f6e2f36 LU-10003 lnet: use Netlink to support old and new NI APIs.
            85941b9fb9 LU-15117 ofd: no lock for dt_bufs_get() in read path
            293844d132 LU-16222 kernel: RHEL 8.7 client and server support
            d3fc875617 LU-16256 tests: sanity-flr to remove temp files
            f97ce54357 LU-15564 osd: add allocation time histogram
            9aaaebfa22 LU-16252 tests: test_129() fix for not combined_mgs_mds
            d41a508a49 LU-16241 obdclass: NL_SET_ERR_MSG work for older kernel
            aaab50dd88 LU-16249 sec: krb5_decrypt_bulk calls decryption primitive
            b533700add LU-16232 scripts: changelog/updatelog emergency cleanup
            b4e06ff1e4 LU-16172 o2iblnd: add verbose debug prints for rx/tx events
            c619b6d6a5 LU-16258 llite: Explicitly support .splice_write
            8f718df474 LU-15852 lnet: Don't modify uptodate peer with temp NI
            4f2914537c LU-15139 osp: block reads until the object is created
            ec1b0dc6e4 LU-15626 tests: Fix "error" reported by shellcheck for replay-dual
            8e563553c0 LU-15626 tests: Fix "error" reported by shellcheck
            b2274a7160 LU-15619 osc: Remove oap lock
            7e0a2b0737 LU-15046 osp: precreate thread vs connect race
            ed0f74cd63 LU-6142 obdclass: make ccc_users in cl_client_cache a refcount_t
            5b911e0326 LU-16160 llite: clear stale page's uptodate bit
            1a79d395dd LU-15935 target: keep track of multirpc slots in last_rcvd
            62fd8f9b49 Revert "LU-16046 revert: "LU-9964 llite: prevent mulitple group locks""
            dd609c6f31 Revert "LU-16046 ldlm: group lock fix"
            

            Nothing immediately jumps out as an obvious cause. One possibility might be "LU-15947 obdclass: improve precision of wakeups for mod_rpcs" if there was some problem with the wakeups.

            Patches committed the previous week seem even less likely:

            # git log --oneline --after 2022-10-25 --before 2022-11-01
            6aee406c84 LU-14719 lod: distributed transaction check space
            d4848d779b LU-16187 tests: Fix is_project_quota_supported()
            9309a32c1a LU-16240 build: Use new AS_HELP_STRING
            83d3f42118 LU-15305 obdclass: fix race in class_del_profile
            1df5199097 LU-15807 ksocklnd: fix irq lock inversion while calling sk_data_ready()
            8a2bd9661d LU-16177 kernel: kernel update RHEL9.0 [5.14.0-70.26.1.el9_0]
            5547158daa LU-16175 kernel: kernel update SLES12 SP5 [4.12.14-122.133.1]
            5d31fdaf68 LU-16174 kernel: kernel update SLES15 SP4 [5.14.21-150400.24.21.2]
            c3467db7e7 LU-16173 kernel: kernel update SLES15 SP3 [5.3.18-150300.59.93.1]
            3493db6a67 LU-16233 build: Add always target for SUSE15 SP3 LTSS
            950e59ced1 LU-15234 lnet: add mechanism for dumping lnd peer debug info
            0a639a32cf LU-15795 lbuild: enable KABI
            c5a436d849 LU-12130 test: pool inheritance for mdt component
            78dddb423f LU-15447 tests: sanity-flr/208 reset rotational status
            c24a38b583 LU-16183 test: sanity-hsm/70 should detect python
            04e5fa7b24 LU-13175 tests: sanity/803 to sync MDTs for actual statfs
            2e08974390 LU-16139 statahead: avoid to block ptlrpcd interpret context
            a966b624ac LU-16149 lnet: Discovery queue and deletion race
            0a317b171e LU-15847 tgt: move tti_ transaction params to tsi_
            4e2e8fd2fc LU-15847 tgt: reply always with the latest assigned transno
            4468f6c9d9 LU-16025 llite: adjust read count as file got truncated
            

            The LU-14719 patch has had a few issues, but those are mostly during recovery, which isn't the case here.

            It might be necessary

            adilger Andreas Dilger added a comment - +12 failures in past 4 weeks, sometimes "Mirroring failed", sometimes "Migration failed". The first failures started on 2022-11-09 (a few earlier failures, but caused by previous subtest failures), and have been 21 on master since then (every 3-7 days), and not on any other branch. The first failure was for a patch that is fixing a later test in sanity, so very unlikely to be the source of the MDS eviction. There were 45 patches landed to master between 2022-11-01 and 2022-11-10: # git log --oneline --after 2022-11-01 --before 2022-11-10 3737b28a9d LU-10003 lnet: to properly handle errors reported use ssize_t 93784852c8 LU-15825 ldlm: remove linefeed from LDLM_DEBUG ed71af843d LU-16280 tests: conf-sanity/117 call setup when FS is not mounted 9a0a89520e LU-16251 obdclass: fill jobid in a safe way b104c0a277 LU-14377 tests: make parallel-scale/rr_alloc less strict 38a52f2fc3 LU-16243 tests: Specify ping source in test_218 d55fe25a95 LU-16215 kfilnd: Use immediate for routed GETs 78c681d9f4 LU-16207 build: add rpm-build BuildRequires for SLES15 SP3 95f7ef6094 LU-16111 build: Fix include of stddef.h 0171801df5 LU-8151 obd: Show correct shadow mountpoints for server 26e765e0d2 LU-15011 tests: additional checks for pool spilling 41610e6207 LU-8837 mgc: move server-only code out of mgc_request.c f32bdde057 LU-8837 lustre: make ldlm and target file lists ef90a02d12 LU-13135 quota: improve checks in OSDs to ignore quota 896cd5b7bc LU-10391 lnet: fix build issue when IPv6 is disabled. 122644ae19 LU-15847 target: report multiple transno to debug log cf121b1668 LU-16203 llog: skip bad records in llog 53b3d1de99 LU-16152 utils: fix integer overflow in cYAML parser db0fb8f2b6 LU-10391 lnet: allow ping packet to contain large nids 5243630b09 LU-15947 obdclass: improve precision of wakeups for mod_rpcs 8f8f6e2f36 LU-10003 lnet: use Netlink to support old and new NI APIs. 85941b9fb9 LU-15117 ofd: no lock for dt_bufs_get() in read path 293844d132 LU-16222 kernel: RHEL 8.7 client and server support d3fc875617 LU-16256 tests: sanity-flr to remove temp files f97ce54357 LU-15564 osd: add allocation time histogram 9aaaebfa22 LU-16252 tests: test_129() fix for not combined_mgs_mds d41a508a49 LU-16241 obdclass: NL_SET_ERR_MSG work for older kernel aaab50dd88 LU-16249 sec: krb5_decrypt_bulk calls decryption primitive b533700add LU-16232 scripts: changelog/updatelog emergency cleanup b4e06ff1e4 LU-16172 o2iblnd: add verbose debug prints for rx/tx events c619b6d6a5 LU-16258 llite: Explicitly support .splice_write 8f718df474 LU-15852 lnet: Don't modify uptodate peer with temp NI 4f2914537c LU-15139 osp: block reads until the object is created ec1b0dc6e4 LU-15626 tests: Fix "error" reported by shellcheck for replay-dual 8e563553c0 LU-15626 tests: Fix "error" reported by shellcheck b2274a7160 LU-15619 osc: Remove oap lock 7e0a2b0737 LU-15046 osp: precreate thread vs connect race ed0f74cd63 LU-6142 obdclass: make ccc_users in cl_client_cache a refcount_t 5b911e0326 LU-16160 llite: clear stale page's uptodate bit 1a79d395dd LU-15935 target: keep track of multirpc slots in last_rcvd 62fd8f9b49 Revert "LU-16046 revert: "LU-9964 llite: prevent mulitple group locks"" dd609c6f31 Revert "LU-16046 ldlm: group lock fix" Nothing immediately jumps out as an obvious cause. One possibility might be " LU-15947 obdclass: improve precision of wakeups for mod_rpcs " if there was some problem with the wakeups. Patches committed the previous week seem even less likely: # git log --oneline --after 2022-10-25 --before 2022-11-01 6aee406c84 LU-14719 lod: distributed transaction check space d4848d779b LU-16187 tests: Fix is_project_quota_supported() 9309a32c1a LU-16240 build: Use new AS_HELP_STRING 83d3f42118 LU-15305 obdclass: fix race in class_del_profile 1df5199097 LU-15807 ksocklnd: fix irq lock inversion while calling sk_data_ready() 8a2bd9661d LU-16177 kernel: kernel update RHEL9.0 [5.14.0-70.26.1.el9_0] 5547158daa LU-16175 kernel: kernel update SLES12 SP5 [4.12.14-122.133.1] 5d31fdaf68 LU-16174 kernel: kernel update SLES15 SP4 [5.14.21-150400.24.21.2] c3467db7e7 LU-16173 kernel: kernel update SLES15 SP3 [5.3.18-150300.59.93.1] 3493db6a67 LU-16233 build: Add always target for SUSE15 SP3 LTSS 950e59ced1 LU-15234 lnet: add mechanism for dumping lnd peer debug info 0a639a32cf LU-15795 lbuild: enable KABI c5a436d849 LU-12130 test: pool inheritance for mdt component 78dddb423f LU-15447 tests: sanity-flr/208 reset rotational status c24a38b583 LU-16183 test: sanity-hsm/70 should detect python 04e5fa7b24 LU-13175 tests: sanity/803 to sync MDTs for actual statfs 2e08974390 LU-16139 statahead: avoid to block ptlrpcd interpret context a966b624ac LU-16149 lnet: Discovery queue and deletion race 0a317b171e LU-15847 tgt: move tti_ transaction params to tsi_ 4e2e8fd2fc LU-15847 tgt: reply always with the latest assigned transno 4468f6c9d9 LU-16025 llite: adjust read count as file got truncated The LU-14719 patch has had a few issues, but those are mostly during recovery, which isn't the case here. It might be necessary

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: