Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15123

sanity-quota: test_7a Error: 'reintegration failed'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.4
    • Lustre 2.16.0, Lustre 2.15.3
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for paf <pfarrell@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/ded8a9a9-b77c-41ca-bb53-105290b2709e

      Attachments

        Issue Links

          Activity

            [LU-15123] sanity-quota: test_7a Error: 'reintegration failed'

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50688/
            Subject: LU-15123 tests: check quota reintegration after recovery
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4432b6e2824775e292f96e202d6fc0db231bc749

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50688/ Subject: LU-15123 tests: check quota reintegration after recovery Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4432b6e2824775e292f96e202d6fc0db231bc749

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50688
            Subject: LU-15123 tests: quota reintegration starts after recovery
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e6a2cb8c1aa0a96ec1d6e4603132635459ac0615

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50688 Subject: LU-15123 tests: quota reintegration starts after recovery Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e6a2cb8c1aa0a96ec1d6e4603132635459ac0615

            [13964.128411] Lustre: lustre-OST0000: Imperative Recovery enabled, recovery window shrunk from 60-180 down to 60-180
            [13965.655492] Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 5 clients reconnect
            ...
            [14061.885567] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-quota test_7a: @@@@@@ FAIL: reintegration failed
            [14067.469119] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
            [14067.470635] Lustre: lustre-OST0000: disconnecting 1 stale clients
            [14067.787175] Lustre: lustre-OST0000: Recovery over after 1:42, of 5 clients 4 recovered and 1 was evicted.

            
            

            reintegration starts only when recovery is over. in this case the recovery process was stuck due to a missing client (to be evicted in the end) and the recovery process took 102 seconds while test 7a waits 90s at most.

            bzzz Alex Zhuravlev added a comment - [13964.128411] Lustre: lustre-OST0000: Imperative Recovery enabled, recovery window shrunk from 60-180 down to 60-180 [13965.655492] Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 5 clients reconnect ... [14061.885567] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-quota test_7a: @@@@@@ FAIL: reintegration failed [14067.469119] Lustre: lustre-OST0000: recovery is timed out, evict stale exports [14067.470635] Lustre: lustre-OST0000: disconnecting 1 stale clients [14067.787175] Lustre: lustre-OST0000: Recovery over after 1:42, of 5 clients 4 recovered and 1 was evicted. reintegration starts only when recovery is over. in this case the recovery process was stuck due to a missing client (to be evicted in the end) and the recovery process took 102 seconds while test 7a waits 90s at most.
            yujian Jian Yu added a comment - +1 on b2_15 branch: https://testing.whamcloud.com/test_sets/150ed79f-6d70-4048-b875-56a9bccc54cf
            nangelinas Nikitas Angelinas added a comment - +1 on master: https://testing.whamcloud.com/test_sets/3d837ba0-73c7-4737-9894-5bbca2c9b479
            nangelinas Nikitas Angelinas added a comment - +1 on master: https://testing.whamcloud.com/test_sets/dc29fa4d-ef8c-4838-a182-7a544385f4cc
            zam Alexander Zarochentsev added a comment - +1 on master: https://testing.whamcloud.com/test_sets/db2d7940-4287-4ff0-9435-81c6520360b7
            qian_wc Qian Yingjin added a comment - +1 on master: https://testing.whamcloud.com/test_sets/7cd51716-a038-4e24-a46b-4a14e83cc1a2

            Still being hit on master, 14/310 runs in the past week.

            adilger Andreas Dilger added a comment - Still being hit on master, 14/310 runs in the past week.
            adilger Andreas Dilger added a comment - +3 on master, all on the same patch: https://testing.whamcloud.com/test_sessions/dbc4c026-0863-41b3-b17c-b695415fa8aa https://testing.whamcloud.com/test_sessions/3db9b21f-fd9b-49a0-bbd7-5dda5aea1be9 https://testing.whamcloud.com/test_sessions/e22114c8-b11d-4494-ae19-030d82261043

            +1 on master - https://testing.whamcloud.com/test_sets/f07f8d61-0faa-41ad-9217-c238bd4c2bb0

            I guess It fails because OST0000 waits for client 1 in a recovery blocking reintegration to start(and finish). Finally it evicts this client:

            [15386.910732] Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 5 clients reconnect
            ...
            [15485.424725] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-quota test_7a: @@@@@@ FAIL: reintegration failed 
            [15485.858740] Lustre: DEBUG MARKER: sanity-quota test_7a: @@@@@@ FAIL: reintegration failed
            [15486.354991] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-1/2021-12-15/lustre-reviews_review-dne-zfs-part-4_85112_1_13_4b065c95-2177-4a6a-b5c8-32b025199627//sanity-quota.test_7a.debug_log.$(hostname -s).1639605011.log;
                           		dmesg > /autotest/autotest-1/2021-12-15/lustre-review
            [15488.876821] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
            [15488.878099] Lustre: lustre-OST0000: disconnecting 1 stale clients 
            [15489.069798] Lustre: lustre-OST0000: Recovery over after 1:42, of 5 clients 4 recovered and 1 was evicted.

            Logs from client1 dmesg:

            [15387.000236] Lustre: lustre-OST0000-osc-ffff9247ca727000: Connection to lustre-OST0000 (at 10.240.26.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [15407.479024] Lustre: lustre-OST0001-osc-ffff9247ca727000: disconnect after 23s idle
            [15409.071515] Lustre: lustre-OST0000-osc-ffff9247ca727000: Connection restored to 10.240.26.143@tcp (at 10.240.26.143@tcp)
            [15474.038827] LustreError: 11-0: lustre-OST0000-osc-ffff9247ca727000: operation ost_disconnect to node 10.240.26.143@tcp failed: rc = -107

             

            scherementsev Sergey Cheremencev added a comment - +1 on master - https://testing.whamcloud.com/test_sets/f07f8d61-0faa-41ad-9217-c238bd4c2bb0 I guess It fails because OST0000 waits for client 1 in a recovery blocking reintegration to start(and finish). Finally it evicts this client: [15386.910732] Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 5 clients reconnect ... [15485.424725] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-quota test_7a: @@@@@@ FAIL: reintegration failed [15485.858740] Lustre: DEBUG MARKER: sanity-quota test_7a: @@@@@@ FAIL: reintegration failed [15486.354991] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-1/2021-12-15/lustre-reviews_review-dne-zfs-part-4_85112_1_13_4b065c95-2177-4a6a-b5c8-32b025199627//sanity-quota.test_7a.debug_log.$(hostname -s).1639605011.log; dmesg > /autotest/autotest-1/2021-12-15/lustre-review [15488.876821] Lustre: lustre-OST0000: recovery is timed out, evict stale exports [15488.878099] Lustre: lustre-OST0000: disconnecting 1 stale clients [15489.069798] Lustre: lustre-OST0000: Recovery over after 1:42, of 5 clients 4 recovered and 1 was evicted. Logs from client1 dmesg: [15387.000236] Lustre: lustre-OST0000-osc-ffff9247ca727000: Connection to lustre-OST0000 (at 10.240.26.143@tcp) was lost; in progress operations using this service will wait for recovery to complete [15407.479024] Lustre: lustre-OST0001-osc-ffff9247ca727000: disconnect after 23s idle [15409.071515] Lustre: lustre-OST0000-osc-ffff9247ca727000: Connection restored to 10.240.26.143@tcp (at 10.240.26.143@tcp) [15474.038827] LustreError: 11-0: lustre-OST0000-osc-ffff9247ca727000: operation ost_disconnect to node 10.240.26.143@tcp failed: rc = -107  

            People

              bzzz Alex Zhuravlev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: