Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12857

recovery-mds-scale test_failover_ost fails with “import is not in FULL state”

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • Lustre 2.13.0, Lustre 2.12.3
    • None
    • 3
    • 9223372036854775807

    Description

      recovery-mds-scale test_failover-ost fails with “import is not in FULL state”. In the suite_log, we see

      trevis-50vm3:  rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid into FULL state after 1475 sec, have DISCONN 
      trevis-50vm3:   Trace dump:
      trevis-50vm3:   = /usr/lib64/lustre/tests/test-framework.sh:5864:error()
      …
      recovery-mds-scale test_failover_ost: @@@@@@ FAIL: import is not in FULL state 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5864:error()
        = /usr/lib64/lustre/tests/test-framework.sh:7245:wait_clients_import_state()
        = /usr/lib64/lustre/tests/recovery-mds-scale.sh:159:failover_target()
        = /usr/lib64/lustre/tests/recovery-mds-scale.sh:242:test_failover_ost()
      

      then the test suite hangs and times out.

      In the case of https://testing.whamcloud.com/test_sets/e816cdac-eb87-11e9-be86-52540065bddc, we fail over ost4 from vm5 to vm6 and the OST failovers look successful. Checking clients after failover, we see

      [86033.161862] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
      [86033.416887] Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
      [86035.230364] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost4 has failed over 1 times, and counting...
      [86035.476561] Lustre: DEBUG MARKER: ost4 has failed over 1 times, and counting...
      [86079.403400] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
      [86079.406508] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 8 previous similar messages
      [86344.411364] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
      [86344.413633] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 14 previous similar messages
      [86827.848075] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=408 DURATION=86400 PERIOD=1200
      [86828.072043] Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=408 DURATION=86400 PERIOD=1200
      [86829.772372] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait ost6 recovery complete before doing next failover...
      [86830.015147] Lustre: DEBUG MARKER: Wait ost6 recovery complete before doing next failover...
      

      We check recovery_status for each of the OST and find there is an issue

      [86841.794013] Lustre: DEBUG MARKER: trevis-50vm6.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0006.recovery_status 1475
      [86842.076196] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL state before doing next failover...
      [86842.297866] Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover...
      [86859.446351] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
      [86859.448604] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages
      [87464.461375] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
      [87464.463680] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages
      [88064.484412] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
      [88064.486869] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages
      [88350.603831] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-mds-scale test_failover_ost: @@@@@@ FAIL: import is not in FULL state 
      [88350.872125] Lustre: DEBUG MARKER: recovery-mds-scale test_failover_ost: @@@@@@ FAIL: import is not in FULL state
      

      We then try to exit the test but experience LNet errors

      [88354.805689] Server failover period: 1200 seconds
      [88354.805689] Exited after:           408 seconds
      [88354.805689] Number of failovers before exit:
      [88354.805689] mds1: 0 times
      [88354.805689] ost1: 0 times
      [88354.805689] ost2: 0 times
      [88354.805689] ost3: 0 times
      [88354.805689] ost4: 1 times
      [88354.805689] ost5: 0 times
      [88354.805689] ost6: 0 times
      [88354.805689] o
      [88355.027288] Lustre: DEBUG MARKER: Duration: 86400
      [88355.647928] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_7-x86_64--failover--1_25__52___8d15cfb4-3473-4202-941b-c914ac734bd4/recovery-mds-scale.test_failover_ost.debug_log.$(hostname -s).1570642546.log;
      [88355.647928]          dmesg > /autotest/autotest2/2019
      [88360.734604] Lustre: lustre-OST0001: Connection restored to d9d4bfb9-3fbd-d3f0-7667-bf145a641dfe (at 10.9.50.1@tcp)
      [88360.736451] Lustre: Skipped 7 previous similar messages
      [88669.517330] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
      [88669.519558] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages
      [89279.521338] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
      

      The call traces are displayed. We see ll_ost00_00*, ll_ost_io00_00* and ll_ost_create00 call traces for some OSTs. For example, we see

      [172080.643164] ll_ost00_000    S ffff92afba003150     0 24093      2 0x00000080
      [172080.644487] Call Trace:
      [172080.644949]  [<ffffffff9e77f229>] schedule+0x29/0x70
      [172080.645903]  [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc]
      [172080.647064]  [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20
      [172080.648125]  [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc]
      [172080.649264]  [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0
      [172080.650356]  [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0
      [172080.651428]  [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [172080.652739]  [<ffffffff9e0c50d1>] kthread+0xd1/0xe0
      [172080.653638]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      [172080.654737]  [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [172080.655906]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      [172080.682767] ll_ost_create00 S ffff92afbc14a0e0     0 24096      2 0x00000080
      [172080.684064] Call Trace:
      [172080.684525]  [<ffffffff9e77f229>] schedule+0x29/0x70
      [172080.685421]  [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc]
      [172080.686731]  [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20
      [172080.687779]  [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc]
      [172080.688908]  [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0
      [172080.689999]  [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0
      [172080.691013]  [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [172080.692314]  [<ffffffff9e0c50d1>] kthread+0xd1/0xe0
      [172080.693223]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      [172080.694371]  [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [172080.695537]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      [172080.710767] ll_ost_io00_000 S ffff92afb8075230     0 24098      2 0x00000080
      [172080.712129] Call Trace:
      [172080.712586]  [<ffffffff9e77f229>] schedule+0x29/0x70
      [172080.713487]  [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc]
      [172080.714784]  [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20
      [172080.715887]  [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc]
      [172080.717015]  [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0
      [172080.718104]  [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0
      [172080.719126]  [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [172080.720443]  [<ffffffff9e0c50d1>] kthread+0xd1/0xe0
      [172080.721352]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      [172080.722448]  [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [172080.723612]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      [172080.752571] ll_ost_seq00_00 S ffff92afb8071070     0 24101      2 0x00000080
      [172080.753858] Call Trace:
      [172080.754332]  [<ffffffff9e77f229>] schedule+0x29/0x70
      [172080.755326]  [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc]
      [172080.756527]  [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20
      [172080.757566]  [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc]
      [172080.758745]  [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0
      [172080.759848]  [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0
      [172080.760868]  [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [172080.762189]  [<ffffffff9e0c50d1>] kthread+0xd1/0xe0
      [172080.763085]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      [172080.764193]  [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [172080.765353]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      …
      

      Attachments

        Issue Links

          Activity

            [LU-12857] recovery-mds-scale test_failover_ost fails with “import is not in FULL state”
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45318/
            Subject: LU-12857 tests: allow clients to be IDLE after recovery
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: af666bef058c5b7997527fc851a84a89375912fb

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45318/ Subject: LU-12857 tests: allow clients to be IDLE after recovery Project: fs/lustre-release Branch: master Current Patch Set: Commit: af666bef058c5b7997527fc851a84a89375912fb

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45318
            Subject: LU-12857 tests: allow clients to be IDLE after recovery
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 22573acdaf7fa4dfbdd86067971a170a45ae6a6f

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45318 Subject: LU-12857 tests: allow clients to be IDLE after recovery Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 22573acdaf7fa4dfbdd86067971a170a45ae6a6f
            adilger Andreas Dilger added a comment - - edited

            In an recovery-mds-scale test failure I looked at, I see that the client import state is failing because it is IDLE instead of FULL. That is normal if osc.*.idle_disconnect is enabled, which is the default on master:
            https://testing.whamcloud.com/test_sets/df019c9d-d46a-48ea-90fd-f6fa0665ebe1

            trevis-55vm7:  rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0003-osc-[-0-9a-f]*.ost_server_uuid into FULL state after 1475 sec, have IDLE 
            
            adilger Andreas Dilger added a comment - - edited In an recovery-mds-scale test failure I looked at, I see that the client import state is failing because it is IDLE instead of FULL . That is normal if osc.*.idle_disconnect is enabled, which is the default on master: https://testing.whamcloud.com/test_sets/df019c9d-d46a-48ea-90fd-f6fa0665ebe1 trevis-55vm7: rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0003-osc-[-0-9a-f]*.ost_server_uuid into FULL state after 1475 sec, have IDLE
            pjones Peter Jones added a comment -

            I would strongly suggest that we re-disable health by default to avoid this issue for users who are not intending to use this feature.

            pjones Peter Jones added a comment - I would strongly suggest that we re-disable health by default to avoid this issue for users who are not intending to use this feature.

            The OST thread stack traces with ptlrpc_wait_event() are normal - these are idle threads waiting for requests to process, so it looks like the OST is mounted, but not getting requests.

            AFAIK, the messages "lnet_peer_ni_add_to_recoveryq_locked() ... Health = 0" indicate that LNet identified that there were errors sending to that node, but it seems that it is never recovering from that problem. This might be a case where LNet Health is making a bad decision when there is only a single interface to a peer, and it can never recover from this case because it isn't sending any messages over that interface that would increase the health value?

            There were a few LNet-related patches landed recently to b2_12 that might be the source of this problem, in particular:

            https://review.whamcloud.com/34252 LU-11816 lnet: setup health timeout defaults
            https://review.whamcloud.com/34967 LU-12344 lnet: handle remote health error
            https://review.whamcloud.com/33304 LU-11478 lnet: misleading discovery seqno
            

            The first patch (34252) is enabling LNet Health on b2_12 for the first time, which may not be prudent given that this hasn't been widely deployed in production yet (it originally landed as v2_12_53-66-g8632e94aeb so 2.13 would be the first release to use it). Since we are trying to get 2.12.3 out the door ASAP, if we can't understand and resolve this quickly, it probably makes sense to revert 34252, or make a small patch that defaults to having health disabled until we can get more operational experience in this mode.

            adilger Andreas Dilger added a comment - The OST thread stack traces with ptlrpc_wait_event() are normal - these are idle threads waiting for requests to process, so it looks like the OST is mounted, but not getting requests. AFAIK, the messages " lnet_peer_ni_add_to_recoveryq_locked() ... Health = 0 " indicate that LNet identified that there were errors sending to that node, but it seems that it is never recovering from that problem. This might be a case where LNet Health is making a bad decision when there is only a single interface to a peer, and it can never recover from this case because it isn't sending any messages over that interface that would increase the health value? There were a few LNet-related patches landed recently to b2_12 that might be the source of this problem, in particular: https://review.whamcloud.com/34252 LU-11816 lnet: setup health timeout defaults https://review.whamcloud.com/34967 LU-12344 lnet: handle remote health error https://review.whamcloud.com/33304 LU-11478 lnet: misleading discovery seqno The first patch (34252) is enabling LNet Health on b2_12 for the first time, which may not be prudent given that this hasn't been widely deployed in production yet (it originally landed as v2_12_53-66-g8632e94aeb so 2.13 would be the first release to use it). Since we are trying to get 2.12.3 out the door ASAP, if we can't understand and resolve this quickly, it probably makes sense to revert 34252, or make a small patch that defaults to having health disabled until we can get more operational experience in this mode.

            People

              ashehata Amir Shehata (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: