Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12857

recovery-mds-scale test_failover_ost fails with “import is not in FULL state”

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • Lustre 2.13.0, Lustre 2.12.3
    • None
    • 3
    • 9223372036854775807

    Description

      recovery-mds-scale test_failover-ost fails with “import is not in FULL state”. In the suite_log, we see

      trevis-50vm3:  rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid into FULL state after 1475 sec, have DISCONN 
      trevis-50vm3:   Trace dump:
      trevis-50vm3:   = /usr/lib64/lustre/tests/test-framework.sh:5864:error()
      …
      recovery-mds-scale test_failover_ost: @@@@@@ FAIL: import is not in FULL state 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5864:error()
        = /usr/lib64/lustre/tests/test-framework.sh:7245:wait_clients_import_state()
        = /usr/lib64/lustre/tests/recovery-mds-scale.sh:159:failover_target()
        = /usr/lib64/lustre/tests/recovery-mds-scale.sh:242:test_failover_ost()
      

      then the test suite hangs and times out.

      In the case of https://testing.whamcloud.com/test_sets/e816cdac-eb87-11e9-be86-52540065bddc, we fail over ost4 from vm5 to vm6 and the OST failovers look successful. Checking clients after failover, we see

      [86033.161862] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
      [86033.416887] Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
      [86035.230364] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost4 has failed over 1 times, and counting...
      [86035.476561] Lustre: DEBUG MARKER: ost4 has failed over 1 times, and counting...
      [86079.403400] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
      [86079.406508] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 8 previous similar messages
      [86344.411364] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
      [86344.413633] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 14 previous similar messages
      [86827.848075] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=408 DURATION=86400 PERIOD=1200
      [86828.072043] Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=408 DURATION=86400 PERIOD=1200
      [86829.772372] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait ost6 recovery complete before doing next failover...
      [86830.015147] Lustre: DEBUG MARKER: Wait ost6 recovery complete before doing next failover...
      

      We check recovery_status for each of the OST and find there is an issue

      [86841.794013] Lustre: DEBUG MARKER: trevis-50vm6.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0006.recovery_status 1475
      [86842.076196] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL state before doing next failover...
      [86842.297866] Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover...
      [86859.446351] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
      [86859.448604] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages
      [87464.461375] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
      [87464.463680] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages
      [88064.484412] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
      [88064.486869] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages
      [88350.603831] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-mds-scale test_failover_ost: @@@@@@ FAIL: import is not in FULL state 
      [88350.872125] Lustre: DEBUG MARKER: recovery-mds-scale test_failover_ost: @@@@@@ FAIL: import is not in FULL state
      

      We then try to exit the test but experience LNet errors

      [88354.805689] Server failover period: 1200 seconds
      [88354.805689] Exited after:           408 seconds
      [88354.805689] Number of failovers before exit:
      [88354.805689] mds1: 0 times
      [88354.805689] ost1: 0 times
      [88354.805689] ost2: 0 times
      [88354.805689] ost3: 0 times
      [88354.805689] ost4: 1 times
      [88354.805689] ost5: 0 times
      [88354.805689] ost6: 0 times
      [88354.805689] o
      [88355.027288] Lustre: DEBUG MARKER: Duration: 86400
      [88355.647928] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_7-x86_64--failover--1_25__52___8d15cfb4-3473-4202-941b-c914ac734bd4/recovery-mds-scale.test_failover_ost.debug_log.$(hostname -s).1570642546.log;
      [88355.647928]          dmesg > /autotest/autotest2/2019
      [88360.734604] Lustre: lustre-OST0001: Connection restored to d9d4bfb9-3fbd-d3f0-7667-bf145a641dfe (at 10.9.50.1@tcp)
      [88360.736451] Lustre: Skipped 7 previous similar messages
      [88669.517330] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
      [88669.519558] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages
      [89279.521338] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
      

      The call traces are displayed. We see ll_ost00_00*, ll_ost_io00_00* and ll_ost_create00 call traces for some OSTs. For example, we see

      [172080.643164] ll_ost00_000    S ffff92afba003150     0 24093      2 0x00000080
      [172080.644487] Call Trace:
      [172080.644949]  [<ffffffff9e77f229>] schedule+0x29/0x70
      [172080.645903]  [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc]
      [172080.647064]  [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20
      [172080.648125]  [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc]
      [172080.649264]  [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0
      [172080.650356]  [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0
      [172080.651428]  [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [172080.652739]  [<ffffffff9e0c50d1>] kthread+0xd1/0xe0
      [172080.653638]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      [172080.654737]  [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [172080.655906]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      [172080.682767] ll_ost_create00 S ffff92afbc14a0e0     0 24096      2 0x00000080
      [172080.684064] Call Trace:
      [172080.684525]  [<ffffffff9e77f229>] schedule+0x29/0x70
      [172080.685421]  [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc]
      [172080.686731]  [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20
      [172080.687779]  [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc]
      [172080.688908]  [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0
      [172080.689999]  [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0
      [172080.691013]  [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [172080.692314]  [<ffffffff9e0c50d1>] kthread+0xd1/0xe0
      [172080.693223]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      [172080.694371]  [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [172080.695537]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      [172080.710767] ll_ost_io00_000 S ffff92afb8075230     0 24098      2 0x00000080
      [172080.712129] Call Trace:
      [172080.712586]  [<ffffffff9e77f229>] schedule+0x29/0x70
      [172080.713487]  [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc]
      [172080.714784]  [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20
      [172080.715887]  [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc]
      [172080.717015]  [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0
      [172080.718104]  [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0
      [172080.719126]  [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [172080.720443]  [<ffffffff9e0c50d1>] kthread+0xd1/0xe0
      [172080.721352]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      [172080.722448]  [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [172080.723612]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      [172080.752571] ll_ost_seq00_00 S ffff92afb8071070     0 24101      2 0x00000080
      [172080.753858] Call Trace:
      [172080.754332]  [<ffffffff9e77f229>] schedule+0x29/0x70
      [172080.755326]  [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc]
      [172080.756527]  [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20
      [172080.757566]  [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc]
      [172080.758745]  [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0
      [172080.759848]  [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0
      [172080.760868]  [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [172080.762189]  [<ffffffff9e0c50d1>] kthread+0xd1/0xe0
      [172080.763085]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      [172080.764193]  [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [172080.765353]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
      …
      

      Attachments

        Issue Links

          Activity

            People

              ashehata Amir Shehata (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: