Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11206

recovery-mds-scale test failover_ost fails with 'import is not in FULL state'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.13.0, Lustre 2.12.1
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      recovery-mds-scale test_failover_ost fails with an OST in IDLE state when it should be in FULL state. From the test_log we can see that we successfully failover OSTs 55 times and, on the 56th failover, we see that one OST is in the IDLE state

      trevis-4vm8: trevis-4vm8.trevis.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid,osc.lustre-OST0001-osc-ffff*.ost_server_uuid,osc.lustre-OST0002-osc-ffff*.ost_server_uuid,osc.lustre-OST0003-osc-ffff*.ost_server_uuid,osc.lustre-OST0004-osc-ffff*.ost_server_uuid,osc.lustre-OST0005-osc-ffff*.ost_server_uuid,osc.lustre-OST0006-osc-ffff*.ost_server_uuid
      trevis-4vm7: trevis-4vm7.trevis.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid,osc.lustre-OST0001-osc-ffff*.ost_server_uuid,osc.lustre-OST0002-osc-ffff*.ost_server_uuid,osc.lustre-OST0003-osc-ffff*.ost_server_uuid,osc.lustre-OST0004-osc-ffff*.ost_server_uuid,osc.lustre-OST0005-osc-ffff*.ost_server_uuid,osc.lustre-OST0006-osc-ffff*.ost_server_uuid
      trevis-4vm8: CMD: trevis-4vm8.trevis.whamcloud.com lctl get_param -n at_max
      trevis-4vm7: CMD: trevis-4vm7.trevis.whamcloud.com lctl get_param -n at_max
      trevis-4vm8: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      trevis-4vm7: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      trevis-4vm8: osc.lustre-OST0001-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      trevis-4vm7: osc.lustre-OST0001-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      trevis-4vm8: osc.lustre-OST0002-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      trevis-4vm7: osc.lustre-OST0002-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      trevis-4vm8: osc.lustre-OST0003-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      trevis-4vm7: osc.lustre-OST0003-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      trevis-4vm7: osc.lustre-OST0004-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      trevis-4vm7: osc.lustre-OST0005-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      trevis-4vm7: osc.lustre-OST0006-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      trevis-4vm8:  rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0004-osc-ffff*.ost_server_uuid into FULL state after 1475 sec, have IDLE 
      trevis-4vm8:   Trace dump:
      trevis-4vm8:   = /usr/lib64/lustre/tests/test-framework.sh:5742:error()
      trevis-4vm8:   = /usr/lib64/lustre/tests/test-framework.sh:6846:_wait_import_state()
      trevis-4vm8:   = /usr/lib64/lustre/tests/test-framework.sh:6868:wait_import_state()
      trevis-4vm8:   = /usr/lib64/lustre/tests/test-framework.sh:6877:wait_import_state_mount()
      trevis-4vm8:   = rpc.sh:21:main()
      

      I can’t find any stack traces or any Lustre errors that would explain what the issue is here.

      So far, we've only seen this for failover testing and only with ZFS.

      Logs for this failure are at
      https://testing.whamcloud.com/test_sets/2f7df696-91c2-11e8-8ee3-52540065bddc

      There are two older test sessions that fail in a similar way, but there are no successful OST failovers when the fail with this error:
      https://testing.whamcloud.com/test_sets/452cb77a-8da8-11e8-a9f7-52540065bddc
      https://testing.whamcloud.com/test_sets/ebaddda0-8ba9-11e8-b0aa-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              pfarrell Patrick Farrell (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: