Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.12.0
-
None
-
3
-
9223372036854775807
Description
recovery-mds-scale test_failover_ost fails with an OST in IDLE state when it should be in FULL state. From the test_log we can see that we successfully failover OSTs 55 times and, on the 56th failover, we see that one OST is in the IDLE state
trevis-4vm8: trevis-4vm8.trevis.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid,osc.lustre-OST0001-osc-ffff*.ost_server_uuid,osc.lustre-OST0002-osc-ffff*.ost_server_uuid,osc.lustre-OST0003-osc-ffff*.ost_server_uuid,osc.lustre-OST0004-osc-ffff*.ost_server_uuid,osc.lustre-OST0005-osc-ffff*.ost_server_uuid,osc.lustre-OST0006-osc-ffff*.ost_server_uuid trevis-4vm7: trevis-4vm7.trevis.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid,osc.lustre-OST0001-osc-ffff*.ost_server_uuid,osc.lustre-OST0002-osc-ffff*.ost_server_uuid,osc.lustre-OST0003-osc-ffff*.ost_server_uuid,osc.lustre-OST0004-osc-ffff*.ost_server_uuid,osc.lustre-OST0005-osc-ffff*.ost_server_uuid,osc.lustre-OST0006-osc-ffff*.ost_server_uuid trevis-4vm8: CMD: trevis-4vm8.trevis.whamcloud.com lctl get_param -n at_max trevis-4vm7: CMD: trevis-4vm7.trevis.whamcloud.com lctl get_param -n at_max trevis-4vm8: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-4vm7: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-4vm8: osc.lustre-OST0001-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-4vm7: osc.lustre-OST0001-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-4vm8: osc.lustre-OST0002-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-4vm7: osc.lustre-OST0002-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-4vm8: osc.lustre-OST0003-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-4vm7: osc.lustre-OST0003-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-4vm7: osc.lustre-OST0004-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-4vm7: osc.lustre-OST0005-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-4vm7: osc.lustre-OST0006-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-4vm8: rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0004-osc-ffff*.ost_server_uuid into FULL state after 1475 sec, have IDLE trevis-4vm8: Trace dump: trevis-4vm8: = /usr/lib64/lustre/tests/test-framework.sh:5742:error() trevis-4vm8: = /usr/lib64/lustre/tests/test-framework.sh:6846:_wait_import_state() trevis-4vm8: = /usr/lib64/lustre/tests/test-framework.sh:6868:wait_import_state() trevis-4vm8: = /usr/lib64/lustre/tests/test-framework.sh:6877:wait_import_state_mount() trevis-4vm8: = rpc.sh:21:main()
I can’t find any stack traces or any Lustre errors that would explain what the issue is here.
So far, we've only seen this for failover testing and only with ZFS.
Logs for this failure are at
https://testing.whamcloud.com/test_sets/2f7df696-91c2-11e8-8ee3-52540065bddc
There are two older test sessions that fail in a similar way, but there are no successful OST failovers when the fail with this error:
https://testing.whamcloud.com/test_sets/452cb77a-8da8-11e8-a9f7-52540065bddc
https://testing.whamcloud.com/test_sets/ebaddda0-8ba9-11e8-b0aa-52540065bddc