[LU-11206] recovery-mds-scale test failover_ost fails with 'import is not in FULL state' Created: 03/Aug/18  Updated: 01/Apr/19  Resolved: 27/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.13.0, Lustre 2.12.1

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Patrick Farrell (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7236 connections on demand Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

recovery-mds-scale test_failover_ost fails with an OST in IDLE state when it should be in FULL state. From the test_log we can see that we successfully failover OSTs 55 times and, on the 56th failover, we see that one OST is in the IDLE state

trevis-4vm8: trevis-4vm8.trevis.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid,osc.lustre-OST0001-osc-ffff*.ost_server_uuid,osc.lustre-OST0002-osc-ffff*.ost_server_uuid,osc.lustre-OST0003-osc-ffff*.ost_server_uuid,osc.lustre-OST0004-osc-ffff*.ost_server_uuid,osc.lustre-OST0005-osc-ffff*.ost_server_uuid,osc.lustre-OST0006-osc-ffff*.ost_server_uuid
trevis-4vm7: trevis-4vm7.trevis.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid,osc.lustre-OST0001-osc-ffff*.ost_server_uuid,osc.lustre-OST0002-osc-ffff*.ost_server_uuid,osc.lustre-OST0003-osc-ffff*.ost_server_uuid,osc.lustre-OST0004-osc-ffff*.ost_server_uuid,osc.lustre-OST0005-osc-ffff*.ost_server_uuid,osc.lustre-OST0006-osc-ffff*.ost_server_uuid
trevis-4vm8: CMD: trevis-4vm8.trevis.whamcloud.com lctl get_param -n at_max
trevis-4vm7: CMD: trevis-4vm7.trevis.whamcloud.com lctl get_param -n at_max
trevis-4vm8: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm7: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm8: osc.lustre-OST0001-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm7: osc.lustre-OST0001-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm8: osc.lustre-OST0002-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm7: osc.lustre-OST0002-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm8: osc.lustre-OST0003-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm7: osc.lustre-OST0003-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm7: osc.lustre-OST0004-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm7: osc.lustre-OST0005-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm7: osc.lustre-OST0006-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm8:  rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0004-osc-ffff*.ost_server_uuid into FULL state after 1475 sec, have IDLE 
trevis-4vm8:   Trace dump:
trevis-4vm8:   = /usr/lib64/lustre/tests/test-framework.sh:5742:error()
trevis-4vm8:   = /usr/lib64/lustre/tests/test-framework.sh:6846:_wait_import_state()
trevis-4vm8:   = /usr/lib64/lustre/tests/test-framework.sh:6868:wait_import_state()
trevis-4vm8:   = /usr/lib64/lustre/tests/test-framework.sh:6877:wait_import_state_mount()
trevis-4vm8:   = rpc.sh:21:main()

I can’t find any stack traces or any Lustre errors that would explain what the issue is here.

So far, we've only seen this for failover testing and only with ZFS.

Logs for this failure are at
https://testing.whamcloud.com/test_sets/2f7df696-91c2-11e8-8ee3-52540065bddc

There are two older test sessions that fail in a similar way, but there are no successful OST failovers when the fail with this error:
https://testing.whamcloud.com/test_sets/452cb77a-8da8-11e8-a9f7-52540065bddc
https://testing.whamcloud.com/test_sets/ebaddda0-8ba9-11e8-b0aa-52540065bddc



 Comments   
Comment by Jian Yu [ 25/Sep/18 ]

conf-sanity test 93 failed with the same issue in DNE ZFS test sessions on master branch:

https://testing.whamcloud.com/test_sets/fa748b22-c0f6-11e8-b143-52540065bddc
https://testing.whamcloud.com/test_sets/65ac0bcc-bb70-11e8-9df3-52540065bddc
https://testing.whamcloud.com/test_sets/1571b9bc-bb99-11e8-9df3-52540065bddc
https://testing.whamcloud.com/test_sets/3a4fa378-bcea-11e8-a9d9-52540065bddc

It's affecting patch testing.

Comment by Patrick Farrell (Inactive) [ 11/Feb/19 ]

Master:
https://testing.whamcloud.com/test_sessions/2e75b03b-1a6a-4ff8-89a8-bdfe0336a0c2

Comment by Gerrit Updater [ 11/Feb/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34225
Subject: LU-11206 tests: Use import_ready to check IDLE
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 31bf8c928d6f45895c6ba2f02f4b857087aabd18

Comment by Gerrit Updater [ 27/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34225/
Subject: LU-11206 tests: Use import_ready to check IDLE
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3ed6b8c2ea27b1a3a9fa073e19d77d7c317ae69f

Comment by Peter Jones [ 27/Feb/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 20/Mar/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34471
Subject: LU-11206 tests: Use import_ready to check IDLE
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 0e5ede9852562b67d0f23d2264cfa78cc3cbfc97

Comment by Gerrit Updater [ 01/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34471/
Subject: LU-11206 tests: Use import_ready to check IDLE
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 54f4c4341b8330468ad95abd2bf4051a3d17abba

Generated at Sat Feb 10 02:41:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.