Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.13.0, Lustre 2.12.1
Affects Version/s: Lustre 2.12.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

recovery-mds-scale test_failover_ost fails with an OST in IDLE state when it should be in FULL state. From the test_log we can see that we successfully failover OSTs 55 times and, on the 56th failover, we see that one OST is in the IDLE state

trevis-4vm8: trevis-4vm8.trevis.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid,osc.lustre-OST0001-osc-ffff*.ost_server_uuid,osc.lustre-OST0002-osc-ffff*.ost_server_uuid,osc.lustre-OST0003-osc-ffff*.ost_server_uuid,osc.lustre-OST0004-osc-ffff*.ost_server_uuid,osc.lustre-OST0005-osc-ffff*.ost_server_uuid,osc.lustre-OST0006-osc-ffff*.ost_server_uuid
trevis-4vm7: trevis-4vm7.trevis.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid,osc.lustre-OST0001-osc-ffff*.ost_server_uuid,osc.lustre-OST0002-osc-ffff*.ost_server_uuid,osc.lustre-OST0003-osc-ffff*.ost_server_uuid,osc.lustre-OST0004-osc-ffff*.ost_server_uuid,osc.lustre-OST0005-osc-ffff*.ost_server_uuid,osc.lustre-OST0006-osc-ffff*.ost_server_uuid
trevis-4vm8: CMD: trevis-4vm8.trevis.whamcloud.com lctl get_param -n at_max
trevis-4vm7: CMD: trevis-4vm7.trevis.whamcloud.com lctl get_param -n at_max
trevis-4vm8: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm7: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm8: osc.lustre-OST0001-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm7: osc.lustre-OST0001-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm8: osc.lustre-OST0002-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm7: osc.lustre-OST0002-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm8: osc.lustre-OST0003-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm7: osc.lustre-OST0003-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm7: osc.lustre-OST0004-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm7: osc.lustre-OST0005-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm7: osc.lustre-OST0006-osc-ffff*.ost_server_uuid in FULL state after 0 sec
trevis-4vm8:  rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0004-osc-ffff*.ost_server_uuid into FULL state after 1475 sec, have IDLE 
trevis-4vm8:   Trace dump:
trevis-4vm8:   = /usr/lib64/lustre/tests/test-framework.sh:5742:error()
trevis-4vm8:   = /usr/lib64/lustre/tests/test-framework.sh:6846:_wait_import_state()
trevis-4vm8:   = /usr/lib64/lustre/tests/test-framework.sh:6868:wait_import_state()
trevis-4vm8:   = /usr/lib64/lustre/tests/test-framework.sh:6877:wait_import_state_mount()
trevis-4vm8:   = rpc.sh:21:main()

I can’t find any stack traces or any Lustre errors that would explain what the issue is here.

So far, we've only seen this for failover testing and only with ZFS.

Logs for this failure are at
https://testing.whamcloud.com/test_sets/2f7df696-91c2-11e8-8ee3-52540065bddc

There are two older test sessions that fail in a similar way, but there are no successful OST failovers when the fail with this error:
https://testing.whamcloud.com/test_sets/452cb77a-8da8-11e8-a9f7-52540065bddc
https://testing.whamcloud.com/test_sets/ebaddda0-8ba9-11e8-b0aa-52540065bddc

Attachments

Issue Links

is related to

LU-7236 OST connect and disconnect on demand

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...

Activity

People

Assignee:: Patrick Farrell (Inactive)

Reporter:: James Nunez (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 03/Aug/18 7:35 PM

Updated:: 01/Apr/19 2:15 PM

Resolved:: 27/Feb/19 5:38 AM