[LU-12857] recovery-mds-scale test_failover_ost fails with “import is not in FULL state” - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.15.0
Affects Version/s: Lustre 2.13.0, Lustre 2.12.3
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

recovery-mds-scale test_failover-ost fails with “import is not in FULL state”. In the suite_log, we see

trevis-50vm3:  rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid into FULL state after 1475 sec, have DISCONN 
trevis-50vm3:   Trace dump:
trevis-50vm3:   = /usr/lib64/lustre/tests/test-framework.sh:5864:error()
…
recovery-mds-scale test_failover_ost: @@@@@@ FAIL: import is not in FULL state 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5864:error()
  = /usr/lib64/lustre/tests/test-framework.sh:7245:wait_clients_import_state()
  = /usr/lib64/lustre/tests/recovery-mds-scale.sh:159:failover_target()
  = /usr/lib64/lustre/tests/recovery-mds-scale.sh:242:test_failover_ost()

then the test suite hangs and times out.

In the case of https://testing.whamcloud.com/test_sets/e816cdac-eb87-11e9-be86-52540065bddc, we fail over ost4 from vm5 to vm6 and the OST failovers look successful. Checking clients after failover, we see

[86033.161862] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
[86033.416887] Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
[86035.230364] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost4 has failed over 1 times, and counting...
[86035.476561] Lustre: DEBUG MARKER: ost4 has failed over 1 times, and counting...
[86079.403400] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
[86079.406508] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 8 previous similar messages
[86344.411364] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
[86344.413633] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 14 previous similar messages
[86827.848075] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=408 DURATION=86400 PERIOD=1200
[86828.072043] Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=408 DURATION=86400 PERIOD=1200
[86829.772372] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait ost6 recovery complete before doing next failover...
[86830.015147] Lustre: DEBUG MARKER: Wait ost6 recovery complete before doing next failover...

We check recovery_status for each of the OST and find there is an issue

[86841.794013] Lustre: DEBUG MARKER: trevis-50vm6.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0006.recovery_status 1475
[86842.076196] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL state before doing next failover...
[86842.297866] Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover...
[86859.446351] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
[86859.448604] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages
[87464.461375] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
[87464.463680] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages
[88064.484412] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
[88064.486869] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages
[88350.603831] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-mds-scale test_failover_ost: @@@@@@ FAIL: import is not in FULL state 
[88350.872125] Lustre: DEBUG MARKER: recovery-mds-scale test_failover_ost: @@@@@@ FAIL: import is not in FULL state

We then try to exit the test but experience LNet errors

[88354.805689] Server failover period: 1200 seconds
[88354.805689] Exited after:           408 seconds
[88354.805689] Number of failovers before exit:
[88354.805689] mds1: 0 times
[88354.805689] ost1: 0 times
[88354.805689] ost2: 0 times
[88354.805689] ost3: 0 times
[88354.805689] ost4: 1 times
[88354.805689] ost5: 0 times
[88354.805689] ost6: 0 times
[88354.805689] o
[88355.027288] Lustre: DEBUG MARKER: Duration: 86400
[88355.647928] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_7-x86_64--failover--1_25__52___8d15cfb4-3473-4202-941b-c914ac734bd4/recovery-mds-scale.test_failover_ost.debug_log.$(hostname -s).1570642546.log;
[88355.647928]          dmesg > /autotest/autotest2/2019
[88360.734604] Lustre: lustre-OST0001: Connection restored to d9d4bfb9-3fbd-d3f0-7667-bf145a641dfe (at 10.9.50.1@tcp)
[88360.736451] Lustre: Skipped 7 previous similar messages
[88669.517330] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
[88669.519558] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages
[89279.521338] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0

The call traces are displayed. We see ll_ost00_00*, ll_ost_io00_00* and ll_ost_create00 call traces for some OSTs. For example, we see

[172080.643164] ll_ost00_000    S ffff92afba003150     0 24093      2 0x00000080
[172080.644487] Call Trace:
[172080.644949]  [<ffffffff9e77f229>] schedule+0x29/0x70
[172080.645903]  [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc]
[172080.647064]  [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20
[172080.648125]  [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc]
[172080.649264]  [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0
[172080.650356]  [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0
[172080.651428]  [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[172080.652739]  [<ffffffff9e0c50d1>] kthread+0xd1/0xe0
[172080.653638]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
[172080.654737]  [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21
[172080.655906]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
[172080.682767] ll_ost_create00 S ffff92afbc14a0e0     0 24096      2 0x00000080
[172080.684064] Call Trace:
[172080.684525]  [<ffffffff9e77f229>] schedule+0x29/0x70
[172080.685421]  [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc]
[172080.686731]  [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20
[172080.687779]  [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc]
[172080.688908]  [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0
[172080.689999]  [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0
[172080.691013]  [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[172080.692314]  [<ffffffff9e0c50d1>] kthread+0xd1/0xe0
[172080.693223]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
[172080.694371]  [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21
[172080.695537]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
[172080.710767] ll_ost_io00_000 S ffff92afb8075230     0 24098      2 0x00000080
[172080.712129] Call Trace:
[172080.712586]  [<ffffffff9e77f229>] schedule+0x29/0x70
[172080.713487]  [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc]
[172080.714784]  [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20
[172080.715887]  [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc]
[172080.717015]  [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0
[172080.718104]  [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0
[172080.719126]  [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[172080.720443]  [<ffffffff9e0c50d1>] kthread+0xd1/0xe0
[172080.721352]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
[172080.722448]  [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21
[172080.723612]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
[172080.752571] ll_ost_seq00_00 S ffff92afb8071070     0 24101      2 0x00000080
[172080.753858] Call Trace:
[172080.754332]  [<ffffffff9e77f229>] schedule+0x29/0x70
[172080.755326]  [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc]
[172080.756527]  [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20
[172080.757566]  [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc]
[172080.758745]  [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0
[172080.759848]  [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0
[172080.760868]  [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[172080.762189]  [<ffffffff9e0c50d1>] kthread+0xd1/0xe0
[172080.763085]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
[172080.764193]  [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21
[172080.765353]  [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40
…

Attachments

Issue Links

is related to

LU-15342 recovery-mds-scale test_failover_mds: sh: IDLE: command not found

Resolved

LU-13813 conf-sanity test_112: can't put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN

Resolved

is related to

LU-7236 OST connect and disconnect on demand

Resolved

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page Loading...

(1 mentioned in)

Activity

[LU-12857] recovery-mds-scale test_failover_ost fails with “import is not in FULL state”

Peter Jones added a comment - 30/Nov/21 2:14 PM

Landed for 2.15

Peter Jones added a comment - 30/Nov/21 2:14 PM Landed for 2.15

Gerrit Updater added a comment - 30/Nov/21 3:52 AM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45318/
Subject: ~~LU-12857~~ tests: allow clients to be IDLE after recovery
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: af666bef058c5b7997527fc851a84a89375912fb

Gerrit Updater added a comment - 30/Nov/21 3:52 AM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45318/ Subject: LU-12857 tests: allow clients to be IDLE after recovery Project: fs/lustre-release Branch: master Current Patch Set: Commit: af666bef058c5b7997527fc851a84a89375912fb

Gerrit Updater added a comment - 21/Oct/21 1:59 AM

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45318
Subject: ~~LU-12857~~ tests: allow clients to be IDLE after recovery
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 22573acdaf7fa4dfbdd86067971a170a45ae6a6f

Gerrit Updater added a comment - 21/Oct/21 1:59 AM "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45318 Subject: LU-12857 tests: allow clients to be IDLE after recovery Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 22573acdaf7fa4dfbdd86067971a170a45ae6a6f

Andreas Dilger added a comment - 21/Oct/21 1:47 AM - edited

In an recovery-mds-scale test failure I looked at, I see that the client import state is failing because it is IDLE instead of FULL. That is normal if osc.*.idle_disconnect is enabled, which is the default on master:
https://testing.whamcloud.com/test_sets/df019c9d-d46a-48ea-90fd-f6fa0665ebe1

trevis-55vm7:  rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0003-osc-[-0-9a-f]*.ost_server_uuid into FULL state after 1475 sec, have IDLE

Andreas Dilger added a comment - 21/Oct/21 1:47 AM - edited In an recovery-mds-scale test failure I looked at, I see that the client import state is failing because it is IDLE instead of FULL . That is normal if osc.*.idle_disconnect is enabled, which is the default on master: https://testing.whamcloud.com/test_sets/df019c9d-d46a-48ea-90fd-f6fa0665ebe1 trevis-55vm7: rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0003-osc-[-0-9a-f]*.ost_server_uuid into FULL state after 1475 sec, have IDLE

Peter Jones added a comment - 15/Oct/19 8:53 AM

I would strongly suggest that we re-disable health by default to avoid this issue for users who are not intending to use this feature.

Peter Jones added a comment - 15/Oct/19 8:53 AM I would strongly suggest that we re-disable health by default to avoid this issue for users who are not intending to use this feature.

Andreas Dilger added a comment - 15/Oct/19 7:25 AM

The OST thread stack traces with ptlrpc_wait_event() are normal - these are idle threads waiting for requests to process, so it looks like the OST is mounted, but not getting requests.

AFAIK, the messages "lnet_peer_ni_add_to_recoveryq_locked() ... Health = 0" indicate that LNet identified that there were errors sending to that node, but it seems that it is never recovering from that problem. This might be a case where LNet Health is making a bad decision when there is only a single interface to a peer, and it can never recover from this case because it isn't sending any messages over that interface that would increase the health value?

There were a few LNet-related patches landed recently to b2_12 that might be the source of this problem, in particular:

https://review.whamcloud.com/34252 LU-11816 lnet: setup health timeout defaults
https://review.whamcloud.com/34967 LU-12344 lnet: handle remote health error
https://review.whamcloud.com/33304 LU-11478 lnet: misleading discovery seqno

The first patch (34252) is enabling LNet Health on b2_12 for the first time, which may not be prudent given that this hasn't been widely deployed in production yet (it originally landed as v2_12_53-66-g8632e94aeb so 2.13 would be the first release to use it). Since we are trying to get 2.12.3 out the door ASAP, if we can't understand and resolve this quickly, it probably makes sense to revert 34252, or make a small patch that defaults to having health disabled until we can get more operational experience in this mode.

Andreas Dilger added a comment - 15/Oct/19 7:25 AM The OST thread stack traces with ptlrpc_wait_event() are normal - these are idle threads waiting for requests to process, so it looks like the OST is mounted, but not getting requests. AFAIK, the messages " lnet_peer_ni_add_to_recoveryq_locked() ... Health = 0 " indicate that LNet identified that there were errors sending to that node, but it seems that it is never recovering from that problem. This might be a case where LNet Health is making a bad decision when there is only a single interface to a peer, and it can never recover from this case because it isn't sending any messages over that interface that would increase the health value? There were a few LNet-related patches landed recently to b2_12 that might be the source of this problem, in particular: https://review.whamcloud.com/34252 LU-11816 lnet: setup health timeout defaults https://review.whamcloud.com/34967 LU-12344 lnet: handle remote health error https://review.whamcloud.com/33304 LU-11478 lnet: misleading discovery seqno The first patch (34252) is enabling LNet Health on b2_12 for the first time, which may not be prudent given that this hasn't been widely deployed in production yet (it originally landed as v2_12_53-66-g8632e94aeb so 2.13 would be the first release to use it). Since we are trying to get 2.12.3 out the door ASAP, if we can't understand and resolve this quickly, it probably makes sense to revert 34252, or make a small patch that defaults to having health disabled until we can get more operational experience in this mode.

People

Assignee:: Amir Shehata (Inactive)

Reporter:: James Nunez (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Oct/19 4:26 PM

Updated:: 10/Jul/24 1:05 AM

Resolved:: 30/Nov/21 2:14 PM