[LU-12857] recovery-mds-scale test_failover_ost fails with “import is not in FULL state” Created: 14/Oct/19 Updated: 25/Apr/23 Resolved: 30/Nov/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
recovery-mds-scale test_failover-ost fails with “import is not in FULL state”. In the suite_log, we see trevis-50vm3: rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid into FULL state after 1475 sec, have DISCONN trevis-50vm3: Trace dump: trevis-50vm3: = /usr/lib64/lustre/tests/test-framework.sh:5864:error() … recovery-mds-scale test_failover_ost: @@@@@@ FAIL: import is not in FULL state Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5864:error() = /usr/lib64/lustre/tests/test-framework.sh:7245:wait_clients_import_state() = /usr/lib64/lustre/tests/recovery-mds-scale.sh:159:failover_target() = /usr/lib64/lustre/tests/recovery-mds-scale.sh:242:test_failover_ost() then the test suite hangs and times out. In the case of https://testing.whamcloud.com/test_sets/e816cdac-eb87-11e9-be86-52540065bddc, we fail over ost4 from vm5 to vm6 and the OST failovers look successful. Checking clients after failover, we see [86033.161862] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK [86033.416887] Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK [86035.230364] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost4 has failed over 1 times, and counting... [86035.476561] Lustre: DEBUG MARKER: ost4 has failed over 1 times, and counting... [86079.403400] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0 [86079.406508] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 8 previous similar messages [86344.411364] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0 [86344.413633] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 14 previous similar messages [86827.848075] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=408 DURATION=86400 PERIOD=1200 [86828.072043] Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=408 DURATION=86400 PERIOD=1200 [86829.772372] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait ost6 recovery complete before doing next failover... [86830.015147] Lustre: DEBUG MARKER: Wait ost6 recovery complete before doing next failover... We check recovery_status for each of the OST and find there is an issue [86841.794013] Lustre: DEBUG MARKER: trevis-50vm6.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0006.recovery_status 1475 [86842.076196] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL state before doing next failover... [86842.297866] Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover... [86859.446351] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0 [86859.448604] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages [87464.461375] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0 [87464.463680] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages [88064.484412] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0 [88064.486869] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages [88350.603831] Lustre: DEBUG MARKER: /usr/sbin/lctl mark recovery-mds-scale test_failover_ost: @@@@@@ FAIL: import is not in FULL state [88350.872125] Lustre: DEBUG MARKER: recovery-mds-scale test_failover_ost: @@@@@@ FAIL: import is not in FULL state We then try to exit the test but experience LNet errors [88354.805689] Server failover period: 1200 seconds [88354.805689] Exited after: 408 seconds [88354.805689] Number of failovers before exit: [88354.805689] mds1: 0 times [88354.805689] ost1: 0 times [88354.805689] ost2: 0 times [88354.805689] ost3: 0 times [88354.805689] ost4: 1 times [88354.805689] ost5: 0 times [88354.805689] ost6: 0 times [88354.805689] o [88355.027288] Lustre: DEBUG MARKER: Duration: 86400 [88355.647928] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_7-x86_64--failover--1_25__52___8d15cfb4-3473-4202-941b-c914ac734bd4/recovery-mds-scale.test_failover_ost.debug_log.$(hostname -s).1570642546.log; [88355.647928] dmesg > /autotest/autotest2/2019 [88360.734604] Lustre: lustre-OST0001: Connection restored to d9d4bfb9-3fbd-d3f0-7667-bf145a641dfe (at 10.9.50.1@tcp) [88360.736451] Lustre: Skipped 7 previous similar messages [88669.517330] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0 [88669.519558] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages [89279.521338] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0 The call traces are displayed. We see ll_ost00_00*, ll_ost_io00_00* and ll_ost_create00 call traces for some OSTs. For example, we see [172080.643164] ll_ost00_000 S ffff92afba003150 0 24093 2 0x00000080 [172080.644487] Call Trace: [172080.644949] [<ffffffff9e77f229>] schedule+0x29/0x70 [172080.645903] [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc] [172080.647064] [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20 [172080.648125] [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc] [172080.649264] [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0 [172080.650356] [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0 [172080.651428] [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [172080.652739] [<ffffffff9e0c50d1>] kthread+0xd1/0xe0 [172080.653638] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 [172080.654737] [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21 [172080.655906] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 [172080.682767] ll_ost_create00 S ffff92afbc14a0e0 0 24096 2 0x00000080 [172080.684064] Call Trace: [172080.684525] [<ffffffff9e77f229>] schedule+0x29/0x70 [172080.685421] [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc] [172080.686731] [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20 [172080.687779] [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc] [172080.688908] [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0 [172080.689999] [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0 [172080.691013] [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [172080.692314] [<ffffffff9e0c50d1>] kthread+0xd1/0xe0 [172080.693223] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 [172080.694371] [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21 [172080.695537] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 [172080.710767] ll_ost_io00_000 S ffff92afb8075230 0 24098 2 0x00000080 [172080.712129] Call Trace: [172080.712586] [<ffffffff9e77f229>] schedule+0x29/0x70 [172080.713487] [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc] [172080.714784] [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20 [172080.715887] [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc] [172080.717015] [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0 [172080.718104] [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0 [172080.719126] [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [172080.720443] [<ffffffff9e0c50d1>] kthread+0xd1/0xe0 [172080.721352] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 [172080.722448] [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21 [172080.723612] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 [172080.752571] ll_ost_seq00_00 S ffff92afb8071070 0 24101 2 0x00000080 [172080.753858] Call Trace: [172080.754332] [<ffffffff9e77f229>] schedule+0x29/0x70 [172080.755326] [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc] [172080.756527] [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20 [172080.757566] [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc] [172080.758745] [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0 [172080.759848] [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0 [172080.760868] [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [172080.762189] [<ffffffff9e0c50d1>] kthread+0xd1/0xe0 [172080.763085] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 [172080.764193] [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21 [172080.765353] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 … |
| Comments |
| Comment by Andreas Dilger [ 15/Oct/19 ] |
|
The OST thread stack traces with ptlrpc_wait_event() are normal - these are idle threads waiting for requests to process, so it looks like the OST is mounted, but not getting requests. AFAIK, the messages "lnet_peer_ni_add_to_recoveryq_locked() ... Health = 0" indicate that LNet identified that there were errors sending to that node, but it seems that it is never recovering from that problem. This might be a case where LNet Health is making a bad decision when there is only a single interface to a peer, and it can never recover from this case because it isn't sending any messages over that interface that would increase the health value? There were a few LNet-related patches landed recently to b2_12 that might be the source of this problem, in particular: https://review.whamcloud.com/34252 LU-11816 lnet: setup health timeout defaults https://review.whamcloud.com/34967 LU-12344 lnet: handle remote health error https://review.whamcloud.com/33304 LU-11478 lnet: misleading discovery seqno The first patch (34252) is enabling LNet Health on b2_12 for the first time, which may not be prudent given that this hasn't been widely deployed in production yet (it originally landed as v2_12_53-66-g8632e94aeb so 2.13 would be the first release to use it). Since we are trying to get 2.12.3 out the door ASAP, if we can't understand and resolve this quickly, it probably makes sense to revert 34252, or make a small patch that defaults to having health disabled until we can get more operational experience in this mode. |
| Comment by Peter Jones [ 15/Oct/19 ] |
|
I would strongly suggest that we re-disable health by default to avoid this issue for users who are not intending to use this feature. |
| Comment by Andreas Dilger [ 21/Oct/21 ] |
|
In an recovery-mds-scale test failure I looked at, I see that the client import state is failing because it is IDLE instead of FULL. That is normal if osc.*.idle_disconnect is enabled, which is the default on master: trevis-55vm7: rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0003-osc-[-0-9a-f]*.ost_server_uuid into FULL state after 1475 sec, have IDLE |
| Comment by Gerrit Updater [ 21/Oct/21 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45318 |
| Comment by Gerrit Updater [ 30/Nov/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45318/ |
| Comment by Peter Jones [ 30/Nov/21 ] |
|
Landed for 2.15 |