Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.13.0, Lustre 2.12.3
-
None
-
3
-
9223372036854775807
Description
recovery-mds-scale test_failover-ost fails with “import is not in FULL state”. In the suite_log, we see
trevis-50vm3: rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid into FULL state after 1475 sec, have DISCONN trevis-50vm3: Trace dump: trevis-50vm3: = /usr/lib64/lustre/tests/test-framework.sh:5864:error() … recovery-mds-scale test_failover_ost: @@@@@@ FAIL: import is not in FULL state Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5864:error() = /usr/lib64/lustre/tests/test-framework.sh:7245:wait_clients_import_state() = /usr/lib64/lustre/tests/recovery-mds-scale.sh:159:failover_target() = /usr/lib64/lustre/tests/recovery-mds-scale.sh:242:test_failover_ost()
then the test suite hangs and times out.
In the case of https://testing.whamcloud.com/test_sets/e816cdac-eb87-11e9-be86-52540065bddc, we fail over ost4 from vm5 to vm6 and the OST failovers look successful. Checking clients after failover, we see
[86033.161862] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK [86033.416887] Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK [86035.230364] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost4 has failed over 1 times, and counting... [86035.476561] Lustre: DEBUG MARKER: ost4 has failed over 1 times, and counting... [86079.403400] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0 [86079.406508] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 8 previous similar messages [86344.411364] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0 [86344.413633] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 14 previous similar messages [86827.848075] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=408 DURATION=86400 PERIOD=1200 [86828.072043] Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=408 DURATION=86400 PERIOD=1200 [86829.772372] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait ost6 recovery complete before doing next failover... [86830.015147] Lustre: DEBUG MARKER: Wait ost6 recovery complete before doing next failover...
We check recovery_status for each of the OST and find there is an issue
[86841.794013] Lustre: DEBUG MARKER: trevis-50vm6.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0006.recovery_status 1475 [86842.076196] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL state before doing next failover... [86842.297866] Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover... [86859.446351] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0 [86859.448604] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages [87464.461375] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0 [87464.463680] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages [88064.484412] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0 [88064.486869] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages [88350.603831] Lustre: DEBUG MARKER: /usr/sbin/lctl mark recovery-mds-scale test_failover_ost: @@@@@@ FAIL: import is not in FULL state [88350.872125] Lustre: DEBUG MARKER: recovery-mds-scale test_failover_ost: @@@@@@ FAIL: import is not in FULL state
We then try to exit the test but experience LNet errors
[88354.805689] Server failover period: 1200 seconds [88354.805689] Exited after: 408 seconds [88354.805689] Number of failovers before exit: [88354.805689] mds1: 0 times [88354.805689] ost1: 0 times [88354.805689] ost2: 0 times [88354.805689] ost3: 0 times [88354.805689] ost4: 1 times [88354.805689] ost5: 0 times [88354.805689] ost6: 0 times [88354.805689] o [88355.027288] Lustre: DEBUG MARKER: Duration: 86400 [88355.647928] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_7-x86_64--failover--1_25__52___8d15cfb4-3473-4202-941b-c914ac734bd4/recovery-mds-scale.test_failover_ost.debug_log.$(hostname -s).1570642546.log; [88355.647928] dmesg > /autotest/autotest2/2019 [88360.734604] Lustre: lustre-OST0001: Connection restored to d9d4bfb9-3fbd-d3f0-7667-bf145a641dfe (at 10.9.50.1@tcp) [88360.736451] Lustre: Skipped 7 previous similar messages [88669.517330] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0 [88669.519558] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages [89279.521338] LNetError: 24068:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.50.8@tcp added to recovery queue. Health = 0
The call traces are displayed. We see ll_ost00_00*, ll_ost_io00_00* and ll_ost_create00 call traces for some OSTs. For example, we see
[172080.643164] ll_ost00_000 S ffff92afba003150 0 24093 2 0x00000080 [172080.644487] Call Trace: [172080.644949] [<ffffffff9e77f229>] schedule+0x29/0x70 [172080.645903] [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc] [172080.647064] [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20 [172080.648125] [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc] [172080.649264] [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0 [172080.650356] [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0 [172080.651428] [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [172080.652739] [<ffffffff9e0c50d1>] kthread+0xd1/0xe0 [172080.653638] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 [172080.654737] [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21 [172080.655906] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 [172080.682767] ll_ost_create00 S ffff92afbc14a0e0 0 24096 2 0x00000080 [172080.684064] Call Trace: [172080.684525] [<ffffffff9e77f229>] schedule+0x29/0x70 [172080.685421] [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc] [172080.686731] [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20 [172080.687779] [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc] [172080.688908] [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0 [172080.689999] [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0 [172080.691013] [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [172080.692314] [<ffffffff9e0c50d1>] kthread+0xd1/0xe0 [172080.693223] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 [172080.694371] [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21 [172080.695537] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 [172080.710767] ll_ost_io00_000 S ffff92afb8075230 0 24098 2 0x00000080 [172080.712129] Call Trace: [172080.712586] [<ffffffff9e77f229>] schedule+0x29/0x70 [172080.713487] [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc] [172080.714784] [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20 [172080.715887] [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc] [172080.717015] [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0 [172080.718104] [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0 [172080.719126] [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [172080.720443] [<ffffffff9e0c50d1>] kthread+0xd1/0xe0 [172080.721352] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 [172080.722448] [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21 [172080.723612] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 [172080.752571] ll_ost_seq00_00 S ffff92afb8071070 0 24101 2 0x00000080 [172080.753858] Call Trace: [172080.754332] [<ffffffff9e77f229>] schedule+0x29/0x70 [172080.755326] [<ffffffffc0bdb355>] ptlrpc_wait_event+0x345/0x360 [ptlrpc] [172080.756527] [<ffffffff9e0da0b0>] ? wake_up_state+0x20/0x20 [172080.757566] [<ffffffffc0be1ad2>] ptlrpc_main+0xa02/0x1460 [ptlrpc] [172080.758745] [<ffffffff9e0d3efe>] ? finish_task_switch+0x4e/0x1c0 [172080.759848] [<ffffffff9e77ec88>] ? __schedule+0x448/0x9c0 [172080.760868] [<ffffffffc0be10d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [172080.762189] [<ffffffff9e0c50d1>] kthread+0xd1/0xe0 [172080.763085] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 [172080.764193] [<ffffffff9e78cd37>] ret_from_fork_nospec_begin+0x21/0x21 [172080.765353] [<ffffffff9e0c5000>] ? insert_kthread_work+0x40/0x40 …
Attachments
Issue Links
- is related to
-
LU-15342 recovery-mds-scale test_failover_mds: sh: IDLE: command not found
- Resolved
-
LU-13813 conf-sanity test_112: can't put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN
- Resolved
- is related to
-
LU-7236 OST connect and disconnect on demand
- Resolved