[LU-13185] recovery-small test 101 hangs on OST failover/failback Created: 31/Jan/20  Updated: 05/Jun/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3, Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: ppc

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Looking at the client test_log for the hang at https://testing.whamcloud.com/test_sets/c518a288-428c-11ea-b083-52540065bddc, the last thing printed before the hang is

Started lustre-OST0000

Looking at the OSS (vm6) console log, we see

[31839.369248] Lustre: lustre-OST0000: Client e8a848d3-e0cf-379f-b910-5e7448e8c6de (at 10.9.0.83@tcp) reconnected, waiting for 3 clients in recovery for 0:57
[31839.371703] Lustre: Skipped 24 previous similar messages
[31860.379116] Lustre: lustre-OST0000: Client e8a848d3-e0cf-379f-b910-5e7448e8c6de (at 10.9.0.83@tcp) reconnected, waiting for 3 clients in recovery for 0:36
[31860.381510] Lustre: Skipped 2 previous similar messages
[31895.395237] Lustre: lustre-OST0000: Client e8a848d3-e0cf-379f-b910-5e7448e8c6de (at 10.9.0.83@tcp) reconnected, waiting for 3 clients in recovery for 0:01
[31895.397640] Lustre: Skipped 4 previous similar messages
[31896.600260] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
[31896.601650] Lustre: lustre-OST0000: disconnecting 1 stale clients
[31896.952742] Lustre: lustre-OST0000: Recovery over after 1:06, of 3 clients 2 recovered and 1 was evicted.
[31896.955583] Lustre: lustre-OST0000: deleting orphan objects from 0x0:75598 to 0x0:75657
[31988.475076] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[31988.476672] LNetError: Skipped 9 previous similar messages
[31988.477615] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[31988.479254] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
[31988.480974] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[31988.483511] LNetError: Skipped 9 previous similar messages
[32313.626053] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
[32313.628238] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 34 previous similar messages
[32608.767216] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[32608.768897] LNetError: Skipped 9 previous similar messages
[32608.770267] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[32608.772194] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
[32608.774014] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[32608.777084] LNetError: Skipped 9 previous similar messages
[32913.908804] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
[32913.910952] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages
[33239.062905] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[33239.064514] LNetError: Skipped 9 previous similar messages
[33239.065469] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[33239.067127] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
[33239.068739] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[33239.071405] LNetError: Skipped 9 previous similar messages
[33524.195275] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
[33524.197461] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages
[33849.350362] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[33849.351963] LNetError: Skipped 9 previous similar messages
[33849.352892] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[33849.354641] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
[33849.356260] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[33849.358975] LNetError: Skipped 9 previous similar messages
[34134.482730] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
[34134.484846] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages
[34459.636819] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[34459.638458] LNetError: Skipped 9 previous similar messages
[34459.639402] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[34459.641150] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
[34459.642761] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[34459.645348] LNetError: Skipped 9 previous similar messages
[34744.769144] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
[34744.771342] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages
[35069.924238] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[35069.925878] LNetError: Skipped 9 previous similar messages
[35069.926814] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[35069.928553] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
[35069.930165] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[35069.932746] LNetError: Skipped 9 previous similar messages
[35355.056588] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
[35355.058740] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages

This test started failing on 13 AUG 2019 and has failed 100% of the time for PPC client testing and started with Lustre 2.12.2.115 and, so far, has only been seen for 2.12.3 and 2.12.4:
https://testing.whamcloud.com/test_sets/d40a7fde-be73-11e9-a2b6-52540065bddc
https://testing.whamcloud.com/test_sets/1e950984-cfe0-11e9-a2b6-52540065bddc


Generated at Sat Feb 10 02:59:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.