Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.3, Lustre 2.14.0
-
3
-
9223372036854775807
Description
Looking at the client test_log for the hang at https://testing.whamcloud.com/test_sets/c518a288-428c-11ea-b083-52540065bddc, the last thing printed before the hang is
Started lustre-OST0000
Looking at the OSS (vm6) console log, we see
[31839.369248] Lustre: lustre-OST0000: Client e8a848d3-e0cf-379f-b910-5e7448e8c6de (at 10.9.0.83@tcp) reconnected, waiting for 3 clients in recovery for 0:57 [31839.371703] Lustre: Skipped 24 previous similar messages [31860.379116] Lustre: lustre-OST0000: Client e8a848d3-e0cf-379f-b910-5e7448e8c6de (at 10.9.0.83@tcp) reconnected, waiting for 3 clients in recovery for 0:36 [31860.381510] Lustre: Skipped 2 previous similar messages [31895.395237] Lustre: lustre-OST0000: Client e8a848d3-e0cf-379f-b910-5e7448e8c6de (at 10.9.0.83@tcp) reconnected, waiting for 3 clients in recovery for 0:01 [31895.397640] Lustre: Skipped 4 previous similar messages [31896.600260] Lustre: lustre-OST0000: recovery is timed out, evict stale exports [31896.601650] Lustre: lustre-OST0000: disconnecting 1 stale clients [31896.952742] Lustre: lustre-OST0000: Recovery over after 1:06, of 3 clients 2 recovered and 1 was evicted. [31896.955583] Lustre: lustre-OST0000: deleting orphan objects from 0x0:75598 to 0x0:75657 [31988.475076] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [31988.476672] LNetError: Skipped 9 previous similar messages [31988.477615] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [31988.479254] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages [31988.480974] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [31988.483511] LNetError: Skipped 9 previous similar messages [32313.626053] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0 [32313.628238] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 34 previous similar messages [32608.767216] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [32608.768897] LNetError: Skipped 9 previous similar messages [32608.770267] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [32608.772194] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages [32608.774014] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [32608.777084] LNetError: Skipped 9 previous similar messages [32913.908804] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0 [32913.910952] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages [33239.062905] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [33239.064514] LNetError: Skipped 9 previous similar messages [33239.065469] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [33239.067127] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages [33239.068739] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [33239.071405] LNetError: Skipped 9 previous similar messages [33524.195275] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0 [33524.197461] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages [33849.350362] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [33849.351963] LNetError: Skipped 9 previous similar messages [33849.352892] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [33849.354641] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages [33849.356260] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [33849.358975] LNetError: Skipped 9 previous similar messages [34134.482730] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0 [34134.484846] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages [34459.636819] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [34459.638458] LNetError: Skipped 9 previous similar messages [34459.639402] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [34459.641150] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages [34459.642761] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [34459.645348] LNetError: Skipped 9 previous similar messages [34744.769144] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0 [34744.771342] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages [35069.924238] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [35069.925878] LNetError: Skipped 9 previous similar messages [35069.926814] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [35069.928553] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages [35069.930165] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [35069.932746] LNetError: Skipped 9 previous similar messages [35355.056588] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0 [35355.058740] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages
This test started failing on 13 AUG 2019 and has failed 100% of the time for PPC client testing and started with Lustre 2.12.2.115 and, so far, has only been seen for 2.12.3 and 2.12.4:
https://testing.whamcloud.com/test_sets/d40a7fde-be73-11e9-a2b6-52540065bddc
https://testing.whamcloud.com/test_sets/1e950984-cfe0-11e9-a2b6-52540065bddc