Details
Description
sanityn test_30 hangs for Ubuntu 18.04. This test has timed out only once this year; 2.12.3 RC1 https://testing.whamcloud.com/test_sets/c1fbbada-eb26-11e9-b62b-52540065bddc .
The last thing seen in the suite_log is
== sanityn test 30: recreate file race =============================================================== 22:38:59 (1570574339) 0 Resetting fail_loc on all nodes...CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null done. 22:39:03 (1570574343) waiting for trevis-63vm10 network 5 secs ... 22:39:03 (1570574343) network interface is UP CMD: trevis-63vm10 rc=0; val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ \$? -eq 0 && \$val -ne 0 ]]; then echo \$(hostname -s): \$val; rc=\$val; fi; exit \$rc 22:39:03 (1570574343) waiting for trevis-63vm11 network 5 secs ... 22:39:03 (1570574343) network interface is UP CMD: trevis-63vm11 rc=0; val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ \$? -eq 0 && \$val -ne 0 ]]; then echo \$(hostname -s): \$val; rc=\$val; fi; exit \$rc 22:39:03 (1570574343) waiting for trevis-63vm12 network 5 secs ... 22:39:03 (1570574343) network interface is UP CMD: trevis-63vm12 rc=0; val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ \$? -eq 0 && \$val -ne 0 ]]; then echo \$(hostname -s): \$val; rc=\$val; fi; exit \$rc CMD: trevis-63vm9.trevis.whamcloud.com rc=0; val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ \$? -eq 0 && \$val -ne 0 ]]; then echo \$(hostname -s): \$val; rc=\$val; fi; exit \$rc CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com dmesg
The only errors in the console logs are on the clients and they look like they are all LNet errors. On client 2 (vm9), we see:
[21579.034108] Lustre: DEBUG MARKER: == sanityn test 30: recreate file race =============================================================== 22:38:59 (1570574339) [21579.207653] Lustre: lustre-OST0000-osc-ffff99dafaa33800: reconnect after 0s idle [21579.209132] Lustre: Skipped 13 previous similar messages [21582.208456] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [21583.567520] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc [21583.660508] Lustre: DEBUG MARKER: dmesg [21608.924090] Lustre: 19092:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1570574362/real 1570574362] req@000000000acdde3c x1646864837598480/t0(0) o9->lustre-OST0000-osc-ffff99daf9073000@10.9.6.92@tcp:28/4 lens 224/224 e 0 to 1 dl 1570574369 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [21609.948076] Lustre: 19094:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1570574362/real 1570574362] req@00000000dbd3ff3c x1646864837598400/t0(0) o400->MGC10.9.6.93@tcp@10.9.6.93@tcp:26/25 lens 224/224 e 0 to 1 dl 1570574370 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [21609.953169] LustreError: 166-1: MGC10.9.6.93@tcp: Connection to MGS (at 10.9.6.93@tcp) was lost; in progress operations using this service will fail [21609.955855] LNetError: 19085:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21609.957754] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988 [21609.959646] LNetError: 19085:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.6.90@tcp added to recovery queue. Health = 900 [21609.961852] Lustre: lustre-MDT0000-mdc-ffff99dafaa33800: Connection to lustre-MDT0000 (at 10.9.6.93@tcp) was lost; in progress operations using this service will wait for recovery to complete [21609.965217] LNetError: 19092:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.93@tcp added to recovery queue. Health = 900 [21610.556188] LNetError: 19083:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.90@tcp added to recovery queue. Health = 900 [21610.558973] LNetError: 19083:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message [21610.561118] LNetError: 19084:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21610.561255] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988 [21610.562956] LNetError: 19084:0:(lib-socket.c:218:lnet_sock_create()) Skipped 2 previous similar messages [21610.564851] LNetError: Skipped 1 previous similar message [21610.972062] Lustre: 19094:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1570574362/real 1570574362] req@00000000e483a625 x1646864837598464/t0(0) o400->lustre-MDT0000-mdc-ffff99daf9073000@10.9.6.93@tcp:12/10 lens 224/224 e 0 to 1 dl 1570574371 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [21610.977514] Lustre: 19094:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 5 previous similar messages [21615.068259] LNetError: 19085:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21615.070217] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988 [21615.072093] LNetError: Skipped 1 previous similar message [21616.092048] Lustre: 19094:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1570574367/real 1570574367] req@000000005bfdf00c x1646864837598592/t0(0) o400->lustre-MDT0000-mdc-ffff99daf9073000@10.9.6.93@tcp:12/10 lens 224/224 e 0 to 1 dl 1570574376 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [21616.097529] Lustre: 19094:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message [21620.188262] LNetError: 19084:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21620.190218] LNetError: 19084:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message [21620.191971] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988 [21623.868127] LNetError: 19090:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.93@tcp added to recovery queue. Health = 600 [21630.428280] LNetError: 19084:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21630.428294] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988 [21630.430415] LNetError: 19084:0:(lib-socket.c:218:lnet_sock_create()) Skipped 2 previous similar messages [21630.432319] LNetError: Skipped 1 previous similar message [21634.108108] LNetError: 19090:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.93@tcp added to recovery queue. Health = 100 [21634.110565] LNetError: 19090:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message [21644.348122] LNetError: 19090:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.93@tcp added to recovery queue. Health = 0
On client 1 (vm10), we see:
[21578.790437] Lustre: DEBUG MARKER: == sanityn test 30: recreate file race =============================================================== 22:38:59 (1570574339) [21581.963262] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [21582.452383] Lustre: DEBUG MARKER: rc=0; [21582.452383] val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); [21582.452383] if [[ $? -eq 0 && $val -ne 0 ]]; then [21582.452383] echo $(hostname -s): $val; [21582.452383] rc=$val; [21582.452383] fi; [21582.452383] exit $rc [21583.410908] Lustre: DEBUG MARKER: dmesg [21611.392103] Lustre: 1726:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1570574365/real 1570574365] req@000000009cb44c0f x1646844218355296/t0(0) o400->lustre-MDT0000-mdc-ffff9828fa370800@10.9.6.93@tcp:12/10 lens 224/224 e 0 to 1 dl 1570574372 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [21611.397433] Lustre: lustre-MDT0000-mdc-ffff9828fa370800: Connection to lustre-MDT0000 (at 10.9.6.93@tcp) was lost; in progress operations using this service will wait for recovery to complete [21611.400523] LustreError: 166-1: MGC10.9.6.93@tcp: Connection to MGS (at 10.9.6.93@tcp) was lost; in progress operations using this service will fail [21611.403213] LNetError: 1716:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21611.405061] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988 [21611.406918] LNetError: 1716:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.6.91@tcp added to recovery queue. Health = 900 [21611.680210] LNetError: 1717:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.91@tcp added to recovery queue. Health = 900 [21616.608235] Lustre: 1725:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1570574370/real 1570574370] req@00000000496d9b48 x1646844218355344/t0(0) o400->lustre-MDT0000-mdc-ffff9828fa370800@10.9.6.93@tcp:12/10 lens 224/224 e 0 to 1 dl 1570574377 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [21616.614144] Lustre: 1725:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 2 previous similar messages [21616.616516] LNetError: 1714:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21616.618344] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988 [21621.728247] LNetError: 1716:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21621.730188] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988 [21626.848308] LNetError: 1713:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21626.850344] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988 [21631.968250] LNetError: 1724:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.93@tcp added to recovery queue. Health = 900 [21637.088290] LNetError: 1716:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21637.090256] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988 [21642.400121] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.93@tcp added to recovery queue. Health = 100 [21642.403405] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message [21647.520107] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.91@tcp added to recovery queue. Health = 0 [21656.736212] LNetError: 1714:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21656.738286] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988 [21661.856106] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.91@tcp added to recovery queue. Health = 0 [21672.096160] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.91@tcp added to recovery queue. Health = 0 [21672.099271] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message [21688.288267] LNetError: 1716:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21688.290241] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988 [21701.792106] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.93@tcp added to recovery queue. Health = 0 [21701.794528] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages [21741.728092] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.91@tcp added to recovery queue. Health = 0 [21741.730506] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 6 previous similar messages [21749.728302] LNetError: 1713:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21749.730287] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988
The OSS and MDS complain that they can’t reach the client. From the OSS (vm11)
[21488.713741] Lustre: DEBUG MARKER: == sanityn test 30: recreate file race =============================================================== 22:38:59 (1570574339) [21491.899995] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [21492.527026] Lustre: DEBUG MARKER: rc=0; [21492.527026] val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); [21492.527026] if [[ $? -eq 0 && $val -ne 0 ]]; then [21492.527026] echo $(hostname -s): $val; [21492.527026] rc=$val; [21492.527026] fi; [21492.527026] exit $rc [21493.342219] Lustre: DEBUG MARKER: dmesg [21559.951678] Lustre: lustre-OST0000: haven't heard from client 5ca422e6-7792-498a-982a-c66d43e718d4 (at 10.9.6.90@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff95893b9a3400, cur 1570574410 expire 1570574380 last 1570574361 [21563.104144] Lustre: lustre-OST0006: haven't heard from client 5ca422e6-7792-498a-982a-c66d43e718d4 (at 10.9.6.90@tcp) in 52 seconds. I think it's dead, and I am evicting it. exp ffff95896937f400, cur 1570574413 expire 1570574383 last 1570574361
and the MDS (vm12)
[21489.572828] Lustre: DEBUG MARKER: == sanityn test 30: recreate file race =============================================================== 22:38:59 (1570574339) [21492.755456] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [21493.752176] Lustre: DEBUG MARKER: rc=0; [21493.752176] val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); [21493.752176] if [[ $? -eq 0 && $val -ne 0 ]]; then [21493.752176] echo $(hostname -s): $val; [21493.752176] rc=$val; [21493.752176] fi; [21493.752176] exit $rc [21494.204307] Lustre: DEBUG MARKER: dmesg [21560.421771] Lustre: MGS: haven't heard from client d8d9c7ea-5064-8947-a11b-d3d0622e7d99 (at 10.9.6.91@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff93b0a1503800, cur 1570574409 expire 1570574379 last 1570574359