Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12849

sanityn test 30 hangs with client LNet errors

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.12.3, Lustre 2.12.6
    • Ubuntu 18.04
    • 3
    • 9223372036854775807

    Description

      sanityn test_30 hangs for Ubuntu 18.04. This test has timed out only once this year; 2.12.3 RC1 https://testing.whamcloud.com/test_sets/c1fbbada-eb26-11e9-b62b-52540065bddc .

      The last thing seen in the suite_log is

      == sanityn test 30: recreate file race =============================================================== 22:38:59 (1570574339)
      0
      Resetting fail_loc on all nodes...CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      done.
      22:39:03 (1570574343) waiting for trevis-63vm10 network 5 secs ...
      22:39:03 (1570574343) network interface is UP
      CMD: trevis-63vm10 rc=0;
      			val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      			if [[ \$? -eq 0 && \$val -ne 0 ]]; then
      				echo \$(hostname -s): \$val;
      				rc=\$val;
      			fi;
      			exit \$rc
      22:39:03 (1570574343) waiting for trevis-63vm11 network 5 secs ...
      22:39:03 (1570574343) network interface is UP
      CMD: trevis-63vm11 rc=0;
      			val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      			if [[ \$? -eq 0 && \$val -ne 0 ]]; then
      				echo \$(hostname -s): \$val;
      				rc=\$val;
      			fi;
      			exit \$rc
      22:39:03 (1570574343) waiting for trevis-63vm12 network 5 secs ...
      22:39:03 (1570574343) network interface is UP
      CMD: trevis-63vm12 rc=0;
      			val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      			if [[ \$? -eq 0 && \$val -ne 0 ]]; then
      				echo \$(hostname -s): \$val;
      				rc=\$val;
      			fi;
      			exit \$rc
      CMD: trevis-63vm9.trevis.whamcloud.com rc=0;
      			val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      			if [[ \$? -eq 0 && \$val -ne 0 ]]; then
      				echo \$(hostname -s): \$val;
      				rc=\$val;
      			fi;
      			exit \$rc
      CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com dmesg
      

      The only errors in the console logs are on the clients and they look like they are all LNet errors. On client 2 (vm9), we see:

      [21579.034108] Lustre: DEBUG MARKER: == sanityn test 30: recreate file race =============================================================== 22:38:59 (1570574339)
      [21579.207653] Lustre: lustre-OST0000-osc-ffff99dafaa33800: reconnect after 0s idle
      [21579.209132] Lustre: Skipped 13 previous similar messages
      [21582.208456] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [21583.567520] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc
      [21583.660508] Lustre: DEBUG MARKER: dmesg
      [21608.924090] Lustre: 19092:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1570574362/real 1570574362]  req@000000000acdde3c x1646864837598480/t0(0) o9->lustre-OST0000-osc-ffff99daf9073000@10.9.6.92@tcp:28/4 lens 224/224 e 0 to 1 dl 1570574369 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [21609.948076] Lustre: 19094:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1570574362/real 1570574362]  req@00000000dbd3ff3c x1646864837598400/t0(0) o400->MGC10.9.6.93@tcp@10.9.6.93@tcp:26/25 lens 224/224 e 0 to 1 dl 1570574370 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [21609.953169] LustreError: 166-1: MGC10.9.6.93@tcp: Connection to MGS (at 10.9.6.93@tcp) was lost; in progress operations using this service will fail
      [21609.955855] LNetError: 19085:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21609.957754] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988
      [21609.959646] LNetError: 19085:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.6.90@tcp added to recovery queue. Health = 900
      [21609.961852] Lustre: lustre-MDT0000-mdc-ffff99dafaa33800: Connection to lustre-MDT0000 (at 10.9.6.93@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [21609.965217] LNetError: 19092:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.93@tcp added to recovery queue. Health = 900
      [21610.556188] LNetError: 19083:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.90@tcp added to recovery queue. Health = 900
      [21610.558973] LNetError: 19083:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
      [21610.561118] LNetError: 19084:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21610.561255] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988
      [21610.562956] LNetError: 19084:0:(lib-socket.c:218:lnet_sock_create()) Skipped 2 previous similar messages
      [21610.564851] LNetError: Skipped 1 previous similar message
      [21610.972062] Lustre: 19094:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1570574362/real 1570574362]  req@00000000e483a625 x1646864837598464/t0(0) o400->lustre-MDT0000-mdc-ffff99daf9073000@10.9.6.93@tcp:12/10 lens 224/224 e 0 to 1 dl 1570574371 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [21610.977514] Lustre: 19094:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
      [21615.068259] LNetError: 19085:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21615.070217] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988
      [21615.072093] LNetError: Skipped 1 previous similar message
      [21616.092048] Lustre: 19094:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1570574367/real 1570574367]  req@000000005bfdf00c x1646864837598592/t0(0) o400->lustre-MDT0000-mdc-ffff99daf9073000@10.9.6.93@tcp:12/10 lens 224/224 e 0 to 1 dl 1570574376 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [21616.097529] Lustre: 19094:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [21620.188262] LNetError: 19084:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21620.190218] LNetError: 19084:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
      [21620.191971] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988
      [21623.868127] LNetError: 19090:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.93@tcp added to recovery queue. Health = 600
      [21630.428280] LNetError: 19084:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21630.428294] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988
      [21630.430415] LNetError: 19084:0:(lib-socket.c:218:lnet_sock_create()) Skipped 2 previous similar messages
      [21630.432319] LNetError: Skipped 1 previous similar message
      [21634.108108] LNetError: 19090:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.93@tcp added to recovery queue. Health = 100
      [21634.110565] LNetError: 19090:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
      [21644.348122] LNetError: 19090:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.93@tcp added to recovery queue. Health = 0
      

      On client 1 (vm10), we see:

      [21578.790437] Lustre: DEBUG MARKER: == sanityn test 30: recreate file race =============================================================== 22:38:59 (1570574339)
      [21581.963262] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [21582.452383] Lustre: DEBUG MARKER: rc=0;
      [21582.452383] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      [21582.452383] 			if [[ $? -eq 0 && $val -ne 0 ]]; then
      [21582.452383] 				echo $(hostname -s): $val;
      [21582.452383] 				rc=$val;
      [21582.452383] 			fi;
      [21582.452383] 			exit $rc
      [21583.410908] Lustre: DEBUG MARKER: dmesg
      [21611.392103] Lustre: 1726:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1570574365/real 1570574365]  req@000000009cb44c0f x1646844218355296/t0(0) o400->lustre-MDT0000-mdc-ffff9828fa370800@10.9.6.93@tcp:12/10 lens 224/224 e 0 to 1 dl 1570574372 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [21611.397433] Lustre: lustre-MDT0000-mdc-ffff9828fa370800: Connection to lustre-MDT0000 (at 10.9.6.93@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [21611.400523] LustreError: 166-1: MGC10.9.6.93@tcp: Connection to MGS (at 10.9.6.93@tcp) was lost; in progress operations using this service will fail
      [21611.403213] LNetError: 1716:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21611.405061] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988
      [21611.406918] LNetError: 1716:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.6.91@tcp added to recovery queue. Health = 900
      [21611.680210] LNetError: 1717:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.91@tcp added to recovery queue. Health = 900
      [21616.608235] Lustre: 1725:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1570574370/real 1570574370]  req@00000000496d9b48 x1646844218355344/t0(0) o400->lustre-MDT0000-mdc-ffff9828fa370800@10.9.6.93@tcp:12/10 lens 224/224 e 0 to 1 dl 1570574377 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [21616.614144] Lustre: 1725:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      [21616.616516] LNetError: 1714:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21616.618344] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988
      [21621.728247] LNetError: 1716:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21621.730188] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988
      [21626.848308] LNetError: 1713:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21626.850344] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988
      [21631.968250] LNetError: 1724:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.93@tcp added to recovery queue. Health = 900
      [21637.088290] LNetError: 1716:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21637.090256] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988
      [21642.400121] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.93@tcp added to recovery queue. Health = 100
      [21642.403405] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
      [21647.520107] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.91@tcp added to recovery queue. Health = 0
      [21656.736212] LNetError: 1714:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21656.738286] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988
      [21661.856106] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.91@tcp added to recovery queue. Health = 0
      [21672.096160] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.91@tcp added to recovery queue. Health = 0
      [21672.099271] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
      [21688.288267] LNetError: 1716:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21688.290241] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988
      [21701.792106] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.93@tcp added to recovery queue. Health = 0
      [21701.794528] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages
      [21741.728092] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.6.91@tcp added to recovery queue. Health = 0
      [21741.730506] LNetError: 1722:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 6 previous similar messages
      [21749.728302] LNetError: 1713:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21749.730287] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.6.93@tcp at host 10.9.6.93 on port 7988
      

      The OSS and MDS complain that they can’t reach the client. From the OSS (vm11)

      [21488.713741] Lustre: DEBUG MARKER: == sanityn test 30: recreate file race =============================================================== 22:38:59 (1570574339)
      [21491.899995] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [21492.527026] Lustre: DEBUG MARKER: rc=0;
      [21492.527026] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      [21492.527026] 			if [[ $? -eq 0 && $val -ne 0 ]]; then
      [21492.527026] 				echo $(hostname -s): $val;
      [21492.527026] 				rc=$val;
      [21492.527026] 			fi;
      [21492.527026] 			exit $rc
      [21493.342219] Lustre: DEBUG MARKER: dmesg
      [21559.951678] Lustre: lustre-OST0000: haven't heard from client 5ca422e6-7792-498a-982a-c66d43e718d4 (at 10.9.6.90@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff95893b9a3400, cur 1570574410 expire 1570574380 last 1570574361
      [21563.104144] Lustre: lustre-OST0006: haven't heard from client 5ca422e6-7792-498a-982a-c66d43e718d4 (at 10.9.6.90@tcp) in 52 seconds. I think it's dead, and I am evicting it. exp ffff95896937f400, cur 1570574413 expire 1570574383 last 1570574361
      

      and the MDS (vm12)

      [21489.572828] Lustre: DEBUG MARKER: == sanityn test 30: recreate file race =============================================================== 22:38:59 (1570574339)
      [21492.755456] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [21493.752176] Lustre: DEBUG MARKER: rc=0;
      [21493.752176] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      [21493.752176] 			if [[ $? -eq 0 && $val -ne 0 ]]; then
      [21493.752176] 				echo $(hostname -s): $val;
      [21493.752176] 				rc=$val;
      [21493.752176] 			fi;
      [21493.752176] 			exit $rc
      [21494.204307] Lustre: DEBUG MARKER: dmesg
      [21560.421771] Lustre: MGS: haven't heard from client d8d9c7ea-5064-8947-a11b-d3d0622e7d99 (at 10.9.6.91@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff93b0a1503800, cur 1570574409 expire 1570574379 last 1570574359
      

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: