Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11805

sanityn test 16a clients hang with ‘Error trying to bind to port 1023: -99’

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.7
    • Ubuntu 18.04 clients
    • 3
    • 9223372036854775807

    Description

      sanityn test_16a clients hang with nfs server time out and lnet_sock_create errors.

      Looking at the client test_log from https://testing.whamcloud.com/test_sets/d6b02bd6-fdd4-11e8-a97c-52540065bddc , we see the test running as expected

      == sanityn test 16a: 12500 iterations of dual-mount fsx ============================================== 23:26:51 (1544484411)
      CMD: trevis-19vm4 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize
      7+0 records in
      7+0 records out
      7340032 bytes (7.3 MB, 7.0 MiB) copied, 1.33143 s, 5.5 MB/s
      7+0 records in
      7+0 records out
      7340032 bytes (7.3 MB, 7.0 MiB) copied, 4.56728 s, 1.6 MB/s
      Chance of close/open is 1 in 50
      Seed set to 4417
      fd 0: /mnt/lustre/f16a.sanityn
      fd 1: /mnt/lustre2/f16a.sanityn
      truncating to largest ever: 0x1dfe5a
      truncating to largest ever: 0x7c61ac
      truncating to largest ever: 0x94abe7
      truncating to largest ever: 0x9d6e31
      truncating to largest ever: 0x9f1422
      000500[0] 1544484425.495566 trunc from 0x9d1bf7  to  0x11cd53e	(0x7fb948 bytes)
      truncating to largest ever: 0x9fe783
      001000[0] 1544484432.787479 mapread    0x138279 thru 0x143e81	(0xbc09 bytes)
      001500[1] 1544484440.835660 mapread    0x28b349 thru 0x2913ee	(0x60a6 bytes)
      002000[0] 1544484448.190155 write      0x780704 thru 0x78dd5e	(0xd65b bytes)
      truncating to largest ever: 0x9ffdb6
      002500[1] 1544484455.762469 read       0x46c5ad thru 0x47a81f	(0xe273 bytes)
      003000[1] 1544484463.442172 trunc from 0xa00000  to  0xd0df7b	(0x30df7c bytes)
      003500[1] 1544484471.490671 trunc from 0x90cfda  to  0xbf7075	(0x2ea09c bytes)
      004000[1] 1544484478.753447 mapwrite   0x9bac71 thru 0x9bf0da	(0x446a bytes)
      004500[1] 1544484486.839306 write      0x91ee3d thru 0x92273d	(0x3901 bytes)
      005000[1] 1544484494.153333 mapwrite   0x68aacb thru 0x693f4f	(0x9485 bytes)
      005500[0] 1544484502.221259 trunc from 0x8ba894  to  0x9956b0	(0xdae1d bytes)
      006000[0] 1544484509.642964 trunc from 0x63a0f3  to  0x7482ff	(0x10e20d bytes)
      006500[0] 1544484517.988496 read       0x598d97 thru 0x59c2f5	(0x355f bytes)
      007000[0] 1544484525.569919 mapwrite   0x71c288 thru 0x71c9f7	(0x770 bytes)
      007500[0] 1544484534.173892 read       0x2caf46 thru 0x2d1ba2	(0x6c5d bytes)
      008000[0] 1544484541.613265 write      0x410bb8 thru 0x418b7e	(0x7fc7 bytes)
      008500[0] 1544484548.359786 read       0x3f4704 thru 0x403779	(0xf076 bytes)
      009000[1] 1544484556.654503 trunc from 0x70d8ef  to  0xf2ace2	(0x81d3f4 bytes)
      009500[1] 1544484565.396798 mapread    0x08576a thru 0x08af61	(0x57f8 bytes)
      010000[0] 1544484573.036642 mapread    0x071104 thru 0x07b937	(0xa834 bytes)
      010500[1] 1544484581.018787 read       0x7a1f9e thru 0x7af1e9	(0xd24c bytes)
      011000[0] 1544484588.977036 write      0x69fc53 thru 0x6a0f6b	(0x1319 bytes)
      011500[0] 1544484597.183378 write      0x7e280e thru 0x7f197b	(0xf16e bytes)
      012000[0] 1544484604.970948 write      0x754ee2 thru 0x75fb66	(0xac85 bytes)
      012500[0] 1544484612.966799 mapread    0x0cb21f thru 0x0d1823	(0x6605 bytes)
      All operations completed A-OK!
      Chance of close/open is 1 in 50
      Seed set to 4612
      fd 0: /mnt/lustre/f16a.sanityn
      fd 1: /mnt/lustre2/f16a.sanityn
      skipping zero size read
      truncating to largest ever: 0x2bff39
      truncating to largest ever: 0x732365
      truncating to largest ever: 0x96ec16
      

      but both clients have network problems. Looking at the console log for Client 1 (vm1), we see network errors until the clients time out

      [21386.685578] Lustre: DEBUG MARKER: == sanityn test 16a: 12500 iterations of dual-mount fsx ============================================== 23:26:51 (1544484411)
      [21607.180560] Lustre: 10756:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484625/real 1544484625]  req@00000000d0fa8236 x1619508903774448/t0(0) o4->lustre-OST0006-osc-0000000057afebbd@10.9.4.226@tcp:6/4 lens 488/448 e 0 to 1 dl 1544484632 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      [21607.183647] Lustre: lustre-OST0006-osc-0000000057afebbd: Connection to lustre-OST0006 (at 10.9.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [21612.300526] LNetError: 10744:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21612.301698] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
      [21612.302806] Lustre: 10755:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1544484637/real 1544484637]  req@00000000013babff x1619508903775008/t0(0) o400->MGC10.9.4.227@tcp@10.9.4.227@tcp:26/25 lens 224/224 e 0 to 1 dl 1544484681 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
      [21612.305632] LustreError: 166-1: MGC10.9.4.227@tcp: Connection to MGS (at 10.9.4.227@tcp) was lost; in progress operations using this service will fail
      [21612.307127] Lustre: lustre-OST0000-osc-00000000c6515b26: Connection to lustre-OST0000 (at 10.9.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [21613.964225] Lustre: 10753:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484630/real 1544484630]  req@00000000cfba67eb x1619508903774608/t0(0) o400->lustre-MDT0000-mdc-0000000057afebbd@10.9.4.227@tcp:12/10 lens 224/224 e 0 to 1 dl 1544484638 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [21613.967218] Lustre: 10753:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 15 previous similar messages
      [21616.396101] Lustre: 10755:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484632/real 1544484632]  req@00000000d57afc11 x1619508903774752/t0(0) o400->lustre-MDT0000-mdc-00000000c6515b26@10.9.4.227@tcp:12/10 lens 224/224 e 0 to 1 dl 1544484641 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [21616.399086] Lustre: 10755:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      [21617.420470] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21617.421604] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message
      [21617.422604] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
      [21617.423647] LNetError: Skipped 1 previous similar message
      [21622.540157] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21622.541290] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message
      [21622.542280] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
      [21622.543328] LNetError: Skipped 1 previous similar message
      [21627.660041] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21627.661169] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message
      [21627.662161] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
      [21627.663208] LNetError: Skipped 1 previous similar message
      [21637.899461] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21637.899548] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.226@tcp at host 10.9.4.226 on port 7988
      [21637.899550] LNetError: Skipped 1 previous similar message
      [21637.902243] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Skipped 2 previous similar messages
      [21649.418594] Lustre: 10753:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484630/real 1544484630]  req@00000000756c150f x1619508903774512/t0(0) o400->lustre-OST0001-osc-00000000c6515b26@10.9.4.226@tcp:28/4 lens 224/224 e 0 to 1 dl 1544484674 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [21653.258745] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21653.258833] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.226@tcp at host 10.9.4.226 on port 7988
      [21653.258835] LNetError: Skipped 1 previous similar message
      [21653.261492] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message
      [21689.097080] LNetError: 10744:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21689.098237] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
      [21689.099278] LNetError: Skipped 1 previous similar message
      [21750.534239] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21750.535386] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message
      [21750.536380] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
      [21750.537421] LNetError: Skipped 1 previous similar message
      [21784.324408] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying
      [21798.339708] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out
      [21802.371510] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying
      [21805.379406] nfs: server trevis-1.trevis.whamcloud.com not responding, timed out
      [21873.408580] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21873.409720] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Skipped 3 previous similar messages
      [21873.410703] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
      [21873.411746] LNetError: Skipped 3 previous similar messages
      [21978.811391] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out
      [21983.995128] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out
      [21985.851057] nfs: server trevis-1.trevis.whamcloud.com not responding, timed out
      [21990.842881] nfs: server trevis-1.trevis.whamcloud.com not responding, timed out
      [22057.720105] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [22057.720179] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.226@tcp at host 10.9.4.226 on port 7988
      [22057.720181] LNetError: Skipped 5 previous similar messages
      [22057.722918] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Skipped 6 previous similar messages
      [22159.283076] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out
      

      I’ve only found one other failure
      https://testing.whamcloud.com/test_sets/63c6b2f6-fdd5-11e8-93ea-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: