Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11805

sanityn test 16a clients hang with ‘Error trying to bind to port 1023: -99’

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.7
    • Ubuntu 18.04 clients
    • 3
    • 9223372036854775807

    Description

      sanityn test_16a clients hang with nfs server time out and lnet_sock_create errors.

      Looking at the client test_log from https://testing.whamcloud.com/test_sets/d6b02bd6-fdd4-11e8-a97c-52540065bddc , we see the test running as expected

      == sanityn test 16a: 12500 iterations of dual-mount fsx ============================================== 23:26:51 (1544484411)
      CMD: trevis-19vm4 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize
      7+0 records in
      7+0 records out
      7340032 bytes (7.3 MB, 7.0 MiB) copied, 1.33143 s, 5.5 MB/s
      7+0 records in
      7+0 records out
      7340032 bytes (7.3 MB, 7.0 MiB) copied, 4.56728 s, 1.6 MB/s
      Chance of close/open is 1 in 50
      Seed set to 4417
      fd 0: /mnt/lustre/f16a.sanityn
      fd 1: /mnt/lustre2/f16a.sanityn
      truncating to largest ever: 0x1dfe5a
      truncating to largest ever: 0x7c61ac
      truncating to largest ever: 0x94abe7
      truncating to largest ever: 0x9d6e31
      truncating to largest ever: 0x9f1422
      000500[0] 1544484425.495566 trunc from 0x9d1bf7  to  0x11cd53e	(0x7fb948 bytes)
      truncating to largest ever: 0x9fe783
      001000[0] 1544484432.787479 mapread    0x138279 thru 0x143e81	(0xbc09 bytes)
      001500[1] 1544484440.835660 mapread    0x28b349 thru 0x2913ee	(0x60a6 bytes)
      002000[0] 1544484448.190155 write      0x780704 thru 0x78dd5e	(0xd65b bytes)
      truncating to largest ever: 0x9ffdb6
      002500[1] 1544484455.762469 read       0x46c5ad thru 0x47a81f	(0xe273 bytes)
      003000[1] 1544484463.442172 trunc from 0xa00000  to  0xd0df7b	(0x30df7c bytes)
      003500[1] 1544484471.490671 trunc from 0x90cfda  to  0xbf7075	(0x2ea09c bytes)
      004000[1] 1544484478.753447 mapwrite   0x9bac71 thru 0x9bf0da	(0x446a bytes)
      004500[1] 1544484486.839306 write      0x91ee3d thru 0x92273d	(0x3901 bytes)
      005000[1] 1544484494.153333 mapwrite   0x68aacb thru 0x693f4f	(0x9485 bytes)
      005500[0] 1544484502.221259 trunc from 0x8ba894  to  0x9956b0	(0xdae1d bytes)
      006000[0] 1544484509.642964 trunc from 0x63a0f3  to  0x7482ff	(0x10e20d bytes)
      006500[0] 1544484517.988496 read       0x598d97 thru 0x59c2f5	(0x355f bytes)
      007000[0] 1544484525.569919 mapwrite   0x71c288 thru 0x71c9f7	(0x770 bytes)
      007500[0] 1544484534.173892 read       0x2caf46 thru 0x2d1ba2	(0x6c5d bytes)
      008000[0] 1544484541.613265 write      0x410bb8 thru 0x418b7e	(0x7fc7 bytes)
      008500[0] 1544484548.359786 read       0x3f4704 thru 0x403779	(0xf076 bytes)
      009000[1] 1544484556.654503 trunc from 0x70d8ef  to  0xf2ace2	(0x81d3f4 bytes)
      009500[1] 1544484565.396798 mapread    0x08576a thru 0x08af61	(0x57f8 bytes)
      010000[0] 1544484573.036642 mapread    0x071104 thru 0x07b937	(0xa834 bytes)
      010500[1] 1544484581.018787 read       0x7a1f9e thru 0x7af1e9	(0xd24c bytes)
      011000[0] 1544484588.977036 write      0x69fc53 thru 0x6a0f6b	(0x1319 bytes)
      011500[0] 1544484597.183378 write      0x7e280e thru 0x7f197b	(0xf16e bytes)
      012000[0] 1544484604.970948 write      0x754ee2 thru 0x75fb66	(0xac85 bytes)
      012500[0] 1544484612.966799 mapread    0x0cb21f thru 0x0d1823	(0x6605 bytes)
      All operations completed A-OK!
      Chance of close/open is 1 in 50
      Seed set to 4612
      fd 0: /mnt/lustre/f16a.sanityn
      fd 1: /mnt/lustre2/f16a.sanityn
      skipping zero size read
      truncating to largest ever: 0x2bff39
      truncating to largest ever: 0x732365
      truncating to largest ever: 0x96ec16
      

      but both clients have network problems. Looking at the console log for Client 1 (vm1), we see network errors until the clients time out

      [21386.685578] Lustre: DEBUG MARKER: == sanityn test 16a: 12500 iterations of dual-mount fsx ============================================== 23:26:51 (1544484411)
      [21607.180560] Lustre: 10756:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484625/real 1544484625]  req@00000000d0fa8236 x1619508903774448/t0(0) o4->lustre-OST0006-osc-0000000057afebbd@10.9.4.226@tcp:6/4 lens 488/448 e 0 to 1 dl 1544484632 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      [21607.183647] Lustre: lustre-OST0006-osc-0000000057afebbd: Connection to lustre-OST0006 (at 10.9.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [21612.300526] LNetError: 10744:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21612.301698] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
      [21612.302806] Lustre: 10755:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1544484637/real 1544484637]  req@00000000013babff x1619508903775008/t0(0) o400->MGC10.9.4.227@tcp@10.9.4.227@tcp:26/25 lens 224/224 e 0 to 1 dl 1544484681 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
      [21612.305632] LustreError: 166-1: MGC10.9.4.227@tcp: Connection to MGS (at 10.9.4.227@tcp) was lost; in progress operations using this service will fail
      [21612.307127] Lustre: lustre-OST0000-osc-00000000c6515b26: Connection to lustre-OST0000 (at 10.9.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [21613.964225] Lustre: 10753:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484630/real 1544484630]  req@00000000cfba67eb x1619508903774608/t0(0) o400->lustre-MDT0000-mdc-0000000057afebbd@10.9.4.227@tcp:12/10 lens 224/224 e 0 to 1 dl 1544484638 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [21613.967218] Lustre: 10753:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 15 previous similar messages
      [21616.396101] Lustre: 10755:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484632/real 1544484632]  req@00000000d57afc11 x1619508903774752/t0(0) o400->lustre-MDT0000-mdc-00000000c6515b26@10.9.4.227@tcp:12/10 lens 224/224 e 0 to 1 dl 1544484641 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [21616.399086] Lustre: 10755:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      [21617.420470] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21617.421604] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message
      [21617.422604] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
      [21617.423647] LNetError: Skipped 1 previous similar message
      [21622.540157] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21622.541290] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message
      [21622.542280] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
      [21622.543328] LNetError: Skipped 1 previous similar message
      [21627.660041] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21627.661169] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message
      [21627.662161] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
      [21627.663208] LNetError: Skipped 1 previous similar message
      [21637.899461] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21637.899548] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.226@tcp at host 10.9.4.226 on port 7988
      [21637.899550] LNetError: Skipped 1 previous similar message
      [21637.902243] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Skipped 2 previous similar messages
      [21649.418594] Lustre: 10753:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484630/real 1544484630]  req@00000000756c150f x1619508903774512/t0(0) o400->lustre-OST0001-osc-00000000c6515b26@10.9.4.226@tcp:28/4 lens 224/224 e 0 to 1 dl 1544484674 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [21653.258745] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21653.258833] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.226@tcp at host 10.9.4.226 on port 7988
      [21653.258835] LNetError: Skipped 1 previous similar message
      [21653.261492] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message
      [21689.097080] LNetError: 10744:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21689.098237] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
      [21689.099278] LNetError: Skipped 1 previous similar message
      [21750.534239] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21750.535386] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message
      [21750.536380] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
      [21750.537421] LNetError: Skipped 1 previous similar message
      [21784.324408] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying
      [21798.339708] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out
      [21802.371510] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying
      [21805.379406] nfs: server trevis-1.trevis.whamcloud.com not responding, timed out
      [21873.408580] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [21873.409720] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Skipped 3 previous similar messages
      [21873.410703] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
      [21873.411746] LNetError: Skipped 3 previous similar messages
      [21978.811391] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out
      [21983.995128] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out
      [21985.851057] nfs: server trevis-1.trevis.whamcloud.com not responding, timed out
      [21990.842881] nfs: server trevis-1.trevis.whamcloud.com not responding, timed out
      [22057.720105] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
      [22057.720179] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.226@tcp at host 10.9.4.226 on port 7988
      [22057.720181] LNetError: Skipped 5 previous similar messages
      [22057.722918] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Skipped 6 previous similar messages
      [22159.283076] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out
      

      I’ve only found one other failure
      https://testing.whamcloud.com/test_sets/63c6b2f6-fdd5-11e8-93ea-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-11805] sanityn test 16a clients hang with ‘Error trying to bind to port 1023: -99’

            I haven't seen this new Lustre versions with newer Ubuntu. We can reopen this ticket if it comes back.

            simmonsja James A Simmons added a comment - I haven't seen this new Lustre versions with newer Ubuntu. We can reopen this ticket if it comes back.

            Saw something similar in test_7f of replay-vbr on 2.12.8: https://testing.whamcloud.com/test_sets/098c125f-9e10-43ac-b73c-baafe9bbe86f

            Ubuntu 18.04, clients having similar network issues:

            [21608.094576] LNetError: 2319:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
            [21608.096826] LNetError: 11e-e: Unexpected error -99 connecting to 10.240.29.171@tcp at host 10.240.29.171 on port 7988
            [21613.214389] LNetError: 2322:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
            [21613.216436] LNetError: 11e-e: Unexpected error -99 connecting to 10.240.29.171@tcp at host 10.240.29.171 on port 7988
            [21617.306228] LustreError: 2330:0:(client.c:2997:ptlrpc_replay_interpret()) @@@ request replay timed out.
            ...
            [22366.405833] nfs: server 10.240.16.204 not responding, timed out
            [22380.421440] nfs: server 10.240.16.204 not responding, timed out
            [22547.328926] nfs: server 10.240.16.204 not responding, timed out
            [22565.760422] nfs: server 10.240.16.204 not responding, timed out
            [22728.315978] nfs: server 10.240.16.204 not responding, timed out
            [22751.355430] nfs: server 10.240.16.204 not responding, timed out
            [22909.047116] nfs: server 10.240.16.204 not responding, timed out
            anikitenko Alena Nikitenko (Inactive) added a comment - Saw something similar in test_7f of replay-vbr on 2.12.8: https://testing.whamcloud.com/test_sets/098c125f-9e10-43ac-b73c-baafe9bbe86f Ubuntu 18.04, clients having similar network issues: [21608.094576] LNetError: 2319:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21608.096826] LNetError: 11e-e: Unexpected error -99 connecting to 10.240.29.171@tcp at host 10.240.29.171 on port 7988 [21613.214389] LNetError: 2322:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21613.216436] LNetError: 11e-e: Unexpected error -99 connecting to 10.240.29.171@tcp at host 10.240.29.171 on port 7988 [21617.306228] LustreError: 2330:0:(client.c:2997:ptlrpc_replay_interpret()) @@@ request replay timed out. ... [22366.405833] nfs: server 10.240.16.204 not responding, timed out [22380.421440] nfs: server 10.240.16.204 not responding, timed out [22547.328926] nfs: server 10.240.16.204 not responding, timed out [22565.760422] nfs: server 10.240.16.204 not responding, timed out [22728.315978] nfs: server 10.240.16.204 not responding, timed out [22751.355430] nfs: server 10.240.16.204 not responding, timed out [22909.047116] nfs: server 10.240.16.204 not responding, timed out

            James - Here's a similar failure at https://testing.whamcloud.com/test_sets/4897bd60-d3a8-41fe-89db-1d13f25b2e92 . Do you think it is the same as reported here?

            jamesanunez James Nunez (Inactive) added a comment - James - Here's a similar failure at https://testing.whamcloud.com/test_sets/4897bd60-d3a8-41fe-89db-1d13f25b2e92 . Do you think it is the same as reported here?

            Is this still true?

            simmonsja James A Simmons added a comment - Is this still true?

            I'm seeing the same errors during racer testing for 2.12.5 RC1 at https://testing.whamcloud.com/test_sets/0fc15871-31c3-40b1-aa09-694b1a36afb3 .

            The client 1 (vm7) console log shows

            [20903.961754] Lustre: DEBUG MARKER: == racer test 1: racer on clients: trevis-14vm7.trevis.whamcloud.com,trevis-14vm8 DURATION=900 ======= 18:04:46 (1590861886)
            [20904.169542] Lustre: DEBUG MARKER: DURATION=900 			MDSCOUNT=1 OSTCOUNT=7			RACER_ENABLE_REMOTE_DIRS=false 			RACER_ENABLE_STRIPED_DIRS=false 			RACER_ENABLE_MIGRATION=false 			RACER_ENABLE_PFL=true 			RACER_ENABLE_DOM=true 			RACER_ENABLE_FLR=true 			LFS=/usr/bin/lfs 			/usr/lib64/lustre/t
            [20904.181545] Lustre: DEBUG MARKER: DURATION=900 			MDSCOUNT=1 OSTCOUNT=7			RACER_ENABLE_REMOTE_DIRS=false 			RACER_ENABLE_STRIPED_DIRS=false 			RACER_ENABLE_MIGRATION=false 			RACER_ENABLE_PFL=true 			RACER_ENABLE_DOM=true 			RACER_ENABLE_FLR=true 			LFS=/usr/bin/lfs 			/usr/lib64/lustre/t
            [20904.538841] Lustre: lustre-OST0000-osc-ffff8de3b4695800: disconnect after 20s idle
            [20904.618933] Lustre: lustre-OST0000-osc-ffff8de3b4695800: reconnect after 0s idle
            [20953.288826] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200000402:0x264:0x0], use llapi_layout_get_by_path()
            [21068.113780] Lustre: 28140:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590862007/real 1590862007]  req@0000000050b212d2 x1668139453448448/t0(0) o36->lustre-MDT0000-mdc-ffff8de3b4695800@10.9.3.4@tcp:12/10 lens 488/4528 e 0 to 1 dl 1590862051 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            [21068.118869] Lustre: lustre-MDT0000-mdc-ffff8de3b4695800: Connection to lustre-MDT0000 (at 10.9.3.4@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [21068.125534] Lustre: lustre-MDT0000-mdc-ffff8de3b4695800: Connection restored to 10.9.3.4@tcp (at 10.9.3.4@tcp)
            [21195.090530] Lustre: 14126:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590862134/real 1590862134]  req@000000004caf3494 x1668139464521600/t0(0) o36->lustre-MDT0000-mdc-ffff8de3b4695800@10.9.3.4@tcp:12/10 lens 488/4528 e 0 to 1 dl 1590862178 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            [21195.096279] Lustre: lustre-MDT0000-mdc-ffff8de3b4695800: Connection to lustre-MDT0000 (at 10.9.3.4@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [21195.103659] Lustre: lustre-MDT0000-mdc-ffff8de3b4695800: Connection restored to 10.9.3.4@tcp (at 10.9.3.4@tcp)
            [21226.350128] 14[26627]: segfault at 1b70 ip 0000000000001b70 sp 00007ffc44a62040 error 14
            [21228.150915] 10[31931]: segfault at 1b70 ip 0000000000001b70 sp 00007fff744b4af0 error 14
            [21228.808660] 12[432]: segfault at 1b70 ip 0000000000001b70 sp 00007ffcca9c08d0 error 14
            [21366.088353] Lustre: 17508:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590862304/real 1590862304]  req@000000009208147b x1668139477794368/t0(0) o36->lustre-MDT0000-mdc-ffff8de3faaae800@10.9.3.4@tcp:12/10 lens 488/4528 e 0 to 1 dl 1590862349 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            [21366.093563] Lustre: lustre-MDT0000-mdc-ffff8de3faaae800: Connection to lustre-MDT0000 (at 10.9.3.4@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [21366.099720] Lustre: lustre-MDT0000-mdc-ffff8de3faaae800: Connection restored to 10.9.3.4@tcp (at 10.9.3.4@tcp)
            [21426.506313] Lustre: 10727:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590862365/real 1590862365]  req@0000000003372c20 x1668139483875584/t0(0) o36->lustre-MDT0000-mdc-ffff8de3faaae800@10.9.3.4@tcp:12/10 lens 488/4528 e 0 to 1 dl 1590862409 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            [21426.511459] Lustre: lustre-MDT0000-mdc-ffff8de3faaae800: Connection to lustre-MDT0000 (at 10.9.3.4@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [21426.520323] Lustre: lustre-MDT0000-mdc-ffff8de3faaae800: Connection restored to 10.9.3.4@tcp (at 10.9.3.4@tcp)
            [21427.877218] 6[7435]: segfault at 8 ip 00007f588ef337f9 sp 00007fffabb2b600 error 4 in ld-2.27.so[7f588ef27000+27000]
            [21477.700923] Lustre: lustre-OST0000-osc-ffff8de3b4695800: disconnect after 20s idle
            [21477.704381] Lustre: Skipped 6 previous similar messages
            [21482.820650] Lustre: lustre-OST0000-osc-ffff8de3faaae800: disconnect after 20s idle
            [21482.822363] Lustre: Skipped 1 previous similar message
            [21629.248858] INFO: task dir_create.sh:19858 blocked for more than 120 seconds.
            [21629.250432]       Tainted: G           OE    4.15.0-72-generic #81-Ubuntu
            [21629.251636] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [21629.253042] dir_create.sh   D    0 19858  19803 0x00000000
            [21629.254205] Call Trace:
            [21629.254816]  __schedule+0x24e/0x880
            [21629.255521]  schedule+0x2c/0x80
            [21629.256162]  rwsem_down_read_failed+0xf0/0x160
            [21629.257115]  call_rwsem_down_read_failed+0x18/0x30
            [21629.258002]  ? call_rwsem_down_read_failed+0x18/0x30
            [21629.258911]  down_read+0x20/0x40
            [21629.259569]  lookup_slow+0x60/0x170
            [21629.260261]  ? lookup_fast+0xe8/0x320
            [21629.261063]  walk_component+0x1c3/0x470
            [21629.261829]  ? __inode_permission+0x5b/0x160
            [21629.262633]  link_path_walk+0x1bc/0x5a0
            [21629.263365]  ? path_init+0x180/0x310
            [21629.264063]  path_openat+0xe1/0x1960
            [21629.264935]  ? ptep_set_access_flags+0x27/0x30
            [21629.265773]  do_filp_open+0x9b/0x110
            [21629.266478]  ? __check_object_size+0xc8/0x1b0
            [21629.267302]  ? __alloc_fd+0xb2/0x170
            [21629.268007]  do_sys_open+0x1bb/0x2c0
            [21629.268826]  ? do_sys_open+0x1bb/0x2c0
            [21629.269549]  SyS_openat+0x14/0x20
            [21629.270216]  do_syscall_64+0x73/0x130
            [21629.270925]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
            [21629.271867] RIP: 0033:0x7fbeb8a37c8e
            [21629.272571] RSP: 002b:00007ffdb3932d30 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
            [21629.274046] RAX: ffffffffffffffda RBX: 0000560096f38080 RCX: 00007fbeb8a37c8e
            [21629.275310] RDX: 0000000000000241 RSI: 0000560096f3cd80 RDI: 00000000ffffff9c
            [21629.276570] RBP: 00007ffdb3932e30 R08: 0000000000000020 R09: 0000000000000000
            [21629.277920] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000000
            [21629.279173] R13: 0000000000000003 R14: 0000000000000001 R15: 0000560096f3cd80
            [21629.280580] INFO: task dir_create.sh:19920 blocked for more than 120 seconds.
            [21629.281911]       Tainted: G           OE    4.15.0-72-generic #81-Ubuntu
            [21629.283107] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [21629.284485] dir_create.sh   D    0 19920  19867 0x00000000
            [21629.285562] Call Trace:
            [21629.286081]  __schedule+0x24e/0x880
            [21629.286764]  schedule+0x2c/0x80
            [21629.287390]  rwsem_down_read_failed+0xf0/0x160
            [21629.288236]  call_rwsem_down_read_failed+0x18/0x30
            [21629.289247]  ? call_rwsem_down_read_failed+0x18/0x30
            [21629.290156]  down_read+0x20/0x40
            [21629.290790]  lookup_slow+0x60/0x170
            [21629.291469]  ? lookup_fast+0xe8/0x320
            [21629.292188]  walk_component+0x1c3/0x470
            [21629.293046]  ? __inode_permission+0x5b/0x160
            [21629.293877]  link_path_walk+0x1bc/0x5a0
            [21629.294618]  ? path_init+0x180/0x310
            [21629.295312]  path_openat+0xe1/0x1960
            [21629.296018]  ? ptep_set_access_flags+0x27/0x30
            [21629.296974]  do_filp_open+0x9b/0x110
            [21629.297675]  ? __check_object_size+0xc8/0x1b0
            [21629.298491]  ? __alloc_fd+0xb2/0x170
            [21629.299187]  do_sys_open+0x1bb/0x2c0
            [21629.299873]  ? do_sys_open+0x1bb/0x2c0
            [21629.300605]  SyS_openat+0x14/0x20
            [21629.301274]  do_syscall_64+0x73/0x130
            [21629.301981]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
            [21629.302903] RIP: 0033:0x7ff250562c8e
            [21629.303596] RSP: 002b:00007ffc6e694450 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
            [21629.304940] RAX: ffffffffffffffda RBX: 000055b89cc350c0 RCX: 00007ff250562c8e
            [21629.306215] RDX: 0000000000000241 RSI: 000055b89cc35100 RDI: 00000000ffffff9c
            [21629.307469] RBP: 00007ffc6e694550 R08: 0000000000000020 R09: 0000000000000000
            [21629.308730] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000000
            [21629.309977] R13: 0000000000000003 R14: 0000000000000001 R15: 000055b89cc35100
            [21629.311518] INFO: task lfs:25776 blocked for more than 120 seconds.
            [21629.312654]       Tainted: G           OE    4.15.0-72-generic #81-Ubuntu
            [21629.313858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [21629.315226] lfs             D    0 25776  19857 0x00000000
            [21629.316239] Call Trace:
            [21629.316766]  __schedule+0x24e/0x880
            [21629.317653]  ? lprocfs_counter_add+0xeb/0x150 [obdclass]
            [21629.318626]  schedule+0x2c/0x80
            [21629.319250]  rwsem_down_write_failed+0x1ea/0x360
            [21629.320241]  ? ll_inode_size_unlock+0x15/0x20 [lustre]
            [21629.321193]  call_rwsem_down_write_failed+0x17/0x30
            [21629.322089]  ? call_rwsem_down_write_failed+0x17/0x30
            [21629.323015]  down_write+0x2d/0x40
            [21629.323672]  path_openat+0x356/0x1960
            [21629.324392]  do_filp_open+0x9b/0x110
            [21629.325092]  ? ll_rmdir_entry+0x3c0/0x3c0 [lustre]
            [21629.325986]  do_sys_open+0x1bb/0x2c0
            [21629.326680]  ? do_sys_open+0x1bb/0x2c0
            [21629.327400]  SyS_openat+0x14/0x20
            [21629.328061]  do_syscall_64+0x73/0x130
            [21629.328771]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
            [21629.329698] RIP: 0033:0x7f72fc9eac8e
            [21629.330395] RSP: 002b:00007ffed4945110 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
            [21629.331756] RAX: ffffffffffffffda RBX: 000055e739443280 RCX: 00007f72fc9eac8e
            [21629.333017] RDX: 0000000001002141 RSI: 00007ffed4953cd3 RDI: 00000000ffffff9c
            [21629.334270] RBP: 000055e739443280 R08: 0000000000000000 R09: 0000000000000000
            [21629.335517] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000041
            [21629.336781] R13: 00000000000001b6 R14: 0000000000000000 R15: 0000000000000000
            

            The client 2 (vm8) console log we see network errors

            [20934.456415] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200000403:0xb5:0x0], use llapi_layout_get_by_path()
            [20946.138560] LNetError: 22496:0:(socklnd_cb.c:2594:ksocknal_check_peer_timeouts()) Total 1 stale ZC_REQs for peer_ni 10.9.4.103@tcp detected; the oldest(0000000097af845a) timed out 0 secs ago, resid: 0, wmem: 2965805
            [20946.142034] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 0000000092b04732
            [20946.143871] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 0000000092b04732
            [20946.145706] LustreError: 22498:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 0000000092b04732
            [20946.147532] LustreError: 22498:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 0000000092b04732
            [20946.480440] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 00000000146f27d1
            [20946.482364] LustreError: 22498:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 000000007be56e5a
            [20946.484188] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 00000000cdf465c0
            [20946.484211] Lustre: 22503:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1590861904/real 1590861904]  req@000000009667adf0 x1668132478057280/t0(0) o4->lustre-OST0000-osc-ffff8af4b4acf800@10.9.4.103@tcp:6/4 lens 488/448 e 0 to 1 dl 1590861948 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
            [20946.491757] Lustre: lustre-OST0000-osc-ffff8af4b4acf800: Connection to lustre-OST0000 (at 10.9.4.103@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [20946.744044] LNetError: 22496:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.103@tcp[2], ip 10.9.4.103:7988, with error, wanted: 1026952, left: 1026952, last alive is 1 secs ago
            [20946.747968] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000b1fe4bb0
            [20949.172081] Lustre: 22504:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1590861907/real 1590861907]  req@000000008d7b3614 x1668132478064448/t0(0) o3->lustre-OST0006-osc-ffff8af4b4acf800@10.9.4.103@tcp:6/4 lens 488/440 e 0 to 1 dl 1590861951 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
            [20949.192238] Lustre: 22504:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
            [20949.194407] Lustre: lustre-OST0006-osc-ffff8af4b4acf800: Connection to lustre-OST0006 (at 10.9.4.103@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [20949.197915] Lustre: Skipped 2 previous similar messages
            [20950.518986] Lustre: lustre-OST0006-osc-ffff8af4b4acf800: Connection restored to 10.9.4.103@tcp (at 10.9.4.103@tcp)
            [20951.857168] Lustre: 22503:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590861908/real 1590861908]  req@0000000024c2c58e x1668132478064512/t0(0) o103->lustre-OST0002-osc-ffff8af4b4acf800@10.9.4.103@tcp:17/18 lens 328/224 e 0 to 1 dl 1590861915 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
            [20956.538413] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 000000007be56e5a
            [20956.540719] Lustre: 22504:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1590861914/real 1590861914]  req@000000004081e96d x1668132478063296/t0(0) o4->lustre-OST0005-osc-ffff8af4baa22800@10.9.4.103@tcp:6/4 lens 488/448 e 0 to 1 dl 1590861961 ref 2 fl Rpc:eX/2/ffffffff rc 0/-1
            [20956.545560] Lustre: 22504:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message
            [20956.547235] Lustre: lustre-OST0005-osc-ffff8af4baa22800: Connection to lustre-OST0005 (at 10.9.4.103@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [20956.550051] Lustre: Skipped 1 previous similar message
            [20956.557091] LustreError: 22498:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 00000000cdf465c0
            [20956.558950] LustreError: 22498:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 00000000146f27d1
            [20956.561422] LNetError: 22496:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.103@tcp[2], ip 10.9.4.103:7988, with error, wanted: 832920, left: 832920, last alive is 0 secs ago
            [20956.564845] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000b1fe4bb0
            [20956.568052] Lustre: lustre-OST0002-osc-ffff8af4b4acf800: Connection restored to 10.9.4.103@tcp (at 10.9.4.103@tcp)
            [20956.570544] Lustre: Skipped 3 previous similar messages
            [21000.655448] Lustre: 22504:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590861909/real 1590861909]  req@00000000b19b17c4 x1668132478066944/t0(0) o2->lustre-OST0004-osc-ffff8af4b4acf800@10.9.4.103@tcp:28/4 lens 440/432 e 0 to 1 dl 1590861955 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
            [21000.660490] Lustre: 22504:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
            [21000.662177] Lustre: lustre-OST0004-osc-ffff8af4b4acf800: Connection to lustre-OST0004 (at 10.9.4.103@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [21000.665033] Lustre: Skipped 8 previous similar messages
            [21000.673077] Lustre: lustre-OST0004-osc-ffff8af4b4acf800: Connection restored to 10.9.4.103@tcp (at 10.9.4.103@tcp)
            [21000.674903] Lustre: Skipped 8 previous similar messages
            [21101.448283] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 00000000e20e6431
            [21101.450476] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 00000000e20e6431
            [21101.452515] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 00000000e20e6431
            [21101.456569] Lustre: 22503:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1590862059/real 1590862059]  req@000000000cba5a20 x1668132491135168/t0(0) o4->lustre-OST0001-osc-ffff8af4b4acf800@10.9.4.103@tcp:6/4 lens 488/448 e 0 to 1 dl 1590862103 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
            [21101.461822] Lustre: lustre-OST0001-osc-ffff8af4b4acf800: Connection to lustre-OST0001 (at 10.9.4.103@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [21101.847254] LNetError: 22496:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.103@tcp[2], ip 10.9.4.103:7988, with error, wanted: 915456, left: 915456, last alive is 2 secs ago
            [21101.850888] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000f1d32768
            [21104.722915] Lustre: lustre-OST0001-osc-ffff8af4b4acf800: Connection restored to 10.9.4.103@tcp (at 10.9.4.103@tcp)
            [21147.975180] Lustre: 10638:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590862063/real 1590862063]  req@0000000006a79285 x1668132491143808/t0(0) o101->lustre-OST0003-osc-ffff8af4b4acf800@10.9.4.103@tcp:28/4 lens 328/400 e 0 to 1 dl 1590862111 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            [21147.980104] Lustre: 10638:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message
            [21147.981771] Lustre: lustre-OST0003-osc-ffff8af4b4acf800: Connection to lustre-OST0003 (at 10.9.4.103@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [21147.984604] Lustre: Skipped 1 previous similar message
            [21147.988318] Lustre: lustre-OST0003-osc-ffff8af4b4acf800: Connection restored to 10.9.4.103@tcp (at 10.9.4.103@tcp)
            [21147.990128] Lustre: Skipped 1 previous similar message
            [21185.865976] Lustre: lustre-MDT0000-mdc-ffff8af4baa22800: Connection restored to 10.9.3.4@tcp (at 10.9.3.4@tcp)
            [21194.289475] LustreError: 11-0: lustre-OST0006-osc-ffff8af4b4acf800: operation ldlm_enqueue to node 10.9.4.103@tcp failed: rc = -107
            [21194.298675] LustreError: 167-0: lustre-OST0006-osc-ffff8af4b4acf800: This client was evicted by lustre-OST0006; in progress operations using this service will fail.
            [21194.304663] Lustre: 22504:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.3.4@tcp:/lustre/fid: [0x200000402:0x12ec:0x0]// may get corrupted (rc -5)
            [21194.311015] LustreError: 25271:0:(ldlm_resource.c:1147:ldlm_resource_complain()) lustre-OST0006-osc-ffff8af4b4acf800: namespace resource [0x10e:0x0:0x0].0x0 (000000004b27c11c) refcount nonzero (1) after lock cleanup; forcing cleanup.
            [21212.020274] Lustre: lustre-OST0006-osc-ffff8af4b4acf800: Connection restored to 10.9.4.103@tcp (at 10.9.4.103@tcp)
            [21212.022138] Lustre: Skipped 1 previous similar message
            [21215.826206] 16[27360]: segfault at 8 ip 00007f14b5f397f9 sp 00007ffe059b9b50 error 4 in ld-2.27.so[7f14b5f2d000+27000]
            [21250.933293] 12[6132]: segfault at 8 ip 00007f5a327d47f9 sp 00007ffedbcb0580 error 4 in ld-2.27.so[7f5a327c8000+27000]
            [21324.097711] Lustre: 6330:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590862243/real 1590862243]  req@000000001e186cb4 x1668132507391872/t0(0) o36->lustre-MDT0000-mdc-ffff8af4baa22800@10.9.3.4@tcp:12/10 lens 488/4528 e 0 to 1 dl 1590862287 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            [21324.102694] Lustre: 6330:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
            [21324.104607] Lustre: lustre-MDT0000-mdc-ffff8af4baa22800: Connection to lustre-MDT0000 (at 10.9.3.4@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [21324.107419] Lustre: Skipped 3 previous similar messages
            [21324.113625] Lustre: lustre-MDT0000-mdc-ffff8af4baa22800: Connection restored to 10.9.3.4@tcp (at 10.9.3.4@tcp)
            [21400.897858] Lustre: lustre-MDT0000-mdc-ffff8af4baa22800: Connection restored to 10.9.3.4@tcp (at 10.9.3.4@tcp)
            [21464.636949] Lustre: 19665:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590862411/real 1590862411]  req@000000009ac917e6 x1668132520702080/t0(0) o36->lustre-MDT0000-mdc-ffff8af4b4acf800@10.9.3.4@tcp:12/10 lens 488/4528 e 0 to 1 dl 1590862428 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            [21464.641922] Lustre: 19665:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
            [21464.643588] Lustre: lustre-MDT0000-mdc-ffff8af4b4acf800: Connection to lustre-MDT0000 (at 10.9.3.4@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [21464.646984] Lustre: Skipped 2 previous similar messages
            [21618.232901] LNetError: 22493:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
            [21618.234795] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.3.4@tcp at host 10.9.3.4 on port 7988
            [21618.237666] LustreError: 166-1: MGC10.9.3.4@tcp: Connection to MGS (at 10.9.3.4@tcp) was lost; in progress operations using this service will fail
            [21618.904052] LNetError: 22494:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
            [21618.904165] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.103@tcp at host 10.9.4.103 on port 7988
            [21618.905805] LNetError: 22494:0:(lib-socket.c:218:lnet_sock_create()) Skipped 2 previous similar messages
            [21618.907518] LNetError: Skipped 1 previous similar message
            [21623.352099] LNetError: 22495:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
            [21623.353922] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.3.4@tcp at host 10.9.3.4 on port 7988
            [21628.472251] LNetError: 22494:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
            [21628.472300] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.103@tcp at host 10.9.4.103 on port 7988
            [21628.474057] LNetError: 22494:0:(lib-socket.c:218:lnet_sock_create()) Skipped 2 previous similar messages
            [21628.475818] LNetError: Skipped 1 previous similar message
            [21629.240579] INFO: task dir_create.sh:11445 blocked for more than 120 seconds.
            [21629.241918]       Tainted: G           OE    4.15.0-72-generic #81-Ubuntu
            [21629.243111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [21629.244474] dir_create.sh   D    0 11445  11349 0x00000000
            [21629.245474] Call Trace:
            [21629.246074]  __schedule+0x24e/0x880
            [21629.246758]  ? __switch_to_asm+0x35/0x70
            [21629.247497]  ? __switch_to_asm+0x41/0x70
            [21629.248243]  schedule+0x2c/0x80
            [21629.248864]  rwsem_down_read_failed+0xf0/0x160
            [21629.249708]  ? __switch_to+0x123/0x4e0
            [21629.250435]  call_rwsem_down_read_failed+0x18/0x30
            [21629.251306]  ? call_rwsem_down_read_failed+0x18/0x30
            [21629.252222]  down_read+0x20/0x40
            [21629.252882]  lookup_slow+0x60/0x170
            [21629.253550]  ? lookup_fast+0xe8/0x320
            [21629.254244]  walk_component+0x1c3/0x470
            [21629.254962]  ? __inode_permission+0x5b/0x160
            [21629.255767]  link_path_walk+0x1bc/0x5a0
            [21629.256487]  ? path_init+0x180/0x310
            [21629.257167]  path_openat+0xe1/0x1960
            [21629.257865]  ? ptep_set_access_flags+0x27/0x30
            [21629.258676]  do_filp_open+0x9b/0x110
            [21629.259371]  ? __check_object_size+0xc8/0x1b0
            [21629.260186]  ? __alloc_fd+0xb2/0x170
            [21629.260869]  do_sys_open+0x1bb/0x2c0
            [21629.261543]  ? do_sys_open+0x1bb/0x2c0
            [21629.262247]  SyS_openat+0x14/0x20
            [21629.262893]  do_syscall_64+0x73/0x130
            [21629.263602]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
            [21629.264525] RIP: 0033:0x7f3817c80c8e
            [21629.265208] RSP: 002b:00007ffc9a42f820 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
            [21629.266521] RAX: ffffffffffffffda RBX: 00005577ca65c080 RCX: 00007f3817c80c8e
            [21629.267771] RDX: 0000000000000241 RSI: 00005577ca65c4d0 RDI: 00000000ffffff9c
            [21629.269001] RBP: 00007ffc9a42f920 R08: 0000000000000020 R09: 0000000000000000
            [21629.270230] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000000
            [21629.271461] R13: 0000000000000003 R14: 0000000000000001 R15: 00005577ca65c4d0
            [21629.272909] INFO: task ls:9059 blocked for more than 120 seconds.
            [21629.273981]       Tainted: G           OE    4.15.0-72-generic #81-Ubuntu
            [21629.275172] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [21629.276519] ls              D    0  9059  11367 0x00000000
            [21629.277498] Call Trace:
            [21629.278059]  __schedule+0x24e/0x880
            [21629.278816]  schedule+0x2c/0x80
            [21629.279431]  rwsem_down_read_failed+0xf0/0x160
            [21629.280261]  call_rwsem_down_read_failed+0x18/0x30
            [21629.281130]  ? call_rwsem_down_read_failed+0x18/0x30
            [21629.282025]  down_read+0x20/0x40
            [21629.282650]  lookup_slow+0x60/0x170
            [21629.283318]  ? lookup_fast+0xe8/0x320
            [21629.284020]  walk_component+0x1c3/0x470
            [21629.284744]  ? __inode_permission+0x5b/0x160
            [21629.285525]  link_path_walk+0x1bc/0x5a0
            [21629.286245]  ? path_init+0x180/0x310
            [21629.286927]  path_openat+0xe1/0x1960
            [21629.287634]  ? __slab_free+0x14d/0x2c0
            [21629.288350]  ? ___slab_alloc+0x20a/0x4b0
            [21629.289083]  ? ___slab_alloc+0x20a/0x4b0
            [21629.289814]  ? __mutex_unlock_slowpath.isra.11+0xcc/0x120
            [21629.290771]  do_filp_open+0x9b/0x110
            [21629.291451]  ? __check_object_size+0xc8/0x1b0
            [21629.292260]  ? __alloc_fd+0xb2/0x170
            [21629.292942]  do_sys_open+0x1bb/0x2c0
            [21629.293626]  ? do_sys_open+0x1bb/0x2c0
            [21629.294337]  SyS_openat+0x14/0x20
            [21629.294983]  do_syscall_64+0x73/0x130
            [21629.295689]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
            [21629.296591] RIP: 0033:0x7fe6e6f97db1
            [21629.297270] RSP: 002b:00007ffd9c06d600 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
            [21629.298568] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe6e6f97db1
            [21629.299807] RDX: 0000000000090800 RSI: 000055d903b6bc40 RDI: 00000000ffffff9c
            [21629.301034] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
            [21629.302261] R10: 0000000000000000 R11: 0000000000000287 R12: 00007fe6e76bced8
            [21629.303486] R13: 0000000000000003 R14: 000055d903b6bc40 R15: 000055d9026e3c4a
            
            jamesanunez James Nunez (Inactive) added a comment - I'm seeing the same errors during racer testing for 2.12.5 RC1 at https://testing.whamcloud.com/test_sets/0fc15871-31c3-40b1-aa09-694b1a36afb3 . The client 1 (vm7) console log shows [20903.961754] Lustre: DEBUG MARKER: == racer test 1: racer on clients: trevis-14vm7.trevis.whamcloud.com,trevis-14vm8 DURATION=900 ======= 18:04:46 (1590861886) [20904.169542] Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 OSTCOUNT=7 RACER_ENABLE_REMOTE_DIRS=false RACER_ENABLE_STRIPED_DIRS=false RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs /usr/lib64/lustre/t [20904.181545] Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 OSTCOUNT=7 RACER_ENABLE_REMOTE_DIRS=false RACER_ENABLE_STRIPED_DIRS=false RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs /usr/lib64/lustre/t [20904.538841] Lustre: lustre-OST0000-osc-ffff8de3b4695800: disconnect after 20s idle [20904.618933] Lustre: lustre-OST0000-osc-ffff8de3b4695800: reconnect after 0s idle [20953.288826] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200000402:0x264:0x0], use llapi_layout_get_by_path() [21068.113780] Lustre: 28140:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590862007/real 1590862007] req@0000000050b212d2 x1668139453448448/t0(0) o36->lustre-MDT0000-mdc-ffff8de3b4695800@10.9.3.4@tcp:12/10 lens 488/4528 e 0 to 1 dl 1590862051 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [21068.118869] Lustre: lustre-MDT0000-mdc-ffff8de3b4695800: Connection to lustre-MDT0000 (at 10.9.3.4@tcp) was lost; in progress operations using this service will wait for recovery to complete [21068.125534] Lustre: lustre-MDT0000-mdc-ffff8de3b4695800: Connection restored to 10.9.3.4@tcp (at 10.9.3.4@tcp) [21195.090530] Lustre: 14126:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590862134/real 1590862134] req@000000004caf3494 x1668139464521600/t0(0) o36->lustre-MDT0000-mdc-ffff8de3b4695800@10.9.3.4@tcp:12/10 lens 488/4528 e 0 to 1 dl 1590862178 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [21195.096279] Lustre: lustre-MDT0000-mdc-ffff8de3b4695800: Connection to lustre-MDT0000 (at 10.9.3.4@tcp) was lost; in progress operations using this service will wait for recovery to complete [21195.103659] Lustre: lustre-MDT0000-mdc-ffff8de3b4695800: Connection restored to 10.9.3.4@tcp (at 10.9.3.4@tcp) [21226.350128] 14[26627]: segfault at 1b70 ip 0000000000001b70 sp 00007ffc44a62040 error 14 [21228.150915] 10[31931]: segfault at 1b70 ip 0000000000001b70 sp 00007fff744b4af0 error 14 [21228.808660] 12[432]: segfault at 1b70 ip 0000000000001b70 sp 00007ffcca9c08d0 error 14 [21366.088353] Lustre: 17508:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590862304/real 1590862304] req@000000009208147b x1668139477794368/t0(0) o36->lustre-MDT0000-mdc-ffff8de3faaae800@10.9.3.4@tcp:12/10 lens 488/4528 e 0 to 1 dl 1590862349 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [21366.093563] Lustre: lustre-MDT0000-mdc-ffff8de3faaae800: Connection to lustre-MDT0000 (at 10.9.3.4@tcp) was lost; in progress operations using this service will wait for recovery to complete [21366.099720] Lustre: lustre-MDT0000-mdc-ffff8de3faaae800: Connection restored to 10.9.3.4@tcp (at 10.9.3.4@tcp) [21426.506313] Lustre: 10727:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590862365/real 1590862365] req@0000000003372c20 x1668139483875584/t0(0) o36->lustre-MDT0000-mdc-ffff8de3faaae800@10.9.3.4@tcp:12/10 lens 488/4528 e 0 to 1 dl 1590862409 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [21426.511459] Lustre: lustre-MDT0000-mdc-ffff8de3faaae800: Connection to lustre-MDT0000 (at 10.9.3.4@tcp) was lost; in progress operations using this service will wait for recovery to complete [21426.520323] Lustre: lustre-MDT0000-mdc-ffff8de3faaae800: Connection restored to 10.9.3.4@tcp (at 10.9.3.4@tcp) [21427.877218] 6[7435]: segfault at 8 ip 00007f588ef337f9 sp 00007fffabb2b600 error 4 in ld-2.27.so[7f588ef27000+27000] [21477.700923] Lustre: lustre-OST0000-osc-ffff8de3b4695800: disconnect after 20s idle [21477.704381] Lustre: Skipped 6 previous similar messages [21482.820650] Lustre: lustre-OST0000-osc-ffff8de3faaae800: disconnect after 20s idle [21482.822363] Lustre: Skipped 1 previous similar message [21629.248858] INFO: task dir_create.sh:19858 blocked for more than 120 seconds. [21629.250432] Tainted: G OE 4.15.0-72-generic #81-Ubuntu [21629.251636] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21629.253042] dir_create.sh D 0 19858 19803 0x00000000 [21629.254205] Call Trace: [21629.254816] __schedule+0x24e/0x880 [21629.255521] schedule+0x2c/0x80 [21629.256162] rwsem_down_read_failed+0xf0/0x160 [21629.257115] call_rwsem_down_read_failed+0x18/0x30 [21629.258002] ? call_rwsem_down_read_failed+0x18/0x30 [21629.258911] down_read+0x20/0x40 [21629.259569] lookup_slow+0x60/0x170 [21629.260261] ? lookup_fast+0xe8/0x320 [21629.261063] walk_component+0x1c3/0x470 [21629.261829] ? __inode_permission+0x5b/0x160 [21629.262633] link_path_walk+0x1bc/0x5a0 [21629.263365] ? path_init+0x180/0x310 [21629.264063] path_openat+0xe1/0x1960 [21629.264935] ? ptep_set_access_flags+0x27/0x30 [21629.265773] do_filp_open+0x9b/0x110 [21629.266478] ? __check_object_size+0xc8/0x1b0 [21629.267302] ? __alloc_fd+0xb2/0x170 [21629.268007] do_sys_open+0x1bb/0x2c0 [21629.268826] ? do_sys_open+0x1bb/0x2c0 [21629.269549] SyS_openat+0x14/0x20 [21629.270216] do_syscall_64+0x73/0x130 [21629.270925] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [21629.271867] RIP: 0033:0x7fbeb8a37c8e [21629.272571] RSP: 002b:00007ffdb3932d30 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 [21629.274046] RAX: ffffffffffffffda RBX: 0000560096f38080 RCX: 00007fbeb8a37c8e [21629.275310] RDX: 0000000000000241 RSI: 0000560096f3cd80 RDI: 00000000ffffff9c [21629.276570] RBP: 00007ffdb3932e30 R08: 0000000000000020 R09: 0000000000000000 [21629.277920] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000000 [21629.279173] R13: 0000000000000003 R14: 0000000000000001 R15: 0000560096f3cd80 [21629.280580] INFO: task dir_create.sh:19920 blocked for more than 120 seconds. [21629.281911] Tainted: G OE 4.15.0-72-generic #81-Ubuntu [21629.283107] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21629.284485] dir_create.sh D 0 19920 19867 0x00000000 [21629.285562] Call Trace: [21629.286081] __schedule+0x24e/0x880 [21629.286764] schedule+0x2c/0x80 [21629.287390] rwsem_down_read_failed+0xf0/0x160 [21629.288236] call_rwsem_down_read_failed+0x18/0x30 [21629.289247] ? call_rwsem_down_read_failed+0x18/0x30 [21629.290156] down_read+0x20/0x40 [21629.290790] lookup_slow+0x60/0x170 [21629.291469] ? lookup_fast+0xe8/0x320 [21629.292188] walk_component+0x1c3/0x470 [21629.293046] ? __inode_permission+0x5b/0x160 [21629.293877] link_path_walk+0x1bc/0x5a0 [21629.294618] ? path_init+0x180/0x310 [21629.295312] path_openat+0xe1/0x1960 [21629.296018] ? ptep_set_access_flags+0x27/0x30 [21629.296974] do_filp_open+0x9b/0x110 [21629.297675] ? __check_object_size+0xc8/0x1b0 [21629.298491] ? __alloc_fd+0xb2/0x170 [21629.299187] do_sys_open+0x1bb/0x2c0 [21629.299873] ? do_sys_open+0x1bb/0x2c0 [21629.300605] SyS_openat+0x14/0x20 [21629.301274] do_syscall_64+0x73/0x130 [21629.301981] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [21629.302903] RIP: 0033:0x7ff250562c8e [21629.303596] RSP: 002b:00007ffc6e694450 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 [21629.304940] RAX: ffffffffffffffda RBX: 000055b89cc350c0 RCX: 00007ff250562c8e [21629.306215] RDX: 0000000000000241 RSI: 000055b89cc35100 RDI: 00000000ffffff9c [21629.307469] RBP: 00007ffc6e694550 R08: 0000000000000020 R09: 0000000000000000 [21629.308730] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000000 [21629.309977] R13: 0000000000000003 R14: 0000000000000001 R15: 000055b89cc35100 [21629.311518] INFO: task lfs:25776 blocked for more than 120 seconds. [21629.312654] Tainted: G OE 4.15.0-72-generic #81-Ubuntu [21629.313858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21629.315226] lfs D 0 25776 19857 0x00000000 [21629.316239] Call Trace: [21629.316766] __schedule+0x24e/0x880 [21629.317653] ? lprocfs_counter_add+0xeb/0x150 [obdclass] [21629.318626] schedule+0x2c/0x80 [21629.319250] rwsem_down_write_failed+0x1ea/0x360 [21629.320241] ? ll_inode_size_unlock+0x15/0x20 [lustre] [21629.321193] call_rwsem_down_write_failed+0x17/0x30 [21629.322089] ? call_rwsem_down_write_failed+0x17/0x30 [21629.323015] down_write+0x2d/0x40 [21629.323672] path_openat+0x356/0x1960 [21629.324392] do_filp_open+0x9b/0x110 [21629.325092] ? ll_rmdir_entry+0x3c0/0x3c0 [lustre] [21629.325986] do_sys_open+0x1bb/0x2c0 [21629.326680] ? do_sys_open+0x1bb/0x2c0 [21629.327400] SyS_openat+0x14/0x20 [21629.328061] do_syscall_64+0x73/0x130 [21629.328771] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [21629.329698] RIP: 0033:0x7f72fc9eac8e [21629.330395] RSP: 002b:00007ffed4945110 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 [21629.331756] RAX: ffffffffffffffda RBX: 000055e739443280 RCX: 00007f72fc9eac8e [21629.333017] RDX: 0000000001002141 RSI: 00007ffed4953cd3 RDI: 00000000ffffff9c [21629.334270] RBP: 000055e739443280 R08: 0000000000000000 R09: 0000000000000000 [21629.335517] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000041 [21629.336781] R13: 00000000000001b6 R14: 0000000000000000 R15: 0000000000000000 The client 2 (vm8) console log we see network errors [20934.456415] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200000403:0xb5:0x0], use llapi_layout_get_by_path() [20946.138560] LNetError: 22496:0:(socklnd_cb.c:2594:ksocknal_check_peer_timeouts()) Total 1 stale ZC_REQs for peer_ni 10.9.4.103@tcp detected; the oldest(0000000097af845a) timed out 0 secs ago, resid: 0, wmem: 2965805 [20946.142034] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 0000000092b04732 [20946.143871] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 0000000092b04732 [20946.145706] LustreError: 22498:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 0000000092b04732 [20946.147532] LustreError: 22498:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 0000000092b04732 [20946.480440] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 00000000146f27d1 [20946.482364] LustreError: 22498:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 000000007be56e5a [20946.484188] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 00000000cdf465c0 [20946.484211] Lustre: 22503:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1590861904/real 1590861904] req@000000009667adf0 x1668132478057280/t0(0) o4->lustre-OST0000-osc-ffff8af4b4acf800@10.9.4.103@tcp:6/4 lens 488/448 e 0 to 1 dl 1590861948 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 [20946.491757] Lustre: lustre-OST0000-osc-ffff8af4b4acf800: Connection to lustre-OST0000 (at 10.9.4.103@tcp) was lost; in progress operations using this service will wait for recovery to complete [20946.744044] LNetError: 22496:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.103@tcp[2], ip 10.9.4.103:7988, with error, wanted: 1026952, left: 1026952, last alive is 1 secs ago [20946.747968] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000b1fe4bb0 [20949.172081] Lustre: 22504:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1590861907/real 1590861907] req@000000008d7b3614 x1668132478064448/t0(0) o3->lustre-OST0006-osc-ffff8af4b4acf800@10.9.4.103@tcp:6/4 lens 488/440 e 0 to 1 dl 1590861951 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 [20949.192238] Lustre: 22504:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 3 previous similar messages [20949.194407] Lustre: lustre-OST0006-osc-ffff8af4b4acf800: Connection to lustre-OST0006 (at 10.9.4.103@tcp) was lost; in progress operations using this service will wait for recovery to complete [20949.197915] Lustre: Skipped 2 previous similar messages [20950.518986] Lustre: lustre-OST0006-osc-ffff8af4b4acf800: Connection restored to 10.9.4.103@tcp (at 10.9.4.103@tcp) [20951.857168] Lustre: 22503:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590861908/real 1590861908] req@0000000024c2c58e x1668132478064512/t0(0) o103->lustre-OST0002-osc-ffff8af4b4acf800@10.9.4.103@tcp:17/18 lens 328/224 e 0 to 1 dl 1590861915 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 [20956.538413] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 000000007be56e5a [20956.540719] Lustre: 22504:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1590861914/real 1590861914] req@000000004081e96d x1668132478063296/t0(0) o4->lustre-OST0005-osc-ffff8af4baa22800@10.9.4.103@tcp:6/4 lens 488/448 e 0 to 1 dl 1590861961 ref 2 fl Rpc:eX/2/ffffffff rc 0/-1 [20956.545560] Lustre: 22504:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message [20956.547235] Lustre: lustre-OST0005-osc-ffff8af4baa22800: Connection to lustre-OST0005 (at 10.9.4.103@tcp) was lost; in progress operations using this service will wait for recovery to complete [20956.550051] Lustre: Skipped 1 previous similar message [20956.557091] LustreError: 22498:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 00000000cdf465c0 [20956.558950] LustreError: 22498:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 00000000146f27d1 [20956.561422] LNetError: 22496:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.103@tcp[2], ip 10.9.4.103:7988, with error, wanted: 832920, left: 832920, last alive is 0 secs ago [20956.564845] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000b1fe4bb0 [20956.568052] Lustre: lustre-OST0002-osc-ffff8af4b4acf800: Connection restored to 10.9.4.103@tcp (at 10.9.4.103@tcp) [20956.570544] Lustre: Skipped 3 previous similar messages [21000.655448] Lustre: 22504:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590861909/real 1590861909] req@00000000b19b17c4 x1668132478066944/t0(0) o2->lustre-OST0004-osc-ffff8af4b4acf800@10.9.4.103@tcp:28/4 lens 440/432 e 0 to 1 dl 1590861955 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 [21000.660490] Lustre: 22504:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 8 previous similar messages [21000.662177] Lustre: lustre-OST0004-osc-ffff8af4b4acf800: Connection to lustre-OST0004 (at 10.9.4.103@tcp) was lost; in progress operations using this service will wait for recovery to complete [21000.665033] Lustre: Skipped 8 previous similar messages [21000.673077] Lustre: lustre-OST0004-osc-ffff8af4b4acf800: Connection restored to 10.9.4.103@tcp (at 10.9.4.103@tcp) [21000.674903] Lustre: Skipped 8 previous similar messages [21101.448283] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 00000000e20e6431 [21101.450476] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 00000000e20e6431 [21101.452515] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 00000000e20e6431 [21101.456569] Lustre: 22503:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1590862059/real 1590862059] req@000000000cba5a20 x1668132491135168/t0(0) o4->lustre-OST0001-osc-ffff8af4b4acf800@10.9.4.103@tcp:6/4 lens 488/448 e 0 to 1 dl 1590862103 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 [21101.461822] Lustre: lustre-OST0001-osc-ffff8af4b4acf800: Connection to lustre-OST0001 (at 10.9.4.103@tcp) was lost; in progress operations using this service will wait for recovery to complete [21101.847254] LNetError: 22496:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.103@tcp[2], ip 10.9.4.103:7988, with error, wanted: 915456, left: 915456, last alive is 2 secs ago [21101.850888] LustreError: 22496:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000f1d32768 [21104.722915] Lustre: lustre-OST0001-osc-ffff8af4b4acf800: Connection restored to 10.9.4.103@tcp (at 10.9.4.103@tcp) [21147.975180] Lustre: 10638:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590862063/real 1590862063] req@0000000006a79285 x1668132491143808/t0(0) o101->lustre-OST0003-osc-ffff8af4b4acf800@10.9.4.103@tcp:28/4 lens 328/400 e 0 to 1 dl 1590862111 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [21147.980104] Lustre: 10638:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message [21147.981771] Lustre: lustre-OST0003-osc-ffff8af4b4acf800: Connection to lustre-OST0003 (at 10.9.4.103@tcp) was lost; in progress operations using this service will wait for recovery to complete [21147.984604] Lustre: Skipped 1 previous similar message [21147.988318] Lustre: lustre-OST0003-osc-ffff8af4b4acf800: Connection restored to 10.9.4.103@tcp (at 10.9.4.103@tcp) [21147.990128] Lustre: Skipped 1 previous similar message [21185.865976] Lustre: lustre-MDT0000-mdc-ffff8af4baa22800: Connection restored to 10.9.3.4@tcp (at 10.9.3.4@tcp) [21194.289475] LustreError: 11-0: lustre-OST0006-osc-ffff8af4b4acf800: operation ldlm_enqueue to node 10.9.4.103@tcp failed: rc = -107 [21194.298675] LustreError: 167-0: lustre-OST0006-osc-ffff8af4b4acf800: This client was evicted by lustre-OST0006; in progress operations using this service will fail. [21194.304663] Lustre: 22504:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.3.4@tcp:/lustre/fid: [0x200000402:0x12ec:0x0]// may get corrupted (rc -5) [21194.311015] LustreError: 25271:0:(ldlm_resource.c:1147:ldlm_resource_complain()) lustre-OST0006-osc-ffff8af4b4acf800: namespace resource [0x10e:0x0:0x0].0x0 (000000004b27c11c) refcount nonzero (1) after lock cleanup; forcing cleanup. [21212.020274] Lustre: lustre-OST0006-osc-ffff8af4b4acf800: Connection restored to 10.9.4.103@tcp (at 10.9.4.103@tcp) [21212.022138] Lustre: Skipped 1 previous similar message [21215.826206] 16[27360]: segfault at 8 ip 00007f14b5f397f9 sp 00007ffe059b9b50 error 4 in ld-2.27.so[7f14b5f2d000+27000] [21250.933293] 12[6132]: segfault at 8 ip 00007f5a327d47f9 sp 00007ffedbcb0580 error 4 in ld-2.27.so[7f5a327c8000+27000] [21324.097711] Lustre: 6330:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590862243/real 1590862243] req@000000001e186cb4 x1668132507391872/t0(0) o36->lustre-MDT0000-mdc-ffff8af4baa22800@10.9.3.4@tcp:12/10 lens 488/4528 e 0 to 1 dl 1590862287 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [21324.102694] Lustre: 6330:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 3 previous similar messages [21324.104607] Lustre: lustre-MDT0000-mdc-ffff8af4baa22800: Connection to lustre-MDT0000 (at 10.9.3.4@tcp) was lost; in progress operations using this service will wait for recovery to complete [21324.107419] Lustre: Skipped 3 previous similar messages [21324.113625] Lustre: lustre-MDT0000-mdc-ffff8af4baa22800: Connection restored to 10.9.3.4@tcp (at 10.9.3.4@tcp) [21400.897858] Lustre: lustre-MDT0000-mdc-ffff8af4baa22800: Connection restored to 10.9.3.4@tcp (at 10.9.3.4@tcp) [21464.636949] Lustre: 19665:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1590862411/real 1590862411] req@000000009ac917e6 x1668132520702080/t0(0) o36->lustre-MDT0000-mdc-ffff8af4b4acf800@10.9.3.4@tcp:12/10 lens 488/4528 e 0 to 1 dl 1590862428 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [21464.641922] Lustre: 19665:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 2 previous similar messages [21464.643588] Lustre: lustre-MDT0000-mdc-ffff8af4b4acf800: Connection to lustre-MDT0000 (at 10.9.3.4@tcp) was lost; in progress operations using this service will wait for recovery to complete [21464.646984] Lustre: Skipped 2 previous similar messages [21618.232901] LNetError: 22493:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21618.234795] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.3.4@tcp at host 10.9.3.4 on port 7988 [21618.237666] LustreError: 166-1: MGC10.9.3.4@tcp: Connection to MGS (at 10.9.3.4@tcp) was lost; in progress operations using this service will fail [21618.904052] LNetError: 22494:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21618.904165] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.103@tcp at host 10.9.4.103 on port 7988 [21618.905805] LNetError: 22494:0:(lib-socket.c:218:lnet_sock_create()) Skipped 2 previous similar messages [21618.907518] LNetError: Skipped 1 previous similar message [21623.352099] LNetError: 22495:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21623.353922] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.3.4@tcp at host 10.9.3.4 on port 7988 [21628.472251] LNetError: 22494:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21628.472300] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.103@tcp at host 10.9.4.103 on port 7988 [21628.474057] LNetError: 22494:0:(lib-socket.c:218:lnet_sock_create()) Skipped 2 previous similar messages [21628.475818] LNetError: Skipped 1 previous similar message [21629.240579] INFO: task dir_create.sh:11445 blocked for more than 120 seconds. [21629.241918] Tainted: G OE 4.15.0-72-generic #81-Ubuntu [21629.243111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21629.244474] dir_create.sh D 0 11445 11349 0x00000000 [21629.245474] Call Trace: [21629.246074] __schedule+0x24e/0x880 [21629.246758] ? __switch_to_asm+0x35/0x70 [21629.247497] ? __switch_to_asm+0x41/0x70 [21629.248243] schedule+0x2c/0x80 [21629.248864] rwsem_down_read_failed+0xf0/0x160 [21629.249708] ? __switch_to+0x123/0x4e0 [21629.250435] call_rwsem_down_read_failed+0x18/0x30 [21629.251306] ? call_rwsem_down_read_failed+0x18/0x30 [21629.252222] down_read+0x20/0x40 [21629.252882] lookup_slow+0x60/0x170 [21629.253550] ? lookup_fast+0xe8/0x320 [21629.254244] walk_component+0x1c3/0x470 [21629.254962] ? __inode_permission+0x5b/0x160 [21629.255767] link_path_walk+0x1bc/0x5a0 [21629.256487] ? path_init+0x180/0x310 [21629.257167] path_openat+0xe1/0x1960 [21629.257865] ? ptep_set_access_flags+0x27/0x30 [21629.258676] do_filp_open+0x9b/0x110 [21629.259371] ? __check_object_size+0xc8/0x1b0 [21629.260186] ? __alloc_fd+0xb2/0x170 [21629.260869] do_sys_open+0x1bb/0x2c0 [21629.261543] ? do_sys_open+0x1bb/0x2c0 [21629.262247] SyS_openat+0x14/0x20 [21629.262893] do_syscall_64+0x73/0x130 [21629.263602] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [21629.264525] RIP: 0033:0x7f3817c80c8e [21629.265208] RSP: 002b:00007ffc9a42f820 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 [21629.266521] RAX: ffffffffffffffda RBX: 00005577ca65c080 RCX: 00007f3817c80c8e [21629.267771] RDX: 0000000000000241 RSI: 00005577ca65c4d0 RDI: 00000000ffffff9c [21629.269001] RBP: 00007ffc9a42f920 R08: 0000000000000020 R09: 0000000000000000 [21629.270230] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000000 [21629.271461] R13: 0000000000000003 R14: 0000000000000001 R15: 00005577ca65c4d0 [21629.272909] INFO: task ls:9059 blocked for more than 120 seconds. [21629.273981] Tainted: G OE 4.15.0-72-generic #81-Ubuntu [21629.275172] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21629.276519] ls D 0 9059 11367 0x00000000 [21629.277498] Call Trace: [21629.278059] __schedule+0x24e/0x880 [21629.278816] schedule+0x2c/0x80 [21629.279431] rwsem_down_read_failed+0xf0/0x160 [21629.280261] call_rwsem_down_read_failed+0x18/0x30 [21629.281130] ? call_rwsem_down_read_failed+0x18/0x30 [21629.282025] down_read+0x20/0x40 [21629.282650] lookup_slow+0x60/0x170 [21629.283318] ? lookup_fast+0xe8/0x320 [21629.284020] walk_component+0x1c3/0x470 [21629.284744] ? __inode_permission+0x5b/0x160 [21629.285525] link_path_walk+0x1bc/0x5a0 [21629.286245] ? path_init+0x180/0x310 [21629.286927] path_openat+0xe1/0x1960 [21629.287634] ? __slab_free+0x14d/0x2c0 [21629.288350] ? ___slab_alloc+0x20a/0x4b0 [21629.289083] ? ___slab_alloc+0x20a/0x4b0 [21629.289814] ? __mutex_unlock_slowpath.isra.11+0xcc/0x120 [21629.290771] do_filp_open+0x9b/0x110 [21629.291451] ? __check_object_size+0xc8/0x1b0 [21629.292260] ? __alloc_fd+0xb2/0x170 [21629.292942] do_sys_open+0x1bb/0x2c0 [21629.293626] ? do_sys_open+0x1bb/0x2c0 [21629.294337] SyS_openat+0x14/0x20 [21629.294983] do_syscall_64+0x73/0x130 [21629.295689] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [21629.296591] RIP: 0033:0x7fe6e6f97db1 [21629.297270] RSP: 002b:00007ffd9c06d600 EFLAGS: 00000287 ORIG_RAX: 0000000000000101 [21629.298568] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe6e6f97db1 [21629.299807] RDX: 0000000000090800 RSI: 000055d903b6bc40 RDI: 00000000ffffff9c [21629.301034] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 [21629.302261] R10: 0000000000000000 R11: 0000000000000287 R12: 00007fe6e76bced8 [21629.303486] R13: 0000000000000003 R14: 000055d903b6bc40 R15: 000055d9026e3c4a
            jamesanunez James Nunez (Inactive) added a comment - - edited

            Another hang, this time for conf-sanity test 123ab, that looks like the hang described here. See https://testing.whamcloud.com/test_sets/ca9b61c0-42c6-11ea-a84d-52540065bddc
            and sanity-quota test 7c at https://testing.whamcloud.com/test_sets/d540e712-42c6-11ea-a84d-52540065bddc

            jamesanunez James Nunez (Inactive) added a comment - - edited Another hang, this time for conf-sanity test 123ab, that looks like the hang described here. See https://testing.whamcloud.com/test_sets/ca9b61c0-42c6-11ea-a84d-52540065bddc and sanity-quota test 7c at https://testing.whamcloud.com/test_sets/d540e712-42c6-11ea-a84d-52540065bddc

            There’s a new sanityn test 16b timeout with the same bind error as described here, but this one has errors during 16a, 16b has errors and then hangs. The stack trace here looks like it could relate to LU-12096 or LU-11644; https://testing.whamcloud.com/test_sets/4bfd1d92-3414-11ea-b1e8-52540065bddc .

            In the client1 (vm9) console log, we see the bind error

            [21432.883724] LustreError: 22845:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000b5145529
            [21465.728594] LNetError: 22845:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.65@tcp[2], ip 10.9.4.65:1022, with error, wanted: 239240, left: 239240, last alive is 2 secs ago
            [21465.732759] LNetError: 22845:0:(socklnd.c:1665:ksocknal_destroy_conn()) Skipped 1 previous similar message
            [21465.734834] LustreError: 22845:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000723c0fc1
            [21501.848952] Lustre: 22854:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1578668549/real 1578668549]  req@000000008ce8400b x1655352734416640/t0(0) o3->lustre-OST0005-osc-ffff8a4ebc33c800@10.9.4.65@tcp:6/4 lens 504/440 e 0 to 1 dl 1578668594 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
            [21501.856614] Lustre: 22854:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 18 previous similar messages
            [21501.858438] Lustre: lustre-OST0005-osc-ffff8a4ebc33c800: Connection to lustre-OST0005 (at 10.9.4.65@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [21501.861445] Lustre: Skipped 18 previous similar messages
            [21501.863516] Lustre: lustre-OST0005-osc-ffff8a4ebc33c800: Connection restored to 10.9.4.65@tcp (at 10.9.4.65@tcp)
            [21501.865478] Lustre: Skipped 18 previous similar messages
            [21522.211547] LNetError: 22845:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.65@tcp[2], ip 10.9.4.65:7988, with error, wanted: 879256, left: 879256, last alive is 1 secs ago
            [21522.215414] LustreError: 22845:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000e7a9497f
            [21577.838123] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
            [21579.169621] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc
            [21579.256992] Lustre: DEBUG MARKER: dmesg
            [21579.948700] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanityn test 16b: 12500 iterations of dual-mount fsx at small size ================================ 15:04:30 \(1578668670\)
            [21580.151697] Lustre: DEBUG MARKER: == sanityn test 16b: 12500 iterations of dual-mount fsx at small size ================================ 15:04:30 (1578668670)
            [21580.327367] Lustre: lustre-OST0002-osc-ffff8a4ebc33c800: reconnect after 658s idle
            [21580.328913] Lustre: Skipped 4 previous similar messages
            [21604.197657] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
            [21604.199563] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.66@tcp at host 10.9.4.66 on port 7988
            [21604.201337] LNetError: 22842:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.4.63@tcp added to recovery queue. Health = 900
            [21604.203389] Lustre: 22852:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1578668695/real 1578668695]  req@000000005e133c6a x1655352736793728/t0(0) o400->lustre-MDT0000-mdc-ffff8a4ebc33c800@10.9.4.66@tcp:12/10 lens 224/224 e 0 to 1 dl 1578668703 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
            [21604.203400] LNetError: 22844:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 900
            [21604.208327] Lustre: 22852:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message
            [21604.208349] Lustre: lustre-MDT0000-mdc-ffff8a4ebc33c800: Connection to lustre-MDT0000 (at 10.9.4.66@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [21604.215094] Lustre: Skipped 1 previous similar message
            [21604.216570] LustreError: 166-1: MGC10.9.4.66@tcp: Connection to MGS (at 10.9.4.66@tcp) was lost; in progress operations using this service will fail
            [21604.869062] LNetError: 22843:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 900
            [21604.871333] LNetError: 22843:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
            [21604.873280] LNetError: 22840:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
            [21604.874992] LNetError: 22840:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
            [21604.876633] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.66@tcp at host 10.9.4.66 on port 7988
            [21604.878382] LNetError: Skipped 1 previous similar message
            [21609.315431] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
            [21609.319993] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
            [21609.325958] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988
            [21614.432546] LNetError: 22844:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
            [21614.432616] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988
            [21614.434629] LNetError: 22844:0:(lib-socket.c:218:lnet_sock_create()) Skipped 2 previous similar messages
            [21614.436474] LNetError: Skipped 1 previous similar message
            [21616.127415] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 0
            [21624.667470] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
            [21624.667552] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988
            [21624.669375] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
            [21624.671184] LNetError: Skipped 1 previous similar message
            [21626.362304] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 0
            [21626.364592] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
            [21636.597164] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 0
            [21636.599721] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
            [21640.019759] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
            [21640.019846] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988
            [21640.021612] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
            [21646.832033] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0
            [21646.834279] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
            [21671.395744] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0
            [21671.398011] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 4 previous similar messages
            [21675.841850] LNetError: 22843:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
            [21675.841929] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988
            [21675.843806] LNetError: 22843:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
            [21675.845642] LNetError: Skipped 1 previous similar message
            [21712.335275] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0
            [21712.337808] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 10 previous similar messages
            [21737.251123] LNetError: 22844:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
            [21737.252988] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.66@tcp at host 10.9.4.66 on port 7988
            [21777.838542] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0
            [21777.840809] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 16 previous similar messages
            [21782.284271] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying
            [21794.566181] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying
            

            In the OSS (vm11) console log, we see

            [21394.252859] Lustre: DEBUG MARKER: == sanityn test 16b: 12500 iterations of dual-mount fsx at small size ================================ 15:04:30 (1578668670)
            [21418.232657] Lustre: 26039:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1578668688/real 1578668688]  req@ffff8d25d6c79200 x1655331512173248/t0(0) o106->lustre-OST0000@10.9.4.63@tcp:15/16 lens 296/280 e 0 to 1 dl 1578668695 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
            [21418.237168] Lustre: 26039:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
            [21423.127118] LNetError: 18170:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.4.65@tcp added to recovery queue. Health = 900
            [21451.276843] LNet: Service thread pid 26039 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [21451.280168] Pid: 26039, comm: ll_ost00_002 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
            [21451.281965] Call Trace:
            [21451.282523]  [<ffffffffc0da5f50>] ptlrpc_set_wait+0x480/0x790 [ptlrpc]
            [21451.284388]  [<ffffffffc0d63115>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
            [21451.285726]  [<ffffffffc0d8454b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
            [21451.286997]  [<ffffffffc1144a6b>] ofd_intent_policy+0x69b/0x920 [ofd]
            [21451.288323]  [<ffffffffc0d63e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
            [21451.289629]  [<ffffffffc0d8c506>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
            [21451.291059]  [<ffffffffc0e14cf2>] tgt_enqueue+0x62/0x210 [ptlrpc]
            [21451.292304]  [<ffffffffc0e1bb0a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
            [21451.293727]  [<ffffffffc0dc046b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [21451.295166]  [<ffffffffc0dc3dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
            [21451.296403]  [<ffffffff9f0c50d1>] kthread+0xd1/0xe0
            [21451.297490]  [<ffffffff9f78cd37>] ret_from_fork_nospec_end+0x0/0x39
            [21451.298750]  [<ffffffffffffffff>] 0xffffffffffffffff
            [21451.299939] LustreError: dumping log to /tmp/lustre-log.1578668728.26039
            [21459.501026] Lustre: lustre-OST0005: haven't heard from client 71ae0a4f-625a-5ffd-4f51-bf7ad010a244 (at 10.9.4.63@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff8d25e3e53c00, cur 1578668736 expire 1578668706 last 1578668687
            [21459.504814] Lustre: Skipped 1 previous similar message
            [21461.068032] Lustre: lustre-OST0000: haven't heard from client 71ae0a4f-625a-5ffd-4f51-bf7ad010a244 (at 10.9.4.63@tcp) in 51 seconds. I think it's dead, and I am evicting it. exp ffff8d25ef7bc000, cur 1578668738 expire 1578668708 last 1578668687
            
            jamesanunez James Nunez (Inactive) added a comment - There’s a new sanityn test 16b timeout with the same bind error as described here, but this one has errors during 16a, 16b has errors and then hangs. The stack trace here looks like it could relate to LU-12096 or LU-11644 ; https://testing.whamcloud.com/test_sets/4bfd1d92-3414-11ea-b1e8-52540065bddc . In the client1 (vm9) console log, we see the bind error [21432.883724] LustreError: 22845:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000b5145529 [21465.728594] LNetError: 22845:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.65@tcp[2], ip 10.9.4.65:1022, with error, wanted: 239240, left: 239240, last alive is 2 secs ago [21465.732759] LNetError: 22845:0:(socklnd.c:1665:ksocknal_destroy_conn()) Skipped 1 previous similar message [21465.734834] LustreError: 22845:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000723c0fc1 [21501.848952] Lustre: 22854:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1578668549/real 1578668549] req@000000008ce8400b x1655352734416640/t0(0) o3->lustre-OST0005-osc-ffff8a4ebc33c800@10.9.4.65@tcp:6/4 lens 504/440 e 0 to 1 dl 1578668594 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 [21501.856614] Lustre: 22854:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 18 previous similar messages [21501.858438] Lustre: lustre-OST0005-osc-ffff8a4ebc33c800: Connection to lustre-OST0005 (at 10.9.4.65@tcp) was lost; in progress operations using this service will wait for recovery to complete [21501.861445] Lustre: Skipped 18 previous similar messages [21501.863516] Lustre: lustre-OST0005-osc-ffff8a4ebc33c800: Connection restored to 10.9.4.65@tcp (at 10.9.4.65@tcp) [21501.865478] Lustre: Skipped 18 previous similar messages [21522.211547] LNetError: 22845:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.65@tcp[2], ip 10.9.4.65:7988, with error, wanted: 879256, left: 879256, last alive is 1 secs ago [21522.215414] LustreError: 22845:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000e7a9497f [21577.838123] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [21579.169621] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc [21579.256992] Lustre: DEBUG MARKER: dmesg [21579.948700] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanityn test 16b: 12500 iterations of dual-mount fsx at small size ================================ 15:04:30 \(1578668670\) [21580.151697] Lustre: DEBUG MARKER: == sanityn test 16b: 12500 iterations of dual-mount fsx at small size ================================ 15:04:30 (1578668670) [21580.327367] Lustre: lustre-OST0002-osc-ffff8a4ebc33c800: reconnect after 658s idle [21580.328913] Lustre: Skipped 4 previous similar messages [21604.197657] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21604.199563] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.66@tcp at host 10.9.4.66 on port 7988 [21604.201337] LNetError: 22842:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.4.63@tcp added to recovery queue. Health = 900 [21604.203389] Lustre: 22852:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1578668695/real 1578668695] req@000000005e133c6a x1655352736793728/t0(0) o400->lustre-MDT0000-mdc-ffff8a4ebc33c800@10.9.4.66@tcp:12/10 lens 224/224 e 0 to 1 dl 1578668703 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [21604.203400] LNetError: 22844:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 900 [21604.208327] Lustre: 22852:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message [21604.208349] Lustre: lustre-MDT0000-mdc-ffff8a4ebc33c800: Connection to lustre-MDT0000 (at 10.9.4.66@tcp) was lost; in progress operations using this service will wait for recovery to complete [21604.215094] Lustre: Skipped 1 previous similar message [21604.216570] LustreError: 166-1: MGC10.9.4.66@tcp: Connection to MGS (at 10.9.4.66@tcp) was lost; in progress operations using this service will fail [21604.869062] LNetError: 22843:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 900 [21604.871333] LNetError: 22843:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message [21604.873280] LNetError: 22840:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21604.874992] LNetError: 22840:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message [21604.876633] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.66@tcp at host 10.9.4.66 on port 7988 [21604.878382] LNetError: Skipped 1 previous similar message [21609.315431] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21609.319993] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message [21609.325958] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988 [21614.432546] LNetError: 22844:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21614.432616] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988 [21614.434629] LNetError: 22844:0:(lib-socket.c:218:lnet_sock_create()) Skipped 2 previous similar messages [21614.436474] LNetError: Skipped 1 previous similar message [21616.127415] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 0 [21624.667470] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21624.667552] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988 [21624.669375] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message [21624.671184] LNetError: Skipped 1 previous similar message [21626.362304] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 0 [21626.364592] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message [21636.597164] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 0 [21636.599721] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message [21640.019759] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21640.019846] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988 [21640.021612] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message [21646.832033] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0 [21646.834279] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message [21671.395744] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0 [21671.398011] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 4 previous similar messages [21675.841850] LNetError: 22843:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21675.841929] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988 [21675.843806] LNetError: 22843:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message [21675.845642] LNetError: Skipped 1 previous similar message [21712.335275] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0 [21712.337808] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 10 previous similar messages [21737.251123] LNetError: 22844:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99 [21737.252988] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.66@tcp at host 10.9.4.66 on port 7988 [21777.838542] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0 [21777.840809] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 16 previous similar messages [21782.284271] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying [21794.566181] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying In the OSS (vm11) console log, we see [21394.252859] Lustre: DEBUG MARKER: == sanityn test 16b: 12500 iterations of dual-mount fsx at small size ================================ 15:04:30 (1578668670) [21418.232657] Lustre: 26039:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1578668688/real 1578668688] req@ffff8d25d6c79200 x1655331512173248/t0(0) o106->lustre-OST0000@10.9.4.63@tcp:15/16 lens 296/280 e 0 to 1 dl 1578668695 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 [21418.237168] Lustre: 26039:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [21423.127118] LNetError: 18170:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.4.65@tcp added to recovery queue. Health = 900 [21451.276843] LNet: Service thread pid 26039 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [21451.280168] Pid: 26039, comm: ll_ost00_002 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [21451.281965] Call Trace: [21451.282523] [<ffffffffc0da5f50>] ptlrpc_set_wait+0x480/0x790 [ptlrpc] [21451.284388] [<ffffffffc0d63115>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] [21451.285726] [<ffffffffc0d8454b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] [21451.286997] [<ffffffffc1144a6b>] ofd_intent_policy+0x69b/0x920 [ofd] [21451.288323] [<ffffffffc0d63e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] [21451.289629] [<ffffffffc0d8c506>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [21451.291059] [<ffffffffc0e14cf2>] tgt_enqueue+0x62/0x210 [ptlrpc] [21451.292304] [<ffffffffc0e1bb0a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [21451.293727] [<ffffffffc0dc046b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [21451.295166] [<ffffffffc0dc3dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [21451.296403] [<ffffffff9f0c50d1>] kthread+0xd1/0xe0 [21451.297490] [<ffffffff9f78cd37>] ret_from_fork_nospec_end+0x0/0x39 [21451.298750] [<ffffffffffffffff>] 0xffffffffffffffff [21451.299939] LustreError: dumping log to /tmp/lustre-log.1578668728.26039 [21459.501026] Lustre: lustre-OST0005: haven't heard from client 71ae0a4f-625a-5ffd-4f51-bf7ad010a244 (at 10.9.4.63@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff8d25e3e53c00, cur 1578668736 expire 1578668706 last 1578668687 [21459.504814] Lustre: Skipped 1 previous similar message [21461.068032] Lustre: lustre-OST0000: haven't heard from client 71ae0a4f-625a-5ffd-4f51-bf7ad010a244 (at 10.9.4.63@tcp) in 51 seconds. I think it's dead, and I am evicting it. exp ffff8d25ef7bc000, cur 1578668738 expire 1578668708 last 1578668687
            jamesanunez James Nunez (Inactive) added a comment - - edited

            We're seeing the same error for sanity-quota test 8 for Ubuntu 18.04 for 2.12.1 RC1; https://testing.whamcloud.com/test_sets/3f576074-6661-11e9-aeec-52540065bddc

            For parallel-scale test connectathon: https://testing.whamcloud.com/test_sets/47d9e6ea-6661-11e9-aeec-52540065bddc

            jamesanunez James Nunez (Inactive) added a comment - - edited We're seeing the same error for sanity-quota test 8 for Ubuntu 18.04 for 2.12.1 RC1; https://testing.whamcloud.com/test_sets/3f576074-6661-11e9-aeec-52540065bddc For parallel-scale test connectathon: https://testing.whamcloud.com/test_sets/47d9e6ea-6661-11e9-aeec-52540065bddc

            This looks like it might be an environment problem? Port 1023 is the first reserved port, which should be available to all root/kernel threads. If there is a problem using this port then either some other thread grabbed that port, or there is something like portreserve running that is consuming all of the reserved ports below 1024.

            adilger Andreas Dilger added a comment - This looks like it might be an environment problem? Port 1023 is the first reserved port, which should be available to all root/kernel threads. If there is a problem using this port then either some other thread grabbed that port, or there is something like portreserve running that is consuming all of the reserved ports below 1024.

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: