Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.7
-
Ubuntu 18.04 clients
-
3
-
9223372036854775807
Description
sanityn test_16a clients hang with nfs server time out and lnet_sock_create errors.
Looking at the client test_log from https://testing.whamcloud.com/test_sets/d6b02bd6-fdd4-11e8-a97c-52540065bddc , we see the test running as expected
== sanityn test 16a: 12500 iterations of dual-mount fsx ============================================== 23:26:51 (1544484411) CMD: trevis-19vm4 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize 7+0 records in 7+0 records out 7340032 bytes (7.3 MB, 7.0 MiB) copied, 1.33143 s, 5.5 MB/s 7+0 records in 7+0 records out 7340032 bytes (7.3 MB, 7.0 MiB) copied, 4.56728 s, 1.6 MB/s Chance of close/open is 1 in 50 Seed set to 4417 fd 0: /mnt/lustre/f16a.sanityn fd 1: /mnt/lustre2/f16a.sanityn truncating to largest ever: 0x1dfe5a truncating to largest ever: 0x7c61ac truncating to largest ever: 0x94abe7 truncating to largest ever: 0x9d6e31 truncating to largest ever: 0x9f1422 000500[0] 1544484425.495566 trunc from 0x9d1bf7 to 0x11cd53e (0x7fb948 bytes) truncating to largest ever: 0x9fe783 001000[0] 1544484432.787479 mapread 0x138279 thru 0x143e81 (0xbc09 bytes) 001500[1] 1544484440.835660 mapread 0x28b349 thru 0x2913ee (0x60a6 bytes) 002000[0] 1544484448.190155 write 0x780704 thru 0x78dd5e (0xd65b bytes) truncating to largest ever: 0x9ffdb6 002500[1] 1544484455.762469 read 0x46c5ad thru 0x47a81f (0xe273 bytes) 003000[1] 1544484463.442172 trunc from 0xa00000 to 0xd0df7b (0x30df7c bytes) 003500[1] 1544484471.490671 trunc from 0x90cfda to 0xbf7075 (0x2ea09c bytes) 004000[1] 1544484478.753447 mapwrite 0x9bac71 thru 0x9bf0da (0x446a bytes) 004500[1] 1544484486.839306 write 0x91ee3d thru 0x92273d (0x3901 bytes) 005000[1] 1544484494.153333 mapwrite 0x68aacb thru 0x693f4f (0x9485 bytes) 005500[0] 1544484502.221259 trunc from 0x8ba894 to 0x9956b0 (0xdae1d bytes) 006000[0] 1544484509.642964 trunc from 0x63a0f3 to 0x7482ff (0x10e20d bytes) 006500[0] 1544484517.988496 read 0x598d97 thru 0x59c2f5 (0x355f bytes) 007000[0] 1544484525.569919 mapwrite 0x71c288 thru 0x71c9f7 (0x770 bytes) 007500[0] 1544484534.173892 read 0x2caf46 thru 0x2d1ba2 (0x6c5d bytes) 008000[0] 1544484541.613265 write 0x410bb8 thru 0x418b7e (0x7fc7 bytes) 008500[0] 1544484548.359786 read 0x3f4704 thru 0x403779 (0xf076 bytes) 009000[1] 1544484556.654503 trunc from 0x70d8ef to 0xf2ace2 (0x81d3f4 bytes) 009500[1] 1544484565.396798 mapread 0x08576a thru 0x08af61 (0x57f8 bytes) 010000[0] 1544484573.036642 mapread 0x071104 thru 0x07b937 (0xa834 bytes) 010500[1] 1544484581.018787 read 0x7a1f9e thru 0x7af1e9 (0xd24c bytes) 011000[0] 1544484588.977036 write 0x69fc53 thru 0x6a0f6b (0x1319 bytes) 011500[0] 1544484597.183378 write 0x7e280e thru 0x7f197b (0xf16e bytes) 012000[0] 1544484604.970948 write 0x754ee2 thru 0x75fb66 (0xac85 bytes) 012500[0] 1544484612.966799 mapread 0x0cb21f thru 0x0d1823 (0x6605 bytes) All operations completed A-OK! Chance of close/open is 1 in 50 Seed set to 4612 fd 0: /mnt/lustre/f16a.sanityn fd 1: /mnt/lustre2/f16a.sanityn skipping zero size read truncating to largest ever: 0x2bff39 truncating to largest ever: 0x732365 truncating to largest ever: 0x96ec16
but both clients have network problems. Looking at the console log for Client 1 (vm1), we see network errors until the clients time out
[21386.685578] Lustre: DEBUG MARKER: == sanityn test 16a: 12500 iterations of dual-mount fsx ============================================== 23:26:51 (1544484411) [21607.180560] Lustre: 10756:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484625/real 1544484625] req@00000000d0fa8236 x1619508903774448/t0(0) o4->lustre-OST0006-osc-0000000057afebbd@10.9.4.226@tcp:6/4 lens 488/448 e 0 to 1 dl 1544484632 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [21607.183647] Lustre: lustre-OST0006-osc-0000000057afebbd: Connection to lustre-OST0006 (at 10.9.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete [21612.300526] LNetError: 10744:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99 [21612.301698] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988 [21612.302806] Lustre: 10755:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1544484637/real 1544484637] req@00000000013babff x1619508903775008/t0(0) o400->MGC10.9.4.227@tcp@10.9.4.227@tcp:26/25 lens 224/224 e 0 to 1 dl 1544484681 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [21612.305632] LustreError: 166-1: MGC10.9.4.227@tcp: Connection to MGS (at 10.9.4.227@tcp) was lost; in progress operations using this service will fail [21612.307127] Lustre: lustre-OST0000-osc-00000000c6515b26: Connection to lustre-OST0000 (at 10.9.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete [21613.964225] Lustre: 10753:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484630/real 1544484630] req@00000000cfba67eb x1619508903774608/t0(0) o400->lustre-MDT0000-mdc-0000000057afebbd@10.9.4.227@tcp:12/10 lens 224/224 e 0 to 1 dl 1544484638 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [21613.967218] Lustre: 10753:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 15 previous similar messages [21616.396101] Lustre: 10755:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484632/real 1544484632] req@00000000d57afc11 x1619508903774752/t0(0) o400->lustre-MDT0000-mdc-00000000c6515b26@10.9.4.227@tcp:12/10 lens 224/224 e 0 to 1 dl 1544484641 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [21616.399086] Lustre: 10755:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages [21617.420470] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99 [21617.421604] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message [21617.422604] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988 [21617.423647] LNetError: Skipped 1 previous similar message [21622.540157] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99 [21622.541290] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message [21622.542280] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988 [21622.543328] LNetError: Skipped 1 previous similar message [21627.660041] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99 [21627.661169] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message [21627.662161] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988 [21627.663208] LNetError: Skipped 1 previous similar message [21637.899461] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99 [21637.899548] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.226@tcp at host 10.9.4.226 on port 7988 [21637.899550] LNetError: Skipped 1 previous similar message [21637.902243] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Skipped 2 previous similar messages [21649.418594] Lustre: 10753:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484630/real 1544484630] req@00000000756c150f x1619508903774512/t0(0) o400->lustre-OST0001-osc-00000000c6515b26@10.9.4.226@tcp:28/4 lens 224/224 e 0 to 1 dl 1544484674 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [21653.258745] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99 [21653.258833] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.226@tcp at host 10.9.4.226 on port 7988 [21653.258835] LNetError: Skipped 1 previous similar message [21653.261492] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message [21689.097080] LNetError: 10744:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99 [21689.098237] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988 [21689.099278] LNetError: Skipped 1 previous similar message [21750.534239] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99 [21750.535386] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message [21750.536380] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988 [21750.537421] LNetError: Skipped 1 previous similar message [21784.324408] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying [21798.339708] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out [21802.371510] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying [21805.379406] nfs: server trevis-1.trevis.whamcloud.com not responding, timed out [21873.408580] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99 [21873.409720] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Skipped 3 previous similar messages [21873.410703] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988 [21873.411746] LNetError: Skipped 3 previous similar messages [21978.811391] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out [21983.995128] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out [21985.851057] nfs: server trevis-1.trevis.whamcloud.com not responding, timed out [21990.842881] nfs: server trevis-1.trevis.whamcloud.com not responding, timed out [22057.720105] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99 [22057.720179] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.226@tcp at host 10.9.4.226 on port 7988 [22057.720181] LNetError: Skipped 5 previous similar messages [22057.722918] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Skipped 6 previous similar messages [22159.283076] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out
I’ve only found one other failure
https://testing.whamcloud.com/test_sets/63c6b2f6-fdd5-11e8-93ea-52540065bddc
Attachments
Issue Links
- is related to
-
LU-12511 Prepare lustre for adoption into the linux kernel
- Open
-
LU-15781 Ubuntu 22.04 LTS release support
- Resolved
- is related to
-
LU-12849 sanityn test 30 hangs with client LNet errors
- Closed
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...