== sanityn test 16a: 12500 iterations of dual-mount fsx ============================================== 23:26:51 (1544484411)
CMD: trevis-19vm4 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize
7+0 records in
7+0 records out
7340032 bytes (7.3 MB, 7.0 MiB) copied, 1.33143 s, 5.5 MB/s
7+0 records in
7+0 records out
7340032 bytes (7.3 MB, 7.0 MiB) copied, 4.56728 s, 1.6 MB/s
Chance of close/open is 1 in 50
Seed set to 4417
fd 0: /mnt/lustre/f16a.sanityn
fd 1: /mnt/lustre2/f16a.sanityn
truncating to largest ever: 0x1dfe5a
truncating to largest ever: 0x7c61ac
truncating to largest ever: 0x94abe7
truncating to largest ever: 0x9d6e31
truncating to largest ever: 0x9f1422
000500[0] 1544484425.495566 trunc from 0x9d1bf7 to 0x11cd53e (0x7fb948 bytes)
truncating to largest ever: 0x9fe783
001000[0] 1544484432.787479 mapread 0x138279 thru 0x143e81 (0xbc09 bytes)
001500[1] 1544484440.835660 mapread 0x28b349 thru 0x2913ee (0x60a6 bytes)
002000[0] 1544484448.190155 write 0x780704 thru 0x78dd5e (0xd65b bytes)
truncating to largest ever: 0x9ffdb6
002500[1] 1544484455.762469 read 0x46c5ad thru 0x47a81f (0xe273 bytes)
003000[1] 1544484463.442172 trunc from 0xa00000 to 0xd0df7b (0x30df7c bytes)
003500[1] 1544484471.490671 trunc from 0x90cfda to 0xbf7075 (0x2ea09c bytes)
004000[1] 1544484478.753447 mapwrite 0x9bac71 thru 0x9bf0da (0x446a bytes)
004500[1] 1544484486.839306 write 0x91ee3d thru 0x92273d (0x3901 bytes)
005000[1] 1544484494.153333 mapwrite 0x68aacb thru 0x693f4f (0x9485 bytes)
005500[0] 1544484502.221259 trunc from 0x8ba894 to 0x9956b0 (0xdae1d bytes)
006000[0] 1544484509.642964 trunc from 0x63a0f3 to 0x7482ff (0x10e20d bytes)
006500[0] 1544484517.988496 read 0x598d97 thru 0x59c2f5 (0x355f bytes)
007000[0] 1544484525.569919 mapwrite 0x71c288 thru 0x71c9f7 (0x770 bytes)
007500[0] 1544484534.173892 read 0x2caf46 thru 0x2d1ba2 (0x6c5d bytes)
008000[0] 1544484541.613265 write 0x410bb8 thru 0x418b7e (0x7fc7 bytes)
008500[0] 1544484548.359786 read 0x3f4704 thru 0x403779 (0xf076 bytes)
009000[1] 1544484556.654503 trunc from 0x70d8ef to 0xf2ace2 (0x81d3f4 bytes)
009500[1] 1544484565.396798 mapread 0x08576a thru 0x08af61 (0x57f8 bytes)
010000[0] 1544484573.036642 mapread 0x071104 thru 0x07b937 (0xa834 bytes)
010500[1] 1544484581.018787 read 0x7a1f9e thru 0x7af1e9 (0xd24c bytes)
011000[0] 1544484588.977036 write 0x69fc53 thru 0x6a0f6b (0x1319 bytes)
011500[0] 1544484597.183378 write 0x7e280e thru 0x7f197b (0xf16e bytes)
012000[0] 1544484604.970948 write 0x754ee2 thru 0x75fb66 (0xac85 bytes)
012500[0] 1544484612.966799 mapread 0x0cb21f thru 0x0d1823 (0x6605 bytes)
All operations completed A-OK!
Chance of close/open is 1 in 50
Seed set to 4612
fd 0: /mnt/lustre/f16a.sanityn
fd 1: /mnt/lustre2/f16a.sanityn
skipping zero size read
truncating to largest ever: 0x2bff39
truncating to largest ever: 0x732365
truncating to largest ever: 0x96ec16
but both clients have network problems. Looking at the console log for Client 1 (vm1), we see network errors until the clients time out
[21386.685578] Lustre: DEBUG MARKER: == sanityn test 16a: 12500 iterations of dual-mount fsx ============================================== 23:26:51 (1544484411)
[21607.180560] Lustre: 10756:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484625/real 1544484625] req@00000000d0fa8236 x1619508903774448/t0(0) o4->lustre-OST0006-osc-0000000057afebbd@10.9.4.226@tcp:6/4 lens 488/448 e 0 to 1 dl 1544484632 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[21607.183647] Lustre: lustre-OST0006-osc-0000000057afebbd: Connection to lustre-OST0006 (at 10.9.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete
[21612.300526] LNetError: 10744:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
[21612.301698] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
[21612.302806] Lustre: 10755:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1544484637/real 1544484637] req@00000000013babff x1619508903775008/t0(0) o400->MGC10.9.4.227@tcp@10.9.4.227@tcp:26/25 lens 224/224 e 0 to 1 dl 1544484681 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[21612.305632] LustreError: 166-1: MGC10.9.4.227@tcp: Connection to MGS (at 10.9.4.227@tcp) was lost; in progress operations using this service will fail
[21612.307127] Lustre: lustre-OST0000-osc-00000000c6515b26: Connection to lustre-OST0000 (at 10.9.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete
[21613.964225] Lustre: 10753:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484630/real 1544484630] req@00000000cfba67eb x1619508903774608/t0(0) o400->lustre-MDT0000-mdc-0000000057afebbd@10.9.4.227@tcp:12/10 lens 224/224 e 0 to 1 dl 1544484638 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[21613.967218] Lustre: 10753:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 15 previous similar messages
[21616.396101] Lustre: 10755:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484632/real 1544484632] req@00000000d57afc11 x1619508903774752/t0(0) o400->lustre-MDT0000-mdc-00000000c6515b26@10.9.4.227@tcp:12/10 lens 224/224 e 0 to 1 dl 1544484641 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[21616.399086] Lustre: 10755:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[21617.420470] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
[21617.421604] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message
[21617.422604] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
[21617.423647] LNetError: Skipped 1 previous similar message
[21622.540157] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
[21622.541290] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message
[21622.542280] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
[21622.543328] LNetError: Skipped 1 previous similar message
[21627.660041] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
[21627.661169] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message
[21627.662161] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
[21627.663208] LNetError: Skipped 1 previous similar message
[21637.899461] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
[21637.899548] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.226@tcp at host 10.9.4.226 on port 7988
[21637.899550] LNetError: Skipped 1 previous similar message
[21637.902243] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Skipped 2 previous similar messages
[21649.418594] Lustre: 10753:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544484630/real 1544484630] req@00000000756c150f x1619508903774512/t0(0) o400->lustre-OST0001-osc-00000000c6515b26@10.9.4.226@tcp:28/4 lens 224/224 e 0 to 1 dl 1544484674 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[21653.258745] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
[21653.258833] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.226@tcp at host 10.9.4.226 on port 7988
[21653.258835] LNetError: Skipped 1 previous similar message
[21653.261492] LNetError: 10740:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message
[21689.097080] LNetError: 10744:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
[21689.098237] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
[21689.099278] LNetError: Skipped 1 previous similar message
[21750.534239] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
[21750.535386] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Skipped 1 previous similar message
[21750.536380] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
[21750.537421] LNetError: Skipped 1 previous similar message
[21784.324408] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying
[21798.339708] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out
[21802.371510] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying
[21805.379406] nfs: server trevis-1.trevis.whamcloud.com not responding, timed out
[21873.408580] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
[21873.409720] LNetError: 10743:0:(lib-socket.c:410:lnet_sock_create()) Skipped 3 previous similar messages
[21873.410703] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.227@tcp at host 10.9.4.227 on port 7988
[21873.411746] LNetError: Skipped 3 previous similar messages
[21978.811391] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out
[21983.995128] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out
[21985.851057] nfs: server trevis-1.trevis.whamcloud.com not responding, timed out
[21990.842881] nfs: server trevis-1.trevis.whamcloud.com not responding, timed out
[22057.720105] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Error trying to bind to port 1023: -99
[22057.720179] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.226@tcp at host 10.9.4.226 on port 7988
[22057.720181] LNetError: Skipped 5 previous similar messages
[22057.722918] LNetError: 10745:0:(lib-socket.c:410:lnet_sock_create()) Skipped 6 previous similar messages
[22159.283076] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out
Ubuntu 18.04, clients having similar network issues:
[21608.094576] LNetError: 2319:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21608.096826] LNetError: 11e-e: Unexpected error -99 connecting to 10.240.29.171@tcp at host 10.240.29.171 on port 7988
[21613.214389] LNetError: 2322:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21613.216436] LNetError: 11e-e: Unexpected error -99 connecting to 10.240.29.171@tcp at host 10.240.29.171 on port 7988
[21617.306228] LustreError: 2330:0:(client.c:2997:ptlrpc_replay_interpret()) @@@ request replay timed out.
...
[22366.405833] nfs: server 10.240.16.204 not responding, timed out
[22380.421440] nfs: server 10.240.16.204 not responding, timed out
[22547.328926] nfs: server 10.240.16.204 not responding, timed out
[22565.760422] nfs: server 10.240.16.204 not responding, timed out
[22728.315978] nfs: server 10.240.16.204 not responding, timed out
[22751.355430] nfs: server 10.240.16.204 not responding, timed out
[22909.047116] nfs: server 10.240.16.204 not responding, timed out
Alena Nikitenko (Inactive)
added a comment - Saw something similar in test_7f of replay-vbr on 2.12.8: https://testing.whamcloud.com/test_sets/098c125f-9e10-43ac-b73c-baafe9bbe86f
Ubuntu 18.04, clients having similar network issues:
[21608.094576] LNetError: 2319:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21608.096826] LNetError: 11e-e: Unexpected error -99 connecting to 10.240.29.171@tcp at host 10.240.29.171 on port 7988
[21613.214389] LNetError: 2322:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21613.216436] LNetError: 11e-e: Unexpected error -99 connecting to 10.240.29.171@tcp at host 10.240.29.171 on port 7988
[21617.306228] LustreError: 2330:0:(client.c:2997:ptlrpc_replay_interpret()) @@@ request replay timed out.
...
[22366.405833] nfs: server 10.240.16.204 not responding, timed out
[22380.421440] nfs: server 10.240.16.204 not responding, timed out
[22547.328926] nfs: server 10.240.16.204 not responding, timed out
[22565.760422] nfs: server 10.240.16.204 not responding, timed out
[22728.315978] nfs: server 10.240.16.204 not responding, timed out
[22751.355430] nfs: server 10.240.16.204 not responding, timed out
[22909.047116] nfs: server 10.240.16.204 not responding, timed out
James Nunez (Inactive)
added a comment - James - Here's a similar failure at https://testing.whamcloud.com/test_sets/4897bd60-d3a8-41fe-89db-1d13f25b2e92 . Do you think it is the same as reported here?
James Nunez (Inactive)
added a comment - - edited Another hang, this time for conf-sanity test 123ab, that looks like the hang described here. See https://testing.whamcloud.com/test_sets/ca9b61c0-42c6-11ea-a84d-52540065bddc
and sanity-quota test 7c at https://testing.whamcloud.com/test_sets/d540e712-42c6-11ea-a84d-52540065bddc
In the client1 (vm9) console log, we see the bind error
[21432.883724] LustreError: 22845:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000b5145529
[21465.728594] LNetError: 22845:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.65@tcp[2], ip 10.9.4.65:1022, with error, wanted: 239240, left: 239240, last alive is 2 secs ago
[21465.732759] LNetError: 22845:0:(socklnd.c:1665:ksocknal_destroy_conn()) Skipped 1 previous similar message
[21465.734834] LustreError: 22845:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000723c0fc1
[21501.848952] Lustre: 22854:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1578668549/real 1578668549] req@000000008ce8400b x1655352734416640/t0(0) o3->lustre-OST0005-osc-ffff8a4ebc33c800@10.9.4.65@tcp:6/4 lens 504/440 e 0 to 1 dl 1578668594 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
[21501.856614] Lustre: 22854:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 18 previous similar messages
[21501.858438] Lustre: lustre-OST0005-osc-ffff8a4ebc33c800: Connection to lustre-OST0005 (at 10.9.4.65@tcp) was lost; in progress operations using this service will wait for recovery to complete
[21501.861445] Lustre: Skipped 18 previous similar messages
[21501.863516] Lustre: lustre-OST0005-osc-ffff8a4ebc33c800: Connection restored to 10.9.4.65@tcp (at 10.9.4.65@tcp)
[21501.865478] Lustre: Skipped 18 previous similar messages
[21522.211547] LNetError: 22845:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.65@tcp[2], ip 10.9.4.65:7988, with error, wanted: 879256, left: 879256, last alive is 1 secs ago
[21522.215414] LustreError: 22845:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000e7a9497f
[21577.838123] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null
[21579.169621] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc
[21579.256992] Lustre: DEBUG MARKER: dmesg
[21579.948700] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanityn test 16b: 12500 iterations of dual-mount fsx at small size ================================ 15:04:30 \(1578668670\)
[21580.151697] Lustre: DEBUG MARKER: == sanityn test 16b: 12500 iterations of dual-mount fsx at small size ================================ 15:04:30 (1578668670)
[21580.327367] Lustre: lustre-OST0002-osc-ffff8a4ebc33c800: reconnect after 658s idle
[21580.328913] Lustre: Skipped 4 previous similar messages
[21604.197657] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21604.199563] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.66@tcp at host 10.9.4.66 on port 7988
[21604.201337] LNetError: 22842:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.4.63@tcp added to recovery queue. Health = 900
[21604.203389] Lustre: 22852:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1578668695/real 1578668695] req@000000005e133c6a x1655352736793728/t0(0) o400->lustre-MDT0000-mdc-ffff8a4ebc33c800@10.9.4.66@tcp:12/10 lens 224/224 e 0 to 1 dl 1578668703 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[21604.203400] LNetError: 22844:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 900
[21604.208327] Lustre: 22852:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[21604.208349] Lustre: lustre-MDT0000-mdc-ffff8a4ebc33c800: Connection to lustre-MDT0000 (at 10.9.4.66@tcp) was lost; in progress operations using this service will wait for recovery to complete
[21604.215094] Lustre: Skipped 1 previous similar message
[21604.216570] LustreError: 166-1: MGC10.9.4.66@tcp: Connection to MGS (at 10.9.4.66@tcp) was lost; in progress operations using this service will fail
[21604.869062] LNetError: 22843:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 900
[21604.871333] LNetError: 22843:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
[21604.873280] LNetError: 22840:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21604.874992] LNetError: 22840:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
[21604.876633] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.66@tcp at host 10.9.4.66 on port 7988
[21604.878382] LNetError: Skipped 1 previous similar message
[21609.315431] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21609.319993] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
[21609.325958] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988
[21614.432546] LNetError: 22844:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21614.432616] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988
[21614.434629] LNetError: 22844:0:(lib-socket.c:218:lnet_sock_create()) Skipped 2 previous similar messages
[21614.436474] LNetError: Skipped 1 previous similar message
[21616.127415] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 0
[21624.667470] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21624.667552] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988
[21624.669375] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
[21624.671184] LNetError: Skipped 1 previous similar message
[21626.362304] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 0
[21626.364592] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
[21636.597164] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 0
[21636.599721] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
[21640.019759] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21640.019846] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988
[21640.021612] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
[21646.832033] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0
[21646.834279] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
[21671.395744] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0
[21671.398011] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 4 previous similar messages
[21675.841850] LNetError: 22843:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21675.841929] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988
[21675.843806] LNetError: 22843:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
[21675.845642] LNetError: Skipped 1 previous similar message
[21712.335275] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0
[21712.337808] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 10 previous similar messages
[21737.251123] LNetError: 22844:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21737.252988] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.66@tcp at host 10.9.4.66 on port 7988
[21777.838542] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0
[21777.840809] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 16 previous similar messages
[21782.284271] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying
[21794.566181] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying
In the OSS (vm11) console log, we see
[21394.252859] Lustre: DEBUG MARKER: == sanityn test 16b: 12500 iterations of dual-mount fsx at small size ================================ 15:04:30 (1578668670)
[21418.232657] Lustre: 26039:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1578668688/real 1578668688] req@ffff8d25d6c79200 x1655331512173248/t0(0) o106->lustre-OST0000@10.9.4.63@tcp:15/16 lens 296/280 e 0 to 1 dl 1578668695 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
[21418.237168] Lustre: 26039:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
[21423.127118] LNetError: 18170:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.4.65@tcp added to recovery queue. Health = 900
[21451.276843] LNet: Service thread pid 26039 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[21451.280168] Pid: 26039, comm: ll_ost00_002 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
[21451.281965] Call Trace:
[21451.282523] [<ffffffffc0da5f50>] ptlrpc_set_wait+0x480/0x790 [ptlrpc]
[21451.284388] [<ffffffffc0d63115>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
[21451.285726] [<ffffffffc0d8454b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
[21451.286997] [<ffffffffc1144a6b>] ofd_intent_policy+0x69b/0x920 [ofd]
[21451.288323] [<ffffffffc0d63e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
[21451.289629] [<ffffffffc0d8c506>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
[21451.291059] [<ffffffffc0e14cf2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[21451.292304] [<ffffffffc0e1bb0a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[21451.293727] [<ffffffffc0dc046b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[21451.295166] [<ffffffffc0dc3dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[21451.296403] [<ffffffff9f0c50d1>] kthread+0xd1/0xe0
[21451.297490] [<ffffffff9f78cd37>] ret_from_fork_nospec_end+0x0/0x39
[21451.298750] [<ffffffffffffffff>] 0xffffffffffffffff
[21451.299939] LustreError: dumping log to /tmp/lustre-log.1578668728.26039
[21459.501026] Lustre: lustre-OST0005: haven't heard from client 71ae0a4f-625a-5ffd-4f51-bf7ad010a244 (at 10.9.4.63@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff8d25e3e53c00, cur 1578668736 expire 1578668706 last 1578668687
[21459.504814] Lustre: Skipped 1 previous similar message
[21461.068032] Lustre: lustre-OST0000: haven't heard from client 71ae0a4f-625a-5ffd-4f51-bf7ad010a244 (at 10.9.4.63@tcp) in 51 seconds. I think it's dead, and I am evicting it. exp ffff8d25ef7bc000, cur 1578668738 expire 1578668708 last 1578668687
James Nunez (Inactive)
added a comment - There’s a new sanityn test 16b timeout with the same bind error as described here, but this one has errors during 16a, 16b has errors and then hangs. The stack trace here looks like it could relate to LU-12096 or LU-11644 ; https://testing.whamcloud.com/test_sets/4bfd1d92-3414-11ea-b1e8-52540065bddc .
In the client1 (vm9) console log, we see the bind error
[21432.883724] LustreError: 22845:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000b5145529
[21465.728594] LNetError: 22845:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.65@tcp[2], ip 10.9.4.65:1022, with error, wanted: 239240, left: 239240, last alive is 2 secs ago
[21465.732759] LNetError: 22845:0:(socklnd.c:1665:ksocknal_destroy_conn()) Skipped 1 previous similar message
[21465.734834] LustreError: 22845:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000723c0fc1
[21501.848952] Lustre: 22854:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1578668549/real 1578668549] req@000000008ce8400b x1655352734416640/t0(0) o3->lustre-OST0005-osc-ffff8a4ebc33c800@10.9.4.65@tcp:6/4 lens 504/440 e 0 to 1 dl 1578668594 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
[21501.856614] Lustre: 22854:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 18 previous similar messages
[21501.858438] Lustre: lustre-OST0005-osc-ffff8a4ebc33c800: Connection to lustre-OST0005 (at 10.9.4.65@tcp) was lost; in progress operations using this service will wait for recovery to complete
[21501.861445] Lustre: Skipped 18 previous similar messages
[21501.863516] Lustre: lustre-OST0005-osc-ffff8a4ebc33c800: Connection restored to 10.9.4.65@tcp (at 10.9.4.65@tcp)
[21501.865478] Lustre: Skipped 18 previous similar messages
[21522.211547] LNetError: 22845:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.65@tcp[2], ip 10.9.4.65:7988, with error, wanted: 879256, left: 879256, last alive is 1 secs ago
[21522.215414] LustreError: 22845:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 00000000e7a9497f
[21577.838123] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null
[21579.169621] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc
[21579.256992] Lustre: DEBUG MARKER: dmesg
[21579.948700] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanityn test 16b: 12500 iterations of dual-mount fsx at small size ================================ 15:04:30 \(1578668670\)
[21580.151697] Lustre: DEBUG MARKER: == sanityn test 16b: 12500 iterations of dual-mount fsx at small size ================================ 15:04:30 (1578668670)
[21580.327367] Lustre: lustre-OST0002-osc-ffff8a4ebc33c800: reconnect after 658s idle
[21580.328913] Lustre: Skipped 4 previous similar messages
[21604.197657] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21604.199563] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.66@tcp at host 10.9.4.66 on port 7988
[21604.201337] LNetError: 22842:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.4.63@tcp added to recovery queue. Health = 900
[21604.203389] Lustre: 22852:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1578668695/real 1578668695] req@000000005e133c6a x1655352736793728/t0(0) o400->lustre-MDT0000-mdc-ffff8a4ebc33c800@10.9.4.66@tcp:12/10 lens 224/224 e 0 to 1 dl 1578668703 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[21604.203400] LNetError: 22844:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 900
[21604.208327] Lustre: 22852:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[21604.208349] Lustre: lustre-MDT0000-mdc-ffff8a4ebc33c800: Connection to lustre-MDT0000 (at 10.9.4.66@tcp) was lost; in progress operations using this service will wait for recovery to complete
[21604.215094] Lustre: Skipped 1 previous similar message
[21604.216570] LustreError: 166-1: MGC10.9.4.66@tcp: Connection to MGS (at 10.9.4.66@tcp) was lost; in progress operations using this service will fail
[21604.869062] LNetError: 22843:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 900
[21604.871333] LNetError: 22843:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
[21604.873280] LNetError: 22840:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21604.874992] LNetError: 22840:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
[21604.876633] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.66@tcp at host 10.9.4.66 on port 7988
[21604.878382] LNetError: Skipped 1 previous similar message
[21609.315431] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21609.319993] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
[21609.325958] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988
[21614.432546] LNetError: 22844:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21614.432616] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988
[21614.434629] LNetError: 22844:0:(lib-socket.c:218:lnet_sock_create()) Skipped 2 previous similar messages
[21614.436474] LNetError: Skipped 1 previous similar message
[21616.127415] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 0
[21624.667470] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21624.667552] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988
[21624.669375] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
[21624.671184] LNetError: Skipped 1 previous similar message
[21626.362304] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 0
[21626.364592] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
[21636.597164] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.65@tcp added to recovery queue. Health = 0
[21636.599721] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
[21640.019759] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21640.019846] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988
[21640.021612] LNetError: 22842:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
[21646.832033] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0
[21646.834279] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
[21671.395744] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0
[21671.398011] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 4 previous similar messages
[21675.841850] LNetError: 22843:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21675.841929] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.65@tcp at host 10.9.4.65 on port 7988
[21675.843806] LNetError: 22843:0:(lib-socket.c:218:lnet_sock_create()) Skipped 1 previous similar message
[21675.845642] LNetError: Skipped 1 previous similar message
[21712.335275] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0
[21712.337808] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 10 previous similar messages
[21737.251123] LNetError: 22844:0:(lib-socket.c:218:lnet_sock_create()) Error trying to bind to port 1023: -99
[21737.252988] LNetError: 11e-e: Unexpected error -99 connecting to 10.9.4.66@tcp at host 10.9.4.66 on port 7988
[21777.838542] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.63@tcp added to recovery queue. Health = 0
[21777.840809] LNetError: 22849:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 16 previous similar messages
[21782.284271] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying
[21794.566181] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying
In the OSS (vm11) console log, we see
[21394.252859] Lustre: DEBUG MARKER: == sanityn test 16b: 12500 iterations of dual-mount fsx at small size ================================ 15:04:30 (1578668670)
[21418.232657] Lustre: 26039:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1578668688/real 1578668688] req@ffff8d25d6c79200 x1655331512173248/t0(0) o106->lustre-OST0000@10.9.4.63@tcp:15/16 lens 296/280 e 0 to 1 dl 1578668695 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
[21418.237168] Lustre: 26039:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
[21423.127118] LNetError: 18170:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.4.65@tcp added to recovery queue. Health = 900
[21451.276843] LNet: Service thread pid 26039 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[21451.280168] Pid: 26039, comm: ll_ost00_002 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
[21451.281965] Call Trace:
[21451.282523] [<ffffffffc0da5f50>] ptlrpc_set_wait+0x480/0x790 [ptlrpc]
[21451.284388] [<ffffffffc0d63115>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
[21451.285726] [<ffffffffc0d8454b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
[21451.286997] [<ffffffffc1144a6b>] ofd_intent_policy+0x69b/0x920 [ofd]
[21451.288323] [<ffffffffc0d63e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
[21451.289629] [<ffffffffc0d8c506>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
[21451.291059] [<ffffffffc0e14cf2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[21451.292304] [<ffffffffc0e1bb0a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[21451.293727] [<ffffffffc0dc046b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[21451.295166] [<ffffffffc0dc3dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[21451.296403] [<ffffffff9f0c50d1>] kthread+0xd1/0xe0
[21451.297490] [<ffffffff9f78cd37>] ret_from_fork_nospec_end+0x0/0x39
[21451.298750] [<ffffffffffffffff>] 0xffffffffffffffff
[21451.299939] LustreError: dumping log to /tmp/lustre-log.1578668728.26039
[21459.501026] Lustre: lustre-OST0005: haven't heard from client 71ae0a4f-625a-5ffd-4f51-bf7ad010a244 (at 10.9.4.63@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff8d25e3e53c00, cur 1578668736 expire 1578668706 last 1578668687
[21459.504814] Lustre: Skipped 1 previous similar message
[21461.068032] Lustre: lustre-OST0000: haven't heard from client 71ae0a4f-625a-5ffd-4f51-bf7ad010a244 (at 10.9.4.63@tcp) in 51 seconds. I think it's dead, and I am evicting it. exp ffff8d25ef7bc000, cur 1578668738 expire 1578668708 last 1578668687
James Nunez (Inactive)
added a comment - - edited We're seeing the same error for sanity-quota test 8 for Ubuntu 18.04 for 2.12.1 RC1; https://testing.whamcloud.com/test_sets/3f576074-6661-11e9-aeec-52540065bddc
For parallel-scale test connectathon: https://testing.whamcloud.com/test_sets/47d9e6ea-6661-11e9-aeec-52540065bddc
This looks like it might be an environment problem? Port 1023 is the first reserved port, which should be available to all root/kernel threads. If there is a problem using this port then either some other thread grabbed that port, or there is something like portreserve running that is consuming all of the reserved ports below 1024.
Andreas Dilger
added a comment - This looks like it might be an environment problem? Port 1023 is the first reserved port, which should be available to all root/kernel threads. If there is a problem using this port then either some other thread grabbed that port, or there is something like portreserve running that is consuming all of the reserved ports below 1024.
I haven't seen this new Lustre versions with newer Ubuntu. We can reopen this ticket if it comes back.