[LU-11334] LNet health check failing with ksocknal_tx_done()) tx failure rc = -113, hstatus = 2 Created: 05/Sep/18  Updated: 19/Dec/18  Resolved: 06/Sep/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: James A Simmons Assignee: Amir Shehata (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Latest tip of lustre-release using tcp for LNet


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While tempting to run the latest lustre I see the following errors:

66851.686185] LNet: Added LNI 172.30.224.9@tcp [8/256/0/180]

[66851.686267] LNet: Accept secure, port 988

[66851.760204] LNetError: 50758:0:(socklnd_cb.c:414:ksocknal_tx_done()) tx failure rc = -113, hstatus = 2

[66851.760533] LNetError: 50789:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-5, 0)

[66912.155561] LustreError: 15f-b: lustre-MDT0000: cannot register this server with the MGS: rc = -110. Is the MGS running?

[66912.190968] LustreError: 50624:0:(obd_mount_server.c:1939:server_fill_super()) Unable to start targets: -110

[66912.191191] LustreError: 50624:0:(obd_mount_server.c:1589:server_put_super()) no obd lustre-MDT0000

[66912.191277] LustreError: 50624:0:(obd_mount_server.c:132:server_deregister_mount()) lustre-MDT0000 not registered

[66912.193716] Lustre: server umount lustre-MDT0000 complete



 Comments   
Comment by Amir Shehata (Inactive) [ 05/Sep/18 ]

I believe this patch should resolve the logging you're seeing:

https://review.whamcloud.com/#/c/33096/

However there does seem to be a legitimate problem that's causing the connection to fail with -EHOSTUNREACH

Comment by James A Simmons [ 05/Sep/18 ]

I can do a normal ping and it works so the network is reachable. You are right lnetctl ping doesn't so lnet is unable to reach the MGS server.

00000400:00000080:76.0F:1536169212.948392:1584:88313:0:(module.c:120:libcfs_ioctl()) libcfs ioctl cmd 3221775678

00000400:00000100:55.0:1536169212.948714:1744:50759:0:(lib-socket.c:600:lnet_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> 172.30.224.8/988

00000400:00000100:55.0:1536169212.948724:1744:50759:0:(acceptor.c:112:lnet_connect_console_error()) Connection to 172.30.224.8@tcp at host 172.30.224.8 was unreachable: the network or that node may be down, or Lustre may be misconfigured.

00000800:00000100:55.0:1536169212.948729:1680:50759:0:(socklnd_cb.c:435:ksocknal_txlist_done()) Deleting packet type 2 len 0 172.30.224.9@tcp->172.30.224.8@tcp

00000800:00020000:55.0:1536169212.948731:1872:50759:0:(socklnd_cb.c:414:ksocknal_tx_done()) tx failure rc = -113, hstatus = 2

00000400:00000100:55.0:1536169212.948734:2352:50759:0:(lib-msg.c:719:lnet_health_check()) msg 0@<0:0>->172.30.224.8@tcp exceeded retry count 0

Comment by Amir Shehata (Inactive) [ 05/Sep/18 ]

health is off by default. So it won't try to resend. that's what the "exceeded retry count 0" means.

It looks like maybe the 988 port is blocked?

Comment by James A Simmons [ 06/Sep/18 ]

Duplicate of LU-11309.

The port was also blocked.

Generated at Sat Feb 10 02:42:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.