[LU-11334] LNet health check failing with ksocknal_tx_done()) tx failure rc = -113, hstatus = 2 Created: 05/Sep/18 Updated: 19/Dec/18 Resolved: 06/Sep/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | James A Simmons | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Latest tip of lustre-release using tcp for LNet |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
While tempting to run the latest lustre I see the following errors: 66851.686185] LNet: Added LNI 172.30.224.9@tcp [8/256/0/180] [66851.686267] LNet: Accept secure, port 988 [66851.760204] LNetError: 50758:0:(socklnd_cb.c:414:ksocknal_tx_done()) tx failure rc = -113, hstatus = 2 [66851.760533] LNetError: 50789:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-5, 0) [66912.155561] LustreError: 15f-b: lustre-MDT0000: cannot register this server with the MGS: rc = -110. Is the MGS running? [66912.190968] LustreError: 50624:0:(obd_mount_server.c:1939:server_fill_super()) Unable to start targets: -110 [66912.191191] LustreError: 50624:0:(obd_mount_server.c:1589:server_put_super()) no obd lustre-MDT0000 [66912.191277] LustreError: 50624:0:(obd_mount_server.c:132:server_deregister_mount()) lustre-MDT0000 not registered [66912.193716] Lustre: server umount lustre-MDT0000 complete |
| Comments |
| Comment by Amir Shehata (Inactive) [ 05/Sep/18 ] |
|
I believe this patch should resolve the logging you're seeing: https://review.whamcloud.com/#/c/33096/ However there does seem to be a legitimate problem that's causing the connection to fail with -EHOSTUNREACH |
| Comment by James A Simmons [ 05/Sep/18 ] |
|
I can do a normal ping and it works so the network is reachable. You are right lnetctl ping doesn't so lnet is unable to reach the MGS server. 00000400:00000080:76.0F:1536169212.948392:1584:88313:0:(module.c:120:libcfs_ioctl()) libcfs ioctl cmd 3221775678 00000400:00000100:55.0:1536169212.948714:1744:50759:0:(lib-socket.c:600:lnet_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> 172.30.224.8/988 00000400:00000100:55.0:1536169212.948724:1744:50759:0:(acceptor.c:112:lnet_connect_console_error()) Connection to 172.30.224.8@tcp at host 172.30.224.8 was unreachable: the network or that node may be down, or Lustre may be misconfigured. 00000800:00000100:55.0:1536169212.948729:1680:50759:0:(socklnd_cb.c:435:ksocknal_txlist_done()) Deleting packet type 2 len 0 172.30.224.9@tcp->172.30.224.8@tcp 00000800:00020000:55.0:1536169212.948731:1872:50759:0:(socklnd_cb.c:414:ksocknal_tx_done()) tx failure rc = -113, hstatus = 2 00000400:00000100:55.0:1536169212.948734:2352:50759:0:(lib-msg.c:719:lnet_health_check()) msg 0@<0:0>->172.30.224.8@tcp exceeded retry count 0 |
| Comment by Amir Shehata (Inactive) [ 05/Sep/18 ] |
|
health is off by default. So it won't try to resend. that's what the "exceeded retry count 0" means. It looks like maybe the 988 port is blocked? |
| Comment by James A Simmons [ 06/Sep/18 ] |
|
Duplicate of The port was also blocked. |