Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11112

lnet: improve error msg in lnet_sock_create()

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • 9223372036854775807

    Description

      The kernel_bind() call in lnet_sock_create() may fail either due to
      problem with the local port, or the local IP address, but the error message currently only includes the port. It would be helpful if the message included both items when indicating a fatal error.

      Background: We've encoutered an issue where LNET had picked a virtual IP address (used for non-Lustre services) for its local_ip, and lnet_sock_create would fail once the IP address was migrated to another node. The error message only included the port, but not the IP address, and so it took a while to correlate the events. Why LNET chose to pick this particular source address is a separate question we need to investigate, but for starters, improving the error message to include all relevant content seems to be a good idea to me.

      Attachments

        Activity

          [LU-11112] lnet: improve error msg in lnet_sock_create()
          pjones Peter Jones added a comment -

          Landed for 2.16

          pjones Peter Jones added a comment - Landed for 2.16

          "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/32758/
          Subject: LU-11112 lnet: improve error msg in lnet_sock_create()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: cc85942a6d501bababfb34b275f1b5613086d118

          gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/32758/ Subject: LU-11112 lnet: improve error msg in lnet_sock_create() Project: fs/lustre-release Branch: master Current Patch Set: Commit: cc85942a6d501bababfb34b275f1b5613086d118

          Daniel Kobras (d.kobras@science-computing.de) uploaded a new patch: https://review.whamcloud.com/32758
          Subject: LU-11112 lnet: improve error msg in lnet_sock_create()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: a0b382caf10e590f112030cc528e1c5fdd470390

          gerrit Gerrit Updater added a comment - Daniel Kobras (d.kobras@science-computing.de) uploaded a new patch: https://review.whamcloud.com/32758 Subject: LU-11112 lnet: improve error msg in lnet_sock_create() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a0b382caf10e590f112030cc528e1c5fdd470390

          Hi Sonia!

          For the scope of this LU, my line of argument just goes:

          • lnet_sock_connect() can fail due to either of two arguments (local_ip and local_port);
          • the resulting error message just includes one of the arguments (local_port);
          • there exists at least one real-world case where it fails due to the other argument (local_ip);
          • hence the error message should be improved to include both arguments.

          The case given was only meant as an example to show that having the full information in the error output occasionally really matters.

           

          Why LNET chose to open a connection with a fixed source IP address rather than just using INADDR_ANY isn't clear to me, yet. One should be able to reproduce it with

          • assign IP alias to interface;
          • start LNET/Lustre;
          • remove IP alias from interface;

          but that's a topic probably more suited to a separate LU (once we've collected more information about it).

          kobras Daniel Kobras (Inactive) added a comment - Hi Sonia! For the scope of this LU, my line of argument just goes: lnet_sock_connect() can fail due to either of two arguments (local_ip and local_port); the resulting error message just includes one of the arguments (local_port); there exists at least one real-world case where it fails due to the other argument (local_ip); hence the error message should be improved to include both arguments. The case given was only meant as an example to show that having the full information in the error output occasionally really matters.   Why LNET chose to open a connection with a fixed source IP address rather than just using INADDR_ANY isn't clear to me, yet. One should be able to reproduce it with assign IP alias to interface; start LNET/Lustre; remove IP alias from interface; but that's a topic probably more suited to a separate LU (once we've collected more information about it).

          Hi Daniel

          In the lnet_sock_connect function, I see "INADDR_ANY" is assigned if the local_ip == 0. With "INDDR_ANY" in the bind call, the socket will be bound to all the local interfaces.

          439         if (local_ip != 0 || local_port != 0) {
          440                 memset(&locaddr, 0, sizeof(locaddr));
          441                 locaddr.sin_family = AF_INET;
          442                 locaddr.sin_port = htons(local_port);
          443                 locaddr.sin_addr.s_addr = (local_ip == 0) ?
          444                                           INADDR_ANY : htonl(local_ip);

          Was this virtual address assigned to one of the interface on the node? It would help to understand if you know what particular action/command execution is resulting in this error.

          Thanks
          Sonia

          sharmaso Sonia Sharma (Inactive) added a comment - Hi Daniel In the lnet_sock_connect function, I see "INADDR_ANY" is assigned if the local_ip == 0. With "INDDR_ANY" in the bind call, the socket will be bound to all the local interfaces. 439         if (local_ip != 0 || local_port != 0) { 440                 memset(&locaddr, 0, sizeof(locaddr)); 441                 locaddr.sin_family = AF_INET; 442                 locaddr.sin_port = htons(local_port); 443                 locaddr.sin_addr.s_addr = (local_ip == 0) ? 444                                           INADDR_ANY : htonl(local_ip); Was this virtual address assigned to one of the interface on the node? It would help to understand if you know what particular action/command execution is resulting in this error. Thanks Sonia

          FWIW: This is the original lnet error message:

          LNetError: 2099:0:(lib-socket.c:455:lnet_sock_create()) Error trying to bind to port 1023: -99
          LNetError: 2099:0:(lib-socket.c:455:lnet_sock_create()) Skipped 8 previous similar messages
          LNetError: 11e-e: Unexpected error -99 connecting to 192.168.10.6@tcp at host 192.168.10.6 on port 988
          
          knweiss Karsten Weiss added a comment - FWIW: This is the original lnet error message: LNetError: 2099:0:(lib-socket.c:455:lnet_sock_create()) Error trying to bind to port 1023: -99 LNetError: 2099:0:(lib-socket.c:455:lnet_sock_create()) Skipped 8 previous similar messages LNetError: 11e-e: Unexpected error -99 connecting to 192.168.10.6@tcp at host 192.168.10.6 on port 988

          It seems Gerrit has moved to a different IP address, and I cannot access it due to local firewall restrictions. Attaching the patch here while I try to sort things out.

          kobras Daniel Kobras (Inactive) added a comment - It seems Gerrit has moved to a different IP address, and I cannot access it due to local firewall restrictions. Attaching the patch here while I try to sort things out.
          pjones Peter Jones added a comment -

          Daniel

          Could you please push your proposed patch into Gerrit so it can be reviewed/landed?

          Peter

          pjones Peter Jones added a comment - Daniel Could you please push your proposed patch into Gerrit so it can be reviewed/landed? Peter
          pjones Peter Jones added a comment -

          Sonia

          Could you please investigate?

          Thanks

          Peter

          pjones Peter Jones added a comment - Sonia Could you please investigate? Thanks Peter

          People

            sharmaso Sonia Sharma (Inactive)
            kobras Daniel Kobras (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: