Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-153

Clients cannot connect to servers with 2 IB cards until "lctl ping" is done from server to clients

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.0.0
    • None
    • RHEL 6.0 GA, ofed1.5.2, Lustre 2.0.0.1, Mellanox QDR Ib cards

    Description

      Clients are not able to connect to server interfaces when there are two IB cards (and two lnets) configured on servers. We have a workaround consisting on "lctl ping" from servers to both lnets on every client. After that clients are able to connect to servers.

      Once clients are mounted we see the problem if we run the "df -h /lustre" command on clients (obvious cause running this command client needs to contact OSSs).

      At first we try to ping every interface on server:

      client> lctl ping 10.50.0.7@o2ib0 => No response
      client> lctl ping 10.50.1.7@o2ib1 => No response

      client>dmesg
      00000400:00000100:3.0F:1297255885.268873:0:2998:0:(lib-move.c:1028:lnet_post_send_locked()) Dropping message for
      12345-10.50.0.7@o2ib1: peer not alive
      00000400:00020000:3.0:1297255885.279758:0:2998:0:(lib-move.c:2628:LNetGet()) error sending GET to 12345-10.50.0.7@o2ib1:
      -113
      00000800:00000100:0.0F:1297255885.284181:0:2435:0:(o2iblnd_cb.c:462:kiblnd_rx_complete()) Rx from 10.50.0.7@o2ib1 failed:
      5

      Then we ping client's interface (client has only one if) on both lnets:

      server> lctl ping 10.50.0.50@o2ib0 => OK
      server> lctl ping 10.50.0.50@o2ib1 => OK

      And problem is solved, "df -h /lustre" will run correctly and all "lctl ping" from client to server's interface will work fine.

      IPoIB ping command is working fine, we don't have DDR infiniband drivers running on our machines and we already tried a network configuration using ip2nets.

      Here you have our ip2nets config (note that all machines in the [7-10] range are servers, with two IB cards, one lnet on every one, and all the rest of the machines are clients with only one IB interface and two lnets on every one):

      [root@berlin5 ~]# cat /sys/module/lnet/parameters/ip2nets
      o2ib0(ib0) 10.50.0.[7-10] ; o2ib1(ib1) 10.50.1.[7-10] ; o2ib0(ib0) 10.50.. ; o2ib1(ib0) 10.50..

      So, it seems like clients are not able to choose between one of the interfaces on servers but once server has 'pinged' clients, these ones are now able to choose the right interface.

      Do you think this could be an OFED bug? Maybe an lnet bug?

      Attachments

        Issue Links

          Activity

            People

              liang Liang Zhen (Inactive)
              dmoreno Diego Moreno (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: