Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-899

Client Connectivity Issues in Complex Lustre Environment

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • None
    • 3
    • 6508

    Description

      Connectivity Issues:
      Although the login nodes are able to mount both production systems, mounting of the second filesystem takes several minutes:

      client fe2:
      Client fe2 - Mount Test:

      [root@fe2 ~]# date
      Mon Dec 5 17:31:17 UTC 2011
      [root@fe2 ~]# logger "Start Testing"
      [root@fe2 ~]# date;mount /mnt/lustre1;date
      Mon Dec 5 17:31:50 UTC 2011
      Mon Dec 5 17:31:51 UTC 2011
      [root@fe2 ~]# date;mount /mnt/lustre2;date
      Mon Dec 5 17:32:09 UTC 2011
      Mon Dec 5 17:34:24 UTC 2011
      [root@fe2 ~]# logger "End Testing"
      Log file attached - fe2.log

      Client fe2:
      ib0: inet addr:10.174.0.38 Bcast:10.255.255.255 Mask:255.255.224.0
      ib1: inet addr:10.175.0.38 Bcast:10.255.255.255 Mask:255.255.224.0
      ib2: inet addr:10.174.81.11 Bcast:10.174.95.255 Mask:255.255.240.0

      [root@fe2 ~]# cat /etc/modprobe.d/lustre.conf

      1. Lustre module configuration file
        options lnet networks="o2ib0(ib0), o2ib1(ib1), o2ib2(ib2)"

      [root@fe2 ~]# lctl list_nids
      10.174.0.38@o2ib
      10.175.0.38@o2ib1
      10.174.81.11@o2ib2

      [root@fe2 ~]# cat /etc/fstab | grep lustre
      10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1 /mnt/lustre1 lustre defaults,flock 0 0
      10.174.80.42@o2ib2:10.174.80.43@o2ib2:/scratch2 /mnt/lustre2 lustre defaults,flock 0 0

      [root@fe2 ~]# df -h | grep lustre
      2.5P 4.7T 2.5P 1% /mnt/lustre1
      3.1P 3.0T 3.1P 1% /mnt/lustre2

      The configuration of the data transfer nodes differs in that they only have 1 active ib port where the login nodes have 3. Even so, they both use the same ib fabric to connect to the production filesystems. The dtn nodes are able to mount the scratch2 filesystem without issue, but cannot mount the scratch1 filesystem.

      dtn1:
      ib0: inet addr:10.174.81.1 Bcast:10.174.95.255 Mask:255.255.240.0

      [root@dtn1 ~]# cat /etc/modprobe.d/lustre.conf

      1. Lustre module configuration file
        options lnet networks="o2ib2(ib0)"

      [root@dtn1 ~]# lctl list_nids
      10.174.81.1@o2ib2

      [root@dtn1 ~]# lctl ping 10.174.80.40@o2ib2
      12345-0@lo
      12345-10.174.31.241@o2ib
      12345-10.174.79.241@o2ib1
      12345-10.174.80.40@o2ib2
      [root@dtn1 ~]# lctl ping 10.174.80.41@o2ib2
      12345-0@lo
      12345-10.174.31.251@o2ib
      12345-10.174.79.251@o2ib1
      12345-10.174.80.41@o2ib2
      [root@dtn1 ~]# lctl ping 10.174.80.42@o2ib2
      12345-0@lo
      12345-10.175.31.242@o2ib
      12345-10.174.79.242@o2ib1
      12345-10.174.80.42@o2ib2
      [root@dtn1 ~]# lctl ping 10.174.80.43@o2ib2
      12345-0@lo
      12345-10.175.31.252@o2ib
      12345-10.174.79.252@o2ib1
      12345-10.174.80.43@o2ib2

      [root@dtn1 ~]# mount /mnt/lustre2
      [root@dtn1 ~]# df -h
      Filesystem Size Used Avail Use% Mounted on
      /dev/mapper/vg_dtn1-lv_root
      50G 9.4G 38G 20% /
      tmpfs 24G 88K 24G 1% /dev/shm
      /dev/sda1 485M 52M 408M 12% /boot
      10.181.1.2:/contrib 132G 2.9G 129G 3% /contrib
      10.181.1.2:/apps/v1 482G 38G 444G 8% /apps
      10.181.1.2:/home 4.1T 404G 3.7T 10% /home
      10.174.80.42@o2ib2:10.174.80.43@o2ib2:/scratch2
      3.1P 3.0T 3.1P 1% /mnt/lustre2

      [root@dtn1 ~]# mount /mnt/lustre1
      mount.lustre: mount 10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1 at /mnt/lustre1 failed: No such file or directory
      Is the MGS specification correct?
      Is the filesystem name correct?
      If upgrading, is the copied client log valid? (see upgrade docs)

      [root@dtn1 ~]# cat /etc/fstab | grep lustre
      10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1 /mnt/lustre1 lustre defaults,flock 0 0
      10.174.80.42@o2ib2:10.174.80.43@o2ib2:/scratch2 /mnt/lustre2 lustre defaults,flock 0 0

      Finally, the TDS compute nodes cannot access the production filesystems. They have the TDS filesystems mounted (lustre1 and lustre2).
      This may be a simple networking issue. Still investigating.

      Attachments

        1. fe2.log
          9 kB
        2. log.client
          243 kB
        3. log1
          88 kB
        4. log2
          5.75 MB
        5. lustre1_uuids.txt
          139 kB
        6. lustre2_uuids.txt
          347 kB
        7. lustre-scratch1
          826 kB
        8. scratch1.log
          243 kB
        9. scratch2.log
          612 kB

        Activity

          [LU-899] Client Connectivity Issues in Complex Lustre Environment
          pjones Peter Jones added a comment -

          Great - thanks Dennis!

          pjones Peter Jones added a comment - Great - thanks Dennis!

          Yes. I already suggested that LU-890 be closed and it was closed by Cliff. This one can be also.

          dnelson@ddn.com Dennis Nelson added a comment - Yes. I already suggested that LU-890 be closed and it was closed by Cliff. This one can be also.
          pjones Peter Jones added a comment -

          Dennis

          Thanks for the update. So can we close both this ticket and LU890?

          Peter

          pjones Peter Jones added a comment - Dennis Thanks for the update. So can we close both this ticket and LU890? Peter

          I made the change on the TDS servers and had to perform a writeconf in order to get it mounted up again. Everything seems to be working now.

          Thank you very much for all of your help!

          dnelson@ddn.com Dennis Nelson added a comment - I made the change on the TDS servers and had to perform a writeconf in order to get it mounted up again. Everything seems to be working now. Thank you very much for all of your help!

          Ah, no. I will have to schedule some time with the customer to do that. I have one node that is not currently in the job queue that I can use for testing. To take the whole filesystem down, I will have to schedule it.

          I will get that scheduled today.

          dnelson@ddn.com Dennis Nelson added a comment - Ah, no. I will have to schedule some time with the customer to do that. I have one node that is not currently in the job queue that I can use for testing. To take the whole filesystem down, I will have to schedule it. I will get that scheduled today.

          have you also changed MDS/MGS and other servers in TDS filesystem to o2ib3 as well (i.e: mds01)? Because you are using o2ib3 as TDS network number, so all clients and servers on TDS network should use that network number (o2ib3).
          Also, try "lctl ping" to verify network is reachable is always a good idea,

          liang Liang Zhen (Inactive) added a comment - have you also changed MDS/MGS and other servers in TDS filesystem to o2ib3 as well (i.e: mds01)? Because you are using o2ib3 as TDS network number, so all clients and servers on TDS network should use that network number (o2ib3). Also, try "lctl ping" to verify network is reachable is always a good idea,

          OK, I tried the following:

          [root@r1i3n15 ~]# cat /etc/modprobe.d/lustre.conf

          1. Lustre module configuration file
            options lnet networks="o2ib3(ib1), o2ib1(ib0)"

          [root@r1i3n15 ~]# lctl list_nids
          10.174.96.65@o2ib3
          10.174.64.65@o2ib1

          [root@r1i3n15 ~]# cat /etc/fstab
          ...
          10.174.96.138@o2ib3:/lustre1 /mnt/tds_lustre1 lustre defaults,flock 0 0
          10.174.96.138@o2ib3:/lustre2 /mnt/tds_lustre2 lustre defaults,flock 0 0
          10.174.79.241@o2ib1:10.174.79.251@o2ib1:/scratch1 /mnt/lsc_lustre1 lustre defaults,flock 0 0
          10.174.79.242@o2ib1:10.174.79.252@o2ib1:/scratch2 /mnt/lsc_lustre2 lustre defaults,flock 0 0

          Now, the production filesystems (scratch1, scratch2) mount and the TDS filesystems fail to mount.

          [root@r1i3n15 ~]# mount -at lustre
          mount.lustre: mount 10.174.96.138@o2ib3:/lustre1 at /mnt/tds_lustre1 failed: Cannot send after transport endpoint shutdown
          mount.lustre: mount 10.174.96.138@o2ib3:/lustre2 at /mnt/tds_lustre2 failed: File exists
          [root@r1i3n15 ~]# df
          Filesystem 1K-blocks Used Available Use% Mounted on
          tmpfs 153600 1708 151892 2% /tmp
          10.181.1.2:/contrib 137625600 3002528 134623072 3% /contrib
          10.181.1.2:/testapps/v1
          45875200 35991488 9883712 79% /apps
          10.181.1.2:/testhome 550764544 166799968 383964576 31% /home
          10.174.79.241@o2ib1:10.174.79.251@o2ib1:/scratch1
          2688660012544 29627611556 2632114228424 2% /mnt/lsc_lustre1
          10.174.79.242@o2ib1:10.174.79.252@o2ib1:/scratch2
          3360825015680 785492156 3326396150596 1% /mnt/lsc_lustre2

          dnelson@ddn.com Dennis Nelson added a comment - OK, I tried the following: [root@r1i3n15 ~] # cat /etc/modprobe.d/lustre.conf Lustre module configuration file options lnet networks="o2ib3(ib1), o2ib1(ib0)" [root@r1i3n15 ~] # lctl list_nids 10.174.96.65@o2ib3 10.174.64.65@o2ib1 [root@r1i3n15 ~] # cat /etc/fstab ... 10.174.96.138@o2ib3:/lustre1 /mnt/tds_lustre1 lustre defaults,flock 0 0 10.174.96.138@o2ib3:/lustre2 /mnt/tds_lustre2 lustre defaults,flock 0 0 10.174.79.241@o2ib1:10.174.79.251@o2ib1:/scratch1 /mnt/lsc_lustre1 lustre defaults,flock 0 0 10.174.79.242@o2ib1:10.174.79.252@o2ib1:/scratch2 /mnt/lsc_lustre2 lustre defaults,flock 0 0 Now, the production filesystems (scratch1, scratch2) mount and the TDS filesystems fail to mount. [root@r1i3n15 ~] # mount -at lustre mount.lustre: mount 10.174.96.138@o2ib3:/lustre1 at /mnt/tds_lustre1 failed: Cannot send after transport endpoint shutdown mount.lustre: mount 10.174.96.138@o2ib3:/lustre2 at /mnt/tds_lustre2 failed: File exists [root@r1i3n15 ~] # df Filesystem 1K-blocks Used Available Use% Mounted on tmpfs 153600 1708 151892 2% /tmp 10.181.1.2:/contrib 137625600 3002528 134623072 3% /contrib 10.181.1.2:/testapps/v1 45875200 35991488 9883712 79% /apps 10.181.1.2:/testhome 550764544 166799968 383964576 31% /home 10.174.79.241@o2ib1:10.174.79.251@o2ib1:/scratch1 2688660012544 29627611556 2632114228424 2% /mnt/lsc_lustre1 10.174.79.242@o2ib1:10.174.79.252@o2ib1:/scratch2 3360825015680 785492156 3326396150596 1% /mnt/lsc_lustre2
          liang Liang Zhen (Inactive) added a comment - - edited

          Here is my undestanding about your setting, please correct me if I was wrong:

           
          client                      TDS MDS                    Production MDS
          ---------                   ---------                  -------
          rli3n15                     mds01                      lfs-mds-1-1 (scratch1)
          10.174.96.64@o2ib0(ib1)     10.174.96.138@o2ib0 [y]    10.174.31.241@o2ib0 [n]
          10.174.64.65@o2ib1(ib0)                                10.174.79.241@o2ib1 [y]
          
          [y] == [yes], means we can reach that NID via "lctl ping" from rli3n15
          [n] == [no],  means we can not reach that NID via "lctl ping" from rli3n15
          
          

          So between rli3n15 and lfs-mds-1-1:

          • 10.174.64.65@o2ib1(ib0) and 10.174.79.241@o2ib1 are on the same LNet network,and they are physically reachable to each other
          • 10.174.96.64@o2ib0(ib1) and 10.174.31.241@o2ib0 are on the same LNet network,
            but they are physically unreachable to each other

          I think if you try to mount scratch1 from rli3n15, it will firstly look at all N
          IDs of lfs-mds-1-1, and it found both itself and lfs-mds-1-1 have two local NIDs on o2ib0 and o2ib1 (although they can't reach eath other on o2ib0), and LNet hop of these two NIDs are same and both interfaces are healthy, so ptlrpc will choose the first NID of lfs-mds-1-1 10.174.31.241@o2ib0, which is actually unreachable to rli3n15.

          I would suggest to try with this one rli3n15:
          options lnet networks="o2ib1(ib0)"

          and try to mount scratch1,2, if it can work, I would suggest to use configuratio
          n like this:

           
          client                      TDS MDS                    Production MDS
          ---------                   ---------                  -------
          rli3n15                     mds01                      lfs-mds-1-1 (scratch1)
          10.174.96.64@o2ib3(ib1)     10.174.96.138@o2ib3 [y]    
          10.174.64.65@o2ib1(ib0)                                10.174.79.241@o2ib1 [y]
                                                                 10.174.31.241@o2ib0 [y]
          
          

          The only change we made here is:
          o2ib0 on rli3n15 and mds01 is replaced by o2ib3, of course, if it can work you will have to change all nodes on TDS to o2ib3...

          liang Liang Zhen (Inactive) added a comment - - edited Here is my undestanding about your setting, please correct me if I was wrong: client TDS MDS Production MDS --------- --------- ------- rli3n15 mds01 lfs-mds-1-1 (scratch1) 10.174.96.64@o2ib0(ib1) 10.174.96.138@o2ib0 [y] 10.174.31.241@o2ib0 [n] 10.174.64.65@o2ib1(ib0) 10.174.79.241@o2ib1 [y] [y] == [yes], means we can reach that NID via "lctl ping" from rli3n15 [n] == [no], means we can not reach that NID via "lctl ping" from rli3n15 So between rli3n15 and lfs-mds-1-1: 10.174.64.65@o2ib1(ib0) and 10.174.79.241@o2ib1 are on the same LNet network,and they are physically reachable to each other 10.174.96.64@o2ib0(ib1) and 10.174.31.241@o2ib0 are on the same LNet network, but they are physically unreachable to each other I think if you try to mount scratch1 from rli3n15, it will firstly look at all N IDs of lfs-mds-1-1, and it found both itself and lfs-mds-1-1 have two local NIDs on o2ib0 and o2ib1 (although they can't reach eath other on o2ib0), and LNet hop of these two NIDs are same and both interfaces are healthy, so ptlrpc will choose the first NID of lfs-mds-1-1 10.174.31.241@o2ib0, which is actually unreachable to rli3n15. I would suggest to try with this one rli3n15: options lnet networks="o2ib1(ib0)" and try to mount scratch1,2, if it can work, I would suggest to use configuratio n like this: client TDS MDS Production MDS --------- --------- ------- rli3n15 mds01 lfs-mds-1-1 (scratch1) 10.174.96.64@o2ib3(ib1) 10.174.96.138@o2ib3 [y] 10.174.64.65@o2ib1(ib0) 10.174.79.241@o2ib1 [y] 10.174.31.241@o2ib0 [y] The only change we made here is: o2ib0 on rli3n15 and mds01 is replaced by o2ib3, of course, if it can work you will have to change all nodes on TDS to o2ib3...

          People

            cliffw Cliff White (Inactive)
            dnelson@ddn.com Dennis Nelson
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: