Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-899

Client Connectivity Issues in Complex Lustre Environment

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • None
    • 3
    • 6508

    Description

      Connectivity Issues:
      Although the login nodes are able to mount both production systems, mounting of the second filesystem takes several minutes:

      client fe2:
      Client fe2 - Mount Test:

      [root@fe2 ~]# date
      Mon Dec 5 17:31:17 UTC 2011
      [root@fe2 ~]# logger "Start Testing"
      [root@fe2 ~]# date;mount /mnt/lustre1;date
      Mon Dec 5 17:31:50 UTC 2011
      Mon Dec 5 17:31:51 UTC 2011
      [root@fe2 ~]# date;mount /mnt/lustre2;date
      Mon Dec 5 17:32:09 UTC 2011
      Mon Dec 5 17:34:24 UTC 2011
      [root@fe2 ~]# logger "End Testing"
      Log file attached - fe2.log

      Client fe2:
      ib0: inet addr:10.174.0.38 Bcast:10.255.255.255 Mask:255.255.224.0
      ib1: inet addr:10.175.0.38 Bcast:10.255.255.255 Mask:255.255.224.0
      ib2: inet addr:10.174.81.11 Bcast:10.174.95.255 Mask:255.255.240.0

      [root@fe2 ~]# cat /etc/modprobe.d/lustre.conf

      1. Lustre module configuration file
        options lnet networks="o2ib0(ib0), o2ib1(ib1), o2ib2(ib2)"

      [root@fe2 ~]# lctl list_nids
      10.174.0.38@o2ib
      10.175.0.38@o2ib1
      10.174.81.11@o2ib2

      [root@fe2 ~]# cat /etc/fstab | grep lustre
      10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1 /mnt/lustre1 lustre defaults,flock 0 0
      10.174.80.42@o2ib2:10.174.80.43@o2ib2:/scratch2 /mnt/lustre2 lustre defaults,flock 0 0

      [root@fe2 ~]# df -h | grep lustre
      2.5P 4.7T 2.5P 1% /mnt/lustre1
      3.1P 3.0T 3.1P 1% /mnt/lustre2

      The configuration of the data transfer nodes differs in that they only have 1 active ib port where the login nodes have 3. Even so, they both use the same ib fabric to connect to the production filesystems. The dtn nodes are able to mount the scratch2 filesystem without issue, but cannot mount the scratch1 filesystem.

      dtn1:
      ib0: inet addr:10.174.81.1 Bcast:10.174.95.255 Mask:255.255.240.0

      [root@dtn1 ~]# cat /etc/modprobe.d/lustre.conf

      1. Lustre module configuration file
        options lnet networks="o2ib2(ib0)"

      [root@dtn1 ~]# lctl list_nids
      10.174.81.1@o2ib2

      [root@dtn1 ~]# lctl ping 10.174.80.40@o2ib2
      12345-0@lo
      12345-10.174.31.241@o2ib
      12345-10.174.79.241@o2ib1
      12345-10.174.80.40@o2ib2
      [root@dtn1 ~]# lctl ping 10.174.80.41@o2ib2
      12345-0@lo
      12345-10.174.31.251@o2ib
      12345-10.174.79.251@o2ib1
      12345-10.174.80.41@o2ib2
      [root@dtn1 ~]# lctl ping 10.174.80.42@o2ib2
      12345-0@lo
      12345-10.175.31.242@o2ib
      12345-10.174.79.242@o2ib1
      12345-10.174.80.42@o2ib2
      [root@dtn1 ~]# lctl ping 10.174.80.43@o2ib2
      12345-0@lo
      12345-10.175.31.252@o2ib
      12345-10.174.79.252@o2ib1
      12345-10.174.80.43@o2ib2

      [root@dtn1 ~]# mount /mnt/lustre2
      [root@dtn1 ~]# df -h
      Filesystem Size Used Avail Use% Mounted on
      /dev/mapper/vg_dtn1-lv_root
      50G 9.4G 38G 20% /
      tmpfs 24G 88K 24G 1% /dev/shm
      /dev/sda1 485M 52M 408M 12% /boot
      10.181.1.2:/contrib 132G 2.9G 129G 3% /contrib
      10.181.1.2:/apps/v1 482G 38G 444G 8% /apps
      10.181.1.2:/home 4.1T 404G 3.7T 10% /home
      10.174.80.42@o2ib2:10.174.80.43@o2ib2:/scratch2
      3.1P 3.0T 3.1P 1% /mnt/lustre2

      [root@dtn1 ~]# mount /mnt/lustre1
      mount.lustre: mount 10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1 at /mnt/lustre1 failed: No such file or directory
      Is the MGS specification correct?
      Is the filesystem name correct?
      If upgrading, is the copied client log valid? (see upgrade docs)

      [root@dtn1 ~]# cat /etc/fstab | grep lustre
      10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1 /mnt/lustre1 lustre defaults,flock 0 0
      10.174.80.42@o2ib2:10.174.80.43@o2ib2:/scratch2 /mnt/lustre2 lustre defaults,flock 0 0

      Finally, the TDS compute nodes cannot access the production filesystems. They have the TDS filesystems mounted (lustre1 and lustre2).
      This may be a simple networking issue. Still investigating.

      Attachments

        1. fe2.log
          9 kB
          Dennis Nelson
        2. log.client
          243 kB
          Dennis Nelson
        3. log1
          88 kB
          Dennis Nelson
        4. log2
          5.75 MB
          Dennis Nelson
        5. lustre1_uuids.txt
          139 kB
          Dennis Nelson
        6. lustre2_uuids.txt
          347 kB
          Dennis Nelson
        7. lustre-scratch1
          826 kB
          Dennis Nelson
        8. scratch1.log
          243 kB
          Dennis Nelson
        9. scratch2.log
          612 kB
          Dennis Nelson

        Activity

          People

            cliffw Cliff White (Inactive)
            dnelson@ddn.com Dennis Nelson
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: