[LU-899] Client Connectivity Issues in Complex Lustre Environment Created: 05/Dec/11  Updated: 14/Dec/11  Resolved: 14/Dec/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Dennis Nelson Assignee: Cliff White (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

The cluster configuration is as follows:

scratch1 Lustre filesystem - 2 MDS, 16 OSS, 4 DDN SFA 10K arrays
scratch2 Lustre filesystem - 2 MDS, 20 OSS, 5 DDN SFA 10K arrays

The scratch1 and scratch2 servers each have 4 IB ports. The ports are used for client connectivity as follows:

Production compute clients access scratch1 via the ib0 port.
Production compute clients access scratch2 via the ib1 port.
Test and Development systems (TDS) access both filesystems through the ib2 port.
Login nodes and data transfer nodes (DTN) access both filesystems through the ib3 port.

The servers are running CentOS 5.5 (2.6.18-238.12.1.el5)
Lustre 1.8.6 with the LU-530 patch installed
The clients are currently running RHEL 6.0.

Server Configuration:
lfs-mds-1-1
ib0 inet addr:10.174.31.241 Bcast:10.174.31.255 Mask:255.255.224.0
ib1 inet addr:10.175.31.241 Bcast:10.175.31.255 Mask:255.255.224.0
ib2 inet addr:10.174.79.241 Bcast:10.174.79.255 Mask:255.255.240.0
ib3 inet addr:10.174.80.40 Bcast:10.174.111.255 Mask:255.255.240.0

[root@lfs-mds-1-1 ~]# cat /etc/modprobe.d/lustre.conf
options lnet networks="o2ib0(ib0), o2ib1(ib2), o2ib2(ib3)"
[root@lfs-mds-1-1 config]# lctl list_nids
10.174.31.241@o2ib
10.174.79.241@o2ib1
10.174.80.40@o2ib2
lfs-mds-1-2:
ib0 inet addr:10.174.31.251 Bcast:10.174.31.255 Mask:255.255.224.0
ib1 inet addr:10.175.31.251 Bcast:10.175.31.255 Mask:255.255.224.0
ib2 inet addr:10.174.79.251 Bcast:10.174.79.255 Mask:255.255.240.0
ib3 inet addr:10.174.80.41 Bcast:10.174.111.255 Mask:255.255.240.0
[root@lfs-mds-1-2 ~]# cat /etc/modprobe.d/lustre.conf
options lnet networks="o2ib0(ib0), o2ib1(ib2), o2ib2(ib3)"
[root@lfs-mds-1-2 ~]# lctl list_nids
10.174.31.251@o2ib
10.174.79.251@o2ib1
10.174.80.41@o2ib2
lfs-mds-2-1:
ib0 inet addr:10.174.31.242 Bcast:10.174.31.255 Mask:255.255.224.0
ib1 inet addr:10.175.31.242 Bcast:10.175.31.255 Mask:255.255.224.0
ib2 inet addr:10.174.79.242 Bcast:10.174.79.255 Mask:255.255.240.0
ib3 inet addr:10.174.80.42 Bcast:10.174.111.255 Mask:255.255.240.0

[root@lfs-mds-2-1 ~]# cat /etc/modprobe.d/lustre.conf
options lnet networks="o2ib0(ib0), o2ib1(ib2), o2ib2(ib3)"

[root@lfs-mds-2-1 ~]# lctl list_nids
10.175.31.242@o2ib
10.174.79.242@o2ib1
10.174.80.42@o2ib2

lfs-mds-2-2:
ib0 inet addr:10.174.31.252 Bcast:10.174.31.255 Mask:255.255.224.0
ib1 inet addr:10.175.31.252 Bcast:10.175.31.255 Mask:255.255.224.0
ib2 inet addr:10.174.79.252 Bcast:10.174.79.255 Mask:255.255.240.0
ib3 inet addr:10.174.80.43 Bcast:10.174.111.255 Mask:255.255.240.0

[root@lfs-mds-2-2 ~]# cat /etc/modprobe.d/lustre.conf
options lnet networks="o2ib0(ib0), o2ib1(ib2), o2ib2(ib3)"

[root@lfs-mds-2-2 ~]# lctl list_nids
10.175.31.252@o2ib
10.174.79.252@o2ib1
10.174.80.43@o2ib2


Attachments: Text File fe2.log     File log.client     File log1     File log2     File lustre-scratch1     Text File lustre1_uuids.txt     Text File lustre2_uuids.txt     Text File scratch1.log     Text File scratch2.log    
Severity: 3
Rank (Obsolete): 6508

 Description   

Connectivity Issues:
Although the login nodes are able to mount both production systems, mounting of the second filesystem takes several minutes:

client fe2:
Client fe2 - Mount Test:

[root@fe2 ~]# date
Mon Dec 5 17:31:17 UTC 2011
[root@fe2 ~]# logger "Start Testing"
[root@fe2 ~]# date;mount /mnt/lustre1;date
Mon Dec 5 17:31:50 UTC 2011
Mon Dec 5 17:31:51 UTC 2011
[root@fe2 ~]# date;mount /mnt/lustre2;date
Mon Dec 5 17:32:09 UTC 2011
Mon Dec 5 17:34:24 UTC 2011
[root@fe2 ~]# logger "End Testing"
Log file attached - fe2.log

Client fe2:
ib0: inet addr:10.174.0.38 Bcast:10.255.255.255 Mask:255.255.224.0
ib1: inet addr:10.175.0.38 Bcast:10.255.255.255 Mask:255.255.224.0
ib2: inet addr:10.174.81.11 Bcast:10.174.95.255 Mask:255.255.240.0

[root@fe2 ~]# cat /etc/modprobe.d/lustre.conf

  1. Lustre module configuration file
    options lnet networks="o2ib0(ib0), o2ib1(ib1), o2ib2(ib2)"

[root@fe2 ~]# lctl list_nids
10.174.0.38@o2ib
10.175.0.38@o2ib1
10.174.81.11@o2ib2

[root@fe2 ~]# cat /etc/fstab | grep lustre
10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1 /mnt/lustre1 lustre defaults,flock 0 0
10.174.80.42@o2ib2:10.174.80.43@o2ib2:/scratch2 /mnt/lustre2 lustre defaults,flock 0 0

[root@fe2 ~]# df -h | grep lustre
2.5P 4.7T 2.5P 1% /mnt/lustre1
3.1P 3.0T 3.1P 1% /mnt/lustre2

The configuration of the data transfer nodes differs in that they only have 1 active ib port where the login nodes have 3. Even so, they both use the same ib fabric to connect to the production filesystems. The dtn nodes are able to mount the scratch2 filesystem without issue, but cannot mount the scratch1 filesystem.

dtn1:
ib0: inet addr:10.174.81.1 Bcast:10.174.95.255 Mask:255.255.240.0

[root@dtn1 ~]# cat /etc/modprobe.d/lustre.conf

  1. Lustre module configuration file
    options lnet networks="o2ib2(ib0)"

[root@dtn1 ~]# lctl list_nids
10.174.81.1@o2ib2

[root@dtn1 ~]# lctl ping 10.174.80.40@o2ib2
12345-0@lo
12345-10.174.31.241@o2ib
12345-10.174.79.241@o2ib1
12345-10.174.80.40@o2ib2
[root@dtn1 ~]# lctl ping 10.174.80.41@o2ib2
12345-0@lo
12345-10.174.31.251@o2ib
12345-10.174.79.251@o2ib1
12345-10.174.80.41@o2ib2
[root@dtn1 ~]# lctl ping 10.174.80.42@o2ib2
12345-0@lo
12345-10.175.31.242@o2ib
12345-10.174.79.242@o2ib1
12345-10.174.80.42@o2ib2
[root@dtn1 ~]# lctl ping 10.174.80.43@o2ib2
12345-0@lo
12345-10.175.31.252@o2ib
12345-10.174.79.252@o2ib1
12345-10.174.80.43@o2ib2

[root@dtn1 ~]# mount /mnt/lustre2
[root@dtn1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_dtn1-lv_root
50G 9.4G 38G 20% /
tmpfs 24G 88K 24G 1% /dev/shm
/dev/sda1 485M 52M 408M 12% /boot
10.181.1.2:/contrib 132G 2.9G 129G 3% /contrib
10.181.1.2:/apps/v1 482G 38G 444G 8% /apps
10.181.1.2:/home 4.1T 404G 3.7T 10% /home
10.174.80.42@o2ib2:10.174.80.43@o2ib2:/scratch2
3.1P 3.0T 3.1P 1% /mnt/lustre2

[root@dtn1 ~]# mount /mnt/lustre1
mount.lustre: mount 10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1 at /mnt/lustre1 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

[root@dtn1 ~]# cat /etc/fstab | grep lustre
10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1 /mnt/lustre1 lustre defaults,flock 0 0
10.174.80.42@o2ib2:10.174.80.43@o2ib2:/scratch2 /mnt/lustre2 lustre defaults,flock 0 0

Finally, the TDS compute nodes cannot access the production filesystems. They have the TDS filesystems mounted (lustre1 and lustre2).
This may be a simple networking issue. Still investigating.



 Comments   
Comment by Cliff White (Inactive) [ 05/Dec/11 ]

I do not see any information about you MGS, are you running the MGS co-located with the MDS? It might be better for this configuration to have one separate MGS for all the filesystems. If lctl ping works, it is odd that the mount would fail, it may be indeed a network issue. Also, you might check the MDS disk (tunefs --print) to see if the failover NIDS are correct on the disk.

Comment by Dennis Nelson [ 05/Dec/11 ]

The MGS is co-located with the MDS. I did neglect to include the MDT nid information:

[root@lfs-mds-1-1 ~]# tunefs.lustre --dryrun /dev/vg_scratch1/mdt
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

Read previous values:
Target: scratch1-MDT0000
Index: 0
Lustre FS: scratch1
Mount type: ldiskfs
Flags: 0x1401
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 mgsnode=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 failover.node=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 failover.node=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 mdt.quota_type=ug

Permanent disk data:
Target: scratch1-MDT0000
Index: 0
Lustre FS: scratch1
Mount type: ldiskfs
Flags: 0x1401
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 mgsnode=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 failover.node=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 failover.node=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 mdt.quota_type=ug

exiting before disk write.

The MGT does not have any NID information. It is my understanding that the client mount command specifies the nids of the systems where the MGT will be mounted.

Comment by Dennis Nelson [ 05/Dec/11 ]

The other thing I would point out is that the dtn nodes are able to mount the scratch2 filesystem. They are attempting to mount the scratch1 filesystem over the same ib subnet (10.174.80.xx). Additionally, the login nodes are able to mount both filesystems using that same subnet. I don't understand how the cause could be a network issue when both the client and the server seem to be able to communicate over the subnet without issues.

Comment by Cliff White (Inactive) [ 05/Dec/11 ]

You didn't include the scratch2 info, but on scratch1 I notice you have everything listed twice:

Parameters: mgsnode=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 mgsnode=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 failover.node=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 failover.node=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 mdt.quota_type=ug

I think what you want would be everything listed once:
Parameters: mgsnode=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 failover.node=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 mdt.quota_type=ug
You do not need the double listing.

Comment by Dennis Nelson [ 05/Dec/11 ]

I do not know what you mean. I have mgsnode listed twice, once for one mds, listing the 3 nids that are used for scratch1, and another time for the other mds, again, listing the 3 nids that are used for scratch1. None of the nids are repeated.

Comment by Cliff White (Inactive) [ 06/Dec/11 ]

You have the same NIDs listed as 'mgsnode' and as 'failnode' mgsnode=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 ... failover.node=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 ... mdt.quota_type=ug

That is not necessary, and may have something to do with the delay in mounting. A NID should be listed either as mgsnode, or failnode, not both as you have here. A NID only needs to be listed once, as mgsnode and failnode are both used when finding a server. The NIDs in the 'failover.node' list do NOT have to be in the 'mgsnode' list and should not be duplicated in this fashion.

Comment by Dennis Nelson [ 06/Dec/11 ]

So, would you suggest using mgsnode or failover.node? Are they identical? It appears that the DDN tools add both of these.

Comment by Ashley Pittman (Inactive) [ 06/Dec/11 ]

This is the output of tunefs.lustre --print for the MDT on scratch2, it's from a snapshot taken last week so may not be up to date, Dennis, can you check this and update if it's wrong.

checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

Read previous values:
Target: scratch2-MDT0000
Index: 0
Lustre FS: scratch2
Mount type: ldiskfs
Flags: 0x1401
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=10.175.31.242@o2ib,10.174.79.242@o2ib1,10.174.80.42@o2ib2 mgsnode=10.175.31.252@o2ib,10.174.79.252@o2ib1,10.174.80.43@o2ib2 failover.node=10.175.31.242@o2ib,10.174.79.242@o2ib1,10.174.80.42@o2ib2 failover.node=10.175.31.252@o2ib,10.174.79.252@o2ib1,10.174.80.43@o2ib2 mdt.quota_type=ug

Permanent disk data:
Target: scratch2-MDT0000
Index: 0
Lustre FS: scratch2
Mount type: ldiskfs
Flags: 0x1401
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=10.175.31.242@o2ib,10.174.79.242@o2ib1,10.174.80.42@o2ib2 mgsnode=10.175.31.252@o2ib,10.174.79.252@o2ib1,10.174.80.43@o2ib2 failover.node=10.175.31.242@o2ib,10.174.79.242@o2ib1,10.174.80.42@o2ib2 failover.node=10.175.31.252@o2ib,10.174.79.252@o2ib1,10.174.80.43@o2ib2 mdt.quota_type=ug

exiting before disk write.

Comment by Dennis Nelson [ 06/Dec/11 ]

Yes, it is the same. It has not been modified since it was initially installed.

Comment by Ashley Pittman (Inactive) [ 06/Dec/11 ]

To be clear, when Dennis says "The MGS is co-located with the MDS" what we mean is that it's running from a different partition on the same hardware. The NIDS used to access it should be the same but it is a different partition and external to the MDT.

Comment by Cliff White (Inactive) [ 06/Dec/11 ]

The primary NIDS for the node do not need to be listed as 'failnode' When the MDT is registered with the MGS (which should always be done from the primary)
the primary NIDS will be added to the config log automatically. Afaik, all the duplication does is increase the length of the list searched when the primary isn't reachable.

Comment by Cliff White (Inactive) [ 06/Dec/11 ]

You have multiple issues going on here. First, can you attach syslogs from a mount attempt from dtn1?
Second, the fe2 log would seem to indicate some network issues, I notice this sequence:
-------

Dec 5 17:33:15 fe2 kernel: Lustre: scratch1-MDT0000-mdc-ffff8817f3d6e400: Connection to service scratch1-MDT0000 via nid 10.174.31.241@o2ib was lost; in progress operations using this service will wait for recovery to complete.
...

Dec 5 17:33:29 fe2 kernel: Lustre: 20703:0:(import.c:517:import_select_connection()) scratch1-MDT0000-mdc-ffff8817f3d6e400: tried all connections, increasing latency to 2s
...

Dec 5 17:34:24 fe2 kernel: Lustre: scratch1-MDT0000-mdc-ffff8817f3d6e400: Connection restored to service scratch1-MDT0000 using nid 10.174.31.241@o2ib.
--------
Which looks like what we see with network issues.

Comment by Dennis Nelson [ 06/Dec/11 ]

Yes, I believe there are multiple issues. Different systems have different symptoms even across the same subnet. I have not seen any indication of a network issue, although, I certainly will not rule out network issues.

Comment by Dennis Nelson [ 07/Dec/11 ]

Sorry, I sent the trace for the other case but did not send the trace for dtn1. Here it is.

[root@dtn1 ~]# lustre_rmmod
[root@dtn1 ~]# modprobe lustre
[root@dtn1 ~]# lctl set_param debug=+trace lnet.debug=+trace
[root@dtn1 ~]# lctl get_param debug
lnet.debug=trace ioctl neterror warning error emerg ha config console
[root@dtn1 ~]# mount /mnt/lustre1
mount.lustre: mount 10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1 at /mnt/lustre1 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root@dtn1 ~]# lctl dk > /tmp/lustre-scratch1

Comment by Cliff White (Inactive) [ 07/Dec/11 ]

Ah, the lustre dumps were requested on bug 890 - I need the syslog (/var/log/messages or dmesg) from the dtn1 mount attempt. Thanks again

Comment by Cliff White (Inactive) [ 07/Dec/11 ]

Hmm. I am starting to suspect there may be config log issues, I see this:
You attempt to mount using the ib3 address (.40), but when the mount starts to reach the MDT, it only tries the
primary address:
....
00000020:00000080:13:1323273617.875570:0:5786:0:(obd_config.c:857:class_process_config()) marker 4 (0x1) scratch1-MDT0000 add mdc
....
00000020:00000080:13:1323273617.875579:0:5786:0:(obd_config.c:799:class_process_config()) adding mapping from uuid 10.174.31.241@o2ib to nid 0x500000aae1ff1 (10.174.31.241@o2ib)
...
00000100:00000100:13:1323273617.875666:0:5786:0:(client.c:69:ptlrpc_uuid_to_connection()) cannot find peer 10.174.31.241@o2ib!
00010000:00080000:13:1323273617.875667:0:5786:0:(ldlm_lib.c:71:import_set_conn()) can't find connection 10.174.31.241@o2ib
00010000:00000001:13:1323273617.875667:0:5786:0:(ldlm_lib.c:72:import_set_conn()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
00010000:00020000:13:1323273617.875668:0:5786:0:(ldlm_lib.c:331:client_obd_setup()) can't add initial connection

and i don't see it trying any of the alternate addresses.
So, it is possible the on-disk (config log) configuration has an issue.

Comment by Dennis Nelson [ 07/Dec/11 ]

So, would a --writeconf for all devices be in order. I was thinking of doing that but I have not had any downtime to do it. The filesystem is not in production but others are using it to prepare for acceptance. I can certainly schedule time if you suggest that is the right plan of action. That might also fix the issue that I am seeing in LU-890 (clients do not reconect after MDS failover).

Comment by Dennis Nelson [ 07/Dec/11 ]

Sorry, trying to do many things at once. Here are the syslogs from dtn1:

[root@dtn1 ~]# umount -at lustre
[root@dtn1 ~]# lustre_rmmod
[root@dtn1 ~]# logger "Start Test"
[root@dtn1 ~]# mount -at lustre
mount.lustre: mount 10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1 at /mnt/lustre1 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root@dtn1 ~]# logger "stop Test"

Dec 7 14:51:49 dtn1 root: Start Test
Dec 7 14:51:56 dtn1 kernel: Lustre: OBD class driver, http://wiki.whamcloud.com/
Dec 7 14:51:56 dtn1 kernel: Lustre: Lustre Version: 1.8.6
Dec 7 14:51:56 dtn1 kernel: Lustre: Build Version: jenkins-wc1-ga73a0cf-PRISTINE-2.6.32-71.el6.x86_64
Dec 7 14:51:56 dtn1 kernel: Lustre: Listener bound to ib0:10.174.81.1:987:mlx4_0
Dec 7 14:51:56 dtn1 kernel: Lustre: Register global MR array, MR size: 0xffffffffffffffff, array size: 1
Dec 7 14:51:56 dtn1 kernel: Lustre: Added LNI 10.174.81.1@o2ib2 [8/64/0/180]
Dec 7 14:51:56 dtn1 kernel: Lustre: Lustre Client File System; http://www.lustre.org/
Dec 7 14:51:56 dtn1 kernel: Lustre: MGC10.174.80.40@o2ib2: Reactivating import
Dec 7 14:51:56 dtn1 kernel: LustreError: 7420:0:(ldlm_lib.c:331:client_obd_setup()) can't add initial connection
Dec 7 14:51:56 dtn1 kernel: LustreError: 7420:0:(obd_config.c:372:class_setup()) setup scratch1-MDT0000-mdc-ffff88060fc9f800 failed (-2)
Dec 7 14:51:56 dtn1 kernel: LustreError: 7420:0:(obd_config.c:1199:class_config_llog_handler()) Err -2 on cfg command:
Dec 7 14:51:56 dtn1 kernel: Lustre: cmd=cf003 0:scratch1-MDT0000-mdc 1:scratch1-MDT0000_UUID 2:10.174.31.241@o2ib
Dec 7 14:51:56 dtn1 kernel: LustreError: 15c-8: MGC10.174.80.40@o2ib2: The configuration from log 'scratch1-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Dec 7 14:51:56 dtn1 kernel: LustreError: 7369:0:(llite_lib.c:1095:ll_fill_super()) Unable to process log: -2
Dec 7 14:51:56 dtn1 kernel: LustreError: 7369:0:(obd_config.c:443:class_cleanup()) Device 2 not setup
Dec 7 14:51:56 dtn1 kernel: LustreError: 7369:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Dec 7 14:51:56 dtn1 kernel: LustreError: 7369:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Dec 7 14:51:56 dtn1 kernel: Lustre: client ffff88060fc9f800 umount complete
Dec 7 14:51:56 dtn1 kernel: LustreError: 7369:0:(obd_mount.c:2065:lustre_fill_super()) Unable to mount (-2)
Dec 7 14:51:56 dtn1 kernel: Lustre: MGC10.174.80.42@o2ib2: Reactivating import
Dec 7 14:51:56 dtn1 kernel: Lustre: Client scratch2-client has started
Dec 7 14:52:09 dtn1 root: stop Test

Comment by Cliff White (Inactive) [ 07/Dec/11 ]

We'd really like to know what is wrong before doing the writeconf. Would be good to know what the current config is.
Please do this for both scratch1 and 2, my example uses 'test2' where you would use 'scratch1' and 'scratch2', and replace /dev/hdb with your MGT partition.
Using the MGS partition (MGT)

  1. umount /mnt/mgs
  2. mount -t ldiskfs /dev/hdb /mnt/mgs
  3. /usr/sbin/llog_reader /mnt/mgs/CONFIGS/test2-client | grep uuid

The result should be something like this:

Target uuid : config_uuid
uuid=test2-clilov_UUID stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1
uuid=test2-clilmv_UUID stripe:cnt=0 size=0 offset=0 pattern=0
#10 (080)add_uuid nid=10.67.73.72@tcp(0x200000a434948) 0: 1:10.67.73.72@tcp
#19 (080)add_uuid nid=10.67.73.71@tcp(0x200000a434947) 0: 1:10.67.73.71@tcp
#25 (080)add_uuid nid=10.67.73.71@tcp(0x200000a434947) 0: 1:10.67.73.71@tcp
#31 (080)add_uuid nid=10.67.73.82@tcp(0x200000a434952) 0: 1:10.67.73.82@tcp

In my config, 10.67.73.82 is the MDS failover NID, the other NIDS are OSS/MDS primary NID. Check your results and verify the proper NIDS are being given to the clients. If all the NIDS aren't in the config log, a writeconf is needed. If the clients are getting a correct NID list from these config logs, then the issue is most likely networking.

Comment by Dennis Nelson [ 08/Dec/11 ]

Sorry for the delay. I captured the info last night and before I could upload the files my laptop died. I had to reschedule time to get the system again.

Comment by Dennis Nelson [ 08/Dec/11 ]

So, it is my understanding that I should find the client nids in the output. I cannot find the dtn1 nid in either although dtn1 has scratch2 mounted. In fact, I cannot find any of the 10.174.81.xx nids in the lustre1. I assume that is part of our problem but what would cause that?

Comment by Cliff White (Inactive) [ 08/Dec/11 ]

No, there are no client NIDs in the config log. As I mentioned previous, there are only server NIDS, we wanted to see if all the server NIDS were being given to the clients. The error from the dtn1 mount would appear to indicate a possible corrupt config log.

Comment by Cliff White (Inactive) [ 08/Dec/11 ]

this is being escalated - please attach the full config logs from both systems. Do the same thing as previous, but instead of the 'grep uuid' just > the whole thing to a file and attach.

Comment by Dennis Nelson [ 08/Dec/11 ]

OK, here they are.

Comment by Johann Lombardi (Inactive) [ 08/Dec/11 ]

It seems that the config logs of scratch1 do not properly set up the 3 nids of lfs-mds-1-1. Only 10.174.31.241@o2ib is added to the niduuid before attach/setup. Could you please run the following command on a login node which has the 2 filesystems mounted:
$ lctl get_param mdc.*.import

The only way to fix this would be to regenerate the config logs with writeconf.

Comment by Dennis Nelson [ 08/Dec/11 ]

[root@fe2 ~]# lctl get_param mdc.*.import
mdc.scratch1-MDT0000-mdc-ffff88109cc8a000.import=
import:
name: scratch1-MDT0000-mdc-ffff88109cc8a000
target: scratch1-MDT0000_UUID
state: FULL
connect_flags: [version, inode_bit_locks, join_file, getattr_by_fid, no_oh_for_devices, early_lock_cancel, adaptive_timeouts, version_recovery, pools]
import_flags: [replayable, pingable]
connection:
failover_nids: [10.174.31.241@o2ib, 10.174.80.41@o2ib]
current_connection: 10.174.31.241@o2ib
connection_attempts: 10
generation: 1
in-progress_invalidations: 0
rpcs:
inflight: 0
unregistering: 0
timeouts: 9
avg_waittime: 1402 usec
service_estimates:
services: 1 sec
network: 1 sec
transactions:
last_replay: 0
peer_committed: 0
last_checked: 0
mdc.scratch2-MDT0000-mdc-ffff8817b83d0800.import=
import:
name: scratch2-MDT0000-mdc-ffff8817b83d0800
target: scratch2-MDT0000_UUID
state: FULL
connect_flags: [version, inode_bit_locks, join_file, getattr_by_fid, no_oh_for_devices, early_lock_cancel, adaptive_timeouts, version_recovery, pools]
import_flags: [replayable, pingable]
connection:
failover_nids: [10.175.31.242@o2ib, 10.174.63.242@o2ib, 10.174.63.252@o2ib, 10.175.31.252@o2ib]
current_connection: 10.175.31.242@o2ib
connection_attempts: 16
generation: 1
in-progress_invalidations: 0
rpcs:
inflight: 0
unregistering: 0
timeouts: 15
avg_waittime: 1643 usec
service_estimates:
services: 1 sec
network: 1 sec
transactions:
last_replay: 0
peer_committed: 0
last_checked: 0
[root@fe2 ~]#

Any ideas why scratch2 would not show the 10.174.80.[42,43] addresses listed? The way this was designed by the customer was that the login nodes would use the .80 subnet. The login nodes really only should have the single nid I added the others as a workaround.

Comment by Cliff White (Inactive) [ 08/Dec/11 ]

writeconf should fix the issue

Comment by Dennis Nelson [ 08/Dec/11 ]

OK, so I understand we are going to need to do a writeconf. I actually asked about doing that already. My customer is going to be asking a lot of questions tomorrow morning, so let me ask a few now.

1. Any idea why this did not work in the first place?
2. Is there a limit to the numbers of nids that a server has? Are we reaching some limit in LNET?
3. What makes us think it will work after doing a writeconf?
4. Is there any point of trying to troubleshoot any longer before we do the writeconf or should I go ahead and do it tomorrow?
5. The DDN tools set --mgsnode and --servernode. Based on previous info from Cliff, he suggested that we do not need both. So, should I just use the --mgsnode flags and drop the --servernode flags?

Thanks,

Comment by Johann Lombardi (Inactive) [ 09/Dec/11 ]

> Any ideas why scratch2 would not show the 10.174.80.[42,43] addresses listed?

The first time a target is mounted, it registers all its configured nids to the MGS.
Then this list of nids is passed to the client which selects the most suitable one (it discards the others) to be used according to its local network configuration. Failover will only be done across nids registered with the --failnode option.

> 1. Any idea why this did not work in the first place?

Could you please tell us the exact commands you ran when you formatted the the MDTs?
Could you please also run tunefs.lustre --print against the MDT devices?

> 2. Is there a limit to the numbers of nids that a server has? Are we reaching some limit in LNET?

Not as far as i know. However, configuring multiple failover nids can increase the recovery time significantly.

> 3. What makes us think it will work after doing a writeconf?
> 4. Is there any point of trying to troubleshoot any longer before we do the writeconf or should I go ahead and do it tomorrow?
> 5. The DDN tools set --mgsnode and --servernode. Based on previous info from Cliff, he suggested that we do not need both. So, should I just use the --mgsnode flags and drop the --servernode flags?

Let's first check how you configured the filesystem in the first place before going down this path.

Comment by Dennis Nelson [ 09/Dec/11 ]

I used the DDN tools to format the MDTs. It is done in two steps. First, the formatting command uses generic placeholders for the NIDs. Then, a tunefs step is performed:

Step 1:
mkfs.lustre --mgs /dev/vg_scratch1/mgs
tune2fs -O MMP /dev/vg_scratch1/mgs
tune2fs -p 5 /dev/vg_scratch1/mgs
mkfs.lustre --mgsnode=127.0.0.2@tcp --failnode=127.0.0.2@tcp --fsname=scratch1 --mdt /dev/vg_scratch1/mdt
tune2fs -p 5 /dev/vg_scratch1/mdt

Step 2:
tunefs.lustre --erase-params /dev/vg_scratch1/mgs
tunefs.lustre --erase-params --mgsnode=10.174.31.241@o2ib0,10.174.79.241@o2ib1,10.174.80.40@o2ib2 --mgsnode=10.174.31.251@o2ib0,10.174.79.251@o2ib1,10.174.80.41@o2ib2 --servicenode=10.174.31.241@o2ib0,10.174.79.241@o2ib1,10.174.80.40@o2ib2 --servicenode=10.174.31.251@o2ib0,10.174.79.251@o2ib1,10.174.80.41@o2ib2 --param mdt.quota_type=ug /dev/vg_scratch1/mdt

[root@lfs-mds-1-1 scratch1]# tunefs.lustre --print /dev/vg_scratch1/mdt
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

Read previous values:
Target: scratch1-MDT0000
Index: 0
Lustre FS: scratch1
Mount type: ldiskfs
Flags: 0x1401
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 mgsnode=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 failover.node=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 failover.node=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 mdt.quota_type=ug

Permanent disk data:
Target: scratch1-MDT0000
Index: 0
Lustre FS: scratch1
Mount type: ldiskfs
Flags: 0x1401
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 mgsnode=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 failover.node=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 failover.node=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 mdt.quota_type=ug

exiting before disk write.

Comment by Johann Lombardi (Inactive) [ 09/Dec/11 ]

Could you please also run tunefs.lustre on the MDT of scracth2?

Comment by Ashley Pittman (Inactive) [ 09/Dec/11 ]

What options do you recommend for the writeconf? As well as the configuration data I assume they options --erase-params and --writeconf should both be set?

Comment by Dennis Nelson [ 09/Dec/11 ]

Commands used:

mkfs.lustre --mgsnode=127.0.0.2@tcp --failnode=127.0.0.2@tcp --fsname=scratch2 --mdt /dev/vg_scratch2/mdt
tune2fs -p 5 /dev/vg_scratch2/mdt

tunefs.lustre --erase-params /dev/vg_scratch2/mgs
tunefs.lustre --erase-params --mgsnode=10.175.31.242@o2ib0,10.174.79.242@o2ib1,10.174.80.42@o2ib2 --mgsnode=10.175.31.252@o2ib0,10.174.79.252@o2ib1,10.174.80.43@o2ib2 --servicenode=10.175.31.242@o2ib0,10.174.79.242@o2ib1,10.174.80.42@o2ib2 --servicenode=10.175.31.252@o2ib0,10.174.79.252@o2ib1,10.174.80.43@o2ib2 --param mdt.quota_type=ug /dev/vg_scratch2/mdt

[root@lfs-mds-2-1 ~]# tunefs.lustre --print /dev/vg_scratch2/mdt
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

Read previous values:
Target: scratch2-MDT0000
Index: 0
Lustre FS: scratch2
Mount type: ldiskfs
Flags: 0x1401
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=10.175.31.242@o2ib,10.174.79.242@o2ib1,10.174.80.42@o2ib2 mgsnode=10.175.31.252@o2ib,10.174.79.252@o2ib1,10.174.80.43@o2ib2 failover.node=10.175.31.242@o2ib,10.174.79.242@o2ib1,10.174.80.42@o2ib2 failover.node=10.175.31.252@o2ib,10.174.79.252@o2ib1,10.174.80.43@o2ib2 mdt.quota_type=ug

Permanent disk data:
Target: scratch2-MDT0000
Index: 0
Lustre FS: scratch2
Mount type: ldiskfs
Flags: 0x1401
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=10.175.31.242@o2ib,10.174.79.242@o2ib1,10.174.80.42@o2ib2 mgsnode=10.175.31.252@o2ib,10.174.79.252@o2ib1,10.174.80.43@o2ib2 failover.node=10.175.31.242@o2ib,10.174.79.242@o2ib1,10.174.80.42@o2ib2 failover.node=10.175.31.252@o2ib,10.174.79.252@o2ib1,10.174.80.43@o2ib2 mdt.quota_type=ug

exiting before disk write.

Comment by Johann Lombardi (Inactive) [ 09/Dec/11 ]

> What options do you recommend for the writeconf?

writeconf can now be passed as a mount option. So i would unmount all clients, MDT and OSTs of scratch1 (not the shared mgs). Then mount the MDT again with -o writeconf and then the OSTs with the same mount option. Then once all targets are up and running, you can mount clients again.
Before proceeding with writeconf, i would advise to back up the CONFIGS directory of the MGS.

I also looked at the commands you used to format the MDS and everything looks good. It is still unclear why the resulting logs are bogus.

Comment by Dennis Nelson [ 09/Dec/11 ]

Let me clarify something. The customer specified that each filesystem needed to be fully independent, Given that, there is not a common MDS. There is an MDS/MGT for each filesystem. The MGS services runs co-located on the MDS system. The MGT is a separate LVM from the MDT although they both reside on a single volume group.

Comment by Dennis Nelson [ 09/Dec/11 ]

Another question, after looking at the tunefs.lustre output, Cliff suggested that the mgsnode and failover.node definitions were duplicates of each other and did not both need to be there. Now, you are saying that the commands are correct. We use the --servicenode syntax in our tunefs commands yet lustre.tunefs displays failover.node. Just to confirm, the mgsnode and failover.node (or servicenode) options must both be present, and the servicenode/failover.node entries are different ways of setting the same parameter? In our case where the mgs and the mds are always on the same server, mgsnode and servicenodes are identical but that would not always be the case.

Comment by Johann Lombardi (Inactive) [ 09/Dec/11 ]

> Let me clarify something. The customer specified that each filesystem needed to be fully independent ...

Understood. This should not change the procedure in any case.

> Just to confirm, the mgsnode and failover.node (or servicenode) options must both be present

Yes. The mgsnode(s) is the nid(s) that will be used by the MDT to connect to the MGS, while failover.node is what will be registered by the MDS to to the MDS and used by client nodes to reach the MDS.

> the servicenode/failover.node entries are different ways of setting the same parameter?

Correct.

> In our case where the mgs and the mds are always on the same server, mgsnode and servicenodes are identical but that would not always be the case

Exactly. It can (must) only be skipped if you run a combo MGT/MDT on the same device.

Comment by Cliff White (Inactive) [ 09/Dec/11 ]

I was unaware you were using --servicenode instead of --failnode, that explains the discrepancy.

Comment by Dennis Nelson [ 09/Dec/11 ]

Thanks. I need to give an update to my customer. At what point are we going to decide to do the writeconf and how confident are we that it will work when we do it?

Comment by Cliff White (Inactive) [ 09/Dec/11 ]

Per Johann, go ahead and do the writeconf. We will attempt to replicate the issue in our lab with multiple nids.

Comment by Johann Lombardi (Inactive) [ 09/Dec/11 ]

I think it would be interesting to collect debug logs of the MGS during the re-registration.
Before remounting the MDS, please run the following comment on the MGS node:

$ lctl set_param subsystem_debug=mgs # only collect debug messages from the MGS
$ lctl set_param debug=+config+info # enable config & info in the debug mask
$ lctl set_param debug_mb=500 # set debug buffer to 500MB

And then run lctl dk > /tmp/logs once the MDS has been successfully mounted.
Please also collect the new output of llog_reader so that we can compare.

We will try to reproduce on our side too.

Comment by Dennis Nelson [ 09/Dec/11 ]

OK, one more question. You mentioned that the writeconf could be done as a mount option. Should I do it that way or should I use the script previously provided? Does it make any difference? Do you have a preference of how we do it? If we do it as a mount option, I think I would need more info on how that works. How do the mgsnode and servicenode options get added doing it with as a mount option?

Comment by Johann Lombardi (Inactive) [ 09/Dec/11 ]

> OK, one more question. You mentioned that the writeconf could be done as a mount option.

Right.

> Should I do it that way or should I use the script previously provided?

It is really as you prefer. If your tool supports writeconf, then you can just use it. On my side, I just find it very convenient to pass -o writeconf as a mount option.

> Does it make any difference?

With tunefs.lustre, you can also erase the paramaters & restore them. However i don't think we need to do this here since we don't intend to change anything (like a nid).

> If we do it as a mount option, I think I would need more info on how that works.

It really works as i mentioned in my comment earlier, unmount everything and then mount the MDT with -o writeconf and then the OSTs with the same mount option.

> How do the mgsnode and servicenode options get added doing it with as a mount option?

Those parameters are removed if you use the --erase-params (e.g. when you want to change one of the parameters). In this case, i don't think we need to do this and we just want to run a simple writeconf.

BTW, if you use OST pools, be aware that running writeconf erases all pools information on the MGS (as well as any other parameters set via lctl conf_param).

Comment by Dennis Nelson [ 10/Dec/11 ]

OK, Finally got the time scheduled to do the writeconf:

[root@lfs-mds-1-1 ~]# pdsh -a modprobe lustre
[root@lfs-mds-1-1 ~]# lctl set_param subsystem_debug=mgslnet.subsystem_debug=mgs
[root@lfs-mds-1-1 ~]# lctl set_param debug=+config+info
lnet.debug=+config+info
error: set_param: writing to file /proc/sys/lnet/debug: Invalid argument
[root@lfs-mds-1-1 ~]# lctl set_param debug=+config
lnet.debug=+config
[root@lfs-mds-1-1 ~]# lctl set_param debug=+info
lnet.debug=+info
[root@lfs-mds-1-1 ~]# lctl get_param debug
lnet.debug=info ioctl neterror warning error emerg ha config console
[root@lfs-mds-1-1 ~]# lctl set_param debug_mb=500
lnet.debug_mb=500
[root@lfs-mds-1-1 ~]# vgchange -ay vg_scratch1
2 logical volume(s) in volume group "vg_scratch1" now active
[root@lfs-mds-1-1 ~]# mount -t lustre -o writeconf /dev/vg_scratch1/mgs /lustre/mgs
[root@lfs-mds-1-1 ~]# mount -t lustre -o writeconf /dev/vg_scratch1/mdt /lustre/scratch1/mdt
[root@lfs-mds-1-1 ~]# lctl dk > /tmp/log1
Ran script to mount all 176 OSTs with -o writeconf
[root@lfs-mds-1-1 ~]# lctl dk > /tmp/log2
[root@lfs-mds-1-1 ~]# pdsh -a umount -at lustre
[root@lfs-mds-1-1 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 16246428 3437164 11970676 23% /
/dev/mapper/VolGroup00-LogVol05
1857406056 200344 1761333100 1% /scratch
/dev/mapper/VolGroup00-LogVol04
4062912 139396 3713804 4% /home
/dev/mapper/VolGroup00-LogVol03
4062912 196308 3656892 6% /tmp
/dev/mapper/VolGroup00-LogVol02
8125880 1155932 6550520 15% /var
tmpfs 37020324 0 37020324 0% /dev/shm
[root@lfs-mds-1-1 ~]# mount -t ldiskfs /dev/vg_scratch1/mgs /lustre/mgs
[root@lfs-mds-1-1 ~]# llog_reader /lustre/mgs/CONFIGS/scratch1-client > log.client

After the writeconf, failover works LU-890, but I still cannot mount the fe2 client over ib2. I am working on getting the log files. Transferring files through the customer's bastion host is difficult.

Comment by Dennis Nelson [ 10/Dec/11 ]

[root@fe1 ~]# cat /etc/modprobe.d/lustre.conf

  1. Lustre module configuration file
    options lnet networks="o2ib0(ib2)"
    [root@fe1 ~]# modprobe lustre
    [root@fe1 ~]# lctl ping 10.174.80.40@o2ib2
    failed to ping 10.174.80.40@o2ib2: Input/output error
    [root@fe1 ~]# ping 10.174.80.40
    PING 10.174.80.40 (10.174.80.40) 56(84) bytes of data.
    64 bytes from 10.174.80.40: icmp_seq=1 ttl=64 time=3.53 ms
    64 bytes from 10.174.80.40: icmp_seq=2 ttl=64 time=0.162 ms
    ^C
      • 10.174.80.40 ping statistics —
        2 packets transmitted, 2 received, 0% packet loss, time 1766ms
        rtt min/avg/max/mdev = 0.162/1.850/3.538/1.688 ms
        [root@fe1 ~]# vi /etc/modprobe.d/lustre.conf
        [root@fe1 ~]# lustre_rmmod
        [root@fe1 ~]# cat /etc/modprobe.d/lustre.conf
  2. Lustre module configuration file
    options lnet networks="o2ib0(ib0)"
    [root@fe1 ~]# modprobe lustre
    [root@fe1 ~]# mount -t lustre 10.174.31.241@o2ib0:10.174.31.251@o2ib0:/scratch1 /mnt/lustre1
    [root@fe1 ~]# ifconfig ib2
    Ifconfig uses the ioctl access method to get the full address information, which limits hardware addresses to 8 bytes.
    Because Infiniband address has 20 bytes, only the first 8 bytes are displayed correctly.
    Ifconfig is obsolete! For replacement check ip.
    ib2 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
    inet addr:10.174.81.10 Bcast:10.174.95.255 Mask:255.255.240.0
    inet6 addr: fe80::202:c903:f:aa67/64 Scope:Link
    UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
    RX packets:502712 errors:0 dropped:0 overruns:0 frame:0
    TX packets:7348 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:256
    RX bytes:41049681 (39.1 MiB) TX bytes:1246560 (1.1 MiB)

[root@fe1 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda6 972826992 401657664 522399748 44% /
tmpfs 49532484 0 49532484 0% /dev/shm
/dev/sda5 142335 45735 89251 34% /boot
10.181.1.2:/apps/v1 504889344 40024672 464864672 8% /apps
10.181.1.2:/contrib 137625600 2996448 134629152 3% /contrib
10.181.1.2:/home 4315676672 458082464 3857594208 11% /home
10.174.31.241@o2ib0:10.174.31.251@o2ib0:/scratch1
2688660012544 14582672096 2647162648864 1% /mnt/lustre1

Comment by Dennis Nelson [ 12/Dec/11 ]

After some more testing, I think we have also resolved the issue mounting the filesystems on the login clients.

I had tried with the following in /etc/modprobe.d/lustre.conf:
options lnet networks="o2ib0(ib2)"

What I have found is that if I set it to this:
options lnet networks="o2ib2(ib2)"

The mounts succeed.

It appears that I have 1 problem remaining:

I cannot mount the production filesystems on the TDS (Test and Development System) clients:

The TDS clients mount the TDS Lustre filesystems over the client ib1 ports. They are supposed to mount the production filesystems over ib0. The ib2 ports of the production lustre servers are connected into the ib0 fabric of the TDS cluster.

TDS Client (r1i3n15):
ib0 inet addr:10.174.64.65 Bcast:10.174.79.255 Mask:255.255.240.0
ib1 inet addr:10.174.96.65 Bcast:10.174.111.255 Mask:255.255.240.0

[root@r1i3n15 ~]# cat /etc/modprobe.d/lustre.conf

  1. Lustre module configuration file
    options lnet networks="o2ib0(ib1), o2ib1(ib0)"

[root@r1i3n15 ~]# lctl list_nids
10.174.96.65@o2ib
10.174.64.65@o2ib1
[root@r1i3n15 ~]# lctl ping 10.174.79.241@o2ib1
12345-0@lo
12345-10.174.31.241@o2ib
12345-10.174.79.241@o2ib1
12345-10.174.80.40@o2ib2

[root@r1i3n15 ~]# cat /etc/fstab

  1. <file system> <mount point> <type> <options> <dump> <pass>
    ...
    10.174.96.138@o2ib:/lustre1 /mnt/tds_lustre1 lustre defaults,flock 0 0
    10.174.96.138@o2ib:/lustre2 /mnt/tds_lustre2 lustre defaults,flock 0 0
    10.174.79.241@o2ib1:10.174.79.251@o2ib1:/scratch1 /mnt/lsc_lustre1 lustre defaults,flock 0 0
    10.174.79.241@o2ib1:10.174.79.251@o2ib1:/scratch2 /mnt/lsc_lustre2 lustre defaults,flock 0 0

When I attempt to mount, the mount command simply hangs with with very little in the client log file:

Dec 12 16:26:28 r1i3n15 kernel: Lustre: MGC10.174.79.241@o2ib1: Reactivating import
Dec 12 16:29:35 r1i3n15 kernel: Lustre: 4374:0:(import.c:517:import_select_connection()) scratch1-MDT0000-mdc-ffff880338c74c00: tried all connections, increasing latency to 7s
Dec 12 16:29:35 r1i3n15 kernel: Lustre: 4374:0:(import.c:517:import_select_connection()) Skipped 7 previous similar messages

It appears to be communicating with the server because I initially, inadvertently used the wrong filesystem name and got this message:

Dec 12 15:52:43 r1i3n15 kernel: Lustre: MGC10.174.79.241@o2ib1: Reactivating import
Dec 12 15:52:43 r1i3n15 kernel: LustreError: 156-2: The client profile 'lustre1-client' could not be read from the MGS. Does that filesystem exist?

It correctly responded that the lustre1-client profile does not exist.

Comment by Cliff White (Inactive) [ 12/Dec/11 ]

since you have resolved the initial issue and this new problem is on a different set of servers, please close this bug and open up a new bug for the new issue.
We'll need to see the lnet config for the servers on the lustre1 and lustre2 filesystems.

Comment by Dennis Nelson [ 12/Dec/11 ]

I'll be glad to open a new ticket for this but it is the same set of servers and it was referenced in my initial post.

Comment by Cliff White (Inactive) [ 12/Dec/11 ]

I am sorry I am a bit confused as to which servers are which.
These messages are telling you two different things:
---------
Dec 12 16:26:28 r1i3n15 kernel: Lustre: MGC10.174.79.241@o2ib1: Reactivating import
Dec 12 16:29:35 r1i3n15 kernel: Lustre: 4374:0:(import.c:517:import_select_connection()) scratch1-MDT0000-mdc-ffff880338c74c00: tried all connections, increasing latency to 7s
Dec 12 16:29:35 r1i3n15 kernel: Lustre: 4374:0:(import.c:517:import_select_connection()) Skipped 7 previous similar messages
------------
This is telling you the MDS cannot be located

Dec 12 15:52:43 r1i3n15 kernel: Lustre: MGC10.174.79.241@o2ib1: Reactivating import
Dec 12 15:52:43 r1i3n15 kernel: LustreError: 156-2: The client profile 'lustre1-client' could not be read from the MGS.
This is referring to the MGS. Since the lustre1-config is missing, this client isn't trying to find the MDS. Same server, but not the same service.
-------------
Can we get a new tunefs.lustre --print for scratch1 MGT and MDT, now that you've done the writeconf?

Comment by Dennis Nelson [ 12/Dec/11 ]

I understand about being confused. As I said, it is a very complex configuration. I think it is going to be a nightmare to support.

I threw in the error message of the lustre1-client simply because I had made a mistake and put lustre1 instead of scratch1. To me, this indicates that it is communicating with the MGS since it was able to tell me that that the lustre1-client profile does not exist. When I use the right filesystem name, the mount just hangs,

Here is the tunefs.lustre --print information after the writeconf.

[root@lfs-mds-1-2 ~]# tunefs.lustre --print /dev/vg_scratch1/mdt
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

Read previous values:
Target: scratch1-MDT0000
Index: 0
Lustre FS: scratch1
Mount type: ldiskfs
Flags: 0x1401
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 mgsnode=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 failover.node=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 failover.node=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 mdt.quota_type=ug

Permanent disk data:
Target: scratch1-MDT0000
Index: 0
Lustre FS: scratch1
Mount type: ldiskfs
Flags: 0x1401
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 mgsnode=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 failover.node=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 failover.node=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 mdt.quota_type=ug

exiting before disk write.

Comment by Johann Lombardi (Inactive) [ 12/Dec/11 ]

I looked at the new configuration log (i.e. file log.client) and the nid setup now looks fine:

#06 (088)add_uuid nid=10.174.31.241@o2ib(0x500000aae1ff1) 0: 1:10.174.31.241@o2ib
#07 (088)add_uuid nid=10.174.79.241@o2ib1(0x500010aae4ff1) 0: 1:10.174.31.241@o2ib
#08 (088)add_uuid nid=10.174.80.40@o2ib2(0x500020aae5028) 0: 1:10.174.31.241@o2ib
#09 (136)attach 0:scratch1-MDT0000-mdc 1:mdc 2:scratch1-MDT0000-mdc_UUID
#10 (144)setup 0:scratch1-MDT0000-mdc 1:scratch1-MDT0000_UUID 2:10.174.31.241@o2ib

while in the previous file:

#06 (088)add_uuid nid=10.174.31.241@o2ib(0x500000aae1ff1) 0: 1:10.174.31.241@o2ib
#07 (136)attach 0:scratch1-MDT0000-mdc 1:mdc 2:scratch1-MDT0000-mdc_UUID
#08 (144)setup 0:scratch1-MDT0000-mdc 1:scratch1-MDT0000_UUID 2:10.174.31.241@o2ib
#09 (088)add_uuid nid=10.174.31.241@o2ib(0x500000aae1ff1) 0: 1:10.174.31.241@o2ib
#10 (088)add_uuid nid=10.174.79.241@o2ib(0x500000aae4ff1) 0: 1:10.174.31.241@o2ib
#11 (088)add_uuid nid=10.174.80.40@o2ib(0x500000aae5028) 0: 1:10.174.31.241@o2ib

While the mount hangs, could you please try to collect import information (lctl get_param mdc.*.import) to check what nid we try to access?

Comment by Dennis Nelson [ 12/Dec/11 ]

Note; The filesystems listed as lustre1 and lustre2 are the TDS lustre filesystems, not production. They use different servers. The problem filesystems are scratch1 and scratch2.

[root@r1i3n15 ~]# lctl get_param mdc.*.import
mdc.lustre1-MDT0000-mdc-ffff88033cea4000.import=
import:
name: lustre1-MDT0000-mdc-ffff88033cea4000
target: lustre1-MDT0000_UUID
state: FULL
connect_flags: [version, inode_bit_locks, join_file, getattr_by_fid, no_oh_for_devices, early_lock_cancel, adaptive_timeouts, version_recovery, pools]
import_flags: [replayable, pingable]
connection:
failover_nids: [10.174.96.138@o2ib]
current_connection: 10.174.96.138@o2ib
connection_attempts: 1
generation: 1
in-progress_invalidations: 0
rpcs:
inflight: 0
unregistering: 0
timeouts: 0
avg_waittime: 235 usec
service_estimates:
services: 1 sec
network: 1 sec
transactions:
last_replay: 0
peer_committed: 0
last_checked: 0
mdc.lustre2-MDT0000-mdc-ffff8803390ee800.import=
import:
name: lustre2-MDT0000-mdc-ffff8803390ee800
target: lustre2-MDT0000_UUID
state: FULL
connect_flags: [version, inode_bit_locks, join_file, getattr_by_fid, no_oh_for_devices, early_lock_cancel, adaptive_timeouts, version_recovery, pools]
import_flags: [replayable, pingable]
connection:
failover_nids: [10.174.96.138@o2ib]
current_connection: 10.174.96.138@o2ib
connection_attempts: 1
generation: 1
in-progress_invalidations: 0
rpcs:
inflight: 0
unregistering: 0
timeouts: 0
avg_waittime: 220 usec
service_estimates:
services: 1 sec
network: 1 sec
transactions:
last_replay: 0
peer_committed: 0
last_checked: 0
mdc.scratch1-MDT0000-mdc-ffff88033b4f1400.import=
import:
name: scratch1-MDT0000-mdc-ffff88033b4f1400
target: scratch1-MDT0000_UUID
state: DISCONN
connect_flags: [version, acl, inode_bit_locks, join_file, getattr_by_fid, no_oh_for_devices, early_lock_cancel, adaptive_timeouts, lru_resize, fid_is_enabled, version_recovery, pools, 64bithash]
import_flags: [replayable]
connection:
failover_nids: [10.174.31.241@o2ib, 10.174.31.251@o2ib]
current_connection: 10.174.31.241@o2ib
connection_attempts: 1
generation: 1
in-progress_invalidations: 0
rpcs:
inflight: 1
unregistering: 0
timeouts: 1
avg_waittime: 0 usec
service_estimates:
services: 5 sec
network: 0 sec
transactions:
last_replay: 0
peer_committed: 0
last_checked: 0

I notice that it says that it is using 10.174.31.241@o2ib. Currently, the MGS and the MDT are mounted on the other node (10.174.31..251). Also, it just says o2ib not o2ib1? Previously, I would not have worried about that but it seemed to make a difference on the production login clients (On the production clients, I tried defining the nid as o2ib0 and they would not mount, yet they did mount when the nid was defined as o2ib2).

Comment by Johann Lombardi (Inactive) [ 12/Dec/11 ]

So the failover_nids list looks good. The client tries to reach the MDS via 10.174.31.241@o2ib and it should then try through 10.174.31.251@o2ib.
As for o2ib/o2ib2, i think it only matters in the lnet module configuration, but i will have to ask Liang to confirm.

Can you successfully ping 10.174.31.251@o2ib from r1i3n15?

Comment by Dennis Nelson [ 12/Dec/11 ]

Sorry, I made a mistake. That is the problem. It is trying to connect through the wrong nid. Copying the original data from above:

[root@r1i3n15 ~]# lctl list_nids
10.174.96.65@o2ib
10.174.64.65@o2ib1
[root@r1i3n15 ~]# lctl ping 10.174.79.241@o2ib1
12345-0@lo
12345-10.174.31.241@o2ib
12345-10.174.79.241@o2ib1
12345-10.174.80.40@o2ib2

[root@r1i3n15 ~]# cat /etc/fstab

<file system> <mount point> <type> <options> <dump> <pass>
...
10.174.96.138@o2ib:/lustre1 /mnt/tds_lustre1 lustre defaults,flock 0 0
10.174.96.138@o2ib:/lustre2 /mnt/tds_lustre2 lustre defaults,flock 0 0
10.174.79.241@o2ib1:10.174.79.251@o2ib1:/scratch1 /mnt/lsc_lustre1 lustre defaults,flock 0 0
10.174.79.241@o2ib1:10.174.79.251@o2ib1:/scratch2 /mnt/lsc_lustre2 lustre defaults,flock 0 0

The only path to these servers from this client is through the 10.174.79.xxx addresses.

Why is it trying 10.174.31.xxx? There is no route for that subnet on these clients:

[root@r1i3n15 ~]# netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
10.181.1.0 10.174.64.67 255.255.255.0 UG 0 0 0 ib0
192.168.159.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
10.174.64.0 0.0.0.0 255.255.240.0 U 0 0 0 ib0
10.174.96.0 0.0.0.0 255.255.240.0 U 0 0 0 ib1
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 ib1
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0

Comment by Johann Lombardi (Inactive) [ 13/Dec/11 ]

I have no idea why lnet selected 10.174.31.241@o2ib/10.174.31.251@o2ib instead of 10.174.79.241@o2ib/10.174.79.251@o2ib.
Liang, any idea?

Comment by Liang Zhen (Inactive) [ 13/Dec/11 ]

Sorry I'm a little confused about this setting, and have a few questions:

  • r1i3n15 has two IB networks: 10.174.96.65@o2ib and 10.174.64.65@o2ib1, I think 10.174.96.65@o2ib is supposed to connect to production filesystem and 10.174.64.65@o2ib1 is supposed to connect to TDS filesystem, right?
  • 12345-10.174.31.241@o2ib and 12345-10.174.79.241@o2ib1 are NIDs of TDS MGS right?
  • I know r1i3n15 can lctl ping 10.174.79.241@o2ib1 (second NID of TDS MGS), my question is: are you able to lctl ping 10.174.31.241@o2ib (first NID of TDS MGS), also, can you use regular ping to reach 10.174.31.241 from rli3n15?
  • can you mount TDS filesystem if you disable this NID (10.174.96.65@o2ib) on rli3n15?

Thanks
Liang

Comment by Dennis Nelson [ 14/Dec/11 ]

No, these clients cannot lctl ping, or ping, the 10.174.31.241 address. That bid exists on the servers to support the scratch1 filesystems from the production clients.

Yes, r1i3n15 can mount the TDS filesystem.

Comment by Dennis Nelson [ 14/Dec/11 ]

I realized that I did not answer one question. There is only one MDS on the TDS filesystem and it has only one nid:

[root@mds01 ~]# lctl list_nids
10.174.96.138@o2ib

[root@r1i3n15 ~]# netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
10.181.1.0 10.174.64.67 255.255.255.0 UG 0 0 0 ib0
192.168.159.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
10.174.64.0 0.0.0.0 255.255.240.0 U 0 0 0 ib0
10.174.96.0 0.0.0.0 255.255.240.0 U 0 0 0 ib1
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 ib1
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0

As you can see, there is no route to 10.174.31.241.

[root@r1i3n15 ~]# ping 10.174.31.241
connect: Network is unreachable

[root@r1i3n15 ~]# cat /etc/modprobe.d/lustre.conf

  1. Lustre module configuration file
    options lnet networks="o2ib(ib1)"

[root@r1i3n15 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
tmpfs 153600 60 153540 1% /tmp
...
10.174.96.138@o2ib:/lustre1
30523086656 2401014176 27816223672 8% /mnt/tds_lustre1
10.174.96.138@o2ib:/lustre2
30523086656 268760500 29948475720 1% /mnt/tds_lustre2

If I unmount the TDS filesystems and change the modprobe.d/lustre.conf file to only include the ib0 port:

[root@r1i3n15 ~]# cat /etc/modprobe.d/lustre.conf

  1. Lustre module configuration file
    options lnet networks="o2ib(ib0)"

I cannot communicate with the MDS. I get this error:
[root@r1i3n15 ~]# lctl list_nids
10.174.64.65@o2ib
[root@r1i3n15 ~]# cat /etc/fstab
...
10.174.79.241@o2ib:10.174.79.251@o2ib:/scratch1 /mnt/lsc_lustre1 lustre defaults,flock 0 0
10.174.79.242@o2ib:10.174.79.252@o2ib:/scratch2 /mnt/lsc_lustre2 lustre defaults,flock 0 0

[root@r1i3n15 ~]# mount -at lustre
mount.lustre: mount 10.174.79.241@o2ib:10.174.79.251@o2ib:/scratch1 at /mnt/lsc_lustre1 failed: Cannot send after transport endpoint shutdown
mount.lustre: mount 10.174.79.242@o2ib:10.174.79.252@o2ib:/scratch2 at /mnt/lsc_lustre2 failed: Cannot send after transport endpoint shutdown

Dec 14 12:17:23 r1i3n15 kernel: Lustre: 27084:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1388172991791113 sent from MGC10.174.79.241@o2ib to NID 10.174.79.241@o2ib 0s ago has failed due to network error (5s prior to deadline).
Dec 14 12:17:23 r1i3n15 kernel: req@ffff880639bf3400 x1388172991791113/t0 o250->MGS@MGC10.174.79.241@o2ib_0:26/25 lens 368/584 e 0 to 1 dl 1323865048 ref 1 fl Rpc:N/0/0 rc 0/0
Dec 14 12:17:23 r1i3n15 kernel: LustreError: 1280:0:(o2iblnd_cb.c:2532:kiblnd_rejected()) 10.174.79.241@o2ib rejected: o2iblnd fatal error
Dec 14 12:17:48 r1i3n15 kernel: Lustre: 27084:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1388172991791115 sent from MGC10.174.79.241@o2ib to NID 10.174.79.251@o2ib 0s ago has failed due to network error (5s prior to deadline).
Dec 14 12:17:48 r1i3n15 kernel: req@ffff880639347c00 x1388172991791115/t0 o250->MGS@MGC10.174.79.241@o2ib_1:26/25 lens 368/584 e 0 to 1 dl 1323865073 ref 1 fl Rpc:N/0/0 rc 0/0
Dec 14 12:17:48 r1i3n15 kernel: LustreError: 27292:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff880639bf3000 x1388172991791116/t0 o501->MGS@MGC10.174.79.241@o2ib_1:26/25 lens 264/432 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
Dec 14 12:17:48 r1i3n15 kernel: LustreError: 1280:0:(o2iblnd_cb.c:2532:kiblnd_rejected()) 10.174.79.251@o2ib rejected: o2iblnd fatal error
Dec 14 12:17:48 r1i3n15 kernel: LustreError: 15c-8: MGC10.174.79.241@o2ib: The configuration from log 'scratch1-client' failed (-108). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Dec 14 12:17:48 r1i3n15 kernel: LustreError: 27292:0:(llite_lib.c:1095:ll_fill_super()) Unable to process log: -108
Dec 14 12:17:48 r1i3n15 kernel: Lustre: client ffff8803386abc00 umount complete
Dec 14 12:17:48 r1i3n15 kernel: LustreError: 27292:0:(obd_mount.c:2065:lustre_fill_super()) Unable to mount (-108)
Dec 14 12:18:13 r1i3n15 kernel: Lustre: 27084:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1388172991791119 sent from MGC10.174.79.242@o2ib to NID 10.174.79.252@o2ib 0s ago has failed due to network error (5s prior to deadline).
Dec 14 12:18:13 r1i3n15 kernel: req@ffff88063be32400 x1388172991791119/t0 o250->MGS@MGC10.174.79.242@o2ib_1:26/25 lens 368/584 e 0 to 1 dl 1323865098 ref 1 fl Rpc:N/0/0 rc 0/0
Dec 14 12:18:13 r1i3n15 kernel: Lustre: 27084:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Dec 14 12:18:13 r1i3n15 kernel: LustreError: 27343:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8803247b2000 x1388172991791120/t0 o501->MGS@MGC10.174.79.242@o2ib_1:26/25 lens 264/432 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
Dec 14 12:18:13 r1i3n15 kernel: LustreError: 15c-8: MGC10.174.79.242@o2ib: The configuration from log 'scratch2-client' failed (-108). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Dec 14 12:18:13 r1i3n15 kernel: LustreError: 27343:0:(llite_lib.c:1095:ll_fill_super()) Unable to process log: -108
Dec 14 12:18:13 r1i3n15 kernel: LustreError: 1293:0:(o2iblnd_cb.c:2532:kiblnd_rejected()) 10.174.79.252@o2ib rejected: o2iblnd fatal error
Dec 14 12:18:13 r1i3n15 kernel: Lustre: client ffff880322a0c400 umount complete
Dec 14 12:18:13 r1i3n15 kernel: LustreError: 27343:0:(obd_mount.c:2065:lustre_fill_super()) Unable to mount (-108)
Dec 14 12:18:13 r1i3n15 kernel: LustreError: 1293:0:(o2iblnd_cb.c:2532:kiblnd_rejected()) Skipped 1 previous similar message

From what I can see, there is no indication of a network problem:

[root@lfs-mds-1-1 ~]# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 2
Firmware version: 2.9.1000
Hardware version: b0
Node GUID: 0x0002c9030010c5f4
System image GUID: 0x0002c9030010c5f7
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 152
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c9030010c5f5
Link layer: IB
Port 2:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 5232
LMC: 0
SM lid: 2
Capability mask: 0x02510868
Port GUID: 0x0002c9030010c5f6
Link layer: IB
CA 'mlx4_1'
CA type: MT26428
Number of ports: 2
Firmware version: 2.9.1000
Hardware version: b0
Node GUID: 0x0002c9030010c6b0
System image GUID: 0x0002c9030010c6b3
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 104
LMC: 0
SM lid: 106
Capability mask: 0x02510868
Port GUID: 0x0002c9030010c6b1
Link layer: IB
Port 2:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 64
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c9030010c6b2
Link layer: IB
[root@lfs-mds-1-1 ~]# ibping -C mlx4_1 -P 1 -S

[root@r1i3n15 ~]# ibping -G 0x0002c9030010c6b1
Pong from lfs-mds-1-1.(none) (Lid 104): time 0.156 ms
Pong from lfs-mds-1-1.(none) (Lid 104): time 0.154 ms
Pong from lfs-mds-1-1.(none) (Lid 104): time 0.137 ms
Pong from lfs-mds-1-1.(none) (Lid 104): time 0.134 ms
Pong from lfs-mds-1-1.(none) (Lid 104): time 0.056 ms
Pong from lfs-mds-1-1.(none) (Lid 104): time 0.131 ms
^C
— lfs-mds-1-1.(none) (Lid 104) ibping statistics —
6 packets transmitted, 6 received, 0% packet loss, time 5123 ms
rtt min/avg/max = 0.056/0.128/0.156 ms

Yet, lctl ping fails:

[root@r1i3n15 ~]# lctl ping 10.174.79.241@o2ib
failed to ping 10.174.79.241@o2ib: Input/output error

If I go back to the original configuration:

[root@r1i3n15 ~]# cat /etc/modprobe.d/lustre.conf

  1. Lustre module configuration file
    options lnet networks="o2ib(ib1), o2ib1(ib0)"

[root@r1i3n15 ~]# lctl list_nids
10.174.96.65@o2ib
10.174.64.65@o2ib1

[root@r1i3n15 ~]# cat /etc/fstab

  1. <file system> <mount point> <type> <options> <dump> <pass>
    ...
    10.174.96.138@o2ib:/lustre1 /mnt/tds_lustre1 lustre defaults,flock 0 0
    10.174.96.138@o2ib:/lustre2 /mnt/tds_lustre2 lustre defaults,flock 0 0
    10.174.79.241@o2ib1:10.174.79.251@o2ib1:/scratch1 /mnt/lsc_lustre1 lustre defaults,flock 0 0
    10.174.79.242@o2ib1:10.174.79.252@o2ib1:/scratch2 /mnt/lsc_lustre2 lustre defaults,flock 0 0

The TDS filesystems mount (lustre1, lustre2) and the production filesystems (scratch1, scratch2) just hang while performing the mount.

Comment by Liang Zhen (Inactive) [ 14/Dec/11 ]

Here is my undestanding about your setting, please correct me if I was wrong:

 
client                      TDS MDS                    Production MDS
---------                   ---------                  -------
rli3n15                     mds01                      lfs-mds-1-1 (scratch1)
10.174.96.64@o2ib0(ib1)     10.174.96.138@o2ib0 [y]    10.174.31.241@o2ib0 [n]
10.174.64.65@o2ib1(ib0)                                10.174.79.241@o2ib1 [y]

[y] == [yes], means we can reach that NID via "lctl ping" from rli3n15
[n] == [no],  means we can not reach that NID via "lctl ping" from rli3n15

So between rli3n15 and lfs-mds-1-1:

  • 10.174.64.65@o2ib1(ib0) and 10.174.79.241@o2ib1 are on the same LNet network,and they are physically reachable to each other
  • 10.174.96.64@o2ib0(ib1) and 10.174.31.241@o2ib0 are on the same LNet network,
    but they are physically unreachable to each other

I think if you try to mount scratch1 from rli3n15, it will firstly look at all N
IDs of lfs-mds-1-1, and it found both itself and lfs-mds-1-1 have two local NIDs on o2ib0 and o2ib1 (although they can't reach eath other on o2ib0), and LNet hop of these two NIDs are same and both interfaces are healthy, so ptlrpc will choose the first NID of lfs-mds-1-1 10.174.31.241@o2ib0, which is actually unreachable to rli3n15.

I would suggest to try with this one rli3n15:
options lnet networks="o2ib1(ib0)"

and try to mount scratch1,2, if it can work, I would suggest to use configuratio
n like this:

 
client                      TDS MDS                    Production MDS
---------                   ---------                  -------
rli3n15                     mds01                      lfs-mds-1-1 (scratch1)
10.174.96.64@o2ib3(ib1)     10.174.96.138@o2ib3 [y]    
10.174.64.65@o2ib1(ib0)                                10.174.79.241@o2ib1 [y]
                                                       10.174.31.241@o2ib0 [y]

The only change we made here is:
o2ib0 on rli3n15 and mds01 is replaced by o2ib3, of course, if it can work you will have to change all nodes on TDS to o2ib3...

Comment by Dennis Nelson [ 14/Dec/11 ]

OK, I tried the following:

[root@r1i3n15 ~]# cat /etc/modprobe.d/lustre.conf

  1. Lustre module configuration file
    options lnet networks="o2ib3(ib1), o2ib1(ib0)"

[root@r1i3n15 ~]# lctl list_nids
10.174.96.65@o2ib3
10.174.64.65@o2ib1

[root@r1i3n15 ~]# cat /etc/fstab
...
10.174.96.138@o2ib3:/lustre1 /mnt/tds_lustre1 lustre defaults,flock 0 0
10.174.96.138@o2ib3:/lustre2 /mnt/tds_lustre2 lustre defaults,flock 0 0
10.174.79.241@o2ib1:10.174.79.251@o2ib1:/scratch1 /mnt/lsc_lustre1 lustre defaults,flock 0 0
10.174.79.242@o2ib1:10.174.79.252@o2ib1:/scratch2 /mnt/lsc_lustre2 lustre defaults,flock 0 0

Now, the production filesystems (scratch1, scratch2) mount and the TDS filesystems fail to mount.

[root@r1i3n15 ~]# mount -at lustre
mount.lustre: mount 10.174.96.138@o2ib3:/lustre1 at /mnt/tds_lustre1 failed: Cannot send after transport endpoint shutdown
mount.lustre: mount 10.174.96.138@o2ib3:/lustre2 at /mnt/tds_lustre2 failed: File exists
[root@r1i3n15 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
tmpfs 153600 1708 151892 2% /tmp
10.181.1.2:/contrib 137625600 3002528 134623072 3% /contrib
10.181.1.2:/testapps/v1
45875200 35991488 9883712 79% /apps
10.181.1.2:/testhome 550764544 166799968 383964576 31% /home
10.174.79.241@o2ib1:10.174.79.251@o2ib1:/scratch1
2688660012544 29627611556 2632114228424 2% /mnt/lsc_lustre1
10.174.79.242@o2ib1:10.174.79.252@o2ib1:/scratch2
3360825015680 785492156 3326396150596 1% /mnt/lsc_lustre2

Comment by Liang Zhen (Inactive) [ 14/Dec/11 ]

have you also changed MDS/MGS and other servers in TDS filesystem to o2ib3 as well (i.e: mds01)? Because you are using o2ib3 as TDS network number, so all clients and servers on TDS network should use that network number (o2ib3).
Also, try "lctl ping" to verify network is reachable is always a good idea,

Comment by Dennis Nelson [ 14/Dec/11 ]

Ah, no. I will have to schedule some time with the customer to do that. I have one node that is not currently in the job queue that I can use for testing. To take the whole filesystem down, I will have to schedule it.

I will get that scheduled today.

Comment by Dennis Nelson [ 14/Dec/11 ]

I made the change on the TDS servers and had to perform a writeconf in order to get it mounted up again. Everything seems to be working now.

Thank you very much for all of your help!

Comment by Peter Jones [ 14/Dec/11 ]

Dennis

Thanks for the update. So can we close both this ticket and LU890?

Peter

Comment by Dennis Nelson [ 14/Dec/11 ]

Yes. I already suggested that LU-890 be closed and it was closed by Cliff. This one can be also.

Comment by Peter Jones [ 14/Dec/11 ]

Great - thanks Dennis!

Generated at Sat Feb 10 01:11:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.