[LU-890] MDS Failover Issue - Clients not reconnecting after MGT/MDT fail over to other MDS. Created: 02/Dec/11 Updated: 12/Dec/11 Resolved: 12/Dec/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 1.8.6, Lustre 1.8.x (1.8.0 - 1.8.5) |
| Type: | Bug | Priority: | Major |
| Reporter: | Dennis Nelson | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
This is a rather complex configuration. It consists of two large Lustre filesystems. scratch1 is comprised of 2, MDS, 16 OSS, 4 DDN SFA 10K storage arrays. scratch2 is comprised of 2 MDS, 20 OSS, 5 DDN SFA 10K storage arrays. The Lustre servers all have 4 IB ports for client access to the filesystems. The compute nodes access scratch1 via their ib0 port (ib0 on the lustre servers). They access scratch2 vi ib1 (also ib1 on the servers). The various login nodes of the cluster access both scratch1 and scratch2 through their ib2 port (also ib2 on the servers). Finally, ib3 is for access to the production filesystems from clients in a test cluster. The servers are running CentOS 5.5 (2.6.18-238.12.1.el5) |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6512 |
| Description |
|
The production compute nodes and login nodes can access both filesystems when the MGT/MDT is running on the primary MDS of scratch1. When the MGT and MDT are failed over to the backup MDS, the clients fail to reconnect. The basic configuration information is as follows: The primary MDS for scratch1 is named lfs-mds-1-1 and the secondary MDS is named lfs-mds-1-2. lfs-mds-1-1: [root@lfs-mds-1-1 config]# lctl list_nids lfs-mds-1-2: [root@lfs-mds-1-2 ~]# lctl list_nids r1i0n0 config (compute node): /etc/modprobe.d/lustre.conf [root@r1i0n0 ~]# lctl list_nids [root@r1i0n0 ~]# lctl ping 10.174.31.241@o2ib fe1 (login node): /etc/modprobe.d/lustre.conf [root@fe1 ~]# lctl list_nids [root@fe1 ~]# lctl ping 10.174.80.40@o2ib2 [root@lfs-mds-1-1 ~]# tunefs.lustre --dryrun /dev/vg_scratch1/mdt Read previous values: Permanent disk data: exiting before disk write. After failing over the MGT and MDT to the backup MDS (lfs-mds-1-2) it appears to have never started recovery: [root@lfs-mds-1-2 lustre]# cat Once I moved the MGT and MDT back to the original system, the client reconnected again in less than a minute: [root@lfs-mds-1-1 ~]# cat The log file on fe1 showed this: The log files on lfs-mds-1-1 and lfs-mds-1-2 are void of any useful data. |
| Comments |
| Comment by Peter Jones [ 02/Dec/11 ] |
|
Hongchao Could you please comment on this one? Thanks Peter |
| Comment by Dennis Nelson [ 05/Dec/11 ] |
|
Peter, Any word on this? I have not heard back from the assigned engineer. Thanks, Applications Support Engineer On 12/2/11 4:26 PM, "Peter Jones (JIRA)" <jira@whamcloud.com> wrote: |
| Comment by Dennis Nelson [ 05/Dec/11 ] |
|
Another data point... I just tested mds failover on scratch2 and it worked as expected. I cannot see any differences between the scratch1 configuration and the scratch2 configuration: [root@lfs-mds-2-1 ~]# cat /etc/modprobe.d/lustre.conf [root@lfs-mds-2-2 ~]# lctl list_nids [root@lfs-mds-2-1 ~]# tunefs.lustre --dryrun /dev/vg_scratch2/mdt Read previous values: Permanent disk data: exiting before disk write. |
| Comment by Hongchao Zhang [ 06/Dec/11 ] |
|
the problem could be related to the client config log, and i am investigating it currently, and will ask you to provide |
| Comment by Hongchao Zhang [ 06/Dec/11 ] |
|
according to the log, the clients don't know there is a failover node of MDT, for it never try to connect the lfs-mds-1-2, could you please remount one client and attach its debug logs? thanks in advance. |
| Comment by Dennis Nelson [ 06/Dec/11 ] |
|
OK, I need some help in how to gather the debug logs. I assume that I have to change /proc/sys/lnet/debug to a different value but I am not sure what that should be to capture the right information. Thanks, |
| Comment by Cliff White (Inactive) [ 06/Dec/11 ] |
|
Adding +trace should be enough, as in:
Then do the mount test, and
and attach. I am confused however by one thing - you say the compute nodes connect via ib1 on the servers, but you do not have ib1 in your lnet config: options lnet networks="o2ib0(ib0), o2ib1(ib2), o2ib2(ib3)" And thus you don't show a 10.175.xx.xx address for your servers: |
| Comment by Dennis Nelson [ 06/Dec/11 ] |
|
I did not design this. The design specified that the compute nodes access scratch1 through ib0 and scratch2 through ib1. So you will see that the scratch1 servers use ib0, ib2 and ib3, while the scratch2 servers use ib1, ib2, and ib3. |
| Comment by Cliff White (Inactive) [ 06/Dec/11 ] |
|
Okay, understood. We will need the debug logs for the mount attempt. |
| Comment by Dennis Nelson [ 06/Dec/11 ] |
|
[root@fe2 ~]# modprobe lustre |
| Comment by Dennis Nelson [ 06/Dec/11 ] |
|
I added the attachment. The trace is only from mounting /mnt/lustre1. I performed another trace where I performed a mount -at mounting both filesystems but it is a large file. It is 32 MB, The JIRA interface says it has a 10 MB limit. I'll be glad to forward the larger file if you give me a way to send it. |
| Comment by Hongchao Zhang [ 07/Dec/11 ] |
|
the log is a little strange, and there is no attach&setup info of the obd_device scratch1-MDT0000-mdc-ffff88180455bc00 and could you please retry to get the debug log at a clean node(meaning it doesn't mount Lustre ever) and without preload the "lustre" module mount /mnt/lustre1 thanks! |
| Comment by Dennis Nelson [ 07/Dec/11 ] |
|
Sorry, I sent the trace for the other case but did not send the trace for dtn1. Here it is. [root@dtn1 ~]# lustre_rmmod |
| Comment by Dennis Nelson [ 07/Dec/11 ] |
|
Please ignore the last entry and attachment. It was intended for |
| Comment by Dennis Nelson [ 07/Dec/11 ] |
|
I realized that I did not run the test exactly as you had asked. I did preload the Lustre modules. I performed the test again: [root@fe2 ~]# lustre_rmmod |
| Comment by Hongchao Zhang [ 07/Dec/11 ] |
|
in the last debug log, the connection to failover node of MDT 10.174.31.251 is added to MDC, but it wasn't shown in the logs |
| Comment by Hongchao Zhang [ 08/Dec/11 ] |
|
just like the comment in umount /mnt/mgs the config files is in directory /mnt/mgs/CONFIGS/ thanks |
| Comment by Dennis Nelson [ 08/Dec/11 ] |
|
Sorry for the delay. I had laptop issues. Here are the uuids. |
| Comment by Cliff White (Inactive) [ 08/Dec/11 ] |
|
I have been reviewing this again, wanted to add some clarification. Server recovery will NOT start until the first client attempts a connection. |
| Comment by Dennis Nelson [ 08/Dec/11 ] |
|
OK, here is the server info once again: lfs-mds-1-2: Although configured with an ip address, the scratch1 filesystem does not use the ib1 fabric. On lfs-mds-2-x, the ib0 fabric is not used. [root@lfs-mds-1-2 ~]# cat /etc/modprobe.d/lustre.conf [root@lfs-mds-1-2 ~]# lctl list_nids Client fe1 (login node) Although the login nodes have a connection to the ib0 and ib1 fabrics (same as ib0 and ib1 on the Lustre servers), the design of the system was such that the login nodes should use the ib2 port (Same fabric as ib3 on Lustre servers) for mounting the Lustre filesystems. Havin all three entries in teh file might be an issue. I had difficulties making the mount work with just ib2 defined in modprobe.d/lustre.conf. This configuration allows the mounts to work although, scratch2 does take a while to mount (about 2.5 minutes). [root@fe1 ~]# cat /etc/modprobe.d/lustre.conf
[root@fe1 ~]# lctl list_nids [root@fe1 ~]# mount [root@fe1 ~]# lctl ping 10.174.80.40@o2ib2 Client dtn1 (data transfer node): ib0 inet addr:10.174.81.1 Bcast:10.174.95.255 Mask:255.255.240.0 [root@dtn1 ~]# cat /etc/modprobe.d/lustre.conf
[root@dtn1 ~]# lctl list_nids [root@dtn1 ~]# lctl ping 10.174.80.40@o2ib2 [root@dtn1 ~]# mount /mnt/lustre1 [root@dtn1 ~]# cat /etc/fstab [root@dtn1 ~]# mount Did I miss anything that you wanted to see? |
| Comment by Dennis Nelson [ 08/Dec/11 ] |
|
hold, on. I just noticed that we have a broadcast address problem. |
| Comment by Dennis Nelson [ 08/Dec/11 ] |
|
I am fixing the incorrect broadcast addresses. I'm not sure that will fix the issues but it is wrong and needs to be fixed. I'll report back with new info after that is completed. |
| Comment by Dennis Nelson [ 12/Dec/11 ] |
|
I believe this can be declared resolved by the writeconf. I am curious though if anyone has any insight on what might have gone wrong. We did not change any parameters, yet after the writeconf, it now works. I still have a client connectivity issue being worked in |
| Comment by Cliff White (Inactive) [ 12/Dec/11 ] |
|
As Johann said, we think there was an issue with the initial creation of the config log, recreation fixed it. We are also working on replication in the lab. |
| Comment by Cliff White (Inactive) [ 12/Dec/11 ] |
|
Recreating config logs with writeconf fixed the failover issue |