[LU-890] MDS Failover Issue - Clients not reconnecting after MGT/MDT fail over to other MDS. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 1.8.6, Lustre 1.8.x (1.8.0 - 1.8.5)
Affects Version/s: None
Labels:
None
Environment:

Hide
This is a rather complex configuration. It consists of two large Lustre filesystems. scratch1 is comprised of 2, MDS, 16 OSS, 4 DDN SFA 10K storage arrays. scratch2 is comprised of 2 MDS, 20 OSS, 5 DDN SFA 10K storage arrays. The Lustre servers all have 4 IB ports for client access to the filesystems. The compute nodes access scratch1 via their ib0 port (ib0 on the lustre servers). They access scratch2 vi ib1 (also ib1 on the servers). The various login nodes of the cluster access both scratch1 and scratch2 through their ib2 port (also ib2 on the servers). Finally, ib3 is for access to the production filesystems from clients in a test cluster.

The servers are running CentOS 5.5 (2.6.18-238.12.1.el5)
Lustre 1.8.6 with the ~~LU-530~~ patch installed
The clients are currently running RHEL 6.0.

Show
This is a rather complex configuration. It consists of two large Lustre filesystems. scratch1 is comprised of 2, MDS, 16 OSS, 4 DDN SFA 10K storage arrays. scratch2 is comprised of 2 MDS, 20 OSS, 5 DDN SFA 10K storage arrays. The Lustre servers all have 4 IB ports for client access to the filesystems. The compute nodes access scratch1 via their ib0 port (ib0 on the lustre servers). They access scratch2 vi ib1 (also ib1 on the servers). The various login nodes of the cluster access both scratch1 and scratch2 through their ib2 port (also ib2 on the servers). Finally, ib3 is for access to the production filesystems from clients in a test cluster. The servers are running CentOS 5.5 (2.6.18-238.12.1.el5) Lustre 1.8.6 with the LU-530 patch installed The clients are currently running RHEL 6.0.

Severity:
3
Rank (Obsolete):
6512

Description

The production compute nodes and login nodes can access both filesystems when the MGT/MDT is running on the primary MDS of scratch1. When the MGT and MDT are failed over to the backup MDS, the clients fail to reconnect.

The basic configuration information is as follows:

The primary MDS for scratch1 is named lfs-mds-1-1 and the secondary MDS is named lfs-mds-1-2.
/etc/modprobe.d/lustre.conf:
options lnet networks="o2ib0(ib0), o2ib1(ib2), o2ib2(ib3)"

lfs-mds-1-1:
ib0 inet addr:10.174.31.241 Bcast:10.174.31.255 Mask:255.255.224.0
ib1 inet addr:10.175.31.241 Bcast:10.175.31.255 Mask:255.255.224.0
ib2 inet addr:10.174.79.241 Bcast:10.174.79.255 Mask:255.255.240.0
ib3 inet addr:10.174.80.40 Bcast:10.174.111.255 Mask:255.255.240.0

[root@lfs-mds-1-1 config]# lctl list_nids
10.174.31.241@o2ib
10.174.79.241@o2ib1
10.174.80.40@o2ib2

lfs-mds-1-2:
ib0 inet addr:10.174.31.251 Bcast:10.174.31.255 Mask:255.255.224.0
ib1 inet addr:10.175.31.251 Bcast:10.175.31.255 Mask:255.255.224.0
ib2 inet addr:10.174.79.251 Bcast:10.174.79.255 Mask:255.255.240.0
ib3 inet addr:10.174.80.41 Bcast:10.174.111.255 Mask:255.255.240.0

[root@lfs-mds-1-2 ~]# lctl list_nids
10.174.31.251@o2ib
10.174.79.251@o2ib1
10.174.80.41@o2ib2

r1i0n0 config (compute node):
ib0 inet addr:10.174.0.55 Bcast:10.174.31.255 Mask:255.255.224.0
ib1 inet addr:10.175.0.55 Bcast:10.175.31.255 Mask:255.255.224.0

/etc/modprobe.d/lustre.conf
options lnet networks="o2ib0(ib0), o2ib1(ib1)"

[root@r1i0n0 ~]# lctl list_nids
10.174.0.55@o2ib
10.175.0.55@o2ib1

[root@r1i0n0 ~]# lctl ping 10.174.31.241@o2ib
12345-0@lo
12345-10.174.31.241@o2ib
12345-10.174.79.241@o2ib1
12345-10.174.80.40@o2ib2
[root@r1i0n0 ~]# lctl ping 10.174.31.251@o2ib
12345-0@lo
12345-10.174.31.251@o2ib
12345-10.174.79.251@o2ib1
12345-10.174.80.41@o2ib2

fe1 (login node):
inet addr:10.174.0.37 Bcast:10.255.255.255 Mask:255.255.224.0
inet addr:10.175.0.37 Bcast:10.255.255.255 Mask:255.255.224.0
inet addr:10.174.81.1 Bcast:10.174.95.255 Mask:255.255.240.0

/etc/modprobe.d/lustre.conf
options lnet networks="o2ib0(ib0), o2ib1(ib1), o2ib2(ib2)"

[root@fe1 ~]# lctl list_nids
10.174.0.37@o2ib
10.175.0.37@o2ib1
10.174.81.10@o2ib2

[root@fe1 ~]# lctl ping 10.174.80.40@o2ib2
12345-0@lo
12345-10.174.31.241@o2ib
12345-10.174.79.241@o2ib1
12345-10.174.80.40@o2ib2
[root@fe1 ~]# lctl ping 10.174.80.41@o2ib2
12345-0@lo
12345-10.174.31.251@o2ib
12345-10.174.79.251@o2ib1
12345-10.174.80.41@o2ib2

[root@lfs-mds-1-1 ~]# tunefs.lustre --dryrun /dev/vg_scratch1/mdt
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

Read previous values:
Target: scratch1-MDT0000
Index: 0
Lustre FS: scratch1
Mount type: ldiskfs
Flags: 0x1401
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 mgsnode=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 failover.node=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 failover.node=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 mdt.quota_type=ug

Permanent disk data:
Target: scratch1-MDT0000
Index: 0
Lustre FS: scratch1
Mount type: ldiskfs
Flags: 0x1401
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 mgsnode=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 failover.node=10.174.31.241@o2ib,10.174.79.241@o2ib1,10.174.80.40@o2ib2 failover.node=10.174.31.251@o2ib,10.174.79.251@o2ib1,10.174.80.41@o2ib2 mdt.quota_type=ug

exiting before disk write.

After failing over the MGT and MDT to the backup MDS (lfs-mds-1-2) it appears to have never started recovery:

[root@lfs-mds-1-2 lustre]# cat
/proc/fs/lustre/mds/scratch1-MDT0000/recovery_status
status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/2275
delayed_clients: 0/2275
completed_clients: 0/2275
replayed_requests: 0/??
queued_requests: 0
next_transno: 55834575147

Once I moved the MGT and MDT back to the original system, the client reconnected again in less than a minute:

[root@lfs-mds-1-1 ~]# cat
/proc/fs/lustre/mds/scratch1-MDT0000/recovery_status
status: RECOVERING
recovery_start: 1322752821
time_remaining: 267
connected_clients: 1896/2275
delayed_clients: 0/2275
completed_clients: 1896/2275
replayed_requests: 0/??
queued_requests: 0
next_transno: 55834575147
[root@lfs-mds-1-1 ~]# cat
/proc/fs/lustre/mds/scratch1-MDT0000/recovery_status
status: COMPLETE
recovery_start: 1322752821
recovery_duration: 56
delayed_clients: 0/2275
completed_clients: 2275/2275
replayed_requests: 0
last_transno: 55834575146

The log file on fe1 showed this:
Dec 1 15:08:21 fe1 kernel: Lustre: 7508:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1386944264150314 sent from scratch1-MDT0000-mdc-ffff880be72aec00 to NID 10.174.31.241@o2ib 7s ago has timed out (7s prior to deadline).
Dec 1 15:08:21 fe1 kernel: req@ffff880bee44fc00 x1386944264150314/t0 o35->scratch1-MDT0000_UUID@10.174.31.241@o2ib:23/10 lens 408/9864 e 0 to 1 dl 1322752101 ref 1 fl Rpc:/0/0 rc 0/0
Dec 1 15:08:21 fe1 kernel: Lustre: 7508:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
Dec 1 15:08:21 fe1 kernel: Lustre: scratch1-MDT0000-mdc-ffff880be72aec00: Connection to service scratch1-MDT0000 via nid 10.174.31.241@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Dec 1 15:08:36 fe1 kernel: Lustre: 5587:0:(import.c:517:import_select_connection()) scratch1-MDT0000-mdc-ffff880be72aec00: tried all connections, increasing latency to 2s
Dec 1 15:08:38 fe1 kernel: Lustre: 5585:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1386944264150337 sent from MGC10.174.80.40@o2ib2 to NID 10.174.80.40@o2ib2 17s ago has timed out (17s prior to deadline).
Dec 1 15:08:38 fe1 kernel: req@ffff880becc30000 x1386944264150337/t0 o400->MGS@MGC10.174.80.40@o2ib2_0:26/25 lens 192/384 e 0 to 1 dl 1322752117 ref 1 fl Rpc:N/0/0 rc 0/0
Dec 1 15:08:38 fe1 kernel: Lustre: 5585:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Dec 1 15:08:38 fe1 kernel: LustreError: 166-1: MGC10.174.80.40@o2ib2: Connection to service MGS via nid 10.174.80.40@o2ib2 was lost; in progress operations using this service will fail.
Dec 1 15:08:52 fe1 kernel: Lustre: 5587:0:(import.c:517:import_select_connection()) scratch1-MDT0000-mdc-ffff880be72aec00: tried all connections, increasing latency to 3s
Dec 1 15:08:59 fe1 kernel: Lustre: 5586:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1386944264151143 sent from MGC10.174.80.40@o2ib2 to NID 10.174.80.40@o2ib2 6s ago has timed out (6s prior to deadline).
Dec 1 15:08:59 fe1 kernel: req@ffff880bed70c400 x1386944264151143/t0 o250->MGS@MGC10.174.80.40@o2ib2_0:26/25 lens 368/584 e 0 to 1 dl 1322752139 ref 1 fl Rpc:N/0/0 rc 0/0
Dec 1 15:08:59 fe1 kernel: Lustre: 5586:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Dec 1 15:09:00 fe1 kernel: Lustre: 5586:0:(import.c:855:ptlrpc_connect_interpret()) MGS@MGC10.174.80.40@o2ib2_1 changed server handle from 0x242210f6584197b7 to 0xa6cae1b09294c1a2
Dec 1 15:09:00 fe1 kernel: Lustre: MGC10.174.80.40@o2ib2: Reactivating import
Dec 1 15:09:00 fe1 kernel: Lustre: MGC10.174.80.40@o2ib2: Connection restored to service MGS using nid 10.174.80.41@o2ib2.
Dec 1 15:09:11 fe1 kernel: Lustre: 5587:0:(import.c:517:import_select_connection()) scratch1-MDT0000-mdc-ffff880be72aec00: tried all connections, increasing latency to 4s
Dec 1 15:09:31 fe1 kernel: Lustre: 5587:0:(import.c:517:import_select_connection()) scratch1-MDT0000-mdc-ffff880be72aec00: tried all connections, increasing latency to 5s
Dec 1 15:09:41 fe1 kernel: Lustre: 5586:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1386944264151550 sent from scratch1-MDT0000-mdc-ffff880be72aec00 to NID 10.174.31.241@o2ib 10s ago has timed out (10s prior to deadline).
Dec 1 15:09:41 fe1 kernel: req@ffff8817eea7d000 x1386944264151550/t0 o38->scratch1-MDT0000_UUID@10.174.31.241@o2ib:12/10 lens 368/584 e 0 to 1 dl 1322752181 ref 1 fl Rpc:N/0/0 rc 0/0
Dec 1 15:09:41 fe1 kernel: Lustre: 5586:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Dec 1 15:10:17 fe1 kernel: Lustre: 5587:0:(import.c:517:import_select_connection()) scratch1-MDT0000-mdc-ffff880be72aec00: tried all connections, increasing latency to 7s
Dec 1 15:10:17 fe1 kernel: Lustre: 5587:0:(import.c:517:import_select_connection()) Skipped 1 previous similar message
Dec 1 15:10:56 fe1 kernel: Lustre: 5586:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1386944264152753 sent from scratch1-MDT0000-mdc-ffff880be72aec00 to NID 10.174.31.241@o2ib 13s ago has timed out (13s prior to deadline).
Dec 1 15:10:56 fe1 kernel: req@ffff881808992800 x1386944264152753/t0 o38->scratch1-MDT0000_UUID@10.174.31.241@o2ib:12/10 lens 368/584 e 0 to 1 dl 1322752256 ref 1 fl Rpc:N/0/0 rc 0/0
Dec 1 15:10:56 fe1 kernel: Lustre: 5586:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Dec 1 15:11:41 fe1 kernel: Lustre: 5587:0:(import.c:517:import_select_connection()) scratch1-MDT0000-mdc-ffff880be72aec00: tried all connections, increasing latency to 10s
Dec 1 15:11:41 fe1 kernel: Lustre: 5587:0:(import.c:517:import_select_connection()) Skipped 2 previous similar messages
Dec 1 15:13:41 fe1 kernel: Lustre: 5586:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1386944264155556 sent from scratch1-MDT0000-mdc-ffff880be72aec00 to NID 10.174.31.241@o2ib 18s ago has timed out (18s prior to deadline).
Dec 1 15:13:41 fe1 kernel: req@ffff880be9f5e000 x1386944264155556/t0 o38->scratch1-MDT0000_UUID@10.174.31.241@o2ib:12/10 lens 368/584 e 0 to 1 dl 1322752421 ref 1 fl Rpc:N/0/0 rc 0/0
Dec 1 15:13:41 fe1 kernel: Lustre: 5586:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
Dec 1 15:14:41 fe1 kernel: Lustre: 5587:0:(import.c:517:import_select_connection()) scratch1-MDT0000-mdc-ffff880be72aec00: tried all connections, increasing latency to 15s
Dec 1 15:14:41 fe1 kernel: Lustre: 5587:0:(import.c:517:import_select_connection()) Skipped 4 previous similar messages
Dec 1 15:18:56 fe1 kernel: Lustre: 5586:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1386944264160358 sent from scratch1-MDT0000-mdc-ffff880be72aec00 to NID 10.174.31.241@o2ib 25s ago has timed out (25s prior to deadline).
Dec 1 15:18:56 fe1 kernel: req@ffff880beb4a4000 x1386944264160358/t0 o38->scratch1-MDT0000_UUID@10.174.31.241@o2ib:12/10 lens 368/584 e 0 to 1 dl 1322752736 ref 1 fl Rpc:N/0/0 rc 0/0
Dec 1 15:18:56 fe1 kernel: Lustre: 5586:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
Dec 1 15:20:17 fe1 kernel: LustreError: 166-1: MGC10.174.80.40@o2ib2: Connection to service MGS via nid 10.174.80.41@o2ib2 was lost; in progress operations using this service will fail.
Dec 1 15:20:18 fe1 kernel: Lustre: 5587:0:(import.c:517:import_select_connection()) scratch1-MDT0000-mdc-ffff880be72aec00: tried all connections, increasing latency to 22s
Dec 1 15:20:18 fe1 kernel: Lustre: 5587:0:(import.c:517:import_select_connection()) Skipped 6 previous similar messages
Dec 1 15:20:24 fe1 kernel: Lustre: 5586:0:(import.c:855:ptlrpc_connect_interpret()) MGS@MGC10.174.80.40@o2ib2_0 changed server handle from 0xa6cae1b09294c1a2 to 0x242210f65845423c
Dec 1 15:20:24 fe1 kernel: Lustre: MGC10.174.80.40@o2ib2: Reactivating import
Dec 1 15:20:24 fe1 kernel: Lustre: MGC10.174.80.40@o2ib2: Connection restored to service MGS using nid 10.174.80.40@o2ib2.
Dec 1 15:21:14 fe1 kernel: LustreError: 5586:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff880be96a8000 x1386944264100092/t55834575126 o101->scratch1-MDT0000_UUID@10.174.31.241@o2ib:12/10 lens 512/4880 e 0 to 1 dl 1322752939 ref 2 fl Interpret:RP/4/0 rc 301/301
Dec 1 15:21:17 fe1 kernel: Lustre: scratch1-MDT0000-mdc-ffff880be72aec00: Connection restored to service scratch1-MDT0000 using nid 10.174.31.241@o2ib.
Dec 1 15:21:17 fe1 kernel: LustreError: 11-0: an error occurred while communicating with 10.174.31.241@o2ib. The mds_close operation failed with -116
Dec 1 15:21:17 fe1 kernel: LustreError: Skipped 7 previous similar messages
Dec 1 15:21:17 fe1 kernel: LustreError: 7508:0:(file.c:116:ll_close_inode_openhandle()) inode 1905262791 mdc close failed: rc = -116

The log files on lfs-mds-1-1 and lfs-mds-1-2 are void of any useful data.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre1_uuids.txt
139 kB
08/Dec/11 10:45 AM
lustre2_uuids.txt
347 kB
08/Dec/11 10:45 AM
lustre-scratch1
1.44 MB
07/Dec/11 11:16 AM
lustre-scratch1
826 kB
07/Dec/11 11:03 AM
lustre-scratch1
1.44 MB
07/Dec/11 8:56 AM
lustre-scratch1
9.71 MB
06/Dec/11 2:28 PM

Activity

[LU-890] MDS Failover Issue - Clients not reconnecting after MGT/MDT fail over to other MDS.

Cliff White (Inactive) added a comment - 08/Dec/11 3:16 PM

I have been reviewing this again, wanted to add some clarification. Server recovery will NOT start until the first client attempts a connection.
We do this so a node with a dead network won't have a failed recovery, it waits for the network to be restored before starting recovery, and looks for a client connection attempt to happen. In your case, I think this is telling us that clients cannot find the backup MDS, since we do not see connection attempts. Are you certain all network routing, masks, etc are correct for clients to reach lfs-mds-1-2? Might be worth a re-check and another round of lctl pings.

Cliff White (Inactive) added a comment - 08/Dec/11 3:16 PM I have been reviewing this again, wanted to add some clarification. Server recovery will NOT start until the first client attempts a connection. We do this so a node with a dead network won't have a failed recovery, it waits for the network to be restored before starting recovery, and looks for a client connection attempt to happen. In your case, I think this is telling us that clients cannot find the backup MDS, since we do not see connection attempts. Are you certain all network routing, masks, etc are correct for clients to reach lfs-mds-1-2? Might be worth a re-check and another round of lctl pings.

Dennis Nelson added a comment - 08/Dec/11 10:45 AM

Sorry for the delay. I had laptop issues. Here are the uuids.

Dennis Nelson added a comment - 08/Dec/11 10:45 AM Sorry for the delay. I had laptop issues. Here are the uuids.

Hongchao Zhang added a comment - 08/Dec/11 5:15 AM

just like the comment in ~~LU-899~~, could you please run the following commands and attach the config files in JIRA,

umount /mnt/mgs
mount -t ldiskfs /dev/your_mgs_device /mnt/mgs

the config files is in directory /mnt/mgs/CONFIGS/

thanks

Hongchao Zhang added a comment - 08/Dec/11 5:15 AM just like the comment in LU-899 , could you please run the following commands and attach the config files in JIRA, umount /mnt/mgs mount -t ldiskfs /dev/your_mgs_device /mnt/mgs the config files is in directory /mnt/mgs/CONFIGS/ thanks

Hongchao Zhang added a comment - 07/Dec/11 9:18 PM

in the last debug log, the connection to failover node of MDT 10.174.31.251 is added to MDC, but it wasn't shown in the logs
of the description section of this ticket, which only used the main 10.174.31.241 MDT node, is there any change for the
system? could you please retry to test whether this node (fe2) can fail over to 10.174.31.151 or not? thanks!

Hongchao Zhang added a comment - 07/Dec/11 9:18 PM in the last debug log, the connection to failover node of MDT 10.174.31.251 is added to MDC, but it wasn't shown in the logs of the description section of this ticket, which only used the main 10.174.31.241 MDT node, is there any change for the system? could you please retry to test whether this node (fe2) can fail over to 10.174.31.151 or not? thanks!

Dennis Nelson added a comment - 07/Dec/11 11:16 AM

I realized that I did not run the test exactly as you had asked. I did preload the Lustre modules. I performed the test again:

[root@fe2 ~]# lustre_rmmod
[root@fe2 ~]# mount /mnt/lustre1
[root@fe2 ~]# lctl dk > /tmp/lustre-scratch1

Dennis Nelson added a comment - 07/Dec/11 11:16 AM I realized that I did not run the test exactly as you had asked. I did preload the Lustre modules. I performed the test again: [root@fe2 ~] # lustre_rmmod [root@fe2 ~] # mount /mnt/lustre1 [root@fe2 ~] # lctl dk > /tmp/lustre-scratch1

Dennis Nelson added a comment - 07/Dec/11 11:05 AM

Please ignore the last entry and attachment. It was intended for ~~LU-899~~.

Dennis Nelson added a comment - 07/Dec/11 11:05 AM Please ignore the last entry and attachment. It was intended for LU-899 .

Dennis Nelson added a comment - 07/Dec/11 11:02 AM

Sorry, I sent the trace for the other case but did not send the trace for dtn1. Here it is.

[root@dtn1 ~]# lustre_rmmod
[root@dtn1 ~]# modprobe lustre
[root@dtn1 ~]# lctl set_param debug=+trace lnet.debug=+trace
[root@dtn1 ~]# lctl get_param debug
lnet.debug=trace ioctl neterror warning error emerg ha config console
[root@dtn1 ~]# mount /mnt/lustre1
mount.lustre: mount 10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1 at /mnt/lustre1 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root@dtn1 ~]# lctl dk > /tmp/lustre-scratch1

Dennis Nelson added a comment - 07/Dec/11 11:02 AM Sorry, I sent the trace for the other case but did not send the trace for dtn1. Here it is. [root@dtn1 ~] # lustre_rmmod [root@dtn1 ~] # modprobe lustre [root@dtn1 ~] # lctl set_param debug=+trace lnet.debug=+trace [root@dtn1 ~] # lctl get_param debug lnet.debug=trace ioctl neterror warning error emerg ha config console [root@dtn1 ~] # mount /mnt/lustre1 mount.lustre: mount 10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1 at /mnt/lustre1 failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) [root@dtn1 ~] # lctl dk > /tmp/lustre-scratch1

Hongchao Zhang added a comment - 07/Dec/11 5:06 AM - edited

the log is a little strange, and there is no attach&setup info of the obd_device scratch1-MDT0000-mdc-ffff88180455bc00 and
some OSC device (e.g. scratch1-OST0001-osc-ffff88180455bc00), but these devices is used by the newly mounted Lustre.

could you please retry to get the debug log at a clean node(meaning it doesn't mount Lustre ever) and without preload the "lustre" module
(the default "debug" config is enough, then it isn't needed to change)? just dump the log after mounting Lustre.

mount /mnt/lustre1
lctl dk > /tmp/lustre-scratch1

thanks!

Hongchao Zhang added a comment - 07/Dec/11 5:06 AM - edited the log is a little strange, and there is no attach&setup info of the obd_device scratch1-MDT0000-mdc-ffff88180455bc00 and some OSC device (e.g. scratch1-OST0001-osc-ffff88180455bc00), but these devices is used by the newly mounted Lustre. could you please retry to get the debug log at a clean node(meaning it doesn't mount Lustre ever) and without preload the "lustre" module (the default "debug" config is enough, then it isn't needed to change)? just dump the log after mounting Lustre. mount /mnt/lustre1 lctl dk > /tmp/lustre-scratch1 thanks!

Dennis Nelson added a comment - 06/Dec/11 2:38 PM

I added the attachment. The trace is only from mounting /mnt/lustre1. I performed another trace where I performed a mount -at mounting both filesystems but it is a large file. It is 32 MB, The JIRA interface says it has a 10 MB limit. I'll be glad to forward the larger file if you give me a way to send it.

Dennis Nelson added a comment - 06/Dec/11 2:38 PM I added the attachment. The trace is only from mounting /mnt/lustre1. I performed another trace where I performed a mount -at mounting both filesystems but it is a large file. It is 32 MB, The JIRA interface says it has a 10 MB limit. I'll be glad to forward the larger file if you give me a way to send it.

Dennis Nelson added a comment - 06/Dec/11 2:28 PM

[root@fe2 ~]# modprobe lustre
[root@fe2 ~]# lctl get_param debug
lnet.debug=ioctl neterror warning error emerg ha config console
[root@fe2 ~]# lctl set_param debug=+trace
lnet.debug=+trace
[root@fe2 ~]# lctl get_param debug
lnet.debug=trace ioctl neterror warning error emerg ha config console
[root@fe2 ~]# mount /mnt/lustre1
[root@fe2 ~]# lctl dk > /tmp/lustre-scratch1

Dennis Nelson added a comment - 06/Dec/11 2:28 PM [root@fe2 ~] # modprobe lustre [root@fe2 ~] # lctl get_param debug lnet.debug=ioctl neterror warning error emerg ha config console [root@fe2 ~] # lctl set_param debug=+trace lnet.debug=+trace [root@fe2 ~] # lctl get_param debug lnet.debug=trace ioctl neterror warning error emerg ha config console [root@fe2 ~] # mount /mnt/lustre1 [root@fe2 ~] # lctl dk > /tmp/lustre-scratch1

Cliff White (Inactive) added a comment - 06/Dec/11 2:10 PM

Okay, understood. We will need the debug logs for the mount attempt.

Cliff White (Inactive) added a comment - 06/Dec/11 2:10 PM Okay, understood. We will need the debug logs for the mount attempt.

People

Assignee:: Hongchao Zhang

Reporter:: Dennis Nelson

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 02/Dec/11 3:40 PM

Updated:: 12/Dec/11 1:06 PM

Resolved:: 12/Dec/11 1:06 PM