[LU-4722] IO Errors during the failover - SLES 11 SP2 - Lustre 2.4.2 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.4.2
Labels:
- mn4
Environment:
SLES 11 SP2
Lustre 2.4.2

Severity:
3
Rank (Obsolete):
12978

Description

We have applied the patch provided in teh ~~LU-3645~~. And still the customer complains that the issue is can be reproduced.

Attaching the latest set of logs.

The issue re-occured on 18th Feb.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

2014-02-19-client_messages_20140219(1).tgz
782 kB
06/Mar/14 4:09 PM
2014-02-19-es_lustre_showall_2014-02-19_161417.tar.bz2
0.3 kB
06/Mar/14 4:09 PM
2014-02-19-es_lustre_showall_2014-02-19_161417(1).tar.bz2
0.3 kB
06/Mar/14 4:09 PM
2014-02-19-server_lctl_dk_20140218.tgz
0.3 kB
06/Mar/14 4:09 PM
check_SLES11SP3_20140210.txt
16 kB
13/Mar/14 2:49 PM
check_SLES11SP3_20140320-1.txt
11 kB
20/Mar/14 11:24 PM

Activity

[LU-4722] IO Errors during the failover - SLES 11 SP2 - Lustre 2.4.2

Hongchao Zhang added a comment - 08/May/14 3:01 PM

there is a bug in obd_str2uuid,

 static inline void obd_str2uuid(struct obd_uuid *uuid, const char *tmp)
 {
        strncpy((char *)uuid->uuid, tmp, sizeof(*uuid));
        uuid->uuid[sizeof(*uuid) - 1] = '\0';
 }

it take "tmp" also as a implicit "obd_uuid" type, but it isn't in all cases, such as in "class_add_uuid", the "tmp" is
"lustre_cfg_string(lcfg, 1)", and obd_str2uuid will copy some undefined data beyond the "tmp" to "uuid" and could cause two same
"uuid" in config were thought to be different.

the patch against b2_4 is tracked at http://review.whamcloud.com/#/c/10269/

Hi Rajesh,
Could you please try the patch in your site?
Thanks!

Hongchao Zhang added a comment - 08/May/14 3:01 PM there is a bug in obd_str2uuid, static inline void obd_str2uuid(struct obd_uuid *uuid, const char *tmp) { strncpy(( char *)uuid->uuid, tmp, sizeof(*uuid)); uuid->uuid[sizeof(*uuid) - 1] = '\0' ; } it take "tmp" also as a implicit "obd_uuid" type, but it isn't in all cases, such as in "class_add_uuid", the "tmp" is "lustre_cfg_string(lcfg, 1)", and obd_str2uuid will copy some undefined data beyond the "tmp" to "uuid" and could cause two same "uuid" in config were thought to be different. the patch against b2_4 is tracked at http://review.whamcloud.com/#/c/10269/ Hi Rajesh, Could you please try the patch in your site? Thanks!

Rajeshwaran Ganesan added a comment - 08/May/14 9:36 AM

Hello Hongchao,

I have uploaded the requested log files into ftp.whamcloud.com:/uploads/~~LU-4722~~

2014-05-08-SR30502_pfs2n17.llog.gz

Thanks,
Rajesh

Rajeshwaran Ganesan added a comment - 08/May/14 9:36 AM Hello Hongchao, I have uploaded the requested log files into ftp.whamcloud.com:/uploads/ LU-4722 2014-05-08-SR30502_pfs2n17.llog.gz Thanks, Rajesh

Hongchao Zhang added a comment - 06/May/14 3:02 PM

there is no error in these configs.

Could you please collect the debug logs(lctl dk >XXX.log) at the problematic node just after mounting the client (make sure the "ha" is contained in "/proc/sys/lnet/debug")?
Thanks very much!

Hongchao Zhang added a comment - 06/May/14 3:02 PM there is no error in these configs. Could you please collect the debug logs(lctl dk >XXX.log) at the problematic node just after mounting the client (make sure the "ha" is contained in "/proc/sys/lnet/debug")? Thanks very much!

Rajeshwaran Ganesan added a comment - 06/May/14 11:49 AM

There seems to be some confusion about our systems. The system with
prefix pfsc and IP addresses 172.26.4.x is our test system and the
system with prefix pfs2 and IP addresses 172.26.17.x is our production
system. We had seen the issue on both systems and therefore provided
logs from both systems. The configuration of both systems should be
very similar and also the config was newly generated on both systems.

After the config was newly generated the remaining issue is that
duplicate IP addresses appear as failover_nids on the servers only.
This appears for /proc/fs/lustre/osp/*/import on the MDS but
astonishingly only on pfsc and not on pfs2. It also appears for
/proc/fs/lustre/osc/*/import if we mount the file systems on servers
but astonishingly only for some OSTs. This appears if we mount the file
system on an OSS, on an MDS which is currently MDT for another file
system or on an currently unused (failover) MDS.

and also, uploading the logs into ftp site

Rajeshwaran Ganesan added a comment - 06/May/14 11:49 AM There seems to be some confusion about our systems. The system with prefix pfsc and IP addresses 172.26.4.x is our test system and the system with prefix pfs2 and IP addresses 172.26.17.x is our production system. We had seen the issue on both systems and therefore provided logs from both systems. The configuration of both systems should be very similar and also the config was newly generated on both systems. After the config was newly generated the remaining issue is that duplicate IP addresses appear as failover_nids on the servers only. This appears for /proc/fs/lustre/osp/*/import on the MDS but astonishingly only on pfsc and not on pfs2. It also appears for /proc/fs/lustre/osc/*/import if we mount the file systems on servers but astonishingly only for some OSTs. This appears if we mount the file system on an OSS, on an MDS which is currently MDT for another file system or on an currently unused (failover) MDS. and also, uploading the logs into ftp site

Hongchao Zhang added a comment - 05/May/14 11:00 AM - edited

/proc/fs/lustre/osc/pfs2dat2-OST0012-osc-ffff881033429400/import:
failover_nids: [172.26.8.15@o2ib, 172.26.8.15@o2ib, 172.26.8.14@o2ib]

Do you mount Lustre client at MDT? normally, the OSC(is OSP actually, and there is a symlink in /proc/fs/lustre/osc/
for each entry in /proc/fs/lustre/osp/) name for MDT is (fsname)-OSTXXXX-osc-MDT0000,
the name for client is (fsname)~~OSTXXXX-osc~~(address of superblock).

Hongchao Zhang added a comment - 05/May/14 11:00 AM - edited /proc/fs/lustre/osc/pfs2dat2-OST0012-osc-ffff881033429400/import: failover_nids: [172.26.8.15@o2ib, 172.26.8.15@o2ib, 172.26.8.14@o2ib] Do you mount Lustre client at MDT? normally, the OSC(is OSP actually, and there is a symlink in /proc/fs/lustre/osc/ for each entry in /proc/fs/lustre/osp/) name for MDT is (fsname)-OSTXXXX-osc-MDT0000, the name for client is (fsname) OSTXXXX-osc (address of superblock).

Hongchao Zhang added a comment - 04/May/14 11:14 AM - edited

the configs seems fine.
Do you also regenerate the config of pfscdat2? it adds "172.26.17.4@o2ib" and "172.26.17.3@o2ib" as target node, and it was "172.26.4.4@o2ib" and "172.26.4.3@o2ib" previously.

cmd=cf010 marker=10(0x1)pfscdat2-OST0000 'add osc'
cmd=cf005 nid=172.26.17.4@o2ib(0x50000ac1a1104) 0:(null)  1:172.26.17.4@o2ib
cmd=cf001 0:pfscdat2-OST0000-osc  1:osc  2:pfscdat2-clilov_UUID
cmd=cf003 0:pfscdat2-OST0000-osc  1:pfscdat2-OST0000_UUID  2:172.26.17.4@o2ib
cmd=cf005 nid=172.26.17.4@o2ib(0x50000ac1a1104) 0:(null)  1:172.26.17.4@o2ib
cmd=cf00b 0:pfscdat2-OST0000-osc  1:172.26.17.4@o2ib
cmd=cf005 nid=172.26.17.3@o2ib(0x50000ac1a1103) 0:(null)  1:172.26.17.3@o2ib
cmd=cf00b 0:pfscdat2-OST0000-osc  1:172.26.17.3@o2ib 
cmd=cf00d 0:pfscdat2-clilov  1:pfscdat2-OST0000_UUID  2:0  3:1
cmd=cf010 marker=10(0x2)pfscdat2-OST0000 'add osc'

Does the issue occur again in the new generated config?

btw, since only the OSC(OSP) on the MDT is affected, could you please dump the config of MDT (the config-uuid-name is fsname-MDT0000, say, pfscdat2-MDT0000).
Thanks!

Hongchao Zhang added a comment - 04/May/14 11:14 AM - edited the configs seems fine. Do you also regenerate the config of pfscdat2? it adds "172.26.17.4@o2ib" and "172.26.17.3@o2ib" as target node, and it was "172.26.4.4@o2ib" and "172.26.4.3@o2ib" previously. cmd=cf010 marker=10(0x1)pfscdat2-OST0000 'add osc' cmd=cf005 nid=172.26.17.4@o2ib(0x50000ac1a1104) 0:(null) 1:172.26.17.4@o2ib cmd=cf001 0:pfscdat2-OST0000-osc 1:osc 2:pfscdat2-clilov_UUID cmd=cf003 0:pfscdat2-OST0000-osc 1:pfscdat2-OST0000_UUID 2:172.26.17.4@o2ib cmd=cf005 nid=172.26.17.4@o2ib(0x50000ac1a1104) 0:(null) 1:172.26.17.4@o2ib cmd=cf00b 0:pfscdat2-OST0000-osc 1:172.26.17.4@o2ib cmd=cf005 nid=172.26.17.3@o2ib(0x50000ac1a1103) 0:(null) 1:172.26.17.3@o2ib cmd=cf00b 0:pfscdat2-OST0000-osc 1:172.26.17.3@o2ib cmd=cf00d 0:pfscdat2-clilov 1:pfscdat2-OST0000_UUID 2:0 3:1 cmd=cf010 marker=10(0x2)pfscdat2-OST0000 'add osc' Does the issue occur again in the new generated config? btw, since only the OSC(OSP) on the MDT is affected, could you please dump the config of MDT (the config-uuid-name is fsname-MDT0000, say, pfscdat2-MDT0000). Thanks!

Rajeshwaran Ganesan added a comment - 01/May/14 12:47 PM

tunefs.lustre --erase-params --mgsnode=172.26.8.12@o2ib
--mgsnode=172.26.8.13@o2ib --servicenode=172.26.8.14@o2ib
--servicenode=172.26.8.15@o2ib /dev/mapper/ost_pfs2dat2_0
tunefs.lustre --erase-params --mgsnode=172.26.8.12@o2ib
--mgsnode=172.26.8.13@o2ib --servicenode=172.26.8.14@o2ib
--servicenode=172.26.8.15@o2ib /dev/mapper/ost_pfs2dat2_1

2. I have uploaded the files in the FTP server

3. only the OSC on the MDT is affected, having the duplicate entries. Anyway, we should find the reason for the wrong behaviour
of the servers.

Rajeshwaran Ganesan added a comment - 01/May/14 12:47 PM tunefs.lustre --erase-params --mgsnode=172.26.8.12@o2ib --mgsnode=172.26.8.13@o2ib --servicenode=172.26.8.14@o2ib --servicenode=172.26.8.15@o2ib /dev/mapper/ost_pfs2dat2_0 tunefs.lustre --erase-params --mgsnode=172.26.8.12@o2ib --mgsnode=172.26.8.13@o2ib --servicenode=172.26.8.14@o2ib --servicenode=172.26.8.15@o2ib /dev/mapper/ost_pfs2dat2_1 2. I have uploaded the files in the FTP server 3. only the OSC on the MDT is affected, having the duplicate entries. Anyway, we should find the reason for the wrong behaviour of the servers.

Hongchao Zhang added a comment - 29/Apr/14 2:13 PM

what is the command line to create the failover nid?

Could you please dump the config file of your system,
at MGS node:
lctl>dl (will show the device list, it will show the device number at the first column)
lctl>device #MGS (MGS index, say, 1)
lctl>dump_cfg pfs2dat2-client (it will dump the config to syslog)

normally one nid (Node Id) will have only one UUID to represent it (one UUID could have more nid)
Having duplicated failover nid could cause some recovery problem, for it will use these nids one by one and it is possible to miss the recovery window.

Thanks

Hongchao Zhang added a comment - 29/Apr/14 2:13 PM what is the command line to create the failover nid? Could you please dump the config file of your system, at MGS node: lctl>dl (will show the device list, it will show the device number at the first column) lctl>device #MGS (MGS index, say, 1) lctl>dump_cfg pfs2dat2-client (it will dump the config to syslog) normally one nid (Node Id) will have only one UUID to represent it (one UUID could have more nid) Having duplicated failover nid could cause some recovery problem, for it will use these nids one by one and it is possible to miss the recovery window. Thanks

Rajeshwaran Ganesan added a comment - 25/Apr/14 12:37 PM

We did the erase config and added the failover IP using tunefs,

1. We would like to know why some of the OSTs are having the failover IP multiple times, is there any known bug in 2.4.1.
2. Having multiple time will it cause any negative impact?

/proc/fs/lustre/osc/pfs2dat2-OST0012-osc-ffff881033429400/import:
failover_nids: [172.26.8.15@o2ib, 172.26.8.15@o2ib, 172.26.8.14@o2ib]

/proc/fs/lustre/osc/pfs2dat2-OST0013-osc-ffff881033429400/import:
failover_nids: [172.26.8.15@o2ib, 172.26.8.14@o2ib]

Rajeshwaran Ganesan added a comment - 25/Apr/14 12:37 PM We did the erase config and added the failover IP using tunefs, 1. We would like to know why some of the OSTs are having the failover IP multiple times, is there any known bug in 2.4.1. 2. Having multiple time will it cause any negative impact? /proc/fs/lustre/osc/pfs2dat2-OST0012-osc-ffff881033429400/import: failover_nids: [172.26.8.15@o2ib, 172.26.8.15@o2ib, 172.26.8.14@o2ib] /proc/fs/lustre/osc/pfs2dat2-OST0013-osc-ffff881033429400/import: failover_nids: [172.26.8.15@o2ib, 172.26.8.14@o2ib]

Hongchao Zhang added a comment - 18/Apr/14 4:35 AM

Hi Rajesh,

Could you please check the content in "/proc/fs/lustre/osc/pfscdat2-OST0000-osc-XXXX/import" at one of your client nodes to verify whether
the failover_nids contains "172.26.4.3@o2ib" and "172.26.4.4@o2ib"?

Thanks!

Hongchao Zhang added a comment - 18/Apr/14 4:35 AM Hi Rajesh, Could you please check the content in "/proc/fs/lustre/osc/pfscdat2-OST0000-osc-XXXX/import" at one of your client nodes to verify whether the failover_nids contains "172.26.4.3@o2ib" and "172.26.4.4@o2ib"? Thanks!

Hongchao Zhang added a comment - 17/Apr/14 11:33 AM

In pfscn3,

Apr 14 16:20:04 pfscn3 kernel: : LDISKFS-fs (dm-9): mounted filesystem with ordered data mode. quota=on. Opts:
Apr 14 16:20:05 pfscn3 kernel: : Lustre: pfscdat2-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
Apr 14 16:20:14 pfscn3 kernel: : Lustre: pfscdat2-OST0000: Will be in recovery for at least 2:30, or until 27 clients reconnect
Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: recovery is timed out, evict stale exports
Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: disconnecting 26 stale clients
Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: Recovery over after 2:30, of 27 clients 1 recovered and 26 were evicted.
Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: deleting orphan objects from 0x0:29711772 to 0x0:29714801
Apr 14 16:25:28 pfscn3 kernel: : Lustre: Failing over pfscdat2-OST0000
Apr 14 16:25:29 pfscn3 kernel: : Lustre: server umount pfscdat2-OST0000 complete
Apr 14 16:25:29 pfscn3 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.17.2@o2ib (no target)

In pfscn4

Apr 11 08:40:02 pfscn4 kernel: : LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. quota=on. Opts:
Apr 11 08:40:02 pfscn4 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.16.28@o2ib (no target)
Apr 11 08:40:07 pfscn4 kernel: : Lustre: 3897:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent Apr 11 08:41:17 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Recovery over after 0:36, of 36 clients 36 recovered and 0 were evicted.
Apr 11 08:41:20 pfscn4 kernel: : Lustre: pfscdat2-OST0000: deleting orphan objects from 0x0:29380160 to 0x0:29380273

...

Apr 14 16:19:52 pfscn4 kernel: : Lustre: Failing over pfscdat2-OST0000
Apr 14 16:19:53 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Not available for connect from 172.26.20.7@o2ib (stopping)
Apr 14 16:19:53 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Not available for connect from 172.26.20.8@o2ib (stopping)
Apr 14 16:19:54 pfscn4 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.20.3@o2ib (no target)
Apr 14 16:19:57 pfscn4 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.17.2@o2ib (no target)
Apr 14 16:19:57 pfscn4 kernel: : LustreError: Skipped 1 previous similar message
Apr 14 16:19:58 pfscn4 kernel: : Lustre: server umount pfscdat2-OST0000 complete

...

Apr 14 16:25:43 pfscn4 kernel: : LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. quota=on. Opts:
Apr 14 16:25:44 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
Apr 14 16:25:47 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Will be in recovery for at least 2:30, or until 1 client reconnects
Apr 14 16:25:47 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client 90192127-1d80-d056-258f-193df5a6691b (at 172.26.4.4@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:30
Apr 14 16:25:49 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client 9b4ef354-d4a2-79a9-196f-2666496727d6 (at 172.26.20.9@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:28
Apr 14 16:25:49 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client b1ee0c39-c7c4-6f09-d8ec-5bf4d696e919 (at 172.26.4.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:27
Apr 14 16:25:49 pfscn4 kernel: : Lustre: Skipped 2 previous similar messages
Apr 14 16:25:51 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client b1ee0c39-c7c4-6f09-d8ec-5bf4d696e919 (at 172.26.4.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:25
Apr 14 16:25:54 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client 42dd99f1-05b3-2c90-ca73-e27b00e04746 (at 172.26.4.8@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:23
Apr 14 16:25:54 pfscn4 kernel: : Lustre: Skipped 10 previous similar messages
Apr 14 16:26:14 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client b1ee0c39-c7c4-6f09-d8ec-5bf4d696e919 (at 172.26.4.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:02
Apr 14 16:26:14 pfscn4 kernel: : Lustre: Skipped 1 previous similar message
Apr 14 16:26:20 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Recovery over after 0:33, of 1 clients 1 recovered and 0 were evicted.
Apr 14 16:26:20 pfscn4 kernel: : Lustre: pfscdat2-OST0000: deleting orphan objects from 0x0:29711772 to 0x0:29714833

In pfscn3
the device "dm-9" was mounted at 16:20:04 as pfscdat2-OST0000, during recovery, Lustre indeed found there were 27 clients (26 normal clients,
1 client from MDT), but it seems these 26 normal clients didn't recover with pfscn3 (the eviction condition after recovery timeout is either the client
didn't need recovery or there was no queued replay request). then these clients were deleted and pfscdat2-OST0000 was unmounted at 16:25:29.

In pfscn4
the device "dm-11" was mounted at 16:25:43 as pfscdat2-OST0000, but it didn't contain client records, then these clients thought it were evicted.

then the problem could be why these clients didn't connect to pfscn3 to recover?

Hongchao Zhang added a comment - 17/Apr/14 11:33 AM In pfscn3, Apr 14 16:20:04 pfscn3 kernel: : LDISKFS-fs (dm-9): mounted filesystem with ordered data mode. quota=on. Opts: Apr 14 16:20:05 pfscn3 kernel: : Lustre: pfscdat2-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 Apr 14 16:20:14 pfscn3 kernel: : Lustre: pfscdat2-OST0000: Will be in recovery for at least 2:30, or until 27 clients reconnect Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: recovery is timed out, evict stale exports Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: disconnecting 26 stale clients Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: Recovery over after 2:30, of 27 clients 1 recovered and 26 were evicted. Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: deleting orphan objects from 0x0:29711772 to 0x0:29714801 Apr 14 16:25:28 pfscn3 kernel: : Lustre: Failing over pfscdat2-OST0000 Apr 14 16:25:29 pfscn3 kernel: : Lustre: server umount pfscdat2-OST0000 complete Apr 14 16:25:29 pfscn3 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.17.2@o2ib (no target) In pfscn4 Apr 11 08:40:02 pfscn4 kernel: : LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. quota=on. Opts: Apr 11 08:40:02 pfscn4 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.16.28@o2ib (no target) Apr 11 08:40:07 pfscn4 kernel: : Lustre: 3897:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent Apr 11 08:41:17 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Recovery over after 0:36, of 36 clients 36 recovered and 0 were evicted. Apr 11 08:41:20 pfscn4 kernel: : Lustre: pfscdat2-OST0000: deleting orphan objects from 0x0:29380160 to 0x0:29380273 ... Apr 14 16:19:52 pfscn4 kernel: : Lustre: Failing over pfscdat2-OST0000 Apr 14 16:19:53 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Not available for connect from 172.26.20.7@o2ib (stopping) Apr 14 16:19:53 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Not available for connect from 172.26.20.8@o2ib (stopping) Apr 14 16:19:54 pfscn4 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.20.3@o2ib (no target) Apr 14 16:19:57 pfscn4 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.17.2@o2ib (no target) Apr 14 16:19:57 pfscn4 kernel: : LustreError: Skipped 1 previous similar message Apr 14 16:19:58 pfscn4 kernel: : Lustre: server umount pfscdat2-OST0000 complete ... Apr 14 16:25:43 pfscn4 kernel: : LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. quota=on. Opts: Apr 14 16:25:44 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 Apr 14 16:25:47 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Will be in recovery for at least 2:30, or until 1 client reconnects Apr 14 16:25:47 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client 90192127-1d80-d056-258f-193df5a6691b (at 172.26.4.4@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:30 Apr 14 16:25:49 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client 9b4ef354-d4a2-79a9-196f-2666496727d6 (at 172.26.20.9@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:28 Apr 14 16:25:49 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client b1ee0c39-c7c4-6f09-d8ec-5bf4d696e919 (at 172.26.4.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:27 Apr 14 16:25:49 pfscn4 kernel: : Lustre: Skipped 2 previous similar messages Apr 14 16:25:51 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client b1ee0c39-c7c4-6f09-d8ec-5bf4d696e919 (at 172.26.4.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:25 Apr 14 16:25:54 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client 42dd99f1-05b3-2c90-ca73-e27b00e04746 (at 172.26.4.8@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:23 Apr 14 16:25:54 pfscn4 kernel: : Lustre: Skipped 10 previous similar messages Apr 14 16:26:14 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client b1ee0c39-c7c4-6f09-d8ec-5bf4d696e919 (at 172.26.4.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:02 Apr 14 16:26:14 pfscn4 kernel: : Lustre: Skipped 1 previous similar message Apr 14 16:26:20 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Recovery over after 0:33, of 1 clients 1 recovered and 0 were evicted. Apr 14 16:26:20 pfscn4 kernel: : Lustre: pfscdat2-OST0000: deleting orphan objects from 0x0:29711772 to 0x0:29714833 In pfscn3 the device "dm-9" was mounted at 16:20:04 as pfscdat2-OST0000, during recovery, Lustre indeed found there were 27 clients (26 normal clients, 1 client from MDT), but it seems these 26 normal clients didn't recover with pfscn3 (the eviction condition after recovery timeout is either the client didn't need recovery or there was no queued replay request). then these clients were deleted and pfscdat2-OST0000 was unmounted at 16:25:29. In pfscn4 the device "dm-11" was mounted at 16:25:43 as pfscdat2-OST0000, but it didn't contain client records, then these clients thought it were evicted. then the problem could be why these clients didn't connect to pfscn3 to recover?

People

Assignee:: Hongchao Zhang

Reporter:: Rajeshwaran Ganesan

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 06/Mar/14 4:04 PM

Updated:: 12/Dec/14 1:10 PM

Resolved:: 12/Dec/14 1:10 PM