[LU-4226] MDS unable to locate swabbed FID SEQ in FLDB Created: 07/Nov/13 Updated: 13/Dec/16 Resolved: 13/Dec/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Christopher Morrone | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl, ppc | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 1 | ||||||||||||||||
| Rank (Obsolete): | 11501 | ||||||||||||||||
| Description |
|
Our sysadmins updated one of out Lustre 2.1 filesystem to lustre 2.4.0-19chaos. Note that this filesystem was likely originally formatted under 1.8. It looks like oi_scrub ran automatically this time, but failed to make any updates: > cat osd-ldiskfs/lsd-MDT0000/oi_scrub name: OI_scrub magic: 0x4c5fd252 oi_files: 1 status: completed flags: param: time_since_last_completed: 505891 seconds time_since_latest_start: 521998 seconds time_since_last_checkpoint: 505891 seconds latest_start_position: 12 last_checkpoint_position: 991133697 first_failure_position: N/A checked: 200636112 updated: 0 failed: 0 prior_updated: 0 noscrub: 3090 igif: 15492100 success_count: 2 run_time: 16107 seconds average_speed: 12456 objects/sec real-time_speed: N/A current_position: N/A You'll recall that we have oi scrub problems when we tried to upgrade the first ldiskfs filesystem to 2.4 in We are seeing similar symptoms as last time. For example, directory listings show ????????? for permissions flags for some of the subdirectories, and we are seeing errors on the MDS console like this: Nov 7 08:06:19 momus-mds1 kernel: LustreError: 7326:0:(fld_handler.c:169:fld_server_lookup()) srv-lsd-MDT0000: Cannot find sequence 0x607000002000000: rc = -5 Nov 7 08:06:19 momus-mds1 kernel: LustreError: 7326:0:(fld_handler.c:169:fld_server_lookup()) Skipped 20 previous similar messages Nov 7 08:06:19 momus-mds1 kernel: LustreError: 7326:0:(osd_handler.c:2125:osd_fld_lookup()) lsd-MDT0000-osd: cannot find FLD range for [0x607000002000000:0x8a0:0x0]: rc = -5 Nov 7 08:06:19 momus-mds1 kernel: LustreError: 7326:0:(osd_handler.c:2125:osd_fld_lookup()) Skipped 14 previous similar messages Nov 7 08:06:19 momus-mds1 kernel: LustreError: 7326:0:(osd_handler.c:3317:osd_remote_fid()) lsd-MDT0000-osd: Can not lookup fld for [0x607000002000000:0x8a0:0x0] Nov 7 08:06:19 momus-mds1 kernel: LustreError: 7326:0:(osd_handler.c:3317:osd_remote_fid()) Skipped 14 previous similar messages The filesystem is unusable many of our users. |
| Comments |
| Comment by Peter Jones [ 07/Nov/13 ] |
|
Di is looking into this issue |
| Comment by Andreas Dilger [ 07/Nov/13 ] |
|
It looks like the FID sequences for your objects are very strange, and the node cannot figure out on which server those objects are located. The FID sequence of [0x607000002000000:0x8a0:0x0] is way outside the range of IGIF FIDs (0x0000000c-0xffffffff) reserved for 1.8 objects, and also way outside the range of objects that would normally be allocated for 2.x MDT objects (starting at 0x200000400). It almost looks like some kind of endian bug? If that was swabbed it would be 0x200000706, which would be a very reasonable FID sequence. What does the FLDB (FID->server location mapping table) look like on your MDS? The following will dump out the FID sequence allocation tables on the node: mds# lctl get_param seq.*.* In particular, the "fldb" entry will show the global mapping table of sequence numbers (first part of the FID) to the server: seq.ctl-testfs-MDT0000.fldb= [0x000000000000000c-0x0000000100000000):0:mdt [0x0000000200000002-0x0000000200000003):0:mdt [0x0000000200000007-0x0000000200000008):0:mdt [0x0000000200000400-0x0000000240000400):0:mdt In this case (single MDT filesystem) all of the allocated sequences map to MDT0000. It also shows that the rest of the unallocated "space" is reserved by the sequence controller (always on MDT0) for future assignment: seq.ctl-testfs-MDT0000.space=[0x240000400 - 0xffffffffffffffff]:0:mdt |
| Comment by Andreas Dilger [ 07/Nov/13 ] |
|
I don't necessarily suspect LFSCK as the culprit here. There are only about 15M inodes that were created with a 1.x MDS, while the remaining 185M were created with a 2.x MDS (either 1.8 or 2.1 clients). It seems more likely that the DNE code is refusing to process these strange FIDs because it thinks that they belong to a remote MDT. That probably wasn't being checked in the 2.1 code, since it only could handle objects on MDT0. Could you please provide a sample of FIDs that are reporting errors (e.g. grep for "Cannot find sequence" in syslog) and attach here? I suspect that they are all byte-swapped FID sequences. The other important question is whether this problem is due to legacy objects (e.g. files created with 1.8 clients on PPC nodes), or if they are still actively being created? Could you please try on a PPC client node and on an x86 client node: client$ touch /path/to/lustre/testfile client$ ls -li /path/to/lustre/testfile client$ lfs getstripe -v /path/to/lustre/testfile If this shows that 2.4 PPC clients (or 2.1 PPC clients, if you still have them) are still creating these objects, then this would be the first problem to find and fix. It might be possible to make a workaround for this by adding an FLDB entry to cover the sequence range [0x0000N0002000000-0xffffN0002000000):0:mdt], but I'm not sure how confused the FLDB code would get if the unallocated "space" didn't extend to 0xffffffffffffffff. It also depends on how many FID sequences were allocated in swabbed order. If there were a large number of these sequences allocated (i.e. "0x0000002000N0000" is very large, and hence "0x0000N0002000000" is) then this could encroach on the unallocated sequence space and potentially cause problems in the future. That said, with the FLDB workaround entry in place, and if no new ones were being created, it would be possible to find and migrate those inodes to have "normal" FIDs, and then remove the FLDB workaround entry to avoid issues in the future. |
| Comment by Andreas Dilger [ 07/Nov/13 ] |
|
It might also be useful to get the FLDB and sequence allocation information from a PPC client (both 2.4 and 2.1 if you are still running both): client$ lctl get_param seq.*.* | grep ffff seq.cli-cli-testfs-MDT0000-mdc-ffff8800464e6800.fid=[0x200000401:0x1:0x0] seq.cli-cli-testfs-MDT0000-mdc-ffff8800464e6800.server=testfs-MDT0000_UUID seq.cli-cli-testfs-MDT0000-mdc-ffff8800464e6800.space=[0x200000402 - 0x200000402]:0:mdt seq.cli-cli-testfs-MDT0000-mdc-ffff8800464e6800.width=131072 |
| Comment by Christopher Morrone [ 07/Nov/13 ] |
|
Andreas, I am not aware of any PPC systems mounting this filesystem. Perhaps at some time in the past, I don't really know. But not now to the best of my knowledge. We are also having trouble with top-level directories in lustre, which were certainly never created from PPC nodes. Here are some of the problem fids reports on the MDS console: 0x105a949837030000 0x22f78a0102000000 0x22f78a0102000003 0x2390f96017010000 0x2570c30002000000 0x2d67c37f37030000 0x2d67c37f37030001 0x2d67c37f37030002 0x2d67c37f37030003 0x2f5e2f0102000000 0x607000002000000 0x6897260102000000 0x716c574202000000 0x82d7a31e0e070013 0x8ba9660102000000 0x969f00d6c7080009 0x978f48ef0d070002 0x978f48ef0d070007 0xaf97d14acf070009 0xaf97d14acf070017 0xaf97d14acf070018 0xaf97d14acf070022 0xb7539d4402000000 0xf104000002000000 EDIT: fixed sequence list |
| Comment by Christopher Morrone [ 07/Nov/13 ] |
|
More requested info: seq.cli-cli-lsd-OST0000-osc-MDT0000.fid=[0x0:0x0:0x0] seq.cli-cli-lsd-OST0000-osc-MDT0000.server=lsd-OST0000_UUID seq.cli-cli-lsd-OST0000-osc-MDT0000.space=[0x0 - 0x0]:0:mdt seq.cli-cli-lsd-OST0000-osc-MDT0000.width=4294967295 seq.cli-cli-lsd-OST0001-osc-MDT0000.fid=[0x0:0x0:0x0] seq.cli-cli-lsd-OST0001-osc-MDT0000.server=lsd-OST0001_UUID seq.cli-cli-lsd-OST0001-osc-MDT0000.space=[0x0 - 0x0]:0:mdt seq.cli-cli-lsd-OST0001-osc-MDT0000.width=4294967295 [cut, they all look the same] seq.cli-cli-lsd-OST0257-osc-MDT0000.fid=[0x0:0x0:0x0] seq.cli-cli-lsd-OST0257-osc-MDT0000.server=lsd-OST0257_UUID seq.cli-cli-lsd-OST0257-osc-MDT0000.space=[0x0 - 0x0]:0:mdt seq.cli-cli-lsd-OST0257-osc-MDT0000.width=4294967295 seq.cli-ctl-lsd-MDT0000.fid=[0x0:0x0:0x0] seq.cli-ctl-lsd-MDT0000.server=ctl-lsd-MDT0000 seq.cli-ctl-lsd-MDT0000.space=[0x0 - 0x0]:0:mdt seq.cli-ctl-lsd-MDT0000.width=131072 seq.ctl-lsd-MDT0000.server=<none> seq.ctl-lsd-MDT0000.space=[0x8c800000400 - 0xffffffffffffffff]:0:mdt seq.ctl-lsd-MDT0000.width=1073741824 seq.srv-lsd-MDT0000.server=ctl-lsd-MDT0000 seq.srv-lsd-MDT0000.space=[0x8c7da139f01 - 0x8c800000400]:0:mdt seq.srv-lsd-MDT0000.width=1 |
| Comment by Christopher Morrone [ 07/Nov/13 ] |
|
Andreas, the only fldb file is under the fld tree, not seq like you demonstrated above. Does that mean anything? fld.srv-lsd-MDT0000.fldb= [0x0000000000000001-0x0000000100000000):0:mdt [0x0000000200000002-0x0000000200000003):0:mdt [0x0000000200000007-0x0000000200000008):0:mdt [0x0000000200000400-0x000008c800000400):0:mdt |
| Comment by Di Wang [ 07/Nov/13 ] |
|
ah, so [0x607000002000000:0x8a0:0x0] is out of allocated sequence space, but I do not understand how top-level directories can be attached with some wrong sequence FIDs if not with PPC client. I do not recall any bugs which can trigger the problem. but I might miss sth. |
| Comment by Andreas Dilger [ 07/Nov/13 ] |
|
Location of fldb file isn't important. The "fldb" entry looks consistent with the "space" entry for MDT0. It does show is that the MDS thinks the FID sequences are being allocated somewhat normally, from 0x200000400 onward on MDT0. It is somewhat unusual in that it appears to have allocated 0x8c7da139f01-0x200000400 = 9645860297473 (~= 2^43 or 9 trillion) sequences for MDT0000. Each sequence would typically be used by one client per mount, though I recall you had bug My theory about byte-swabbed FID sequences is out the window I guess. The values look to be all over the place, and I can't see any obvious pattern about what would be causing this. |
| Comment by Di Wang [ 07/Nov/13 ] |
|
Another possibility might be OI-scrub screw up FIDs during update grade, but that is just wild guess. Fan Yong, please comment here. Chris: If you create a new file on this system right now. Does the file shows reasonable FID with it. You can simply get the FID by "lfs path2fid xxxx", Thanks. |
| Comment by Christopher Morrone [ 07/Nov/13 ] |
|
I have some more information about the PPC situation. We did have PPC clients mount this filesystem in the past. The PPC clients were running a 1.8 flavor while the servers were are 2.1. It is possible that some top-level user directories were created from a PPC login node. We are not mounting the filesystem from any PPC nodes today. |
| Comment by Christopher Morrone [ 07/Nov/13 ] |
$ touch bar $ lfs path2fid bar [0x8c7da1389ec:0x4:0x0] Yes, that is working. |
| Comment by Andreas Dilger [ 07/Nov/13 ] |
|
Chris, would you say that the number of broken files is a large fraction of files in the filesystem, or only isolated to specific files/directories? If you know of a specific directory in the filesystem suffering this problem, could you please check on the MDS with a relatively new version of e2fsprogs: mds# debugfs -c -R "stat /ROOT/path/to/bad/file" /dev/mdsdev
debugfs 1.42.7.wc1 (12-Apr-2013)
/dev/vg_sookie/lvmdt1: catastrophic mode - not reading inode or group bitmaps
Inode: 117 Type: regular Mode: 0644 Flags: 0x0
Generation: 2384158001 Version: 0x00000002:00000008
User: 0 Group: 0 Size: 0
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x527c1e95:00000000 -- Thu Nov 7 16:13:25 2013
atime: 0x527c1e95:00000000 -- Thu Nov 7 16:13:25 2013
mtime: 0x527c1e95:00000000 -- Thu Nov 7 16:13:25 2013
crtime: 0x527c1e95:d604a370 -- Thu Nov 7 16:13:25 2013
Size of extra inode fields: 28
Extended attributes stored in inode body:
lma = "00 00 00 00 00 00 00 00 00 04 00 00 02 00 00 00 04 00 00 00 00 00 00 00
" (24)
lma: fid=[0x0000000200000400:0x4:0x0] compat=0 incompat=0
lov = "d0 0b d1 0b 01 00 00 00 04 00 00 00 00 00 00 00 00 04 00 00 02 00 00 00
00 00 10 00 01 00 00 00 22 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
0 00 01 00 00 00 " (56)
link = "df f1 ea 11 01 00 00 00 2d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
0 00 15 00 00 00 02 00 00 04 00 00 00 00 03 00 00 00 00 66 6f 6f " (45)
BLOCKS:
mds# debugfs -c -R "ls -lD /ROOT/path/to/bad" /dev/mdsdev
debugfs 1.42.7.wc1 (12-Apr-2013)
/dev/vg_sookie/lvmdt1: catastrophic mode - not reading inode or group bitmaps
116 40755 (2) 0 0 4096 7-Nov-2013 16:13 .
229388 40755 (18) 0 0 4096 7-Nov-2013 16:13 [0x200000007:0x1:0x0] ..
117 100644 (17) 0 0 0 7-Nov-2013 16:13 [0x200000400:0x4:0x0] foo
118 100644 (17) 0 0 0 7-Nov-2013 16:13 [0x200000400:0x5:0x0] bar
This will tell us if the FID in the LMA xattr, in the LOV xattr, and in the directory entry data. It will also tell us if the file was recently created, or is old. It would also be useful to use the "stat" command to check the crtime of several of the files with bad FIDs, to see if there is any consistency between new/old creation date. What Lustre version(s) are the clients on this system? |
| Comment by Andreas Dilger [ 07/Nov/13 ] |
|
Ah, ignore the path2fid request in my previous comment, I see you already did that for Di. |
| Comment by Di Wang [ 07/Nov/13 ] |
|
I just made a temporary patch http://review.whamcloud.com/8213 to make FLDB "working", i.e. all of non-exist sequence will point to MDT0, since you are using single MDT, this can make FLDB work. But since we do not know the real problem yet, so I am not sure whether this temporary patch will make the FS "working" again.(i.e. I do not know OI table is good or not right now). Though this will help us understand the problem. |
| Comment by Christopher Morrone [ 08/Nov/13 ] |
|
The proplematic top-level directories are easiest to find: # > debugfs.ldiskfs -c -R "stat /ROOT/dkp" /dev/sda debugfs.ldiskfs 1.42.7.wc1.1chaos (12-Apr-2013) /dev/sda: catastrophic mode - not reading inode or group bitmaps Inode: 66529648 Type: directory Mode: 0700 Flags: 0x80000 Generation: 607576858 Version: 0x0000002e:0ff37cd7 User: 41679 Group: 41679 Size: 4096 File ACL: 0 Directory ACL: 0 Links: 2 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4f31ae1d:00000000 -- Tue Feb 7 15:05:01 2012 atime: 0x524106c8:00000000 -- Mon Sep 23 20:28:08 2013 mtime: 0x4f31ae1d:00000000 -- Tue Feb 7 15:05:01 2012 crtime: 0x4f31ae1d:e22ed204 -- Tue Feb 7 15:05:01 2012 Size of extra inode fields: 28 Extended attributes stored in inode body: lma = "00 00 00 00 00 00 00 00 00 00 00 02 00 00 07 06 4f 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " (64) lma: fid=[0x200000706:0x4f090000:0x0] compat=0 incompat=0 link = "df f1 ea 11 01 00 00 00 2d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 15 00 00 00 00 01 e6 00 01 fb 01 ac dd 00 00 00 00 64 6b 70 " (45) lov = "d0 0b d1 0b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00 ff ff " (32) EXTENTS: (0):33264747 # > debugfs.ldiskfs -c -R "ls -lD /ROOT" /dev/sda |grep dkp debugfs.ldiskfs 1.42.7.wc1.1chaos (12-Apr-2013) /dev/sda: catastrophic mode - not reading inode or group bitmaps 66529648 40700 (2) 41679 41679 4096 7-Feb-2012 15:05 dkp This directory on a lustre client looks like: d?????????? ? ? ? ? ? dkp
Clients are mostly 2.1 flavor still, although there may be some 2.4 out there. We're in the (much longer than hoped) process of upgrading the servers before we do the client upgrades. |
| Comment by Andreas Dilger [ 08/Nov/13 ] |
|
In this case "fid=[0x200000706:0x4f090000:0x0]" has a sequence that looks reasonable "0x200000706", though the OID looks strange "0x4f090000". It should normally be below 128k. In any case, this in itself shouldn't be causing the "can't find sequence" errors, since this particular sequence is valid and should map to MDT0 properly. The FID in the LMA does not match the one in the LOV EA (which appears to be all zero) since this is a directory and not a regular file. This is not a problem, but I just wanted to see if this data was inconsistent. Could you do this same step with a regular file? It also looks like your version of debugfs.ldiskfs either does not implement the "-D" option, or the top level directory does not have dirdata that holds the FID. Since I also have an MDT that has existed since Lustre 1.6 or earlier (currently running 2.1.3 server with 2.4 clients), I'm going to try upgrading this to 2.4.1 to see what happens. |
| Comment by Di Wang [ 08/Nov/13 ] |
|
Hmm, the sequence seems correct, so Andreas's idea of swapped problem might be correct. But OID seems too big here. (The maxim OID of MDT FID should be 0x20000ULL) lma: fid=[0x200000706:0x4f090000:0x0] compat=0 incompat=0 Chris, could you please tell me client version(2.1.6?), are there any tag? What is the output of "lfs path2fid dkp"? Could you please provide -1 debug log of client/MDT when do "stat dkp"? (please clear client cache before doing this "lctl set_param ldlm.*.MDT-mdc*.lru_size=0") Thanks. |
| Comment by Christopher Morrone [ 08/Nov/13 ] |
|
The client I am on happens to be the same lustre version 2.4.0-19chaos. There are probably lustre 2.1.4-[45]chaos clients mounting the filesystem as well. > lfs path2fid dkp can't get fid for dkp: Input/output error I'll work on getting logs. |
| Comment by Christopher Morrone [ 08/Nov/13 ] |
|
I spot checked some working top level directories, and large OID numbers look pretty common. Here are some fid EAs from working directories: lma: fid=[0xd6e35b0102000000:0xa3b20000:0x0] compat=0 incompat=0 lma: fid=[0x7c83e80100000000:0x716125c4:0x0] compat=0 incompat=0 Using path2fid, those two are reported on the client, respectively, as: [0x2015be3d6:0xb2a3:0x0] [0x1e8837c:0xc4256171:0x0] So it would appear that debugfs is printing the fid info in the wrong byte order. (Both client and server are x86_64, and running the same 2.4.0-19chaos version of lustre). |
| Comment by Di Wang [ 08/Nov/13 ] |
|
Interesting. Hmm, what is your debugfs version? debugfs suppose to print things as little endian on x86_64. static inline void lfsck_swab_fid(struct lu_fid *fid)
{
fid->f_seq = ext2fs_le64_to_cpu(fid->f_seq);
fid->f_oid = ext2fs_le32_to_cpu(fid->f_oid);
fid->f_ver = ext2fs_le32_to_cpu(fid->f_ver);
}
static void print_lmastr(FILE *out, ext2_ino_t inode_num, void *data, int len)
{
struct lustre_mdt_attrs *lma = data;
if (len < sizeof(*lma)) {
fprintf(stderr, "%s: error: lma for inode %u smaller than "
"expected (%d bytes).\n",
debug_prog_name, inode_num, len);
return;
}
lfsck_swab_fid(&lma->lma_self_fid);
fprintf(out, " lma: fid="DFID"\n", PFID(&lma->lma_self_fid));
}
Hmm, debugfs is wrong here, then the fid we got from the previous comment is wrong. fid=[0x200000706:0x4f090000:0x0] Then it means the FID([0x200000706:0x4f090000:0x0]) stored on LMA is wrong. |
| Comment by Christopher Morrone [ 08/Nov/13 ] |
|
See attached client_log.txt, and serveR_log.txt.bz2. Note that client nid is 192.168.115.67@o2ib10, and the server nid is 172.16.64.141@tcp. |
| Comment by Christopher Morrone [ 08/Nov/13 ] |
|
There is too much MDS traffic to reasonably catch that one RPC with -1 debugging. I backed it off to our defaults + rpctrace on the server side. Let me know if there is some other conservative debug setting that would work for you. |
| Comment by Christopher Morrone [ 08/Nov/13 ] |
|
debugfs version is 1.42.7.wc1.1chaos |
| Comment by Di Wang [ 08/Nov/13 ] |
|
Unfortunately, the server log here is not very helpful 00000100:00100000:9.0:1383878901.757322:0:7277:0:(service.c:1867:ptlrpc_server_handle_req_in()) got req x1451068745273592 00000100:00100000:9.0:1383878901.757330:0:7277:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-192.168.115.67@o2ib10, seq: 610971421 00000100:00100000:9.0:1383878901.757333:0:7277:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt02_044:249dc688-8ad3-1d27-4460-e739bfdc22f5+5:68652:x1451068745273592:12345-192.168.115.67@o2ib10:101 80000000:00020000:9.0:1383878901.757376:0:7277:0:(fld_handler.c:169:fld_server_lookup()) srv-lsd-MDT0000: Cannot find sequence 0x607000002000000: rc = -5 00000004:00020000:9.0:1383878901.757380:0:7277:0:(osd_handler.c:2125:osd_fld_lookup()) lsd-MDT0000-osd: cannot find FLD range for [0x607000002000000:0x94f:0x0]: rc = -5 00000004:00020000:9.0:1383878901.757382:0:7277:0:(osd_handler.c:3317:osd_remote_fid()) lsd-MDT0000-osd: Can not lookup fld for [0x607000002000000:0x94f:0x0] 80000000:00020000:9.0:1383878901.757394:0:7277:0:(fld_handler.c:169:fld_server_lookup()) srv-lsd-MDT0000: Cannot find sequence 0x607000002000000: rc = -5 00000100:00100000:9.0:1383878901.757418:0:7277:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt02_044:249dc688-8ad3-1d27-4460-e739bfdc22f5+5:68652:x1451068745273592:12345-192.168.115.67@o2ib10:101 Request procesed in 85us (105us total) trans 0 rc 301/301 This is the log I found. almost the same as we got from the console. Could you please also add "+info +inode" "lctl set_param debug "+info +inode"" on the server side? Hope that would not be too much. |
| Comment by Christopher Morrone [ 08/Nov/13 ] |
|
What is your version of e2fsprogs? The code at the 1.42.7.wc1 is: fid_be_to_cpu(&lma->lma_self_fid, &lma->lma_self_fid);
fprintf(out, " lma: fid="DFID" compat=%x incompat=%x\n",
PFID(&lma->lma_self_fid), ext2fs_le32_to_cpu(lma->lma_compat),
ext2fs_le32_to_cpu(lma->lma_incompat));
That does the opposite swab of the one that you showed: static inline void fid_be_to_cpu(struct lu_fid *dst, struct lu_fid *src)
{
dst->f_seq = ext2fs_be64_to_cpu(src->f_seq);
dst->f_oid = ext2fs_be32_to_cpu(src->f_oid);
dst->f_ver = ext2fs_be32_to_cpu(src->f_ver);
}
|
| Comment by Di Wang [ 08/Nov/13 ] |
|
ah, on 1.42.7 debugfs is somewhat wrong. static void print_lmastr(FILE *out, ext2_ino_t inode_num, void *data, int len)
{
struct lustre_mdt_attrs *lma = data;
if (len < offsetof(typeof(*lma), lma_self_fid) +
sizeof(lma->lma_self_fid)) {
fprintf(stderr, "%s: error: LMA for inode %u smaller than "
"expected (%d bytes).\n",
debug_prog_name, inode_num, len);
return;
}
fid_be_to_cpu(&lma->lma_self_fid, &lma->lma_self_fid);
fprintf(out, " lma: fid="DFID" compat=%x incompat=%x\n",
PFID(&lma->lma_self_fid), ext2fs_le32_to_cpu(lma->lma_compat),
ext2fs_le32_to_cpu(lma->lma_incompat));
}
It tries to convert the FID from big endian, but actually LMA store the FID as little endian. |
| Comment by Di Wang [ 08/Nov/13 ] |
|
Chris: Do you still have PPC client(2.1) attached to this server? Could you create a directory from PPC client then use debugfs to see whether the FID in lma is correct? |
| Comment by Di Wang [ 08/Nov/13 ] |
|
Oh, my debugfs version is 1.42.3wc3, I create a ticket |
| Comment by Christopher Morrone [ 08/Nov/13 ] |
|
No. We have not have PPC clients attached to this filesystem. |
| Comment by nasf (Inactive) [ 08/Nov/13 ] |
|
It seems that the "oid" in the FID is correct, but the "seq" in the FID is in wrong order. It is quite possible that the FID in LMA, and in dir-ent entry, and in the OI mapping are all consistent with one another, but with the wrong "seq" order. The issue should has been there since Lustre-2.1, but because Lustre-2.1 MDS did not verify the FLDB, so it worked before. On the other hand, it is NOT all the FIDs with wrong "seq" order. The FID [0x2015be3d6:0xb2a3:0x0] and [0x1e8837c:0xc4256171:0x0] are valid. I suspect that only some special client, such as PPC client, ever generated invalid FID and sent it to the Lustre-2.1 MDT when create (such as the "/ROOT/dkp". So if possible, we can downgrade the MDS to Lustre-2.1, and test whether it is the case or not. |
| Comment by Di Wang [ 08/Nov/13 ] |
|
Since during upgrade(2.1 to 2.4), OI-scrub will unlikely touch the FID(both in LMA and name entry). But the FIDs(in LMA) of "dkp" is clearly wrong(Note: FID in name-entry should be wrong too, we can see it from the debug message) So probably it is not the problem of 2.4 or upgrading. It is more likely a bug already exists on 2.1. Though we suspect that is probably related with PPC client. As Fan Yong said, we do not do fld lookup(check) for 2.1, so it "works" on 2.1. |
| Comment by Di Wang [ 08/Nov/13 ] |
|
On possible solution here might be (probably there are better ones) But we still need to find out why this "big endian" FIDs was stored in LMA or name-entry. Does it related with PPC client. is it still exists on 2.1? Andreas, Do you recall any bugs related? |
| Comment by Christopher Morrone [ 08/Nov/13 ] |
|
And the MDT just happily wrote the bogus sequence to disk...that sounds like at least two bugs. |
| Comment by Christopher Morrone [ 08/Nov/13 ] |
|
I have logs if you still want them. Where can I upload large files these days? Is ftp.whamcloud.com still the place? |
| Comment by Di Wang [ 08/Nov/13 ] |
|
According to the timestamp of dpk ctime: 0x4f31ae1d:00000000 -- Tue Feb 7 15:05:01 2012 atime: 0x524106c8:00000000 -- Mon Sep 23 20:28:08 2013 mtime: 0x4f31ae1d:00000000 -- Tue Feb 7 15:05:01 2012 crtime: 0x4f31ae1d:e22ed204 -- Tue Feb 7 15:05:01 2012 What is the lustre version was running at Feb 2012? |
| Comment by Di Wang [ 08/Nov/13 ] |
|
Clearly, b2_1 did not do good job here. |
| Comment by Christopher Morrone [ 08/Nov/13 ] |
|
In Dec 2012 is was running 2.1.2-3chaos. So either that or something earlier. |
| Comment by Christopher Morrone [ 08/Nov/13 ] |
|
I just uploaded |
| Comment by Di Wang [ 08/Nov/13 ] |
|
Chris: Thanks for debug log, so it is clear that, the FID(of dkp) in LMA and name entry are wrong, LU-4226_server_log2.txt:00000020:00000040:8.0:1383880710.013496:0:7320:0:(lustre_handles.c:114:class_handle_hash()) added object ffff88051775fc00 with handle 0x57ab4f35e04c8dd2 to hash LU-4226_server_log2.txt:00010000:00000040:8.0:1383880710.013497:0:7320:0:(ldlm_resource.c:1423:ldlm_resource_dump()) --- Resource: ffff8805ece16d00 (31850497/4211191005/0/2365253) (rc: 1) LU-4226_server_log2.txt:80000000:00020000:8.0:1383880710.013533:0:7320:0:(fld_handler.c:169:fld_server_lookup()) srv-lsd-MDT0000: Cannot find sequence 0x607000002000000: rc = -5 LU-4226_server_log2.txt:00000004:00020000:8.0:1383880710.046488:0:7320:0:(osd_handler.c:2125:osd_fld_lookup()) lsd-MDT0000-osd: cannot find FLD range for [0x607000002000000:0x94f:0x0]: rc = -5 LU-4226_server_log2.txt:00000004:00020000:8.0:1383880710.086690:0:7320:0:(osd_handler.c:3317:osd_remote_fid()) lsd-MDT0000-osd: Can not lookup fld for [0x607000002000000:0x94f:0x0] LU-4226_server_log2.txt:00000004:00000040:8.0:1383880710.119477:0:7320:0:(mdt_handler.c:2384:mdt_object_find()) Find object for [0x607000002000000:0x94f:0x0] LU-4226_server_log2.txt:00000004:00000040:8.0:1383880710.119485:0:7320:0:(mdt_handler.c:5018:mdt_object_init()) object init, fid = [0x607000002000000:0x94f:0x0] LU-4226_server_log2.txt:80000000:00020000:8.0:1383880710.119493:0:7320:0:(fld_handler.c:169:fld_server_lookup()) srv-lsd-MDT0000: Cannot find sequence 0x607000002000000: rc = -5 LU-4226_server_log2.txt:00000004:00000040:8.0:1383880710.119500:0:7320:0:(mdt_handler.c:5038:mdt_object_free()) object free, fid = [0x607000002000000:0x94f:0x0] LU-4226_server_log2.txt:00010000:00000040:8.0:1383880710.119505:0:7320:0:(ldlm_lock.c:888:ldlm_lock_decref_internal()) forcing cancel of local lock It seems OID is correct, but sequence 0x607000002000000 is wrong(likely to be wrongly swapped), So it means the error might happened during seq allocation process, otherwise both OID and sequence will be wrong, since they are always be swapped at the same time except seq allocation. I checked 2.1.2 code, and did not find anything wrong there. I will keep digging the history here. Thanks. |
| Comment by Andreas Dilger [ 08/Nov/13 ] |
|
I upgraded my home system from 2.1.3 to 2.4.1 (RHEL6 with RPMs from the b2_4 "last_successful_build" yum repo on build.whamcloud.com to include Fan Yong's scrub fix from It would be useful to know if these files with the crazy FIDs are still being created, or if they have stopped being created when that happened. If there are a limited number of them, then it may be possible to "lfs_migrate" them to new files with Di's patch applied on the MDS. Alternately, it may also be possible to mount the filesystem as ldiskfs and delete the trusted.lma and trusted.link xattrs from the file (essentially turning the file into a 1.8 upgrade object with IGIF) and then re-run LFSCK on it to clean up the dirdata entries and re-add the files into the OI. I haven't tested that yet, so I'm not 100% sure of what effect it will have. Clients would probably need to be remounted, or at a minimum have their cache flushed. |
| Comment by Andreas Dilger [ 08/Nov/13 ] |
|
Chris, have you considered downgrading to 2.1 again to get the system back up and usable? |
| Comment by Di Wang [ 08/Nov/13 ] |
|
I checked the code since 2.1.0, and did not find anything unusual which can trigger the problem. Unfortunately, I do not have any big endian machine to try this on 2.1.0 to see whether the problem is still there. If the system is still working on 2.1, and if you can find some PPC clients attached to it, that might help us understand the issue. Thanks. |
| Comment by Christopher Morrone [ 08/Nov/13 ] |
|
I think I like the idea to remove the trusted.lma and trusted.link xattrs the best. Walking the filesystem through the direct ldiskfs mount won't be too time consuming. I'll give that a try on some files in a test filesystem. |
| Comment by Christopher Morrone [ 08/Nov/13 ] |
|
I fear the downgrade option. Upgrades don't work well, and those are at least somewhat tested. Downgrades aren't even tested. |
| Comment by Peter Jones [ 08/Nov/13 ] |
|
Chris We do test downgrades, but certainly nothing as complex as this situation Peter |
| Comment by Andreas Dilger [ 08/Nov/13 ] |
|
Chris, Di's patch in http://review.whamcloud.com/8213 may also be a quick workaround. It weakens the added "FID sanity check" that was introduced in 2.4 so that invalid FIDs are assumed to be on MDT0000. They will still fail if the object doesn't exist, but it might allow the bad files to become accessible again, and would not involve any permanent change to the filesystem, like deleting the LMA xattr does. It would also make sense to do a "dd" backup of the MDT if possible. That should be relatively fast even if the target is just a single 4TB SATA drive (est. 5.5h for a 100MB/s source/target, extrapolating a 2TB MDT from the ~1B inodes that LFSCK scanned), and could be done with the filesystem live if necessary (a backup that needs an e2fsck is better than none). |
| Comment by Ned Bass [ 09/Nov/13 ] |
|
Andreas, we may try your method of deleting trusted.lma and trusted.link xattrs. One thing I'm unsure about is the semantics of the lfsck_start command with regard to the supported repair types. Can we simultaneously repair the OI and namespace (i.e. FID-in-Dirent and LinkEA)? That is, should we run lctl lfsck_start -M <mdt> lctl lfsck_start -M <mdt> -t namespace in immediate succession, or should we wait until the OI scrub completes before starting the namespace repair, or can both types be started with one command? |
| Comment by nasf (Inactive) [ 10/Nov/13 ] |
|
"lctl lfsck_start -M <MDT>" will only trigger OI_scrub check/repair at background. So the command will return immediately, but as for how long the check/repair will last depends on how many files on the MDT. "lctl lfsck_start -M <MDT> -t namespace" will trigger OI_scrub and namespace check/repair simultaneously. The two components run in parallel at background. Similar as above, the command will return immediately, and the scanning time depends on the files count on the MDT. Something to be clarified: 2) namespace scanning also trust FID-in-LMA. If the FID-in-LMA does not exist, it will append IGIF mode FID after name-entry in the directory block, and also use the IGIF for linkEA. 3) Because on server-side, the files' FIDs are changed, you have to re-mount the clients to purge out all related cached but staled information. To be safe, it is better to find a test system firstly to verify wether it works as our expectation or not. |
| Comment by Andreas Dilger [ 14/Nov/13 ] |
|
This is a simple script to check a filesystem for strange looking fids. usage: checkfid.sh [-v] /path/to/lustre It will print tons of errors on the filesystem discussed in this bug, but it would be useful to run it on other filesystems (preferably before upgrade to 2.4) to see if they suffer from the same problems. |
| Comment by Ned Bass [ 15/Nov/13 ] |
|
Thanks Andreas, that's quite helpful (though I don't think this does what you intended): [[ -n "$LAST" && "$F" == "$LAST" ]] && LAST="" && echo "found" || continue Incidentally, we carried out the "remove trusted.{lma,link} from bad files + lfsck" recovery procedure, and it worked pretty much as expected. |
| Comment by Andreas Dilger [ 15/Nov/13 ] |
|
Updated version of checkfid.sh program. The "restart" mechanism was added at the last minute and looked like it was working, but wasn't. |
| Comment by Andreas Dilger [ 18/Nov/13 ] |
|
Any chance to run the checkfid.sh script on any of your other filesystems? |
| Comment by Christopher Morrone [ 25/Nov/13 ] |
|
Yes, it was run (maybe still running) on four of our ldiskfs systems on the SCF. Of the four, only one had bad fids, and that filesystem was the one that BG/P used exclusively. That filesystem has in excess of 1 million files/directories with bad fids. So that would appear to be anther strong correlation pointing to ppc clients and lack of checking on the servers. |
| Comment by Ned Bass [ 05/Dec/13 ] |
|
Andreas, in case checkfid.sh is needed again, it needs to handle sequence numbers that compare as negative integers: - [[ ${SFID[1]} -ge $MAXFID ]] && echo "$F: bad SEQ $FFID" && continue
+ if [[ ${SFID[1]} -ge $MAXFID -o ${FID[1]} -lt 0 ]] ; then
+ echo "$F: bad SEQ $FFID"
+ continue
+ fi
|
| Comment by Di Wang [ 15/Apr/14 ] |
|
Ned, Chris: Could you please tell me if OI_scrub fix these bad FIDs? Are there anything else I should do for this ticket? Thanks. |
| Comment by Di Wang [ 15/Apr/14 ] |
|
Btw: we will do more FID validation on the server side in https://jira.hpdd.intel.com/browse/LU-4232. Ah, already attached that ticket to the sub-tasks. |
| Comment by Andreas Dilger [ 19/Apr/14 ] |
|
Di, I don't think there was any way for LFSCK to fix the bad FIDs directly. My understanding is that the LMA xattr was removed from the inodes, and then LFSCK treated this as an upgraded 1.8 filesystem with IGIF FIDs and recreated the LMA. |
| Comment by Christopher Morrone [ 21/Apr/14 ] |
|
The problem was handled as Andreas explained. If the servers now have code to prevent this problem in the fist place, then the ticket is complete. |
| Comment by James A Simmons [ 14/Aug/16 ] |
|
time to close this out. |