[LU-7867] OI scrubber causing performance issues Created: 12/Mar/16 Updated: 15/Jun/16 Resolved: 09/May/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.5 |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Major |
| Reporter: | Dustin Leverman | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
OI scrubber causing performance issues.
|
| Comments |
| Comment by Jian Yu [ 12/Mar/16 ] |
|
Hi Dustin, I checked the question with nasf who is the expert of lfsck and OI scrub. He had more questions:
Could you please take a look if the second one is the case you hit? Thank you. |
| Comment by Andreas Dilger [ 12/Mar/16 ] |
|
Please provide specific error messages from the console log, otherwise it is difficult to know what actual problem being hit. |
| Comment by Dustin Leverman [ 14/Mar/16 ] |
|
Jian, The OI scrub is being triggered automatically with messages like these every 5 minutes or so: Mar 14 12:48:25 f1-oss1d8 kernel: [1379308.219915] Lustre: f1-OST00bf-osd: trigger OI scrub by RPC for [0x100000000:0x33d61e8:0x0], rc = 0 [1] We have had to fix this a few times before by manually removing those objects, however we were wondering if we could use lfsck to fix this problem so we don't keep running into it? In our version of lustre is either online or offline lfsck supported? Thanks, |
| Comment by Dustin Leverman [ 14/Mar/16 ] |
|
To address the performance impact of having the OI scrubber start every 5 minutes, is it safe to disable the OI scrubber for that OST until we come up with a solution? lctl set_param osd-ldiskfs.f1-OST00bf.auto_scrub=0 Thanks, |
| Comment by nasf (Inactive) [ 14/Mar/16 ] |
|
Disable auto_scrub (set it as 0) will cause related RPC handler to return -78 to the sponsor directly if detect insistent OI mapping. That can avoid the slow performance, but related application may fail out (because of -78 failure). According to logs you pasted, it is the destroying OST-object triggered the OI scrub, and the failure of destroying the OST-object caused orphan. In fact, I suspect that the OI mapping is valid, but it is some improper check that misguided the OI scrub. I will try to make patch to fix that. |
| Comment by Gerrit Updater [ 15/Mar/16 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18912 |
| Comment by nasf (Inactive) [ 15/Mar/16 ] |
|
Dustin, The patch 18912 will fix the issue of finding inconsistent OI mapping and trigger OI scrub. With that, you need NOT to disable auto_scrub, and the performance should be recovered. But for the orphan OST-objects that have existed already, this patch can help nothing. Since Lustre-2.5.5 does not support the new online layout LFSCK, you have to use the old lfsck (for detail, please check old Lustre manual "27.2 lfsck" section) to find out them. If you do NOT care about some orphan OST-objects (which waste some OST space, depends on the orphans' size), then just apply this patch directly. |
| Comment by Matt Ezell [ 15/Mar/16 ] |
|
Hi Nasf- Thanks for the patch. I think I'm missing something... This file system was formatted under Lustre 2.4 and upgraded to 2.5. We have never run anything newer here*, so I would expect all of our object to be old-style IDIF (OST index set to 0). As I understand it, the patch will now equate old-style and new-style IDIF objects to prevent the scrubber from starting due to this - this is good. But where are the new-style objids coming from? We have had ldiskfs issues on this OST before, might some version of e2fsck have "fixed" the LMA for us? Is it this confusion between old-style and new-style IDIFs also causing the "error destroying object" with -115 message, or is that caused by something else? I understand that this patch cannot recover objects that have already been orphaned, but I want to make sure we don't keep accumulating orphans. Thanks! *Note: The servers have only ever been 2.4.x or 2.5.x. Clients have been 1.8.x, 2.4.x, 2.5.x, and 2.7.x. It's possible that a test client may have used a 2.8 pre-release at some point. |
| Comment by nasf (Inactive) [ 15/Mar/16 ] |
|
The IDIF (for the OST-object) stored in the LMA EA on disk is the old format, means the index is always 0, but the index for the given IDIF from upper layer for lookup the OST-object may be not zero. Currently, I cannot 100% sure that it is such difference caused your trouble, but it is a likely suspected point and once it is happen, it will cause your trouble. This patch can verify and prevent such case. The patch will not change any on-disk data, applying the patch is harmless for your system. On the other hand, the data on your disk keeps old format, nobody changed. When the OI scrub triggered in your system, it tried to update the IFID-in-LMA EA, but the new generated one was still zero indexed, the same as original one. That equals to nothing to be updated. That is why you may see that the OI scrub was triggered repeatedly by the same IDIF periodically. The "error destroying object" with -115 message was returned by the osd_fid_lookup() when locating the OST-object for destroying. Before the OI scrub completed, the OSD could not know which local inode was mapped to the given IDIF, so the destroying operation was blocked or failed. That is why the orphan OST-object generated. This patch will filter out the index difference and unnecessary prevent OI scrub, then avoid more OST-object orphans. |
| Comment by Matt Ezell [ 15/Mar/16 ] |
|
Thanks for the update. I noticed from your patch that there's already a CDEBUG for D_INODE in osd_check_lma so I added inode to the debug string and waited for this to trigger again. I dumped the log and found the following: 00000010:00000002:5.0:1458059877.008993:0:21045:0:(ost_handler.c:2398:ost_handle()) destroy 00002000:00080000:5.0:1458059877.008998:0:21045:0:(ofd_obd.c:1075:ofd_destroy()) f1-OST00bf: Destroy object 0x0:54354408 count 1 00000004:00000002:5.0:1458059877.009023:0:21045:0:(osd_handler.c:491:osd_check_lma()) f1-OST00bf: FID [0x100000000:0x33d61e8:0x0] != self_fid [0x100000000:0x320e401:0x0] 00000004:02000400:5.0:1458059877.009110:0:21045:0:(osd_handler.c:589:osd_fid_lookup()) f1-OST00bf-osd: trigger OI scrub by RPC for [0x100000000:0x33d61e8:0x0], rc = 0 [1] 00000010:00000002:5.0:1458059878.032100:0:21052:0:(ost_handler.c:2398:ost_handle()) destroy 00002000:00080000:5.0:1458059878.032104:0:21052:0:(ofd_obd.c:1075:ofd_destroy()) f1-OST00bf: Destroy object 0x0:53495634 count 1 00000004:00000002:5.0:1458059878.032126:0:21052:0:(osd_handler.c:491:osd_check_lma()) f1-OST00bf: FID [0x100000000:0x3304752:0x0] != self_fid [0x100000000:0x320e400:0x0] 00002000:00020000:5.0:1458059878.032129:0:21052:0:(ofd_obd.c:1096:ofd_destroy()) f1-OST00bf: error destroying object [0x100000000:0x3304752:0x0]: -115 Do those look like the same IDIF just with different indexes? I don't know how that information is packed in there, but they don't look like they match to me. Here's some information about that first object: [root@f1-oss1d8 ~]# debugfs -c -R 'stat O/0/d8/54354408' /dev/mapper/f1-ddn1d-l53 debugfs 1.42.13.wc4 (28-Nov-2015) /dev/mapper/f1-ddn1d-l53: catastrophic mode - not reading inode or group bitmaps Inode: 562889 Type: regular Mode: 0666 Flags: 0x80000 Generation: 833716884 Version: 0x0000000e:00c7ec1c User: 800 Group: 502 Size: 1048576 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 2048 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x552a389a:00000000 -- Sun Apr 12 09:19:22 2015 atime: 0x566c289a:c40af440 -- Sat Dec 12 14:00:58 2015 mtime: 0x552a389a:00000000 -- Sun Apr 12 09:19:22 2015 crtime: 0x552a3831:79d980f4 -- Sun Apr 12 09:17:37 2015 Size of extra inode fields: 28 Extended attributes stored in inode body: invalid EA entry in inode EXTENTS: (0-254):17259-17513, (255):17514 [root@f1-oss1d8 ~]# rpm -qf /sbin/debugfs e2fsprogs-1.42.13.wc4-7.el6.x86_64 What does "invalid EA entry in inode" mean? Is this object damaged? |
| Comment by Matt Ezell [ 15/Mar/16 ] |
|
Err... [root@f1-oss1d8 ~]# debugfs -c -R 'stat O/0/d8/54354408' /dev/mapper/f1-ddn1d-l53 debugfs 1.42.13.wc4 (28-Nov-2015) /dev/mapper/f1-ddn1d-l53: catastrophic mode - not reading inode or group bitmaps Inode: 562889 Type: regular Mode: 0666 Flags: 0x80000 Generation: 833716884 Version: 0x0000000e:00c7ec1c User: 800 Group: 502 Size: 1048576 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 2048 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x552a389a:00000000 -- Sun Apr 12 09:19:22 2015 atime: 0x566c289a:c40af440 -- Sat Dec 12 14:00:58 2015 mtime: 0x552a389a:00000000 -- Sun Apr 12 09:19:22 2015 crtime: 0x552a3831:79d980f4 -- Sun Apr 12 09:17:37 2015 Size of extra inode fields: 28 Extended attributes stored in inode body: invalid EA entry in inode EXTENTS: (0-254):17259-17513, (255):17514 [root@f1-oss1d8 ~]# debugfs -c -R 'stat O/0/d1/52487169' /dev/mapper/f1-ddn1d-l53 debugfs 1.42.13.wc4 (28-Nov-2015) /dev/mapper/f1-ddn1d-l53: catastrophic mode - not reading inode or group bitmaps Inode: 562889 Type: regular Mode: 0666 Flags: 0x80000 Generation: 833716884 Version: 0x0000000e:00c7ec1c User: 800 Group: 502 Size: 1048576 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 2048 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x552a389a:00000000 -- Sun Apr 12 09:19:22 2015 atime: 0x566c289a:c40af440 -- Sat Dec 12 14:00:58 2015 mtime: 0x552a389a:00000000 -- Sun Apr 12 09:19:22 2015 crtime: 0x552a3831:79d980f4 -- Sun Apr 12 09:17:37 2015 Size of extra inode fields: 28 Extended attributes stored in inode body: invalid EA entry in inode EXTENTS: (0-254):17259-17513, (255):17514 Note both objects claim to be inode 562889 but the link count is 1. |
| Comment by Andreas Dilger [ 15/Mar/16 ] |
|
I was going to suggest to check object O/0/d0/52487168 for the same reason. At this point, I would recommend to take this OST offline and run "e2fsck -fn" to see if there is corruption. I suspect that there will be at least some shared blocks, as it appears that at least one inode table block has been written to the wrong location, or there is some corruption in the directory entry. It is also unfortunate that the xattr data is reporting as corrupted, since this could otherwise be used to determine which inodes are the right ones and which are being referenced incorrectly. This can at least partially be checked from the kernel logs. |
| Comment by Andreas Dilger [ 15/Mar/16 ] |
|
Depending on what kind of corruption is detected by e2fsck, it might be best to just let "e2fsck -fy" run and fix the problems, then ll_recover_lost_found_objs to recover anything left in lost+found. Has there been any corruption or other disk errors reported on this system? |
| Comment by Dustin Leverman [ 15/Mar/16 ] |
|
Andreas, I don't know if this is telling or not but $(degugfs -c -R 'stats' <LUN>) is telling us that the LUN is clean. Also, we did not open a Jira ticket on this last corruption issue, it was handled with DDN. Thanks, |
| Comment by nasf (Inactive) [ 16/Mar/16 ] |
That means the OI mapping for FID [0x100000000:0x33d61e8:0x0] is mapped to the inode 562889 with the name entry "/O/0/d8/54354408", but such name entry is wrong. The inode is referenced by the name entry "/O/0/d1/52487169", that has been verified via the FID-in-LMA. The info of "invalid EA entry in inode" does not means the inode corruption, because the OSD could read LMA EA. I would suggest you to update the e2fsprogs, that may be helpful, and if we can get the PFID EA from the inode 562889, then we can know whether related MDT-object is still there or not. Lustre OST-object is always single referenced, so the unrecognised mapping entry "/O/0/d8/54354408" is wrong. I am not sure how the corruption was generated, but your site is not the first one to hit such trouble. Anyway, run e2fsck to fix the disk level corruption is the first step. |
| Comment by Matt Ezell [ 16/Mar/16 ] |
|
The check that is causing debugfs to mark the EA data as invalid is (!ea_inode && value + entry->e_value_size >= end) because for LMA, entry->e_value_size = 24 and value + entry->e_value_size = end. I ran debugfs under gdb and manually set entry->e_value_size = 23. This resulted in the following output: Inode: 562889 Type: regular Mode: 0666 Flags: 0x80000 Generation: 833716884 Version: 0x0000000e:00c7ec1c User: 800 Group: 502 Size: 1048576 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 2048 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x552a389a:00000000 -- Sun Apr 12 09:19:22 2015 atime: 0x566c289a:c40af440 -- Sat Dec 12 14:00:58 2015 mtime: 0x552a389a:00000000 -- Sun Apr 12 09:19:22 2015 crtime: 0x552a3831:79d980f4 -- Sun Apr 12 09:17:37 2015 Size of extra inode fields: 28 Extended attributes stored in inode body: lma = "08 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 01 e4 20 03 00 00 00 " (23) fid = "f8 95 00 22 29 03 00 02 00 00 00 00 02 00 00 00 " (16) fid: parent=[0x2000329220095f8:0x0:0x0] stripe=2 EXTENTS: (0-254):17259-17513, (255):17514 I don't think that formatting for the parent fid is correct, and feeding it directly into fid2path results in invalid argument. I think the sequence provided is actually the sequence+objid (is this a bug? should this print a 'valid' fid?) so I tried: # lfs fid2path /lustre/f1 0x200032922:0x95f8:0x0 fid2path: error on FID 0x200032922:0x95f8:0x0: No such file or directory This is expected, since the RPC that causes this OI lookup is trying to destroy the object anyway. My guess is that the MDS has deleted its inode and it added all the objects to the llog for removal. llog processing sent an unlink RPC to the OST, but due to this damaged object returned -115. I'm not familiar with llog processing, but I would hazard to guess that it retries the unlink (since we are seeing this RPC every ~600 seconds). So I guess the options at this point are:
Since this has happened before (see |
| Comment by nasf (Inactive) [ 17/Mar/16 ] |
|
The output of debugfs seems wrong, because the OSD got the self FID as [0x100000000:0x320e401:0x0], but the debugfs output is: lma = "08 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 01 e4 20 03 00 00 00 " (23). So the value offset is wrong. Which version e2fsprogs are you using? Is it possible to try newer e2fsprogs? |
| Comment by Matt Ezell [ 17/Mar/16 ] |
|
Note from my previous comment that I manually set the value length to 23 to avoid the 'invalid EA' message
We are using the latest version available from Intel: e2fsprogs-1.42.13.wc4-7.el6.x86_64.rpm [root@f1-oss1d8 ~]# rpm -qi e2fsprogs Name : e2fsprogs Relocations: (not relocatable) Version : 1.42.13.wc4 Vendor: (none) Release : 7.el6 Build Date: Fri Dec 11 19:51:16 2015 Install Date: Wed Feb 24 17:44:21 2016 Build Host: onyx-8-sde1-el6-x8664.onyx.hpdd.intel.com Group : System Environment/Base Source RPM: e2fsprogs-1.42.13.wc4-7.el6.src.rpm Size : 3170704 License: GPLv2 Signature : (none) URL : https://downloads.hpdd.intel.com/public/e2fsprogs/ Summary : Utilities for managing ext2, ext3, and ext4 filesystems Description : The e2fsprogs package contains a number of utilities for creating, checking, modifying, and correcting any inconsistencies in second, third and fourth extended (ext2/ext3/ext4) filesystems. E2fsprogs contains e2fsck (used to repair filesystem inconsistencies after an unclean shutdown), mke2fs (used to initialize a partition to contain an empty ext2 filesystem), debugfs (used to examine the internal structure of a filesystem, to manually repair a corrupted filesystem, or to create test cases for e2fsck), tune2fs (used to modify filesystem parameters), and most of the other core ext2fs filesystem utilities. You should install the e2fsprogs package if you need to manage the performance of an ext2, ext3, or ext4 filesystem. |
| Comment by Gerrit Updater [ 18/Mar/16 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18999 |
| Comment by nasf (Inactive) [ 18/Mar/16 ] |
|
Matt, Would you please to apply above patch on your e2fsprogs that fixed the debugfs issue, and please run the patched debugfs on your system. Thanks! |
| Comment by Andreas Dilger [ 18/Mar/16 ] |
|
Sorry, committed this comment too quickly. Something strange is going on here. I'm just checking my local MDT and OST filesystems with debugfs 1.42.12.wc1 and it is printing the xattrs values properly. Checking with debugfs 1.42.13.wc4 appears to have the same problem, looking into it more closely. |
| Comment by Andreas Dilger [ 18/Mar/16 ] |
|
Updated patch http://review.whamcloud.com/18999 should now resolve the problem introduced in the upstream code. |
| Comment by Matt Ezell [ 18/Mar/16 ] |
|
I used the build from Jenkins and it now correctly shows the EAs for /O/0/d8/54354408: lma = "08 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 01 e4 20 03 00 00 00 00 " (24) lma: fid=[0x100000000:0x320e401:0x0] compat=8 incompat=0 fid = "f8 95 00 22 29 03 00 02 00 00 00 00 02 00 00 00 " (16) fid: parent=[0x2000329220095f8:0x0:0x0] stripe=2 |
| Comment by Matt Ezell [ 18/Mar/16 ] |
|
Andreas - is the parent fid being printed correctly? I ran a quick test on a non-corrupted OST: # lfs getstripe test
test
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 123
obdidx objid objid group
123 120572 0x1d6fc 0
124 120522 0x1d6ca 0
125 120541 0x1d6dd 0
126 120607 0x1d71f 0
# lfs path2fid test
[0x200000410:0x2254:0x0]
# debugfs -c -R 'stat O/0/d28/120572' /dev/mapper/f1-ddn1a-l24 ... Extended attributes stored in inode body: lma = "00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 fc d6 01 00 00 00 00 00 " (24) lma: fid=[0x100000000:0x1d6fc:0x0] compat=0 incompat=0 fid = "54 22 00 10 04 00 00 02 00 00 00 00 00 00 00 00 " (16) fid: parent=[0x200000410002254:0x0:0x0] stripe=0 Should I open a separate ticket for that? |
| Comment by Andreas Dilger [ 19/Mar/16 ] |
|
No, the parent doesn't look correct. It should exactly match the output from lfs fid2path, but it looks offset by 3 bytes for some reason. We can add the fix into the same patch, no need for a new ticket. |
| Comment by nasf (Inactive) [ 21/Mar/16 ] |
|
Matt, Is the "test" an existing file (updated from old device) or new created one? If it is the latter case, how do you generate it? and would you please to retest test with -1 level debug collected on the client and related OST? Your system is Lustre-2.5.5 based (both client and server), right? any additional patches on that? (BTW, I have tried Lustre-2.5.5 in my local VM environment, but cannot reproduce your trouble - strange PFID EA. So I need more logs from you.) Thanks! |
| Comment by Matt Ezell [ 21/Mar/16 ] |
|
For this file system, we have a mixture of clients - most are 2.5, but we also have some 1.8 and 2.7. I just ran some tests, and it seems that new files created under 2.5 or 2.7 are OK. But the file I created on the 1.8 client shows the strange PFID EA. The 1.8 clients are going away soon, so no need to debug that client, but we do have a ton of files on disk with the strange PFID EA. My main concern is that a future lfsck run (under 2.8 servers, eventually) might mark these files as damaged and/or orphaned. I'll run some tests with debugging later today and upload relevant logs. Thanks. |
| Comment by nasf (Inactive) [ 21/Mar/16 ] |
|
It is the client to send the PFID information to the OST when write/setattr. According to your description, the 1.8 client may sent "ostid" formatted PFID to the OST, then caused the strange PFID EA. Usually, as long as the OST-object is NOT real orphan, then after upgrading your system to Lustre-2.6 or newer, the layout LFSCK will handle it as unmatched MDT-OST objects pairs, and will correct the PFID EA with the right MDT-object's FID. |
| Comment by Matt Ezell [ 21/Mar/16 ] |
|
I think the strange PFID EA is coming from fid_flatten. It takes the FID and does ino = (seq << 24) + ((seq >> 24) & 0xffffff0000ULL) + fid_oid(fid); to make the 128-bit FID fit in a 64-bit inode. I didn't specifically see this in the log, but I believe this is what the client is setting as PFID. Since the 1.8 clients are going away, it's not worth looking into it further. We'll make sure to test LFSCK in 2.8 against objects created under 1.8 before upgrading the server. So, back to the original issue at hand. I guess we need to take a downtime to run e2fsck. There could be two outcomes:
If situation #2 happens, should we manually unlink them? Then, how do we ensure there aren't more damaged objects on this OST? |
| Comment by Andreas Dilger [ 23/Mar/16 ] |
|
I suspect that e2fsck will just consider these files as hard linked, so you will find them by looking for regular files under O/0/ with nlink=2. |
| Comment by Gerrit Updater [ 29/Mar/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/18999/ |
| Comment by nasf (Inactive) [ 12/Apr/16 ] |
|
Any further information about the issue? Have we ever run e2fsck as discussed above? |
| Comment by Gerrit Updater [ 12/Apr/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/19493 |
| Comment by Dustin Leverman [ 13/Apr/16 ] |
|
We are in a holding pattern on this until the customer allows us to take an outage. It will probably be another 5-6 weeks before we get the green light. I think it is okay to close this ticket. Our plan is to run the e2fsck on the LUN, if that doesn't fix the issue we will wipe out the 2 OST objects by hand and will lfsck when we update to lustre-2.8 servers. Thanks, |
| Comment by nasf (Inactive) [ 13/Apr/16 ] |
|
Thanks Dustin for the explanation. Then we will close the ticket after landing Andreas' latest patch. |
| Comment by Gerrit Updater [ 02/May/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/19493/ |
| Comment by nasf (Inactive) [ 09/May/16 ] |
|
Related patches have been landed. |