[LU-7867] OI scrubber causing performance issues - Whamcloud Community JIRA

Matt Ezell added a comment - 18/Mar/16 5:26 PM

I used the build from Jenkins and it now correctly shows the EAs for /O/0/d8/54354408:

  lma = "08 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 01 e4 20 03 00 00 00 00 " (24)
  lma: fid=[0x100000000:0x320e401:0x0] compat=8 incompat=0
  fid = "f8 95 00 22 29 03 00 02 00 00 00 00 02 00 00 00 " (16)
  fid: parent=[0x2000329220095f8:0x0:0x0] stripe=2

Matt Ezell added a comment - 18/Mar/16 5:26 PM I used the build from Jenkins and it now correctly shows the EAs for /O/0/d8/54354408: lma = "08 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 01 e4 20 03 00 00 00 00 " (24) lma: fid=[0x100000000:0x320e401:0x0] compat=8 incompat=0 fid = "f8 95 00 22 29 03 00 02 00 00 00 00 02 00 00 00 " (16) fid: parent=[0x2000329220095f8:0x0:0x0] stripe=2

Andreas Dilger added a comment - 18/Mar/16 9:13 AM

Updated patch http://review.whamcloud.com/18999 should now resolve the problem introduced in the upstream code.

Andreas Dilger added a comment - 18/Mar/16 9:13 AM Updated patch http://review.whamcloud.com/18999 should now resolve the problem introduced in the upstream code.

Andreas Dilger added a comment - 18/Mar/16 7:17 AM - edited

Sorry, committed this comment too quickly.

Something strange is going on here. I'm just checking my local MDT and OST filesystems with debugfs 1.42.12.wc1 and it is printing the xattrs values properly. Checking with debugfs 1.42.13.wc4 appears to have the same problem, looking into it more closely.

Andreas Dilger added a comment - 18/Mar/16 7:17 AM - edited Sorry, committed this comment too quickly. Something strange is going on here. I'm just checking my local MDT and OST filesystems with debugfs 1.42.12.wc1 and it is printing the xattrs values properly. Checking with debugfs 1.42.13.wc4 appears to have the same problem, looking into it more closely.

nasf (Inactive) added a comment - 18/Mar/16 1:11 AM

Matt,

Would you please to apply above patch on your e2fsprogs that fixed the debugfs issue, and please run the patched debugfs on your system.

Thanks!

nasf (Inactive) added a comment - 18/Mar/16 1:11 AM Matt, Would you please to apply above patch on your e2fsprogs that fixed the debugfs issue, and please run the patched debugfs on your system. Thanks!

Gerrit Updater added a comment - 18/Mar/16 1:09 AM

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18999
Subject: ~~LU-7867~~ debugfs: locate EA value correctly
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: 4f64aa8135d0865fba88c3ee49981362ec0cb994

Gerrit Updater added a comment - 18/Mar/16 1:09 AM Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18999 Subject: LU-7867 debugfs: locate EA value correctly Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: 4f64aa8135d0865fba88c3ee49981362ec0cb994

Matt Ezell added a comment - 17/Mar/16 1:14 PM

Note from my previous comment that I manually set the value length to 23 to avoid the 'invalid EA' message

I ran debugfs under gdb and manually set entry->e_value_size = 23.

We are using the latest version available from Intel: e2fsprogs-1.42.13.wc4-7.el6.x86_64.rpm

[root@f1-oss1d8 ~]# rpm -qi e2fsprogs
Name        : e2fsprogs                    Relocations: (not relocatable)
Version     : 1.42.13.wc4                       Vendor: (none)
Release     : 7.el6                         Build Date: Fri Dec 11 19:51:16 2015
Install Date: Wed Feb 24 17:44:21 2016         Build Host: onyx-8-sde1-el6-x8664.onyx.hpdd.intel.com
Group       : System Environment/Base       Source RPM: e2fsprogs-1.42.13.wc4-7.el6.src.rpm
Size        : 3170704                          License: GPLv2
Signature   : (none)
URL         : https://downloads.hpdd.intel.com/public/e2fsprogs/
Summary     : Utilities for managing ext2, ext3, and ext4 filesystems
Description :
The e2fsprogs package contains a number of utilities for creating,
checking, modifying, and correcting any inconsistencies in second,
third and fourth extended (ext2/ext3/ext4) filesystems. E2fsprogs
contains e2fsck (used to repair filesystem inconsistencies after an
unclean shutdown), mke2fs (used to initialize a partition to contain
an empty ext2 filesystem), debugfs (used to examine the internal
structure of a filesystem, to manually repair a corrupted
filesystem, or to create test cases for e2fsck), tune2fs (used to
modify filesystem parameters), and most of the other core ext2fs
filesystem utilities.

You should install the e2fsprogs package if you need to manage the
performance of an ext2, ext3, or ext4 filesystem.

Matt Ezell added a comment - 17/Mar/16 1:14 PM Note from my previous comment that I manually set the value length to 23 to avoid the 'invalid EA' message I ran debugfs under gdb and manually set entry->e_value_size = 23. We are using the latest version available from Intel : e2fsprogs-1.42.13.wc4-7.el6.x86_64.rpm [root@f1-oss1d8 ~]# rpm -qi e2fsprogs Name : e2fsprogs Relocations: (not relocatable) Version : 1.42.13.wc4 Vendor: (none) Release : 7.el6 Build Date: Fri Dec 11 19:51:16 2015 Install Date: Wed Feb 24 17:44:21 2016 Build Host: onyx-8-sde1-el6-x8664.onyx.hpdd.intel.com Group : System Environment/Base Source RPM: e2fsprogs-1.42.13.wc4-7.el6.src.rpm Size : 3170704 License: GPLv2 Signature : (none) URL : https://downloads.hpdd.intel.com/public/e2fsprogs/ Summary : Utilities for managing ext2, ext3, and ext4 filesystems Description : The e2fsprogs package contains a number of utilities for creating, checking, modifying, and correcting any inconsistencies in second, third and fourth extended (ext2/ext3/ext4) filesystems. E2fsprogs contains e2fsck (used to repair filesystem inconsistencies after an unclean shutdown), mke2fs (used to initialize a partition to contain an empty ext2 filesystem), debugfs (used to examine the internal structure of a filesystem, to manually repair a corrupted filesystem, or to create test cases for e2fsck), tune2fs (used to modify filesystem parameters), and most of the other core ext2fs filesystem utilities. You should install the e2fsprogs package if you need to manage the performance of an ext2, ext3, or ext4 filesystem.

nasf (Inactive) added a comment - 17/Mar/16 1:00 PM

The output of debugfs seems wrong, because the OSD got the self FID as [0x100000000:0x320e401:0x0], but the debugfs output is: lma = "08 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 01 e4 20 03 00 00 00 " (23). So the value offset is wrong.

Which version e2fsprogs are you using? Is it possible to try newer e2fsprogs?
Before resolve the debugfs issue, please NOT remove these OST-objects manually, because we are not sure whether they are really the targets to be destroyed.

nasf (Inactive) added a comment - 17/Mar/16 1:00 PM The output of debugfs seems wrong, because the OSD got the self FID as [0x100000000:0x320e401:0x0] , but the debugfs output is: lma = "08 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 01 e4 20 03 00 00 00 " (23). So the value offset is wrong. Which version e2fsprogs are you using? Is it possible to try newer e2fsprogs? Before resolve the debugfs issue, please NOT remove these OST-objects manually, because we are not sure whether they are really the targets to be destroyed.

Matt Ezell added a comment - 16/Mar/16 2:56 PM

The check that is causing debugfs to mark the EA data as invalid is

(!ea_inode && value + entry->e_value_size >= end)

because for LMA, entry->e_value_size = 24 and value + entry->e_value_size = end. I ran debugfs under gdb and manually set entry->e_value_size = 23. This resulted in the following output:

Inode: 562889   Type: regular    Mode:  0666   Flags: 0x80000
Generation: 833716884    Version: 0x0000000e:00c7ec1c
User:   800   Group:   502   Size: 1048576
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 2048
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x552a389a:00000000 -- Sun Apr 12 09:19:22 2015
 atime: 0x566c289a:c40af440 -- Sat Dec 12 14:00:58 2015
 mtime: 0x552a389a:00000000 -- Sun Apr 12 09:19:22 2015
crtime: 0x552a3831:79d980f4 -- Sun Apr 12 09:17:37 2015
Size of extra inode fields: 28
Extended attributes stored in inode body: 
  lma = "08 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 01 e4 20 03 00 00 00 " (23)
  fid = "f8 95 00 22 29 03 00 02 00 00 00 00 02 00 00 00 " (16)
  fid: parent=[0x2000329220095f8:0x0:0x0] stripe=2
EXTENTS:
(0-254):17259-17513, (255):17514

I don't think that formatting for the parent fid is correct, and feeding it directly into fid2path results in invalid argument. I think the sequence provided is actually the sequence+objid (is this a bug? should this print a 'valid' fid?) so I tried:

# lfs fid2path /lustre/f1 0x200032922:0x95f8:0x0
fid2path: error on FID 0x200032922:0x95f8:0x0: No such file or directory

This is expected, since the RPC that causes this OI lookup is trying to destroy the object anyway. My guess is that the MDS has deleted its inode and it added all the objects to the llog for removal. llog processing sent an unlink RPC to the OST, but due to this damaged object returned -115. I'm not familiar with llog processing, but I would hazard to guess that it retries the unlink (since we are seeing this RPC every ~600 seconds).

So I guess the options at this point are:

Run the newer e2fsck, hope it finds and fixes or unlinks these problematic objects
Just unlink these objects, since that's what the MDS is trying to do anyway

Since this has happened before (see ~~LU-7378~~) we are worried that this might keep happening. Is there any way to scan the objects and determine if any more are in this state?

Matt Ezell added a comment - 16/Mar/16 2:56 PM The check that is causing debugfs to mark the EA data as invalid is (!ea_inode && value + entry->e_value_size >= end) because for LMA, entry->e_value_size = 24 and value + entry->e_value_size = end. I ran debugfs under gdb and manually set entry->e_value_size = 23. This resulted in the following output: Inode: 562889 Type: regular Mode: 0666 Flags: 0x80000 Generation: 833716884 Version: 0x0000000e:00c7ec1c User: 800 Group: 502 Size: 1048576 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 2048 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x552a389a:00000000 -- Sun Apr 12 09:19:22 2015 atime: 0x566c289a:c40af440 -- Sat Dec 12 14:00:58 2015 mtime: 0x552a389a:00000000 -- Sun Apr 12 09:19:22 2015 crtime: 0x552a3831:79d980f4 -- Sun Apr 12 09:17:37 2015 Size of extra inode fields: 28 Extended attributes stored in inode body: lma = "08 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 01 e4 20 03 00 00 00 " (23) fid = "f8 95 00 22 29 03 00 02 00 00 00 00 02 00 00 00 " (16) fid: parent=[0x2000329220095f8:0x0:0x0] stripe=2 EXTENTS: (0-254):17259-17513, (255):17514 I don't think that formatting for the parent fid is correct, and feeding it directly into fid2path results in invalid argument. I think the sequence provided is actually the sequence+objid (is this a bug? should this print a 'valid' fid?) so I tried: # lfs fid2path /lustre/f1 0x200032922:0x95f8:0x0 fid2path: error on FID 0x200032922:0x95f8:0x0: No such file or directory This is expected, since the RPC that causes this OI lookup is trying to destroy the object anyway. My guess is that the MDS has deleted its inode and it added all the objects to the llog for removal. llog processing sent an unlink RPC to the OST, but due to this damaged object returned -115. I'm not familiar with llog processing, but I would hazard to guess that it retries the unlink (since we are seeing this RPC every ~600 seconds). So I guess the options at this point are: Run the newer e2fsck, hope it finds and fixes or unlinks these problematic objects Just unlink these objects, since that's what the MDS is trying to do anyway Since this has happened before (see LU-7378 ) we are worried that this might keep happening. Is there any way to scan the objects and determine if any more are in this state?

nasf (Inactive) added a comment - 16/Mar/16 6:52 AM

[root@f1-oss1d8 ~]# debugfs -c -R 'stat O/0/d8/54354408' /dev/mapper/f1-ddn1d-l53
debugfs 1.42.13.wc4 (28-Nov-2015)
/dev/mapper/f1-ddn1d-l53: catastrophic mode - not reading inode or group bitmaps
Inode: 562889 Type: regular Mode: 0666 Flags: 0x80000
Generation: 833716884 Version: 0x0000000e:00c7ec1c
User: 800 Group: 502 Size: 1048576
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 2048
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x552a389a:00000000 – Sun Apr 12 09:19:22 2015
atime: 0x566c289a:c40af440 – Sat Dec 12 14:00:58 2015
mtime: 0x552a389a:00000000 – Sun Apr 12 09:19:22 2015
crtime: 0x552a3831:79d980f4 – Sun Apr 12 09:17:37 2015
Size of extra inode fields: 28
Extended attributes stored in inode body:
invalid EA entry in inode
EXTENTS:
(0-254):17259-17513, (255):17514

That means the OI mapping for FID [0x100000000:0x33d61e8:0x0] is mapped to the inode 562889 with the name entry "/O/0/d8/54354408", but such name entry is wrong. The inode is referenced by the name entry "/O/0/d1/52487169", that has been verified via the FID-in-LMA. The info of "invalid EA entry in inode" does not means the inode corruption, because the OSD could read LMA EA. I would suggest you to update the e2fsprogs, that may be helpful, and if we can get the PFID EA from the inode 562889, then we can know whether related MDT-object is still there or not.

Lustre OST-object is always single referenced, so the unrecognised mapping entry "/O/0/d8/54354408" is wrong. I am not sure how the corruption was generated, but your site is not the first one to hit such trouble. Anyway, run e2fsck to fix the disk level corruption is the first step.

nasf (Inactive) added a comment - 16/Mar/16 6:52 AM [root@f1-oss1d8 ~] # debugfs -c -R 'stat O/0/d8/54354408' /dev/mapper/f1-ddn1d-l53 debugfs 1.42.13.wc4 (28-Nov-2015) /dev/mapper/f1-ddn1d-l53: catastrophic mode - not reading inode or group bitmaps Inode: 562889 Type: regular Mode: 0666 Flags: 0x80000 Generation: 833716884 Version: 0x0000000e:00c7ec1c User: 800 Group: 502 Size: 1048576 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 2048 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x552a389a:00000000 – Sun Apr 12 09:19:22 2015 atime: 0x566c289a:c40af440 – Sat Dec 12 14:00:58 2015 mtime: 0x552a389a:00000000 – Sun Apr 12 09:19:22 2015 crtime: 0x552a3831:79d980f4 – Sun Apr 12 09:17:37 2015 Size of extra inode fields: 28 Extended attributes stored in inode body: invalid EA entry in inode EXTENTS: (0-254):17259-17513, (255):17514 That means the OI mapping for FID [0x100000000:0x33d61e8:0x0] is mapped to the inode 562889 with the name entry "/O/0/d8/54354408", but such name entry is wrong. The inode is referenced by the name entry "/O/0/d1/52487169", that has been verified via the FID-in-LMA. The info of "invalid EA entry in inode" does not means the inode corruption, because the OSD could read LMA EA. I would suggest you to update the e2fsprogs, that may be helpful, and if we can get the PFID EA from the inode 562889, then we can know whether related MDT-object is still there or not. Lustre OST-object is always single referenced, so the unrecognised mapping entry "/O/0/d8/54354408" is wrong. I am not sure how the corruption was generated, but your site is not the first one to hit such trouble. Anyway, run e2fsck to fix the disk level corruption is the first step.

Dustin Leverman added a comment - 15/Mar/16 8:04 PM

Andreas,
This LUN has had corruption issues in the past (last April to be exact). We did run an e2fsck on it until it ran clean, however that was using an older version of e2fsprogs. Perhaps the new version would find issues that the last version could not, and we can check that. It will take a week or two to schedule a downtime with the customer to perform maintenance.

I don't know if this is telling or not but $(degugfs -c -R 'stats' <LUN>) is telling us that the LUN is clean. Also, we did not open a Jira ticket on this last corruption issue, it was handled with DDN.

Thanks,
Dustin

Dustin Leverman added a comment - 15/Mar/16 8:04 PM Andreas, This LUN has had corruption issues in the past (last April to be exact). We did run an e2fsck on it until it ran clean, however that was using an older version of e2fsprogs. Perhaps the new version would find issues that the last version could not, and we can check that. It will take a week or two to schedule a downtime with the customer to perform maintenance. I don't know if this is telling or not but $(degugfs -c -R 'stats' <LUN>) is telling us that the LUN is clean. Also, we did not open a Jira ticket on this last corruption issue, it was handled with DDN. Thanks, Dustin

Andreas Dilger added a comment - 15/Mar/16 5:55 PM

Depending on what kind of corruption is detected by e2fsck, it might be best to just let "e2fsck -fy" run and fix the problems, then ll_recover_lost_found_objs to recover anything left in lost+found.

Has there been any corruption or other disk errors reported on this system?

Andreas Dilger added a comment - 15/Mar/16 5:55 PM Depending on what kind of corruption is detected by e2fsck, it might be best to just let "e2fsck -fy" run and fix the problems, then ll_recover_lost_found_objs to recover anything left in lost+found. Has there been any corruption or other disk errors reported on this system?

OI scrubber causing performance issues

Details

Description

Attachments

Issue Links

Activity

People

Dates