Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.12.2
    • None
    • EL 7.6.1810 x86_64
    • 3
    • 9223372036854775807

    Description

      MDT filesystem filled up, MDS crashed and would crash again shortly after mounting filesystem . Disabled mount on boot and mounted the MDT as ldiskfs to make some space. After deleting some files, unmounted and ran a fsck.ext4. This has crashed twice after running for 6-7 days, the first time the MDS was unresponsive and had to be rebooted, the second time I could capture a backtrace. I remounted the MDT - still showing 100% (could well be because the fsck didn't finish). Not sure what to do next.

      Attachments

        Issue Links

          Activity

            [LU-14056] MDT filesystem full, fsck crashing

            Thanks for all the help. It might have been the addition of a filer that is being backed up on this cluster that has 284 million files...

            cmcl Campbell Mcleay (Inactive) added a comment - Thanks for all the help. It might have been the addition of a filer that is being backed up on this cluster that has 284 million files...
            pjones Peter Jones added a comment -

            Looks like we're ok to close out this ticket now

            pjones Peter Jones added a comment - Looks like we're ok to close out this ticket now

            Looking at the xattr blocks in bmds1-sample-files.tar.gz it appears that the external xattr block is being used by the "link" xattr, which tracks the hard links for each file. On those files I saw between 10 and 30 links on those files, with an xattr size between 900-3000 bytes because of the relatively long filenames (60-75 bytes each).

            If this is a common workload for you, then there are a couple of options (for future filesystems at least) to change the formatting options on the MDT from the defaults. The default is to format with 2560 bytes/inode (1024 bytes for the inode itself, plus an average of 1536 bytes/inode for xattrs, directory entry, logs, etc.). Formatting the MDT with "mkfs.lustre --mdt ... --mkfsoptions='-i 5120'" would allow a 4KB xattr block for each inode. While each inode wouldn't necessarily have an xattr block, there are also directory blocks and other needs for that space. Unfortunately, it isn't possible to change this ratio for an existing MDT filesystem without a full backup/restore.

            adilger Andreas Dilger added a comment - Looking at the xattr blocks in bmds1-sample-files.tar.gz it appears that the external xattr block is being used by the " link " xattr, which tracks the hard links for each file. On those files I saw between 10 and 30 links on those files, with an xattr size between 900-3000 bytes because of the relatively long filenames (60-75 bytes each). If this is a common workload for you, then there are a couple of options (for future filesystems at least) to change the formatting options on the MDT from the defaults. The default is to format with 2560 bytes/inode (1024 bytes for the inode itself, plus an average of 1536 bytes/inode for xattrs, directory entry, logs, etc.). Formatting the MDT with " mkfs.lustre --mdt ... --mkfsoptions='-i 5120' " would allow a 4KB xattr block for each inode. While each inode wouldn't necessarily have an xattr block, there are also directory blocks and other needs for that space. Unfortunately, it isn't possible to change this ratio for an existing MDT filesystem without a full backup/restore.

            Hi Andreas,

            Yes, it is up and operational. Thanks for all your help!

            Kind regards,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi Andreas, Yes, it is up and operational. Thanks for all your help! Kind regards, Campbell

            cmcl Just to confirm, besides the "osp_sync_declare_add()) logging isn't available" message, my understanding is that this filesystem is now up and operational?

            adilger Andreas Dilger added a comment - cmcl Just to confirm, besides the " osp_sync_declare_add()) logging isn't available " message, my understanding is that this filesystem is now up and operational?

            Hi Andreas,

            Sadly I didn't capture the summary information at the end of the fsck, we had a power cut and so things have been hectic the last two days. I have captured some info on 4 files on the MDS (let me know if you would like more). I will attach them to the ticket. I will also create a new ticket for the log issue.

            Kind regards,

            Campbell bmds1-sample-files.tar.gz

            cmcl Campbell Mcleay (Inactive) added a comment - Hi Andreas, Sadly I didn't capture the summary information at the end of the fsck, we had a power cut and so things have been hectic the last two days. I have captured some info on 4 files on the MDS (let me know if you would like more). I will attach them to the ticket. I will also create a new ticket for the log issue. Kind regards, Campbell bmds1-sample-files.tar.gz

            As for the llog message:

            osp_sync_declare_add()) logging isn't available, run LFSCK
            

            it implies that the MDS isn't able to create a recovery log (typically) for OST object deletes. This may eventually become an issue, depending on how this error is handled. Could you please file that into a separate LU ticket so that it can be tracked and fixed properly. The "run LFSCK" part means (AFAIK) that there is a chance of OST objects not being deleted, so deleting files from the filesystem will not reduce space usage on the OSTs.

            I'd need someone else to look into whether this error means "the OST object is deleted immediately, and space will be orphaned only in case of an MDS crash" (i.e. very low severity), or "no OST object is deleted and space may run out quickly" (more serious), and/or whether this issue is specific to a single OST (less important, but seems to be the case from what I can see), or it affects many/all OSTs (more serious). This can be assigned and resolved (along with a better error message) in the context of the new ticket.

            adilger Andreas Dilger added a comment - As for the llog message: osp_sync_declare_add()) logging isn't available, run LFSCK it implies that the MDS isn't able to create a recovery log (typically) for OST object deletes. This may eventually become an issue, depending on how this error is handled. Could you please file that into a separate LU ticket so that it can be tracked and fixed properly. The " run LFSCK " part means (AFAIK) that there is a chance of OST objects not being deleted, so deleting files from the filesystem will not reduce space usage on the OSTs. I'd need someone else to look into whether this error means "the OST object is deleted immediately, and space will be orphaned only in case of an MDS crash" (i.e. very low severity), or "no OST object is deleted and space may run out quickly" (more serious), and/or whether this issue is specific to a single OST (less important, but seems to be the case from what I can see), or it affects many/all OSTs (more serious). This can be assigned and resolved (along with a better error message) in the context of the new ticket.
            Oct 30 20:30:45 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            :
            $ sudo lfs getstripe -d /user_data/
            

            so that means the MDS is mounted without crashing? An improvement at least...

            As for the "lfs getstripe" it looks like the default file layout (1 stripe on any OST), so it doesn't seem like a candidate for what is using the MDT space. Do you have the summary lines from the recently-completed e2fsck on the MDT that report the total number of regular files and directories? Each directory consumes at least one 4KB block, so if there are lots of them it could consume a lot of space.

            Alternately, you could try checking a random sampling of files on the MDS via debugfs (don't worry about the "catastrophic mode" message, that is just a faster way to run debugfs, and it is read-only so does not affect the filesystem, even if mounted):

            # debugfs -c /dev/md127
            debugfs 1.45.6.wc2 (28-Sep-2020)
            /dev/md127: catastrophic mode - not reading inode or group bitmaps
            debugfs: cd /ROOT
            debugfs: stat path/to/some/file
            Inode: 2401450   Type: regular    Mode:  0644   Flags: 0x0
            Generation: 1618288620    Version: 0x000000a3:0011dca7
            User:     0   Group:     0   Size: 2147483648
            File ACL: 2400743
            Links: 1   Blockcount: 8
            Fragment:  Address: 0    Number: 0    Size: 0
             ctime: 0x5e14f5e0:00000000 -- Tue Jan  7 14:19:28 2020
             atime: 0x5f24c5ad:00000000 -- Fri Jul 31 19:30:21 2020
             mtime: 0x5e14f5e0:00000000 -- Tue Jan  7 14:19:28 2020
            crtime: 0x5c9d43de:770dc7c8 -- Thu Mar 28 15:59:58 2019
            Size of extra inode fields: 28
            Extended attributes:
              trusted.lma (24) = 00 00 00 00 00 00 00 00 34 f0 01 00 02 00 00 00 01 00 00 00 00 00 00 00 
              lma: fid=[0x20001f034:0x1:0x0] compat=0 incompat=0
              trusted.link (50)
              trusted.lov (368)
            debugfs: stat path/to/another/file
            :
            

            to see if those files have a non-zero value for the "File ACL:" field, which would indicate something is using an xattr block. If yes, then could you please dump the xattr block and attach it here so I can see what is stored in it:

            dd if=/dev/md127 of=/tmp/file.xattr bs=4k count=1 skip=2400743
            
            adilger Andreas Dilger added a comment - Oct 30 20:30:45 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK : $ sudo lfs getstripe -d /user_data/ so that means the MDS is mounted without crashing? An improvement at least... As for the " lfs getstripe " it looks like the default file layout (1 stripe on any OST), so it doesn't seem like a candidate for what is using the MDT space. Do you have the summary lines from the recently-completed e2fsck on the MDT that report the total number of regular files and directories? Each directory consumes at least one 4KB block, so if there are lots of them it could consume a lot of space. Alternately, you could try checking a random sampling of files on the MDS via debugfs (don't worry about the " catastrophic mode " message, that is just a faster way to run debugfs, and it is read-only so does not affect the filesystem, even if mounted): # debugfs -c /dev/md127 debugfs 1.45.6.wc2 (28-Sep-2020) /dev/md127: catastrophic mode - not reading inode or group bitmaps debugfs: cd /ROOT debugfs: stat path/to/some/file Inode: 2401450 Type: regular Mode: 0644 Flags: 0x0 Generation: 1618288620 Version: 0x000000a3:0011dca7 User: 0 Group: 0 Size: 2147483648 File ACL: 2400743 Links: 1 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x5e14f5e0:00000000 -- Tue Jan 7 14:19:28 2020 atime: 0x5f24c5ad:00000000 -- Fri Jul 31 19:30:21 2020 mtime: 0x5e14f5e0:00000000 -- Tue Jan 7 14:19:28 2020 crtime: 0x5c9d43de:770dc7c8 -- Thu Mar 28 15:59:58 2019 Size of extra inode fields: 28 Extended attributes: trusted.lma (24) = 00 00 00 00 00 00 00 00 34 f0 01 00 02 00 00 00 01 00 00 00 00 00 00 00 lma: fid=[0x20001f034:0x1:0x0] compat=0 incompat=0 trusted.link (50) trusted.lov (368) debugfs: stat path/to/another/file : to see if those files have a non-zero value for the " File ACL: " field, which would indicate something is using an xattr block. If yes, then could you please dump the xattr block and attach it here so I can see what is stored in it: dd if=/dev/md127 of=/tmp/file.xattr bs=4k count=1 skip=2400743
            cmcl@bravo1 ~ -bash$ sudo lfs getstripe -d /user_data/
            stripe_count:  1 stripe_size:   1048576 pattern:       0 stripe_offset: -1
            
            cmcl@bravo1 ~ -bash$
            
            cmcl Campbell Mcleay (Inactive) added a comment - cmcl@bravo1 ~ -bash$ sudo lfs getstripe -d /user_data/ stripe_count: 1 stripe_size: 1048576 pattern: 0 stripe_offset: -1 cmcl@bravo1 ~ -bash$

            Hi Peter,

            Yes, two distinct filesystems.

            Regards,

            campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi Peter, Yes, two distinct filesystems. Regards, campbell

            People

              pjones Peter Jones
              cmcl Campbell Mcleay (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: