[LU-14056] MDT filesystem full, fsck crashing Created: 21/Oct/20  Updated: 16/Nov/20  Resolved: 07/Nov/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Campbell Mcleay (Inactive) Assignee: Peter Jones
Resolution: Fixed Votes: 0
Labels: None
Environment:

EL 7.6.1810 x86_64


Attachments: File bmds1-sample-files.tar.gz    
Issue Links:
Related
is related to LU-12674 osp should handle -EINPROGRESS on llo... Resolved
is related to LU-13197 e2fsck fails with 'Internal error: co... Resolved
is related to LU-14098 LustreError: 19003:0:(osp_sync.c:350:... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

MDT filesystem filled up, MDS crashed and would crash again shortly after mounting filesystem . Disabled mount on boot and mounted the MDT as ldiskfs to make some space. After deleting some files, unmounted and ran a fsck.ext4. This has crashed twice after running for 6-7 days, the first time the MDS was unresponsive and had to be rebooted, the second time I could capture a backtrace. I remounted the MDT - still showing 100% (could well be because the fsck didn't finish). Not sure what to do next.



 Comments   
Comment by Campbell Mcleay (Inactive) [ 21/Oct/20 ]

Backtrace from MDS:

Problem in HTREE directory inode 630433964: block #8 not referenced

Invalid HTREE directory inode 630433964 (/ROOT/ARCHIVE/dirvish/filers/odin-vol/20200816/tree/vol/builds/virtual_symtree/b57adc5d59/virtual/lib/py

thon2.7).  Clear HTree index? yes

 

Problem in HTREE directory inode 630433968: block #1 not referenced

Problem in HTREE directory inode 630433968: block #2 not referenced

Problem in HTREE directory inode 630433968: block #3 not referenced

Problem in HTREE directory inode 630433968: block #4 not referenced

Problem in HTREE directory inode 630433968: block #5 not referenced

Problem in HTREE directory inode 630433968: block #6 not referenced

Problem in HTREE directory inode 630433968: block #7 not referenced

Problem in HTREE directory inode 630433968: block #8 not referenced

Invalid HTREE directory inode 630433968 (/ROOT/ARCHIVE/dirvish/filers/odin-vol/20200816/tree/vol/builds/virtual_symtree/d0d9dbaddf/virtual/bin). 

 Clear HTree index? yes  

 

*** Error in `fsck.ext4': munmap_chunk(): invalid pointer: 0x00007f0689755010 ***

======= Backtrace: =========

/lib64/libc.so.6(+0x7f754)[0x7f0d3a06b754]

/lib64/libext2fs.so.2(ext2fs_free_dblist+0x4e)[0x7f0d3ae3682e]

fsck.ext4(e2fsck_pass2+0x2fc)[0x41cefc]

fsck.ext4(e2fsck_run+0x52)[0x410912]

fsck.ext4(main+0x14b9)[0x40c899]

/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0d3a00e495]

fsck.ext4[0x40ec93]

======= Memory map: ========

00400000-00447000 r-xp 00000000 08:03 134343878                          /usr/sbin/fsck.ext4

00646000-00647000 r--p 00046000 08:03 134343878                          /usr/sbin/fsck.ext4

00647000-0064d000 rw-p 00047000 08:03 134343878                          /usr/sbin/fsck.ext4

01d63000-5da983000 rw-p 00000000 00:00 0                                 [heap]

7efffd4e2000-7f01bdff0000 rw-p 00000000 00:00 0

7f01bdff0000-7f037eafe000 rw-p 00000000 00:00 0

7f051291f000-7f0519f58000 rw-p 00000000 00:00 0

7f051b30b000-7f0625155000 rw-p 00000000 00:00 0

7f0627454000-7f0c12358000 rw-p 00000000 00:00 0

7f0d1be7c000-7f0d1be91000 r-xp 00000000 08:03 134580339                  /usr/lib64/libgcc_s-4.8.5-20150702.so.1

7f0d1be91000-7f0d1c090000 ---p 00015000 08:03 134580339                  /usr/lib64/libgcc_s-4.8.5-20150702.so.1

7f0d1c090000-7f0d1c091000 r--p 00014000 08:03 134580339                  /usr/lib64/libgcc_s-4.8.5-20150702.so.1

7f0d1c091000-7f0d1c092000 rw-p 00015000 08:03 134580339                  /usr/lib64/libgcc_s-4.8.5-20150702.so.1

7f0d1c1a2000-7f0d338a6000 rw-p 00000000 00:00 0

7f0d338a6000-7f0d39dd0000 r--p 00000000 08:03 134642126                  /usr/lib/locale/locale-archive

7f0d39dd0000-7f0d39de7000 r-xp 00000000 08:03 134578956                  /usr/lib64/libpthread-2.17.so

7f0d39de7000-7f0d39fe6000 ---p 00017000 08:03 134578956                  /usr/lib64/libpthread-2.17.so

7f0d39fe6000-7f0d39fe7000 r--p 00016000 08:03 134578956                  /usr/lib64/libpthread-2.17.so

7f0d39fe7000-7f0d39fe8000 rw-p 00017000 08:03 134578956                  /usr/lib64/libpthread-2.17.so

7f0d39fe8000-7f0d39fec000 rw-p 00000000 00:00 0

7f0d39fec000-7f0d3a1ae000 r-xp 00000000 08:03 134578930                  /usr/lib64/libc-2.17.so

7f0d3a1ae000-7f0d3a3ae000 ---p 001c2000 08:03 134578930                  /usr/lib64/libc-2.17.so

7f0d3a3ae000-7f0d3a3b2000 r--p 001c2000 08:03 134578930                  /usr/lib64/libc-2.17.so

7f0d3a3b2000-7f0d3a3b4000 rw-p 001c6000 08:03 134578930                  /usr/lib64/libc-2.17.so

7f0d3a3b4000-7f0d3a3b9000 rw-p 00000000 00:00 0

7f0d3a3b9000-7f0d3a3bb000 r-xp 00000000 08:03 134578936                  /usr/lib64/libdl-2.17.so

7f0d3a3bb000-7f0d3a5bb000 ---p 00002000 08:03 134578936                  /usr/lib64/libdl-2.17.so

7f0d3a5bb000-7f0d3a5bc000 r--p 00002000 08:03 134578936                  /usr/lib64/libdl-2.17.so

7f0d3a5bc000-7f0d3a5bd000 rw-p 00003000 08:03 134578936                  /usr/lib64/libdl-2.17.so

7f0d3a5bd000-7f0d3a5c5000 r-xp 00000000 08:03 134579517                  /usr/lib64/libe2p.so.2.3

7f0d3a5c5000-7f0d3a7c4000 ---p 00008000 08:03 134579517                  /usr/lib64/libe2p.so.2.3

7f0d3a7c4000-7f0d3a7c5000 r--p 00007000 08:03 134579517                  /usr/lib64/libe2p.so.2.3

7f0d3a7c5000-7f0d3a7c6000 rw-p 00008000 08:03 134579517                  /usr/lib64/libe2p.so.2.3

7f0d3a7c6000-7f0d3a7ca000 r-xp 00000000 08:03 134579178                  /usr/lib64/libuuid.so.1.3.0

7f0d3a7ca000-7f0d3a9c9000 ---p 00004000 08:03 134579178                  /usr/lib64/libuuid.so.1.3.0

7f0d3a9c9000-7f0d3a9ca000 r--p 00003000 08:03 134579178                  /usr/lib64/libuuid.so.1.3.0

7f0d3a9ca000-7f0d3a9cb000 rw-p 00004000 08:03 134579178                  /usr/lib64/libuuid.so.1.3.0

7f0d3a9cb000-7f0d3aa07000 r-xp 00000000 08:03 135196520                  /usr/lib64/libblkid.so.1.1.0

7f0d3aa07000-7f0d3ac06000 ---p 0003c000 08:03 135196520                  /usr/lib64/libblkid.so.1.1.0

7f0d3ac06000-7f0d3ac09000 r--p 0003b000 08:03 135196520                  /usr/lib64/libblkid.so.1.1.0

7f0d3ac09000-7f0d3ac0a000 rw-p 0003e000 08:03 135196520                  /usr/lib64/libblkid.so.1.1.0

7f0d3ac0a000-7f0d3ac0b000 rw-p 00000000 00:00 0

7f0d3ac0b000-7f0d3ac0e000 r-xp 00000000 08:03 134551797                  /usr/lib64/libcom_err.so.2.1

7f0d3ac0e000-7f0d3ae0d000 ---p 00003000 08:03 134551797                  /usr/lib64/libcom_err.so.2.1

7f0d3ae0d000-7f0d3ae0e000 r--p 00002000 08:03 134551797                  /usr/lib64/libcom_err.so.2.1

7f0d3ae0e000-7f0d3ae0f000 rw-p 00003000 08:03 134551797                  /usr/lib64/libcom_err.so.2.1

7f0d3ae0f000-7f0d3ae74000 r-xp 00000000 08:03 134600929                  /usr/lib64/libext2fs.so.2.4

7f0d3ae74000-7f0d3b074000 ---p 00065000 08:03 134600929                  /usr/lib64/libext2fs.so.2.4

7f0d3b074000-7f0d3b075000 r--p 00065000 08:03 134600929                  /usr/lib64/libext2fs.so.2.4

7f0d3b075000-7f0d3b077000 rw-p 00066000 08:03 134600929                  /usr/lib64/libext2fs.so.2.4

7f0d3b077000-7f0d3b099000 r-xp 00000000 08:03 134578918                  /usr/lib64/ld-2.17.so

7f0d3b123000-7f0d3b187000 rw-p 00000000 00:00 0

7f0d3b196000-7f0d3b197000 r--p 00000000 08:03 275338                     /usr/share/locale/en_GB/LC_MESSAGES/libc.mo

7f0d3b1da000-7f0d3b298000 rw-p 00000000 00:00 0

7f0d3b298000-7f0d3b299000 r--p 00021000 08:03 134578918                  /usr/lib64/ld-2.17.so

7f0d3b299000-7f0d3b29a000 rw-p 00022000 08:03 134578918                  /usr/lib64/ld-2.17.so

7f0d3b29a000-7f0d3b29b000 rw-p 00000000 00:00 0

7ffd473d6000-7ffd473f7000 rw-p 00000000 00:00 0                          [stack]

7ffd473fe000-7ffd47400000 r-xp 00000000 00:00 0                          [vdso]

ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Signal (6) SIGABRT si_code=SI_TKILL

fsck.ext4[0x42ef71]  

/lib64/libc.so.6(+0x36340)[0x7f0d3a022340]

/lib64/libc.so.6(gsignal+0x37)[0x7f0d3a0222c7]

/lib64/libc.so.6(abort+0x148)[0x7f0d3a0239b8]

/lib64/libc.so.6(+0x78e17)[0x7f0d3a064e17]

/lib64/libc.so.6(+0x7f754)[0x7f0d3a06b754]

/lib64/libext2fs.so.2(ext2fs_free_dblist+0x4e)[0x7f0d3ae3682e]

fsck.ext4(e2fsck_pass2+0x2fc)[0x41cefc]

fsck.ext4(e2fsck_run+0x52)[0x410912]

fsck.ext4(main+0x14b9)[0x40c899]

/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0d3a00e495]

fsck.ext4[0x40ec93]
Comment by Andreas Dilger [ 21/Oct/20 ]

What version of e2fsprogs are you using? You should typically be using the latest version, like e2fsprogs-1.45.6-wc1. Note tha there is also e2fsprogs-1.45.6-wc2 which has multi-threaded scanning for pass1 and pass5 with "-mN", but that code is new and you will have to decide whether the speed up of e2fsck (up to 40x for those phases) is worth the risk of running new code (though it has of course passed our own testing).

This crash looks possibly like a memory allocation bug in e2fsck hit when there are over 300M directories on the MDT that was fixed since 1.45.2-wc2.

As a general rule, there is no need to run e2fsck after mounting as ldiskfs, but it does seem like there is some corruption for another reason, possibly due to the filesystem being so full?

Do you know what is using all the space on the MDT? Do you have ChangeLogs enabled that are not being consumed? You could check for excess space usage in O/ on the MDT.

Comment by Andreas Dilger [ 21/Oct/20 ]

LU-13197 is the memory allocation issue I was referencing that was fixed in 1.45.2-wc2 and later.

How big is the MDT in blocks and inodes, and do you have a large number of directories?

Comment by Campbell Mcleay (Inactive) [ 21/Oct/20 ]

Hi Andreas,

I'm using e2fsprogs-1.45.2.wc1-0.el7.x86_64. I'll grab the version you mention. I am not sure whether we have changelogs enabled - how would I tell?

The 'O' directory doesn't have much in it by the look of it:

root@bmds1 /u/cmcl bash# du -sh /mnt/tmp/O/*
17M     /mnt/tmp/O/1
136K    /mnt/tmp/O/10
136K    /mnt/tmp/O/200000003
root@bmds1 /u/cmcl bash#

In terms of freeing up space, I deleted a few hundred GB of data (on a 4.4TB filesystem), so it should've dropped below the 100% it is still at. After unmounting and running the fsck.ext4 (-v -f -y), there were a lot of errors reported, many of which it said it fixed, but also many things it said it couldn't fix, for which it reported 'Ignore error?' (I don't recall the full message sorry. I was redirecting the output to a file but ended up having to copy /dev/null over it as it was beginning to fill up the filesystem).

Should I just run the newer fsck and hope that that will fix it, since it might actually finish? Or is something else going on here?

-Campbell

Comment by Campbell Mcleay (Inactive) [ 21/Oct/20 ]
root@bmds1 /u/cmcl bash# df -i /mnt/tmp
Filesystem         Inodes      IUsed      IFree IUse% Mounted on
/dev/md127     3121381376 1124855469 1996525907   37% /mnt/tmp

root@bmds1 /u/cmcl bash# df /mnt/tmp
Filesystem      1K-blocks       Used Available Use% Mounted on
/dev/md127     4677011320 4653348652         0 100% /mnt/tmp
root@bmds1 /u/cmcl bash#

So a bit over 1.1 billion inodes, I'm not sure what the number of directories are (though I could run a find and see), but I would expect a very large number. What takes up blocks on an MDT? Directories with large numbers of files (i.e., inode heavy stuff)?

Comment by Andreas Dilger [ 22/Oct/20 ]

Unless you are using Data-on-MDT, there are only a few things that consume a lot of space on the MDT. The Changelogs would take up space under O/, and directories. If you have wide-striped files, or PFL layouts that are larger than can fit into the inode will consume a 4KB block for each file.

One case is that happens sometimes is that there are a lot of empty directories because the regular files are deleted but directories are left behind. Also, if you are using "hard link trees" for backups, where there are lots of directories and hard links to a relatively smaller number of regular files. Each directory uses at least one 4KB block.

Based on the space usage and inode usage, it does appear that you are using about one 4KB block for each inode, so that might be a large layout xattr? Depending on how your filesystem is used, we can likely come up with some way to reduce that usage once it is up.

Definitely the first step is to get e2fsck to finish with the new 1.45.6-wc1 or -wc2 version.

Comment by Campbell Mcleay (Inactive) [ 22/Oct/20 ]

Running the fsck with 28 threads (of 32). Seems to have done about a fifth of the inodes on pass 1 in 5 hours, which seems a big improvement. The memory usage is creeping up slowly though from 166GB (66%) to 173GB (69%) RSS of 256GB. Fingers crossed that it doesn't run out of memory before the check completes!

Comment by Andreas Dilger [ 22/Oct/20 ]

I've had good luck enabling swap for e2fsck using an M.2 PCIe device in the MDS when the filesystem is very large. This is fast enough to allow e2fsck to complete in a reasonable time, even if the node does not have enough RAM.

Comment by Campbell Mcleay (Inactive) [ 26/Oct/20 ]

Hi Andreas,

We're 19/28 done of what I assume is pass 1 of the fsck (i.e., 19 of the 28 threads appear to have finished). Based on that, at a guess it'll be another 2 days for that to finish. How much longer (very roughly) do you think the next phases to take (or is it really impossible to give even a rough figure)?

Thanks,

Campbell

Comment by Andreas Dilger [ 27/Oct/20 ]

The parallel e2fsck optimizations were mostly targetted tested on very large OSTs, so pass1 and pass5 are optimized, but the MDT will likely also spend a considerable amount of time in pass2 (directory scanning), which is not parallelized yet.

Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ]

Hi Andreas,

Good news - the fsck has finished. What are the next steps?

Kind regards,

Campbell

Comment by Andreas Dilger [ 29/Oct/20 ]

It appears from the previous comments that you have already freed up some space from the MDT, so you should hopefully be able to mount it at this point.

You also need to figure out what is consuming all the space. It is either a too-large layout xattr, extra ACLs or user xattrs, or (unlikely) most of the inodes are directories (eg. lots of symlink trees for backups).

Getting "lfs getstripe -d /mnt/lustre" (or whatever the mount point is) and maybe for a few other files in the filesystem would be useful. Also, "dumpe2fs -h /dev/md127" on the MDS to check what the inode size is.

Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ]

Looks like it crashed when I had the MDT mounted, just going to get a console on it

Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ]

Before the crash:

root@bmds1 /u/cmcl bash# lfs getstripe -d /mnt/mdt
lfs getstripe: cannot open '/mnt/mdt': No such device or address (6)
error: getstripe failed for /mnt/mdt.
root@bmds1 /u/cmcl bash# df
Filesystem      1K-blocks       Used  Available Use% Mounted on
/dev/sda3       124849604   16905180  107944424  14% /
devtmpfs        131927228          0  131927228   0% /dev
tmpfs           131962172         88  131962084   1% /dev/shm
tmpfs           131962172      75420  131886752   1% /run
tmpfs           131962172          0  131962172   0% /sys/fs/cgroup
/dev/sda1         1038336     490508     547828  48% /boot
tools:/tools   4148586112 3647906240  500679872  88% /tools
apps:/apps     2116812800 1869961856  246850944  89% /apps
u:/u           3285293504 2128296704 1156996800  65% /u
tmpfs            26392436          0   26392436   0% /run/user/0
tmpfs            26392436          0   26392436   0% /run/user/3407
/dev/md127     4640472172 4339829044          0 100% /mnt/mdt
Comment by Andreas Dilger [ 29/Oct/20 ]

The "lfs getstripe" will only work on the client when the MDT is mounted as type Lustre.

Comment by Andreas Dilger [ 29/Oct/20 ]

The "dumpe2fs -h" should be run on the MDS against the block device.

Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ]

Had to hard reboot it, just waiting for it to come back up. Will unmount it imeediately after boot up so it doesn't crash again

Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ]

The filesystem was mounted as type lustre, so not sure why lfs getstripe didn't work...

Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ]
root@bmds1 /u/cmcl bash# dumpe2fs -h /dev/md127
dumpe2fs 1.45.6.wc2 (28-Sep-2020)
Filesystem volume name:   bravo-MDT0000
Last mounted on:          /
Filesystem UUID:          5b78ca34-326a-4d37-8f01-9b03ca53d9ca
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              3121381376
Block count:              1560682496
Reserved block count:     78034124
Free blocks:              84295922
Free inodes:              2096785937
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1024
Blocks per group:         16384
Fragments per group:      16384
Inodes per group:         32768
Inode blocks per group:   4096
RAID stride:              128
RAID stripe width:        256
Flex block group size:    16
Filesystem created:       Thu Jul 13 10:23:19 2017
Last mount time:          Thu Oct 29 16:17:56 2020
Last write time:          Thu Oct 29 16:17:02 2020
Mount count:              4
Maximum mount count:      -1
Last checked:             Thu Oct 22 09:35:52 2020
Check interval:           0 (<none>)
Lifetime writes:          733 TB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          512
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      66a3e161-a2ef-4872-92d2-c796d2cf3ed3
Journal backup:           inode blocks
User quota inode:         3
Group quota inode:        4
Journal features:         journal_incompat_revoke
Journal size:             4096M
Journal length:           1048576
Journal sequence:         0x6baf5a41
Journal start:            0

root@bmds1 /u/cmcl bash#
Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ]

Will try again with lfs getstripe

Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ]
/dev/md127 on /mnt/mdt type lustre (ro,svname=bravo-MDT0000,mgs,osd=osd-ldiskfs,user_xattr,errors=remount-ro)
root@bmds1 /u/cmcl bash# lfs getstripe -d /mnt/mdt
lfs getstripe: cannot open '/mnt/mdt': No such device or address (6)
error: getstripe failed for /mnt/mdt.
root@bmds1 /u/cmcl bash#
Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ]

seems like /mnt/mdt is ro

From the messages log:

Oct 29 16:28:02 bmds1 kernel: LDISKFS-fs (md127): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
Oct 29 16:28:03 bmds1 kernel: Lustre: MGS: Connection restored to 3bd1b283-9795-c8d2-1f35-b82a09561469 (at 0@lo)
Oct 29 16:28:03 bmds1 kernel: Lustre: Skipped 14 previous similar messages
Oct 29 16:28:03 bmds1 kernel: LustreError: 26385:0:(llog_osd.c:988:llog_osd_next_block()) bravo-MDT0000-osd: invalid llog tail at log id [0x1584b:0x1:0x0]:0 offset 4575232 last_rec idx 1 tail idx 0lrt len 20 read_size 4096
Oct 29 16:28:03 bmds1 kernel: LustreError: 26385:0:(osp_sync.c:1248:osp_sync_thread()) bravo-OST000b-osc-MDT0000: llog process with osp_sync_process_queues failed: -22
Oct 29 16:28:03 bmds1 kernel: LustreError: 26385:0:(osp_sync.c:1248:osp_sync_thread()) Skipped 1 previous similar message
Oct 29 16:28:04 bmds1 kernel: Lustre: bravo-MDT0000: Imperative Recovery not enabled, recovery window 300-900
Oct 29 16:28:04 bmds1 kernel: Lustre: bravo-MDT0000: in recovery but waiting for the first client to connect
Oct 29 16:28:09 bmds1 kernel: Lustre: MGS: Connection restored to c236af3f-63c3-2c6b-d470-285b33212eea (at 10.21.22.52@tcp)
Oct 29 16:28:12 bmds1 kernel: Lustre: bravo-MDT0000: Connection restored to bravo-MDT0000-lwp-OST0065_UUID (at 10.21.22.56@tcp)
Oct 29 16:28:12 bmds1 kernel: Lustre: Skipped 35 previous similar messages
Oct 29 16:28:14 bmds1 kernel: Lustre: MGS: Connection restored to bravo-MDT0000-lwp-OST004a_UUID (at 10.21.22.55@tcp)
Oct 29 16:28:14 bmds1 kernel: Lustre: Skipped 52 previous similar messages
Oct 29 16:28:20 bmds1 kernel: Lustre: Failing over bravo-MDT0000
Oct 29 16:28:21 bmds1 kernel: LustreError: 26604:0:(ldlm_lib.c:2606:target_stop_recovery_thread()) bravo-MDT0000: Aborting recovery
Oct 29 16:28:21 bmds1 kernel: Lustre: 26577:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Oct 29 16:28:21 bmds1 kernel: Lustre: 26577:0:(ldlm_lib.c:2059:target_recovery_overseer()) Skipped 2 previous similar messages
Oct 29 16:28:21 bmds1 kernel: Lustre: bravo-MDT0000: in recovery but waiting for the first client to connect
Oct 29 16:28:21 bmds1 kernel: Lustre: 26577:0:(mdt_handler.c:6554:mdt_postrecov()) bravo-MDT0000: auto trigger paused LFSCK failed: rc = -6
Oct 29 16:28:21 bmds1 kernel: LustreError: 26364:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9fbb3ab2ba80 x1681903610508336/t0(0) o5->bravo-OST0000-osc-MDT0000@10.21.22.51@tcp:28/4 lens 432/432 e 0 to 0 dl 0 ref 2 fl Rpc:N/0/ffffffff rc 0/-1
Oct 29 16:28:21 bmds1 kernel: LustreError: 26404:0:(osp_precreate.c:940:osp_precreate_cleanup_orphans()) bravo-OST0014-osc-MDT0000: cannot cleanup orphans: rc = -5
Oct 29 16:28:21 bmds1 kernel: LustreError: 26404:0:(osp_precreate.c:940:osp_precreate_cleanup_orphans()) Skipped 90 previous similar messages
Oct 29 16:28:21 bmds1 kernel: LustreError: 26364:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 95 previous similar messages
Oct 29 16:28:29 bmds1 kernel: LustreError: 137-5: bravo-MDT0000_UUID: not available for connect from 10.21.22.23@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Oct 29 16:28:30 bmds1 kernel: Lustre: 26604:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1603988904/real 1603988904]  req@ffff9fdb32ebb180 x1681903610510032/t0(0) o251->MGC10.21.22.50@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1603988910 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ]

Never mind, I see from other clusters 'ro' is normal for an MDT

Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ]

I'll try and make some more space on it by mounting it as ldiskfs. There is a discrepancy in free space though. When mounted as type 'lustre':

dev/md127     4640472172 4339829044          0 100% /mnt/mdt

As type 'ldiskfs':

/dev/md127     4677011320 4339827656   25047168 100% /mnt/tmp
Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ]

Got it down to 98% (mounted as ldiskfs) so far. Running a lot more deletes overnight. Will see how it is tomorrow morning

Comment by Andreas Dilger [ 29/Oct/20 ]

lfs getstripe -d /mnt/mdt

The "lfs getstripe -d" command needs to be run against a client mountpoint (could be a temporary mount on the MDS), not the MDT mountpoint.

There is a discrepancy in free space though. When mounted as type 'lustre'.

Lustre reserves some space for internal use.

Oct 29 16:28:03 bmds1 kernel: LustreError: 26385:0:(llog_osd.c:988:llog_osd_next_block()) bravo-MDT0000-osd: invalid llog tail at log id [0x1584b:0x1:0x0]:0 offset 4575232 last_rec idx 1 tail idx 0lrt len 20 read_size 4096
Oct 29 16:28:03 bmds1 kernel: LustreError: 26385:0:(osp_sync.c:1248:osp_sync_thread()) bravo-OST000b-osc-MDT0000: llog process with osp_sync_process_queues failed: -22

It looks like there is a bit of corruption in one of the recovery logs, but this is not the source of the crash (at least from what is shown in the log snippet).

Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ]

Cleared the MDT down to 70% full. However, it is still kernel panicing shortly after mounting. I'll get what I can from /var/log/messages and paste into the ticket

Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ]

Any suggestions as to what we can do here? Getting a bit desparate here

Comment by Andreas Dilger [ 30/Oct/20 ]

We need the stack trace from the MDS to see what is going wrong. If the MDS panics when the problem is hit, then you would need to capture the serial console, or at least manually type in the basic parts of the stack trace.

Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ]

Hi Andreas,

This is all we get:

Message from syslogd@bmds1 at Oct 30 14:10:00 ...
 kernel:LustreError: 12237:0:(osp_sync.c:346:osp_sync_declare_add()) ASSERTION( ctxt ) failed: 

Message from syslogd@bmds1 at Oct 30 14:10:00 ...
 kernel:LustreError: 12237:0:(osp_sync.c:346:osp_sync_declare_add()) LBUG

Message from syslogd@bmds1 at Oct 30 14:10:00 ...
 kernel:Kernel panic - not syncing: LBUG
packet_write_wait: Connection to 10.21.22.50 port 22: Broken pipe

Since it is quite easy to replicate this, perhaps I could enable various debug parameters to get more information?

Comment by Andreas Dilger [ 30/Oct/20 ]

The ticket summary shows you are running Lustre 2.12.2. There is a patch https://review.whamcloud.com/36348 "LU-12674 osp: handle -EINPROGRESS on llog objects" landed in 2.12.4 that fixes this crash. I would recommend to update the MDS (and preferably OSS) to 2.12.5.

Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ]

ok, will do. I'll install 2.12.5_62_gac40c31

Comment by Andreas Dilger [ 30/Oct/20 ]

Note that 2.12.5_62_gac40c31 is a build for a patch that hasn't even landed to the 2.12 branch, and as such I would not recommend to use this unless there is a specific need to use that patch.

The b2_12 branch will have more fixes than 2.12.5, but it has not undergone final release testing (large-scale stress testing, long-term load and recovery testing, etc), and may still have issues that are not seen in the shorter (but still significant, about 80h) testing that is done on each patch before it lands.

Not that I'm aware of any specific issues in b2_12 or 2.12.5_62_gac40c31, and we are approaching a 2.12.6 release, but I thought you might be confused about the builds.

The latest 2.12 development build (pre-2.12.6) is:
https://build.whamcloud.com/job/lustre-b2_12/102/

The actual 2.12.5 release build (recommended for most users) is:
https://build.whamcloud.com/job/lustre-b2_12/87/

Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ]

Getting the following messages in the MDS log after mounting MDT:

Oct 30 20:30:37 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Oct 30 20:30:38 bmds1 kernel: LustreError: 19498:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Oct 30 20:30:38 bmds1 kernel: LustreError: 19498:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages
Oct 30 20:30:39 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Oct 30 20:30:39 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages
Oct 30 20:30:41 bmds1 kernel: LustreError: 19516:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Oct 30 20:30:41 bmds1 kernel: LustreError: 19516:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 74 previous similar messages
Oct 30 20:30:45 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Oct 30 20:30:45 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 233 previous similar messages

Does this need actioning? Seems to relate to logging again

Comment by Peter Jones [ 30/Oct/20 ]

Andreas (aka left hand)

Campbell has two tickets on the go - LU-13500 is the other one and Oleg recommended that patch for that issue.

Campbell

Are there two distinct filesystems here?

Peter (aka right hand)

Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ]

Hi Peter,

Yes, two distinct filesystems.

Regards,

campbell

Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ]
cmcl@bravo1 ~ -bash$ sudo lfs getstripe -d /user_data/
stripe_count:  1 stripe_size:   1048576 pattern:       0 stripe_offset: -1

cmcl@bravo1 ~ -bash$
Comment by Andreas Dilger [ 31/Oct/20 ]
Oct 30 20:30:45 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
:
$ sudo lfs getstripe -d /user_data/

so that means the MDS is mounted without crashing? An improvement at least...

As for the "lfs getstripe" it looks like the default file layout (1 stripe on any OST), so it doesn't seem like a candidate for what is using the MDT space. Do you have the summary lines from the recently-completed e2fsck on the MDT that report the total number of regular files and directories? Each directory consumes at least one 4KB block, so if there are lots of them it could consume a lot of space.

Alternately, you could try checking a random sampling of files on the MDS via debugfs (don't worry about the "catastrophic mode" message, that is just a faster way to run debugfs, and it is read-only so does not affect the filesystem, even if mounted):

# debugfs -c /dev/md127
debugfs 1.45.6.wc2 (28-Sep-2020)
/dev/md127: catastrophic mode - not reading inode or group bitmaps
debugfs: cd /ROOT
debugfs: stat path/to/some/file
Inode: 2401450   Type: regular    Mode:  0644   Flags: 0x0
Generation: 1618288620    Version: 0x000000a3:0011dca7
User:     0   Group:     0   Size: 2147483648
File ACL: 2400743
Links: 1   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5e14f5e0:00000000 -- Tue Jan  7 14:19:28 2020
 atime: 0x5f24c5ad:00000000 -- Fri Jul 31 19:30:21 2020
 mtime: 0x5e14f5e0:00000000 -- Tue Jan  7 14:19:28 2020
crtime: 0x5c9d43de:770dc7c8 -- Thu Mar 28 15:59:58 2019
Size of extra inode fields: 28
Extended attributes:
  trusted.lma (24) = 00 00 00 00 00 00 00 00 34 f0 01 00 02 00 00 00 01 00 00 00 00 00 00 00 
  lma: fid=[0x20001f034:0x1:0x0] compat=0 incompat=0
  trusted.link (50)
  trusted.lov (368)
debugfs: stat path/to/another/file
:

to see if those files have a non-zero value for the "File ACL:" field, which would indicate something is using an xattr block. If yes, then could you please dump the xattr block and attach it here so I can see what is stored in it:

dd if=/dev/md127 of=/tmp/file.xattr bs=4k count=1 skip=2400743
Comment by Andreas Dilger [ 31/Oct/20 ]

As for the llog message:

osp_sync_declare_add()) logging isn't available, run LFSCK

it implies that the MDS isn't able to create a recovery log (typically) for OST object deletes. This may eventually become an issue, depending on how this error is handled. Could you please file that into a separate LU ticket so that it can be tracked and fixed properly. The "run LFSCK" part means (AFAIK) that there is a chance of OST objects not being deleted, so deleting files from the filesystem will not reduce space usage on the OSTs.

I'd need someone else to look into whether this error means "the OST object is deleted immediately, and space will be orphaned only in case of an MDS crash" (i.e. very low severity), or "no OST object is deleted and space may run out quickly" (more serious), and/or whether this issue is specific to a single OST (less important, but seems to be the case from what I can see), or it affects many/all OSTs (more serious). This can be assigned and resolved (along with a better error message) in the context of the new ticket.

Comment by Campbell Mcleay (Inactive) [ 31/Oct/20 ]

Hi Andreas,

Sadly I didn't capture the summary information at the end of the fsck, we had a power cut and so things have been hectic the last two days. I have captured some info on 4 files on the MDS (let me know if you would like more). I will attach them to the ticket. I will also create a new ticket for the log issue.

Kind regards,

Campbell bmds1-sample-files.tar.gz

Comment by Andreas Dilger [ 02/Nov/20 ]

cmcl Just to confirm, besides the "osp_sync_declare_add()) logging isn't available" message, my understanding is that this filesystem is now up and operational?

Comment by Campbell Mcleay (Inactive) [ 02/Nov/20 ]

Hi Andreas,

Yes, it is up and operational. Thanks for all your help!

Kind regards,

Campbell

Comment by Andreas Dilger [ 02/Nov/20 ]

Looking at the xattr blocks in bmds1-sample-files.tar.gz it appears that the external xattr block is being used by the "link" xattr, which tracks the hard links for each file. On those files I saw between 10 and 30 links on those files, with an xattr size between 900-3000 bytes because of the relatively long filenames (60-75 bytes each).

If this is a common workload for you, then there are a couple of options (for future filesystems at least) to change the formatting options on the MDT from the defaults. The default is to format with 2560 bytes/inode (1024 bytes for the inode itself, plus an average of 1536 bytes/inode for xattrs, directory entry, logs, etc.). Formatting the MDT with "mkfs.lustre --mdt ... --mkfsoptions='-i 5120'" would allow a 4KB xattr block for each inode. While each inode wouldn't necessarily have an xattr block, there are also directory blocks and other needs for that space. Unfortunately, it isn't possible to change this ratio for an existing MDT filesystem without a full backup/restore.

Comment by Peter Jones [ 07/Nov/20 ]

Looks like we're ok to close out this ticket now

Comment by Campbell Mcleay (Inactive) [ 16/Nov/20 ]

Thanks for all the help. It might have been the addition of a filer that is being backed up on this cluster that has 284 million files...

Generated at Sat Feb 10 03:06:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.