[LU-14056] MDT filesystem full, fsck crashing Created: 21/Oct/20 Updated: 16/Nov/20 Resolved: 07/Nov/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Campbell Mcleay (Inactive) | Assignee: | Peter Jones |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
EL 7.6.1810 x86_64 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
MDT filesystem filled up, MDS crashed and would crash again shortly after mounting filesystem . Disabled mount on boot and mounted the MDT as ldiskfs to make some space. After deleting some files, unmounted and ran a fsck.ext4. This has crashed twice after running for 6-7 days, the first time the MDS was unresponsive and had to be rebooted, the second time I could capture a backtrace. I remounted the MDT - still showing 100% (could well be because the fsck didn't finish). Not sure what to do next. |
| Comments |
| Comment by Campbell Mcleay (Inactive) [ 21/Oct/20 ] |
|
Backtrace from MDS: Problem in HTREE directory inode 630433964: block #8 not referenced Invalid HTREE directory inode 630433964 (/ROOT/ARCHIVE/dirvish/filers/odin-vol/20200816/tree/vol/builds/virtual_symtree/b57adc5d59/virtual/lib/py thon2.7). Clear HTree index? yes Problem in HTREE directory inode 630433968: block #1 not referenced Problem in HTREE directory inode 630433968: block #2 not referenced Problem in HTREE directory inode 630433968: block #3 not referenced Problem in HTREE directory inode 630433968: block #4 not referenced Problem in HTREE directory inode 630433968: block #5 not referenced Problem in HTREE directory inode 630433968: block #6 not referenced Problem in HTREE directory inode 630433968: block #7 not referenced Problem in HTREE directory inode 630433968: block #8 not referenced Invalid HTREE directory inode 630433968 (/ROOT/ARCHIVE/dirvish/filers/odin-vol/20200816/tree/vol/builds/virtual_symtree/d0d9dbaddf/virtual/bin). Clear HTree index? yes *** Error in `fsck.ext4': munmap_chunk(): invalid pointer: 0x00007f0689755010 *** ======= Backtrace: ========= /lib64/libc.so.6(+0x7f754)[0x7f0d3a06b754] /lib64/libext2fs.so.2(ext2fs_free_dblist+0x4e)[0x7f0d3ae3682e] fsck.ext4(e2fsck_pass2+0x2fc)[0x41cefc] fsck.ext4(e2fsck_run+0x52)[0x410912] fsck.ext4(main+0x14b9)[0x40c899] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0d3a00e495] fsck.ext4[0x40ec93] ======= Memory map: ======== 00400000-00447000 r-xp 00000000 08:03 134343878 /usr/sbin/fsck.ext4 00646000-00647000 r--p 00046000 08:03 134343878 /usr/sbin/fsck.ext4 00647000-0064d000 rw-p 00047000 08:03 134343878 /usr/sbin/fsck.ext4 01d63000-5da983000 rw-p 00000000 00:00 0 [heap] 7efffd4e2000-7f01bdff0000 rw-p 00000000 00:00 0 7f01bdff0000-7f037eafe000 rw-p 00000000 00:00 0 7f051291f000-7f0519f58000 rw-p 00000000 00:00 0 7f051b30b000-7f0625155000 rw-p 00000000 00:00 0 7f0627454000-7f0c12358000 rw-p 00000000 00:00 0 7f0d1be7c000-7f0d1be91000 r-xp 00000000 08:03 134580339 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7f0d1be91000-7f0d1c090000 ---p 00015000 08:03 134580339 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7f0d1c090000-7f0d1c091000 r--p 00014000 08:03 134580339 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7f0d1c091000-7f0d1c092000 rw-p 00015000 08:03 134580339 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7f0d1c1a2000-7f0d338a6000 rw-p 00000000 00:00 0 7f0d338a6000-7f0d39dd0000 r--p 00000000 08:03 134642126 /usr/lib/locale/locale-archive 7f0d39dd0000-7f0d39de7000 r-xp 00000000 08:03 134578956 /usr/lib64/libpthread-2.17.so 7f0d39de7000-7f0d39fe6000 ---p 00017000 08:03 134578956 /usr/lib64/libpthread-2.17.so 7f0d39fe6000-7f0d39fe7000 r--p 00016000 08:03 134578956 /usr/lib64/libpthread-2.17.so 7f0d39fe7000-7f0d39fe8000 rw-p 00017000 08:03 134578956 /usr/lib64/libpthread-2.17.so 7f0d39fe8000-7f0d39fec000 rw-p 00000000 00:00 0 7f0d39fec000-7f0d3a1ae000 r-xp 00000000 08:03 134578930 /usr/lib64/libc-2.17.so 7f0d3a1ae000-7f0d3a3ae000 ---p 001c2000 08:03 134578930 /usr/lib64/libc-2.17.so 7f0d3a3ae000-7f0d3a3b2000 r--p 001c2000 08:03 134578930 /usr/lib64/libc-2.17.so 7f0d3a3b2000-7f0d3a3b4000 rw-p 001c6000 08:03 134578930 /usr/lib64/libc-2.17.so 7f0d3a3b4000-7f0d3a3b9000 rw-p 00000000 00:00 0 7f0d3a3b9000-7f0d3a3bb000 r-xp 00000000 08:03 134578936 /usr/lib64/libdl-2.17.so 7f0d3a3bb000-7f0d3a5bb000 ---p 00002000 08:03 134578936 /usr/lib64/libdl-2.17.so 7f0d3a5bb000-7f0d3a5bc000 r--p 00002000 08:03 134578936 /usr/lib64/libdl-2.17.so 7f0d3a5bc000-7f0d3a5bd000 rw-p 00003000 08:03 134578936 /usr/lib64/libdl-2.17.so 7f0d3a5bd000-7f0d3a5c5000 r-xp 00000000 08:03 134579517 /usr/lib64/libe2p.so.2.3 7f0d3a5c5000-7f0d3a7c4000 ---p 00008000 08:03 134579517 /usr/lib64/libe2p.so.2.3 7f0d3a7c4000-7f0d3a7c5000 r--p 00007000 08:03 134579517 /usr/lib64/libe2p.so.2.3 7f0d3a7c5000-7f0d3a7c6000 rw-p 00008000 08:03 134579517 /usr/lib64/libe2p.so.2.3 7f0d3a7c6000-7f0d3a7ca000 r-xp 00000000 08:03 134579178 /usr/lib64/libuuid.so.1.3.0 7f0d3a7ca000-7f0d3a9c9000 ---p 00004000 08:03 134579178 /usr/lib64/libuuid.so.1.3.0 7f0d3a9c9000-7f0d3a9ca000 r--p 00003000 08:03 134579178 /usr/lib64/libuuid.so.1.3.0 7f0d3a9ca000-7f0d3a9cb000 rw-p 00004000 08:03 134579178 /usr/lib64/libuuid.so.1.3.0 7f0d3a9cb000-7f0d3aa07000 r-xp 00000000 08:03 135196520 /usr/lib64/libblkid.so.1.1.0 7f0d3aa07000-7f0d3ac06000 ---p 0003c000 08:03 135196520 /usr/lib64/libblkid.so.1.1.0 7f0d3ac06000-7f0d3ac09000 r--p 0003b000 08:03 135196520 /usr/lib64/libblkid.so.1.1.0 7f0d3ac09000-7f0d3ac0a000 rw-p 0003e000 08:03 135196520 /usr/lib64/libblkid.so.1.1.0 7f0d3ac0a000-7f0d3ac0b000 rw-p 00000000 00:00 0 7f0d3ac0b000-7f0d3ac0e000 r-xp 00000000 08:03 134551797 /usr/lib64/libcom_err.so.2.1 7f0d3ac0e000-7f0d3ae0d000 ---p 00003000 08:03 134551797 /usr/lib64/libcom_err.so.2.1 7f0d3ae0d000-7f0d3ae0e000 r--p 00002000 08:03 134551797 /usr/lib64/libcom_err.so.2.1 7f0d3ae0e000-7f0d3ae0f000 rw-p 00003000 08:03 134551797 /usr/lib64/libcom_err.so.2.1 7f0d3ae0f000-7f0d3ae74000 r-xp 00000000 08:03 134600929 /usr/lib64/libext2fs.so.2.4 7f0d3ae74000-7f0d3b074000 ---p 00065000 08:03 134600929 /usr/lib64/libext2fs.so.2.4 7f0d3b074000-7f0d3b075000 r--p 00065000 08:03 134600929 /usr/lib64/libext2fs.so.2.4 7f0d3b075000-7f0d3b077000 rw-p 00066000 08:03 134600929 /usr/lib64/libext2fs.so.2.4 7f0d3b077000-7f0d3b099000 r-xp 00000000 08:03 134578918 /usr/lib64/ld-2.17.so 7f0d3b123000-7f0d3b187000 rw-p 00000000 00:00 0 7f0d3b196000-7f0d3b197000 r--p 00000000 08:03 275338 /usr/share/locale/en_GB/LC_MESSAGES/libc.mo 7f0d3b1da000-7f0d3b298000 rw-p 00000000 00:00 0 7f0d3b298000-7f0d3b299000 r--p 00021000 08:03 134578918 /usr/lib64/ld-2.17.so 7f0d3b299000-7f0d3b29a000 rw-p 00022000 08:03 134578918 /usr/lib64/ld-2.17.so 7f0d3b29a000-7f0d3b29b000 rw-p 00000000 00:00 0 7ffd473d6000-7ffd473f7000 rw-p 00000000 00:00 0 [stack] 7ffd473fe000-7ffd47400000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Signal (6) SIGABRT si_code=SI_TKILL fsck.ext4[0x42ef71] /lib64/libc.so.6(+0x36340)[0x7f0d3a022340] /lib64/libc.so.6(gsignal+0x37)[0x7f0d3a0222c7] /lib64/libc.so.6(abort+0x148)[0x7f0d3a0239b8] /lib64/libc.so.6(+0x78e17)[0x7f0d3a064e17] /lib64/libc.so.6(+0x7f754)[0x7f0d3a06b754] /lib64/libext2fs.so.2(ext2fs_free_dblist+0x4e)[0x7f0d3ae3682e] fsck.ext4(e2fsck_pass2+0x2fc)[0x41cefc] fsck.ext4(e2fsck_run+0x52)[0x410912] fsck.ext4(main+0x14b9)[0x40c899] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0d3a00e495] fsck.ext4[0x40ec93] |
| Comment by Andreas Dilger [ 21/Oct/20 ] |
|
What version of e2fsprogs are you using? You should typically be using the latest version, like e2fsprogs-1.45.6-wc1. Note tha there is also e2fsprogs-1.45.6-wc2 which has multi-threaded scanning for pass1 and pass5 with "-mN", but that code is new and you will have to decide whether the speed up of e2fsck (up to 40x for those phases) is worth the risk of running new code (though it has of course passed our own testing). This crash looks possibly like a memory allocation bug in e2fsck hit when there are over 300M directories on the MDT that was fixed since 1.45.2-wc2. As a general rule, there is no need to run e2fsck after mounting as ldiskfs, but it does seem like there is some corruption for another reason, possibly due to the filesystem being so full? Do you know what is using all the space on the MDT? Do you have ChangeLogs enabled that are not being consumed? You could check for excess space usage in O/ on the MDT. |
| Comment by Andreas Dilger [ 21/Oct/20 ] |
|
How big is the MDT in blocks and inodes, and do you have a large number of directories? |
| Comment by Campbell Mcleay (Inactive) [ 21/Oct/20 ] |
|
Hi Andreas, I'm using e2fsprogs-1.45.2.wc1-0.el7.x86_64. I'll grab the version you mention. I am not sure whether we have changelogs enabled - how would I tell? The 'O' directory doesn't have much in it by the look of it: root@bmds1 /u/cmcl bash# du -sh /mnt/tmp/O/* 17M /mnt/tmp/O/1 136K /mnt/tmp/O/10 136K /mnt/tmp/O/200000003 root@bmds1 /u/cmcl bash# In terms of freeing up space, I deleted a few hundred GB of data (on a 4.4TB filesystem), so it should've dropped below the 100% it is still at. After unmounting and running the fsck.ext4 (-v -f -y), there were a lot of errors reported, many of which it said it fixed, but also many things it said it couldn't fix, for which it reported 'Ignore error?' (I don't recall the full message sorry. I was redirecting the output to a file but ended up having to copy /dev/null over it as it was beginning to fill up the filesystem). Should I just run the newer fsck and hope that that will fix it, since it might actually finish? Or is something else going on here? -Campbell |
| Comment by Campbell Mcleay (Inactive) [ 21/Oct/20 ] |
root@bmds1 /u/cmcl bash# df -i /mnt/tmp Filesystem Inodes IUsed IFree IUse% Mounted on /dev/md127 3121381376 1124855469 1996525907 37% /mnt/tmp root@bmds1 /u/cmcl bash# df /mnt/tmp Filesystem 1K-blocks Used Available Use% Mounted on /dev/md127 4677011320 4653348652 0 100% /mnt/tmp root@bmds1 /u/cmcl bash# So a bit over 1.1 billion inodes, I'm not sure what the number of directories are (though I could run a find and see), but I would expect a very large number. What takes up blocks on an MDT? Directories with large numbers of files (i.e., inode heavy stuff)? |
| Comment by Andreas Dilger [ 22/Oct/20 ] |
|
Unless you are using Data-on-MDT, there are only a few things that consume a lot of space on the MDT. The Changelogs would take up space under O/, and directories. If you have wide-striped files, or PFL layouts that are larger than can fit into the inode will consume a 4KB block for each file. One case is that happens sometimes is that there are a lot of empty directories because the regular files are deleted but directories are left behind. Also, if you are using "hard link trees" for backups, where there are lots of directories and hard links to a relatively smaller number of regular files. Each directory uses at least one 4KB block. Based on the space usage and inode usage, it does appear that you are using about one 4KB block for each inode, so that might be a large layout xattr? Depending on how your filesystem is used, we can likely come up with some way to reduce that usage once it is up. Definitely the first step is to get e2fsck to finish with the new 1.45.6-wc1 or -wc2 version. |
| Comment by Campbell Mcleay (Inactive) [ 22/Oct/20 ] |
|
Running the fsck with 28 threads (of 32). Seems to have done about a fifth of the inodes on pass 1 in 5 hours, which seems a big improvement. The memory usage is creeping up slowly though from 166GB (66%) to 173GB (69%) RSS of 256GB. Fingers crossed that it doesn't run out of memory before the check completes! |
| Comment by Andreas Dilger [ 22/Oct/20 ] |
|
I've had good luck enabling swap for e2fsck using an M.2 PCIe device in the MDS when the filesystem is very large. This is fast enough to allow e2fsck to complete in a reasonable time, even if the node does not have enough RAM. |
| Comment by Campbell Mcleay (Inactive) [ 26/Oct/20 ] |
|
Hi Andreas, We're 19/28 done of what I assume is pass 1 of the fsck (i.e., 19 of the 28 threads appear to have finished). Based on that, at a guess it'll be another 2 days for that to finish. How much longer (very roughly) do you think the next phases to take (or is it really impossible to give even a rough figure)? Thanks, Campbell |
| Comment by Andreas Dilger [ 27/Oct/20 ] |
|
The parallel e2fsck optimizations were mostly targetted tested on very large OSTs, so pass1 and pass5 are optimized, but the MDT will likely also spend a considerable amount of time in pass2 (directory scanning), which is not parallelized yet. |
| Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ] |
|
Hi Andreas, Good news - the fsck has finished. What are the next steps? Kind regards, Campbell |
| Comment by Andreas Dilger [ 29/Oct/20 ] |
|
It appears from the previous comments that you have already freed up some space from the MDT, so you should hopefully be able to mount it at this point. You also need to figure out what is consuming all the space. It is either a too-large layout xattr, extra ACLs or user xattrs, or (unlikely) most of the inodes are directories (eg. lots of symlink trees for backups). Getting "lfs getstripe -d /mnt/lustre" (or whatever the mount point is) and maybe for a few other files in the filesystem would be useful. Also, "dumpe2fs -h /dev/md127" on the MDS to check what the inode size is. |
| Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ] |
|
Looks like it crashed when I had the MDT mounted, just going to get a console on it |
| Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ] |
|
Before the crash: root@bmds1 /u/cmcl bash# lfs getstripe -d /mnt/mdt lfs getstripe: cannot open '/mnt/mdt': No such device or address (6) error: getstripe failed for /mnt/mdt. root@bmds1 /u/cmcl bash# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda3 124849604 16905180 107944424 14% / devtmpfs 131927228 0 131927228 0% /dev tmpfs 131962172 88 131962084 1% /dev/shm tmpfs 131962172 75420 131886752 1% /run tmpfs 131962172 0 131962172 0% /sys/fs/cgroup /dev/sda1 1038336 490508 547828 48% /boot tools:/tools 4148586112 3647906240 500679872 88% /tools apps:/apps 2116812800 1869961856 246850944 89% /apps u:/u 3285293504 2128296704 1156996800 65% /u tmpfs 26392436 0 26392436 0% /run/user/0 tmpfs 26392436 0 26392436 0% /run/user/3407 /dev/md127 4640472172 4339829044 0 100% /mnt/mdt |
| Comment by Andreas Dilger [ 29/Oct/20 ] |
|
The "lfs getstripe" will only work on the client when the MDT is mounted as type Lustre. |
| Comment by Andreas Dilger [ 29/Oct/20 ] |
|
The "dumpe2fs -h" should be run on the MDS against the block device. |
| Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ] |
|
Had to hard reboot it, just waiting for it to come back up. Will unmount it imeediately after boot up so it doesn't crash again |
| Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ] |
|
The filesystem was mounted as type lustre, so not sure why lfs getstripe didn't work... |
| Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ] |
root@bmds1 /u/cmcl bash# dumpe2fs -h /dev/md127 dumpe2fs 1.45.6.wc2 (28-Sep-2020) Filesystem volume name: bravo-MDT0000 Last mounted on: / Filesystem UUID: 5b78ca34-326a-4d37-8f01-9b03ca53d9ca Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 3121381376 Block count: 1560682496 Reserved block count: 78034124 Free blocks: 84295922 Free inodes: 2096785937 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1024 Blocks per group: 16384 Fragments per group: 16384 Inodes per group: 32768 Inode blocks per group: 4096 RAID stride: 128 RAID stripe width: 256 Flex block group size: 16 Filesystem created: Thu Jul 13 10:23:19 2017 Last mount time: Thu Oct 29 16:17:56 2020 Last write time: Thu Oct 29 16:17:02 2020 Mount count: 4 Maximum mount count: -1 Last checked: Thu Oct 22 09:35:52 2020 Check interval: 0 (<none>) Lifetime writes: 733 TB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 512 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 66a3e161-a2ef-4872-92d2-c796d2cf3ed3 Journal backup: inode blocks User quota inode: 3 Group quota inode: 4 Journal features: journal_incompat_revoke Journal size: 4096M Journal length: 1048576 Journal sequence: 0x6baf5a41 Journal start: 0 root@bmds1 /u/cmcl bash# |
| Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ] |
|
Will try again with lfs getstripe |
| Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ] |
/dev/md127 on /mnt/mdt type lustre (ro,svname=bravo-MDT0000,mgs,osd=osd-ldiskfs,user_xattr,errors=remount-ro) root@bmds1 /u/cmcl bash# lfs getstripe -d /mnt/mdt lfs getstripe: cannot open '/mnt/mdt': No such device or address (6) error: getstripe failed for /mnt/mdt. root@bmds1 /u/cmcl bash# |
| Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ] |
|
seems like /mnt/mdt is ro From the messages log: Oct 29 16:28:02 bmds1 kernel: LDISKFS-fs (md127): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Oct 29 16:28:03 bmds1 kernel: Lustre: MGS: Connection restored to 3bd1b283-9795-c8d2-1f35-b82a09561469 (at 0@lo) Oct 29 16:28:03 bmds1 kernel: Lustre: Skipped 14 previous similar messages Oct 29 16:28:03 bmds1 kernel: LustreError: 26385:0:(llog_osd.c:988:llog_osd_next_block()) bravo-MDT0000-osd: invalid llog tail at log id [0x1584b:0x1:0x0]:0 offset 4575232 last_rec idx 1 tail idx 0lrt len 20 read_size 4096 Oct 29 16:28:03 bmds1 kernel: LustreError: 26385:0:(osp_sync.c:1248:osp_sync_thread()) bravo-OST000b-osc-MDT0000: llog process with osp_sync_process_queues failed: -22 Oct 29 16:28:03 bmds1 kernel: LustreError: 26385:0:(osp_sync.c:1248:osp_sync_thread()) Skipped 1 previous similar message Oct 29 16:28:04 bmds1 kernel: Lustre: bravo-MDT0000: Imperative Recovery not enabled, recovery window 300-900 Oct 29 16:28:04 bmds1 kernel: Lustre: bravo-MDT0000: in recovery but waiting for the first client to connect Oct 29 16:28:09 bmds1 kernel: Lustre: MGS: Connection restored to c236af3f-63c3-2c6b-d470-285b33212eea (at 10.21.22.52@tcp) Oct 29 16:28:12 bmds1 kernel: Lustre: bravo-MDT0000: Connection restored to bravo-MDT0000-lwp-OST0065_UUID (at 10.21.22.56@tcp) Oct 29 16:28:12 bmds1 kernel: Lustre: Skipped 35 previous similar messages Oct 29 16:28:14 bmds1 kernel: Lustre: MGS: Connection restored to bravo-MDT0000-lwp-OST004a_UUID (at 10.21.22.55@tcp) Oct 29 16:28:14 bmds1 kernel: Lustre: Skipped 52 previous similar messages Oct 29 16:28:20 bmds1 kernel: Lustre: Failing over bravo-MDT0000 Oct 29 16:28:21 bmds1 kernel: LustreError: 26604:0:(ldlm_lib.c:2606:target_stop_recovery_thread()) bravo-MDT0000: Aborting recovery Oct 29 16:28:21 bmds1 kernel: Lustre: 26577:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery Oct 29 16:28:21 bmds1 kernel: Lustre: 26577:0:(ldlm_lib.c:2059:target_recovery_overseer()) Skipped 2 previous similar messages Oct 29 16:28:21 bmds1 kernel: Lustre: bravo-MDT0000: in recovery but waiting for the first client to connect Oct 29 16:28:21 bmds1 kernel: Lustre: 26577:0:(mdt_handler.c:6554:mdt_postrecov()) bravo-MDT0000: auto trigger paused LFSCK failed: rc = -6 Oct 29 16:28:21 bmds1 kernel: LustreError: 26364:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff9fbb3ab2ba80 x1681903610508336/t0(0) o5->bravo-OST0000-osc-MDT0000@10.21.22.51@tcp:28/4 lens 432/432 e 0 to 0 dl 0 ref 2 fl Rpc:N/0/ffffffff rc 0/-1 Oct 29 16:28:21 bmds1 kernel: LustreError: 26404:0:(osp_precreate.c:940:osp_precreate_cleanup_orphans()) bravo-OST0014-osc-MDT0000: cannot cleanup orphans: rc = -5 Oct 29 16:28:21 bmds1 kernel: LustreError: 26404:0:(osp_precreate.c:940:osp_precreate_cleanup_orphans()) Skipped 90 previous similar messages Oct 29 16:28:21 bmds1 kernel: LustreError: 26364:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 95 previous similar messages Oct 29 16:28:29 bmds1 kernel: LustreError: 137-5: bravo-MDT0000_UUID: not available for connect from 10.21.22.23@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Oct 29 16:28:30 bmds1 kernel: Lustre: 26604:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1603988904/real 1603988904] req@ffff9fdb32ebb180 x1681903610510032/t0(0) o251->MGC10.21.22.50@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1603988910 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 |
| Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ] |
|
Never mind, I see from other clusters 'ro' is normal for an MDT |
| Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ] |
|
I'll try and make some more space on it by mounting it as ldiskfs. There is a discrepancy in free space though. When mounted as type 'lustre': dev/md127 4640472172 4339829044 0 100% /mnt/mdt As type 'ldiskfs': /dev/md127 4677011320 4339827656 25047168 100% /mnt/tmp |
| Comment by Campbell Mcleay (Inactive) [ 29/Oct/20 ] |
|
Got it down to 98% (mounted as ldiskfs) so far. Running a lot more deletes overnight. Will see how it is tomorrow morning |
| Comment by Andreas Dilger [ 29/Oct/20 ] |
The "lfs getstripe -d" command needs to be run against a client mountpoint (could be a temporary mount on the MDS), not the MDT mountpoint.
Lustre reserves some space for internal use. Oct 29 16:28:03 bmds1 kernel: LustreError: 26385:0:(llog_osd.c:988:llog_osd_next_block()) bravo-MDT0000-osd: invalid llog tail at log id [0x1584b:0x1:0x0]:0 offset 4575232 last_rec idx 1 tail idx 0lrt len 20 read_size 4096 Oct 29 16:28:03 bmds1 kernel: LustreError: 26385:0:(osp_sync.c:1248:osp_sync_thread()) bravo-OST000b-osc-MDT0000: llog process with osp_sync_process_queues failed: -22 It looks like there is a bit of corruption in one of the recovery logs, but this is not the source of the crash (at least from what is shown in the log snippet). |
| Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ] |
|
Cleared the MDT down to 70% full. However, it is still kernel panicing shortly after mounting. I'll get what I can from /var/log/messages and paste into the ticket |
| Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ] |
|
Any suggestions as to what we can do here? Getting a bit desparate here |
| Comment by Andreas Dilger [ 30/Oct/20 ] |
|
We need the stack trace from the MDS to see what is going wrong. If the MDS panics when the problem is hit, then you would need to capture the serial console, or at least manually type in the basic parts of the stack trace. |
| Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ] |
|
Hi Andreas, This is all we get: Message from syslogd@bmds1 at Oct 30 14:10:00 ... kernel:LustreError: 12237:0:(osp_sync.c:346:osp_sync_declare_add()) ASSERTION( ctxt ) failed: Message from syslogd@bmds1 at Oct 30 14:10:00 ... kernel:LustreError: 12237:0:(osp_sync.c:346:osp_sync_declare_add()) LBUG Message from syslogd@bmds1 at Oct 30 14:10:00 ... kernel:Kernel panic - not syncing: LBUG packet_write_wait: Connection to 10.21.22.50 port 22: Broken pipe Since it is quite easy to replicate this, perhaps I could enable various debug parameters to get more information? |
| Comment by Andreas Dilger [ 30/Oct/20 ] |
|
The ticket summary shows you are running Lustre 2.12.2. There is a patch https://review.whamcloud.com/36348 " |
| Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ] |
|
ok, will do. I'll install 2.12.5_62_gac40c31 |
| Comment by Andreas Dilger [ 30/Oct/20 ] |
|
Note that 2.12.5_62_gac40c31 is a build for a patch that hasn't even landed to the 2.12 branch, and as such I would not recommend to use this unless there is a specific need to use that patch. The b2_12 branch will have more fixes than 2.12.5, but it has not undergone final release testing (large-scale stress testing, long-term load and recovery testing, etc), and may still have issues that are not seen in the shorter (but still significant, about 80h) testing that is done on each patch before it lands. Not that I'm aware of any specific issues in b2_12 or 2.12.5_62_gac40c31, and we are approaching a 2.12.6 release, but I thought you might be confused about the builds. The latest 2.12 development build (pre-2.12.6) is: The actual 2.12.5 release build (recommended for most users) is: |
| Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ] |
|
Getting the following messages in the MDS log after mounting MDT: Oct 30 20:30:37 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Oct 30 20:30:38 bmds1 kernel: LustreError: 19498:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Oct 30 20:30:38 bmds1 kernel: LustreError: 19498:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages Oct 30 20:30:39 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Oct 30 20:30:39 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages Oct 30 20:30:41 bmds1 kernel: LustreError: 19516:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Oct 30 20:30:41 bmds1 kernel: LustreError: 19516:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 74 previous similar messages Oct 30 20:30:45 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Oct 30 20:30:45 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 233 previous similar messages Does this need actioning? Seems to relate to logging again |
| Comment by Peter Jones [ 30/Oct/20 ] |
|
Andreas (aka left hand) Campbell has two tickets on the go - LU-13500 is the other one and Oleg recommended that patch for that issue. Campbell Are there two distinct filesystems here? Peter (aka right hand) |
| Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ] |
|
Hi Peter, Yes, two distinct filesystems. Regards, campbell |
| Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ] |
cmcl@bravo1 ~ -bash$ sudo lfs getstripe -d /user_data/ stripe_count: 1 stripe_size: 1048576 pattern: 0 stripe_offset: -1 cmcl@bravo1 ~ -bash$ |
| Comment by Andreas Dilger [ 31/Oct/20 ] |
so that means the MDS is mounted without crashing? An improvement at least... As for the "lfs getstripe" it looks like the default file layout (1 stripe on any OST), so it doesn't seem like a candidate for what is using the MDT space. Do you have the summary lines from the recently-completed e2fsck on the MDT that report the total number of regular files and directories? Each directory consumes at least one 4KB block, so if there are lots of them it could consume a lot of space. Alternately, you could try checking a random sampling of files on the MDS via debugfs (don't worry about the "catastrophic mode" message, that is just a faster way to run debugfs, and it is read-only so does not affect the filesystem, even if mounted): # debugfs -c /dev/md127 debugfs 1.45.6.wc2 (28-Sep-2020) /dev/md127: catastrophic mode - not reading inode or group bitmaps debugfs: cd /ROOT debugfs: stat path/to/some/file Inode: 2401450 Type: regular Mode: 0644 Flags: 0x0 Generation: 1618288620 Version: 0x000000a3:0011dca7 User: 0 Group: 0 Size: 2147483648 File ACL: 2400743 Links: 1 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x5e14f5e0:00000000 -- Tue Jan 7 14:19:28 2020 atime: 0x5f24c5ad:00000000 -- Fri Jul 31 19:30:21 2020 mtime: 0x5e14f5e0:00000000 -- Tue Jan 7 14:19:28 2020 crtime: 0x5c9d43de:770dc7c8 -- Thu Mar 28 15:59:58 2019 Size of extra inode fields: 28 Extended attributes: trusted.lma (24) = 00 00 00 00 00 00 00 00 34 f0 01 00 02 00 00 00 01 00 00 00 00 00 00 00 lma: fid=[0x20001f034:0x1:0x0] compat=0 incompat=0 trusted.link (50) trusted.lov (368) debugfs: stat path/to/another/file : to see if those files have a non-zero value for the "File ACL:" field, which would indicate something is using an xattr block. If yes, then could you please dump the xattr block and attach it here so I can see what is stored in it: dd if=/dev/md127 of=/tmp/file.xattr bs=4k count=1 skip=2400743 |
| Comment by Andreas Dilger [ 31/Oct/20 ] |
|
As for the llog message: osp_sync_declare_add()) logging isn't available, run LFSCK it implies that the MDS isn't able to create a recovery log (typically) for OST object deletes. This may eventually become an issue, depending on how this error is handled. Could you please file that into a separate LU ticket so that it can be tracked and fixed properly. The "run LFSCK" part means (AFAIK) that there is a chance of OST objects not being deleted, so deleting files from the filesystem will not reduce space usage on the OSTs. I'd need someone else to look into whether this error means "the OST object is deleted immediately, and space will be orphaned only in case of an MDS crash" (i.e. very low severity), or "no OST object is deleted and space may run out quickly" (more serious), and/or whether this issue is specific to a single OST (less important, but seems to be the case from what I can see), or it affects many/all OSTs (more serious). This can be assigned and resolved (along with a better error message) in the context of the new ticket. |
| Comment by Campbell Mcleay (Inactive) [ 31/Oct/20 ] |
|
Hi Andreas, Sadly I didn't capture the summary information at the end of the fsck, we had a power cut and so things have been hectic the last two days. I have captured some info on 4 files on the MDS (let me know if you would like more). I will attach them to the ticket. I will also create a new ticket for the log issue. Kind regards, Campbell bmds1-sample-files.tar.gz |
| Comment by Andreas Dilger [ 02/Nov/20 ] |
|
cmcl Just to confirm, besides the "osp_sync_declare_add()) logging isn't available" message, my understanding is that this filesystem is now up and operational? |
| Comment by Campbell Mcleay (Inactive) [ 02/Nov/20 ] |
|
Hi Andreas, Yes, it is up and operational. Thanks for all your help! Kind regards, Campbell |
| Comment by Andreas Dilger [ 02/Nov/20 ] |
|
Looking at the xattr blocks in bmds1-sample-files.tar.gz If this is a common workload for you, then there are a couple of options (for future filesystems at least) to change the formatting options on the MDT from the defaults. The default is to format with 2560 bytes/inode (1024 bytes for the inode itself, plus an average of 1536 bytes/inode for xattrs, directory entry, logs, etc.). Formatting the MDT with "mkfs.lustre --mdt ... --mkfsoptions='-i 5120'" would allow a 4KB xattr block for each inode. While each inode wouldn't necessarily have an xattr block, there are also directory blocks and other needs for that space. Unfortunately, it isn't possible to change this ratio for an existing MDT filesystem without a full backup/restore. |
| Comment by Peter Jones [ 07/Nov/20 ] |
|
Looks like we're ok to close out this ticket now |
| Comment by Campbell Mcleay (Inactive) [ 16/Nov/20 ] |
|
Thanks for all the help. It might have been the addition of a filer that is being backed up on this cluster that has 284 million files... |