[LU-13452] MDT is 100% full, cannot delete files Created: 14/Apr/20 Updated: 20/Apr/20 Resolved: 20/Apr/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Campbell Mcleay (Inactive) | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 7.2.1511, lustre version 2.10.7-1 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
MDS filesystem is full, and we cannot free space on it. It will crash (kernel panic) when trying to delete files. Apr 13 16:01:50 emds1 kernel: LDISKFS-fs (md0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Apr 13 16:01:50 emds1 kernel: LustreError: 11368:0:(osd_handler.c:7131:osd_mount()) echo-MDT0000-osd: failed to set lma on /dev/md0 root inode Apr 13 16:01:50 emds1 kernel: LustreError: 11368:0:(obd_config.c:558:class_setup()) setup echo-MDT0000-osd failed (-30) Apr 13 16:01:50 emds1 kernel: LustreError: 11368:0:(obd_mount.c:203:lustre_start_simple()) echo-MDT0000-osd setup error -30 Apr 13 16:01:50 emds1 kernel: LustreError: 11368:0:(obd_mount_server.c:1848:server_fill_super()) Unable to start osd on /dev/md0: -30 Apr 13 16:01:50 emds1 kernel: LustreError: 11368:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount (-30) Apr 13 16:02:01 emds1 kernel: LDISKFS-fs (md0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Apr 13 16:02:01 emds1 kernel: Lustre: MGS: Connection restored to 8f792be4-fada-1d75-0dbd-ec8601cdce7f (at 0@lo) Apr 13 16:02:01 emds1 kernel: LustreError: 11438:0:(genops.c:478:class_register_device()) echo-OST0000-osc-MDT0000: already exists, won't add Apr 13 16:02:01 emds1 kernel: LustreError: 11438:0:(obd_config.c:1682:class_config_llog_handler()) MGC10.23.22.104@tcp: cfg command failed: rc = -17 Apr 13 16:02:01 emds1 kernel: Lustre: cmd=cf001 0:echo-OST0000-osc-MDT0000 1:osp 2:echo-MDT0000-mdtlov_UUID Apr 13 16:02:01 emds1 kernel: LustreError: 15c-8: MGC10.23.22.104@tcp: The configuration from log 'echo-MDT0000' failed (-17). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Apr 13 16:02:01 emds1 kernel: LustreError: 11380:0:(obd_mount_server.c:1389:server_start_targets()) failed to start server echo-MDT0000: -17 Apr 13 16:02:01 emds1 kernel: LustreError: 11380:0:(obd_mount_server.c:1882:server_fill_super()) Unable to start targets: -17 Apr 13 16:02:01 emds1 kernel: Lustre: Failing over echo-MDT0000 Apr 13 16:02:07 emds1 kernel: Lustre: 11380:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1586818921/real 1586818921] req@ffff8d748ab38000 x1663898946110400/t0(0) o251->MGC10.23.22.104@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1586818927 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 13 16:02:08 emds1 kernel: Lustre: server umount echo-MDT0000 complete Apr 13 16:02:08 emds1 kernel: LustreError: 11380:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount (-17) |
| Comments |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
Looking at usage over the last 30 weeks all of a sudden at the end of February something happened and you can see it very obviously in the graph attached. We don't believe our workflows have changed. 'df -i' output shows it only 35% full inode-wise. We've tried mounting the MDT as ldiskfs and having a look around. Nothing obvious (to my eyes) immediately showing as strange. I did try removing an old directory in there just to see if I could free up a bit of space (knowing this would leave orphans on the OSTs we'd later have to clean up with lfsck). Watching with 'df -k' I can see the used space go down but the free space remains at 0. dumpe2fs -h shows free blocks. No changelog users: cat /proc/fs/lustre/mdd/echo-MDT0000/changelog_users current index: 0 ID index Following that I tried remounting it and then using the MDS as a client to try and delete files. This immediately kernel panics. df -i output: /dev/md0 2463408928 842379642 1621029286 35% /mnt/mdt 10.23.22.104@tcp:/echo 2463408928 842379642 1621029286 35% /mnt/echo df -k output: /dev/md0 3662011924 3662011924 0 100% /mnt/mdt 10.23.22.104@tcp:/echo 3156734724504 3022320163920 102578680848 97% /mnt/echo dumpe2fs -h output: Filesystem volume name: echo-MDT0000 Last mounted on: / Filesystem UUID: 9ae293d6-bef3-4b02-a52d-1890fe21861b Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 2463408928 Block count: 1539776448 Reserved block count: 76988415 Free blocks: 5959371 Free inodes: 1621029388 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1024 Blocks per group: 20472 Fragments per group: 20472 Inodes per group: 32752 Inode blocks per group: 8188 RAID stride: 32 RAID stripe width: 64 Flex block group size: 16 Filesystem created: Fri Aug 9 11:07:47 2019 Last mount time: Tue Apr 14 10:49:56 2020 Last write time: Tue Apr 14 10:49:56 2020 Mount count: 74 Maximum mount count: -1 Last checked: Fri Aug 9 11:07:47 2019 Check interval: 0 (<none>) Lifetime writes: 337 TB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 1024 Required extra isize: 32 Desired extra isize: 32 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 9d4302bc-bff9-46ea-a1d2-f020ad493de8 Journal backup: inode blocks User quota inode: 3 Group quota inode: 4 Journal features: journal_incompat_revoke Journal size: 4100M Journal length: 1048576 Journal sequence: 0x004018cc Journal start: 1 |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
Also worth noting, we tried an e2fsck on /dev/md0 (the block device under /mnt/mdt) - this threw a huge number of errors but most seemed to be down to not being able to change anything because the filesystem is full. One other thing we noticed which just seems strange is that the total size reported is 3.5TB, despite md0 being just over 6TB. The block count in dumpe2fs suggests that it should be just over 6TB. |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
Things you could check:
Do you have multiple MDTs? |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
Note my previous comment was written without having seen your comment first. It would be helpful to get some of the output from e2fsck. The 3.5TB vs 6TB discrepancy is expected. The 2.4B inodes each consume 1KB for a total of 2.4TB, and are not listed in the total/used space for "df", otherwise users would be unhappy because their filesystem was 40% full after formatting. There are still almost 6M blocks free, about 24GB, so it isn't clear to me what is "full"? One possibility is that it is unhappy when the free blocks count < reserved block count? This should only be possible from writes by root. You could try setting the reserved block count to zero via "tune2fs -m 0 /dev/md0" and see if that allows you to modify the filesystem again. As for what is consuming the space, it is pretty clear that something was aggressively and steadily consuming blocks from the MDT. You could also check the oi16.N files are not outrageously large. For 800M used inodes they might be in the neighborhood of 50GB in size. Not that we've seen a problem there, but just trying to figure it out. You could also check "du -sk" of the other top-level directories besides "ROOT/" to see if they are consuming a lot of space. |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
Thanks for the quick reply. To go through the bits...
I have a 9.3GB screenlog from the e2fsck that I'm just gzip -9 ing. We stopped the fsck early because it seemed quite stuck. Let me know if there's somewhere I can upload a large file. |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
Following the tune2fs, mounted as ldiskfs it does now show free space: /dev/md0 3690846660 3667009320 23837340 100% /mnt/ldiskfs-mount When mounted as lustre is still shows 100%: /dev/md0 3662011924 3662011924 0 100% /mnt/mdt Unfortunately trying to delete anything from the filesystem seems to have immediately panic'd it again. No output in dmesg before the crash but two lines in syslog: Apr 14 12:10:03 emds1 kernel: LustreError: 22194:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: Apr 14 12:10:03 emds1 kernel: LustreError: 22194:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
FYI, nothing outside of ROOT seems big: emds1 /mnt/ldiskfs-mount # du -sh `ls -1 | grep -v ROOT` 0 BATCHID 4.0K CATALOGS 0 changelog_catalog 0 changelog_users 1.5M CONFIGS 8.0K fld 0 hsm_actions 8.0K last_rcvd 260K LFSCK 4.0K lfsck_bookmark 4.0K lfsck_layout 8.0K lfsck_namespace 16K lost+found 4.0K lov_objid 4.0K lov_objseq 8.0K NIDTBL_VERSIONS 8.0K nodemap 19M O 1.9G oi.16.0 1.9G oi.16.1 1.9G oi.16.10 1.9G oi.16.11 1.9G oi.16.12 1.9G oi.16.13 1.9G oi.16.14 1.9G oi.16.15 1.9G oi.16.16 1.9G oi.16.17 1.9G oi.16.18 1.9G oi.16.19 1.9G oi.16.2 2.0G oi.16.20 1.9G oi.16.21 1.9G oi.16.22 1.9G oi.16.23 2.0G oi.16.24 1.9G oi.16.25 1.9G oi.16.26 1.9G oi.16.27 1.9G oi.16.28 1.9G oi.16.29 1.9G oi.16.3 1.9G oi.16.30 1.9G oi.16.31 1.9G oi.16.32 1.9G oi.16.33 1.9G oi.16.34 2.0G oi.16.35 1.9G oi.16.36 1.9G oi.16.37 1.9G oi.16.38 1.9G oi.16.39 1.9G oi.16.4 1.9G oi.16.40 1.9G oi.16.41 1.9G oi.16.42 1.9G oi.16.43 1.9G oi.16.44 1.9G oi.16.45 1.9G oi.16.46 1.9G oi.16.47 1.9G oi.16.48 1.9G oi.16.49 1.9G oi.16.5 1.9G oi.16.50 1.9G oi.16.51 1.9G oi.16.52 1.9G oi.16.53 1.9G oi.16.54 1.9G oi.16.55 1.9G oi.16.56 1.9G oi.16.57 1.9G oi.16.58 1.9G oi.16.59 1.9G oi.16.6 1.9G oi.16.60 1.9G oi.16.61 1.9G oi.16.62 1.9G oi.16.63 1.9G oi.16.7 2.0G oi.16.8 1.9G oi.16.9 4.0K OI_scrub 4.0K PENDING 60K quota_master 52K quota_slave 4.0K REMOTE_PARENT_DIR 4.0K reply_data 4.0K seq_ctl 4.0K seq_srv 4.0K update_log 4.0K update_log_dir |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
Can you get the full stack trace from the LBUG/crash via a serial console or IPMI, or just take a photo with a connected monitor? As for the e2fsck output, if it is many, many MB of the same message over and over again, I don't need to see all of that. The important thing would be the start of the run until it starts hitting errors, and then a sampling of the most common messages after that. |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
Also, collecting the full debug logs from Lustre during mount would be useful using the following process: mds# modprobe libcfs mds# lctl set_param debug=-1 debug_mb=1024 mds# lctl set_param panic_on_lbug=0 mds# mount -t lustre /dev/md0 /mnt/mdt & [wait until MDS is mounted] client# [mount client, then unlink a file, presumably hitting an error] mds# lctl dk | gzip -9 > /tmp/debug.log.gz |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
Based on the statfs results from mounting as "lustre" and "ldiskfs", it looks like the difference is 0.78%, which exactly matches OSD_STATFS_RESERVED_SHIFT=7, or 1/128th of the space. In order to get above that threshold, it looks like we would need to free up about 5GB of space, but there doesn't appear to be any easy place to get that space. Since you are running Lustre 2.10 you won't have Data-on-MDT, so there shouldn't be (m)any large files in the ROOT/ directory, but there could be large (empty?) directories, which is why the e2fsck output may be interesting. Also, the debug log and/or stack trace from the MDS will shed light on what piece of code is hitting the problem. |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
Was watching the Dell iDRAC after settings the debug params you noted. Interestingly it let me rm a single file, so I ran a recursive rm on an old directory and it pretty much immediately hung up. The only errors output are: Message from syslogd@emds1 at Apr 14 13:07:35 ... kernel:LustreError: 14902:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: Message from syslogd@emds1 at Apr 14 13:07:35 ... kernel:LustreError: 14902:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG emds1 / # emds1 / # emds1 / # emds1 / # lctl dk | gzip -9 > /tmp/debug.log.gz emds1 / # lctl dk Debug log: 0 lines, 0 kept, 0 dropped, 0 bad. The e2fsck output is a lot of: Inode XXXX is a zero-length directory. Clear? and Missing '.' in directory inode 988692386. Fix? yes Setting filetype for entry '.' in Error writing block 4246401459379240960 (Invalid argument). Ignore error? yes Error writing block 4246401467969175552 (Invalid argument). Ignore error? yes /ROOT/ARCHIVE/dirvish/filers/gungnir-vol/20200321/tree/vol/builds/dneg_houdini_coretools_binaries/4.0.3/a0fc0c34b5/bin (988692386) to 2. Missing '..' in directory inode 988692386. Fix? yes Setting filetype for entry '..' in /ROOT/ARCHIVE/dirvish/filers/gungnir-vol/20200321/tree/vol/builds/dneg_houdini_coretools_binaries/4.0.3/a0fc0c34b5/bin (988692386) to 2. Error reading block 4246401476559110144 (Invalid argument) while reading directory block. Ignore error? yes Force rewrite? yes Directory inode 988692388, block #0, offset 0: directory corrupted Salvage? yes Probably worth saying the RAID itself has reported no errors anywhere I can see. |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
As briefly mentioned before, we have a bunch of old stuff on this filesystem that we can definitely delete. In order to free up that extra 5GB or more I could remove directly from the ldiskfs mount and would have to rely on lfsck later to clean up the orphaned objects....? Also since you mention we're on 2.10, we're not averse to upgrading if it brings with it more useful debug output or bug fixes etc. |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
OK, based on your e2fsck summary, it is pointing to a problem in the ldiskfs code where it allocates the directory inode, then hit an out-of-space error trying to allocate the block for the directory, but it doesn't clean up the directory inode on error? Then, when e2fsck is running on the filesystem, it is trying to allocate blocks for all of these zero-length directories and failing, and then hitting errors trying to insert "." and ".." entries into the non-existent directory blocks. That is just speculation based on the errors, and not something we can do anything about yet. On the MDS, when you dumped the debug log, did you previously set debug=-1 to capture all the logs? You could check for files like /tmp/lustre-log.<timestamp> that may contain the log dumped by the LBUG (if MDS has not been rebooted yet). Also, is there a stack trace shown by dmesg? That would also be useful to attach here. |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
Stack trace from dmesg (strangely wasn't on iDRAC console so I'd missed it): [ 1060.397830] LustreError: 11759:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: [ 1060.399600] LustreError: 11759:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG [ 1060.401333] Pid: 11759, comm: mdt00_006 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Mon Mar 25 19:46:55 UTC 2019 [ 1060.401335] Call Trace: [ 1060.401353] [<ffffffffc0a337cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 1060.401372] [<ffffffffc0a3387c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 1060.401382] [<ffffffffc153b159>] osp_sync_declare_add+0x3a9/0x3e0 [osp] [ 1060.401398] [<ffffffffc152a299>] osp_declare_destroy+0xc9/0x1c0 [osp] [ 1060.401408] [<ffffffffc1487c6e>] lod_sub_declare_destroy+0xce/0x2d0 [lod] [ 1060.401431] [<ffffffffc146c4a5>] lod_obj_stripe_destroy_cb+0x85/0x90 [lod] [ 1060.401445] [<ffffffffc1477ac6>] lod_obj_for_each_stripe+0xb6/0x230 [lod] [ 1060.401458] [<ffffffffc147884b>] lod_declare_destroy+0x43b/0x5c0 [lod] [ 1060.401470] [<ffffffffc14d2531>] mdd_declare_finish_unlink+0x91/0x1f0 [mdd] [ 1060.401484] [<ffffffffc14e312b>] mdd_unlink+0x52b/0xbf0 [mdd] [ 1060.401498] [<ffffffffc13ca898>] mdt_reint_unlink+0xc28/0x11d0 [mdt] [ 1060.401522] [<ffffffffc13cdca3>] mdt_reint_rec+0x83/0x210 [mdt] [ 1060.401541] [<ffffffffc13af18b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] [ 1060.401555] [<ffffffffc13bafa7>] mdt_reint+0x67/0x140 [mdt] [ 1060.401570] [<ffffffffc0fbd7ca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [ 1060.401654] [<ffffffffc0f6605b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [ 1060.401707] [<ffffffffc0f697a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [ 1060.401773] [ 1060.401778] [<ffffffff94ec1c31>] kthread+0xd1/0xe0 [ 1060.401779] [ 1060.401795] [<ffffffff95574c37>] ret_from_fork_nospec_end+0x0/0x39 [ 1060.401796] [ 1060.401834] [<ffffffffffffffff>] 0xffffffffffffffff [ 1060.401947] LustreError: dumping log to /tmp/lustre-log.1586896595.11759 [ 1065.165171] Lustre: echo-MDT0000: Connection restored to MGC10.23.22.104@tcp_0 (at 0@lo) [ 1065.165180] Lustre: Skipped 35 previous similar messages [ 1201.484927] INFO: task mdt00_006:11759 blocked for more than 120 seconds. [ 1201.486843] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1201.488696] mdt00_006 D ffffa15403355140 0 11759 2 0x00000000 [ 1201.488703] Call Trace: [ 1201.488733] [<ffffffffc0a3b510>] ? libcfs_debug_dumplog+0xc0/0x180 [libcfs] [ 1201.488748] [<ffffffff95567c49>] schedule+0x29/0x70 [ 1201.488761] [<ffffffffc0a338ad>] lbug_with_loc+0x7d/0xa0 [libcfs] [ 1201.488781] [<ffffffffc153b159>] osp_sync_declare_add+0x3a9/0x3e0 [osp] [ 1201.488797] [<ffffffffc152a299>] osp_declare_destroy+0xc9/0x1c0 [osp] [ 1201.488826] [<ffffffffc1487c6e>] lod_sub_declare_destroy+0xce/0x2d0 [lod] [ 1201.488841] [<ffffffffc146c420>] ? lod_xattr_list+0x150/0x150 [lod] [ 1201.488856] [<ffffffffc146c4a5>] lod_obj_stripe_destroy_cb+0x85/0x90 [lod] [ 1201.488873] [<ffffffffc1477ac6>] lod_obj_for_each_stripe+0xb6/0x230 [lod] [ 1201.488888] [<ffffffffc147884b>] lod_declare_destroy+0x43b/0x5c0 [lod] [ 1201.488911] [<ffffffffc14e82a1>] ? mdd_env_info+0x21/0x60 [mdd] [ 1201.488923] [<ffffffffc14d2531>] mdd_declare_finish_unlink+0x91/0x1f0 [mdd] [ 1201.488938] [<ffffffffc14e312b>] mdd_unlink+0x52b/0xbf0 [mdd] [ 1201.488965] [<ffffffffc13ca898>] mdt_reint_unlink+0xc28/0x11d0 [mdt] [ 1201.488988] [<ffffffffc13cdca3>] mdt_reint_rec+0x83/0x210 [mdt] [ 1201.489006] [<ffffffffc13af18b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] [ 1201.489024] [<ffffffffc13bafa7>] mdt_reint+0x67/0x140 [mdt] [ 1201.489118] [<ffffffffc0fbd7ca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [ 1201.489196] [<ffffffffc0f6605b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [ 1201.489273] [<ffffffffc0f62688>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 1201.489281] [<ffffffff94ed67c2>] ? default_wake_function+0x12/0x20 [ 1201.489289] [<ffffffff94ecba9b>] ? __wake_up_common+0x5b/0x90 [ 1201.489364] [<ffffffffc0f697a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [ 1201.489441] [<ffffffffc0f68d10>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] [ 1201.489449] [<ffffffff94ec1c31>] kthread+0xd1/0xe0 [ 1201.489455] [<ffffffff94ec1b60>] ? insert_kthread_work+0x40/0x40 [ 1201.489461] [<ffffffff95574c37>] ret_from_fork_nospec_begin+0x21/0x21 [ 1201.489467] [<ffffffff94ec1b60>] ? insert_kthread_work+0x40/0x40 Attaching the lustre-log file. |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
I wouldn't suggest upgrading to 2.12 before this situation is resolved. I am not aware of any particular fix that would address this issue. It's not clear whether the "delete 5GB of space" will help, or if there is some other data structure that is having problems. AFAICS, the 0.78% free threshold for reserved space is only affecting the df output, but there is already 23GB of free space on the MDT, and at least root should be able to delete files, if the MDS didn't keep crashing. Getting the console log (via dmesg if the MDS doesn't actually reboot) would be helpful. You might need to increase the debug level for the console messages if it isn't showing anything, like "echo 7 > /proc/sys/kernel/printk" to print more information to the console. One other question - what version of e2fsck are you using? You should use the latest release (1.45.2-wc1) from https://downloads.whamcloud.com/public/e2fsprogs/latest/el7/RPMS/x86_64/ which may help with repairing or removing the broken directories. |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
We are using: e2fsprogs-libs-1.42.13.wc6-7.el7.x86_64 e2fsprogs-1.42.13.wc6-7.el7.x86_64 I can upgrade these two now. Do you recommend starting another e2fsck at this point or wait to see what the log/stack trace show first? |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
Is it possible that you have a tunable parameter in your config log to set debug=0? It seems that there are lots of debug logs until a certain point in the mount, then it stops entirely until the osp_sync_declare_add()) ASSERTION( ctxt ) failed message is printed: : : 00000100:02000000:5.0:1586896576.159231:0:11420:0:(import.c:1541:ptlrpc_import_recovery_state_machine()) echo-MDT0000: Connection restored to 10.23.22.105@tcp (at 10.23.22.105@tcp) 00000100:02000000:1.0:1586896576.159236:0:11674:0:(import.c:1541:ptlrpc_import_recovery_state_machine()) echo-MDT0000: Connection restored to 10.23.22.105@tcp (at 10.23.22.105@tcp) 00000004:00040000:20.0:1586896595.964383:0:11759:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: 00000004:00040000:20.0:1586896595.966157:0:11759:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG so there is about a 20s gap in the logs. It seems that after mounting the MDS you may need to run "lctl set_param debug=-1" again and then re-capture the logs to get the background of what the client was doing before the LBUG, but I'm still looking through what is in the current log to see what I can find. |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
I've been running that debug=-1 line each time. Do you know where I'd look for the tunable you mention? Edit: Nevermind - just re-read your comment and see I need to set it again after the mount. |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
i would recommend to update e2fsck, and then run read-only "e2fsck -fn" (under script so the output is logged) to see if it behaves better. The stack trace at least shows us what the client is doing that is causing problems. Clearly unlinking, which isn't a surprise, but details are helpful. Is it possible that you changed the default layout at the start of March to have a very wide striping, or use PFL with a large number of components, that would cause all the files to allocate large xattr blocks? Alternately, did you add a lot of ACLs, or applications using other xattrs (SELinux?) that might have done the same? |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
None of those were done as far as I know. It's just been sat there for months being a target for rsync to back up the same set of servers. I'll try running the fsck now in a logged screen. Guessing this will take several hours so if there's anything else you'd like me to do first, let me know. |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
Isn't this just the time I'm spending navigating to the directory where I'm going to do the rm? Think I mentioned before, I'm using the MDS as the client |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
Hmm, is your backup something like "BackupPC" which is creating large numbers of hard links to the same files? That would also cause Lustre to allocate an external xattr block for each file after a certain number of hard links to each file. |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
Yes - we use dirvish which does create hardlinks for incrementals. We have been using it for years though without change. |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
Given the nature of what's on here and how long the fsck is going to take to run, I'm tempted to run it with -y unless you recommend otherwise now that those 23GB are free. |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
Another engineer working on this issue has informed me that this problem may already be resolved via patch https://review.whamcloud.com/38224 " |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
Would we need to upgrade/downgrade to any particular version? |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
There is a build at https://build.whamcloud.com/job/lustre-reviews/73314/ for RHEL7.6 x86_64 that should be installed on the MDS and preferably OSS nodes also. It is based on the 2.10.8 release, or you can build your own if that is normally what you do. |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
Note that you should be able to install the Lustre server kernel on pretty much any RHEL7 distro, unless you have other kernel modules (e.g. IB) that also have dependencies. In that case, you might be better off to rebuild the Lustre modules for your specific kernel. |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
I'll install that on the OSSs now while I wait on the fsck. fsck seems to be running faster than expected. 22% through on pass 1. |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
The previously-mentioned However, there are still directories that have problems due to running out of space that will need e2fsck to fix. I suspect if you try to delete them via Lustre or ldiskfs they will cause the MDT to be remounted read-only (-30 error). |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
Ok. I'm gonna leave the e2fsck to complete (hopefully) and in the meantime I'll install the updated lustre packages. Once it's done, reboots all 'round and I'll let you know how it goes. Thanks for all your help so far. |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
The other problem to understand/address is what is consuming the space on the MDT. You could check this if you know some files that were backed up recently, and run "debugfs -c -R 'stat /ROOT/path/to/backup/file' /dev/md0" to dump the inode information. The "catastrophic mode" message is not cause for alarm, just the "-c" flag skipping loading of unnecessary filesystem metadata to run faster. This should dump something like: debugfs 1.45.2.wc1 (27-May-2019) /tmp/testfs-mdt1: catastrophic mode - not reading inode or group bitmaps Inode: 171 Type: regular Mode: 0644 Flags: 0x0 Generation: 225492012 Version: 0x00000005:000099b6 User: 0 Group: 0 Project: 0 Size: 0 File ACL: 3525837 Links: 1 Blockcount: 0 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x5e8d87f3:00000000 -- Wed Apr 8 02:14:43 2020 atime: 0x5e8d87f3:00000000 -- Wed Apr 8 02:14:43 2020 mtime: 0x5e8d87f3:00000000 -- Wed Apr 8 02:14:43 2020 crtime: 0x5e8d87f3:45e6e870 -- Wed Apr 8 02:14:43 2020 Size of extra inode fields: 32 Extended attributes: trusted.lma (24) = 00 00 00 00 00 00 00 00 42 23 00 00 02 00 00 00 bd 03 00 00 00 00 00 00 lma: fid=[0x200002342:0x3bd:0x0] compat=0 incompat=0 trusted.lov (56) security.selinux (37) = "unconfined_u:object_r:unlabeled_t:s0\000" trusted.link (968) trusted.som (24) = 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 BLOCKS: and of interest is the "File ACL:" line being non-zero (indicating an external xattr block) and the "trusted.link (968)" xattr line showing that it is large (at least larger than can fit into the remaining ~450 bytes of xattr space available in the inode). If that is the case, you might be able to reduce this xattr space usage by removing one of your older backup copies, or some old directory trees that have lots of files with these large xattrs. |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
You could also use "debugfs -c -R 'stat <inode_number> [<inode_number> ...]' /dev/md0" to dump the attributes of of a few problematic inodes previously reported by e2fsck, if they haven't already been fixed. That might give us an idea of what went wrong with them. |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
I'll do that as soon as the fsck completes and before I mount as lustre to try and delete stuff. It's certainly likely that we'll exceed the 450 bytes on almost all the files because we maintain (usually) 2 or 3 weeks of daily hard linked backups for each server so up to roughly 21 hard links per file. Again though this hasn't changed recently, it's been this way for some time and I'm not sure what would have caused such an incredible linear increase in the numbers of files/size used. |
| Comment by Andreas Dilger [ 14/Apr/20 ] |
|
When you have a chance (not urgent) it would be useful to check whether the trend for "df -i" on the MDT (if you track this history) is increasing similarly as shown in df.png. That would at least tell us if the extra usage is new files (it both used space and inodes are growing at about the same rate) or it is extra space usage per file or elsewhere (if space usage is increasing while inode usage is roughly constant). |
| Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ] |
|
Very annoyingly we noticed that we didn't have inode stats turned on in collectd. We've corrected this but only as of today unfortunately. |
| Comment by Campbell Mcleay (Inactive) [ 15/Apr/20 ] |
|
Unfortunately: emds1 /root # e2fsck -fvy -C 0 /dev/md0 e2fsck 1.45.2.wc1 (27-May-2019) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Internal error: couldn't find dir_info for 2391487120. e2fsck: aborted |
| Comment by Campbell Mcleay (Inactive) [ 15/Apr/20 ] |
|
A couple of debugfs examples: debugfs -c -R 'stat /ROOT/ARCHIVE/dirvish/filers/vannfs31/20200328/tree/user_data/CRB_RESTORES/IO-82556/CRB/ldev_2d_elements/SCAN/S_ldev_2d_elements_blood_element_squib_large_flat_a_002_s01/2156x1806/s_ldev_2d_elements_blood_element_squib_large_flat_a_002_s01.1117.exr' /dev/md0 debugfs 1.45.2.wc1 (27-May-2019) /dev/md0: catastrophic mode - not reading inode or group bitmaps Inode: 1734097604 Type: regular Mode: 0444 Flags: 0x0 Generation: 294602363 Version: 0x00000025:b9345598 User: 4014 Group: 20 Project: 0 Size: 0 File ACL: 1084109575 Links: 22 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x5e901af0:00000000 -- Fri Apr 10 00:06:24 2020 atime: 0x5e6b3965:00000000 -- Fri Mar 13 00:42:29 2020 mtime: 0x5a21f22f:00000000 -- Fri Dec 1 16:22:07 2017 crtime: 0x5e6b3965:c27fd8ec -- Fri Mar 13 00:42:29 2020 Size of extra inode fields: 32 Extended attributes: trusted.lma (24) = 00 00 00 00 00 00 00 00 68 82 00 00 02 00 00 00 83 ac 01 00 00 00 00 00 lma: fid=[0x200008268:0x1ac83:0x0] compat=0 incompat=0 trusted.lov (56) trusted.link (1916) BLOCKS: debugfs -c -R 'stat /ROOT/ARCHIVE/dirvish/filers/gungnir-vol/20200404/tree/vol/builds/usd/0.7.0/e2f93f71e4/lib/python/pxr/Pcp/__init__.py' /dev/md0 debugfs 1.45.2.wc1 (27-May-2019) /dev/md0: catastrophic mode - not reading inode or group bitmaps Inode: 820727609 Type: regular Mode: 0644 Flags: 0x0 Generation: 3289393243 Version: 0x00000025:9a66a677 User: 518 Group: 20 Project: 0 Size: 0 File ACL: 0 Links: 6 Blockcount: 0 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x5e95e738:6ca5e29c -- Tue Apr 14 09:39:20 2020 atime: 0x5d63cb50:00000000 -- Mon Aug 26 05:06:40 2019 mtime: 0x579ffa2d:00000000 -- Mon Aug 1 18:41:01 2016 crtime: 0x5d63cb50:820f5640 -- Mon Aug 26 05:06:40 2019 Size of extra inode fields: 32 Extended attributes: trusted.lma (24) = 00 00 00 00 00 00 00 00 f5 16 00 00 02 00 00 00 15 67 01 00 00 00 00 00 lma: fid=[0x2000016f5:0x16715:0x0] compat=0 incompat=0 trusted.lov (56) trusted.link (285) BLOCKS: |
| Comment by Campbell Mcleay (Inactive) [ 15/Apr/20 ] |
|
I am however able now to delete files without a panic. |
| Comment by Andreas Dilger [ 15/Apr/20 ] |
|
The dir_info e2fsck error appears to be the same as This e2fsck bug was hit at another site that has a very large number of directories (over 180M directories), which is unusual for most cases, but in the case of your symlink trees there are lots of directories with relatively few directories. The updated e2fsck was confirmed to fix the problem on their filesystem. |
| Comment by Andreas Dilger [ 15/Apr/20 ] |
|
As for the debugfs stat output, it definitely shows that the "link" xattr is large in at least some cases, and would consume an extra block for each such inode. Also, based on the previous e2fsck issue, it seems that there are a very large number of directories compared to regular files, and each directory will also consume at least one block. Based on |
| Comment by Campbell Mcleay (Inactive) [ 15/Apr/20 ] |
|
Our directory trees are ridiculously deep and overused for structure of data so this doesn't surprise me. I'm still not sure what changed end of Feb though so we're gonna have to watch this carefully. Checking this morning, backups are still running and seem to be somewhat stable so I'll let a good backup complete then try and take it offline to run a new e2fsck. After we've had a successful e2fsck, I'd like to upgrade to 2.12.4 but would it be sensible to run an lfsck prior to doing that, or after to get all the updates/bug-fixes? |
| Comment by Andreas Dilger [ 15/Apr/20 ] |
|
If you run e2fsck on the MDT to repair any problems in the local MDT filesystem, then running LFSCK is not strictly required, as it is mostly doing garbage collection and handling cases where there is some inconsistency between the MDT and OSTs. Generally, LFSCK has been getting better with newer releases of Lustre, so it is probably better to wait until after the upgrade if you want to run it, and unless there are visible problems with the filesystem you may want to wait until there is a good time to run it (e.g. planned system outage). |
| Comment by Campbell Mcleay (Inactive) [ 15/Apr/20 ] |
|
Will do. The priority of this ticket can be dropped if you like since the filesystem is now up and running. I'll continue to report back on the e2fsck progress. Thank you for your help getting things working so quickly. |
| Comment by Peter Jones [ 18/Apr/20 ] |
|
Any update cmcl |
| Comment by Andreas Dilger [ 19/Apr/20 ] |
|
I ran some tests on a local filesystem with master, filling up the MDT with DOM files and directories, and while there were some -28 = -ENOSPC errors printed on the console, I didn't have any problems with deleting the files afterward. |
| Comment by Campbell Mcleay (Inactive) [ 20/Apr/20 ] |
|
After deleting a huge number of files and getting a couple of good backups through, I managed to run an fsck over the weekend (results attached). With that done, everything remounted and seemed to function fine. I've since upgraded to 2.12.4 and we seem to be back in business. Thanks again for all your help. |
| Comment by Peter Jones [ 20/Apr/20 ] |
|
Excellent - thanks for the update! |