[LU-13452] MDT is 100% full, cannot delete files Created: 14/Apr/20  Updated: 20/Apr/20  Resolved: 20/Apr/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.7
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Campbell Mcleay (Inactive) Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL 7.2.1511, lustre version 2.10.7-1


Attachments: PNG File df.png     File lustre-log.1586896595.11759.gz     File lustre-log.1586896595.11759.txt.gz     File screenlog.0.gz    
Issue Links:
Related
is related to LU-12674 osp should handle -EINPROGRESS on llo... Resolved
is related to LU-13197 e2fsck fails with 'Internal error: co... Resolved
is related to LU-13453 ENOSPC on OI insert should't leak inodes Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

MDS filesystem is full, and we cannot free space on it. It will crash (kernel panic) when trying to delete files.

Apr 13 16:01:50 emds1 kernel: LDISKFS-fs (md0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc

Apr 13 16:01:50 emds1 kernel: LustreError: 11368:0:(osd_handler.c:7131:osd_mount()) echo-MDT0000-osd: failed to set lma on /dev/md0 root inode

Apr 13 16:01:50 emds1 kernel: LustreError: 11368:0:(obd_config.c:558:class_setup()) setup echo-MDT0000-osd failed (-30)

Apr 13 16:01:50 emds1 kernel: LustreError: 11368:0:(obd_mount.c:203:lustre_start_simple()) echo-MDT0000-osd setup error -30

Apr 13 16:01:50 emds1 kernel: LustreError: 11368:0:(obd_mount_server.c:1848:server_fill_super()) Unable to start osd on /dev/md0: -30

Apr 13 16:01:50 emds1 kernel: LustreError: 11368:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount  (-30)

Apr 13 16:02:01 emds1 kernel: LDISKFS-fs (md0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc

Apr 13 16:02:01 emds1 kernel: Lustre: MGS: Connection restored to 8f792be4-fada-1d75-0dbd-ec8601cdce7f (at 0@lo)

Apr 13 16:02:01 emds1 kernel: LustreError: 11438:0:(genops.c:478:class_register_device()) echo-OST0000-osc-MDT0000: already exists, won't add

Apr 13 16:02:01 emds1 kernel: LustreError: 11438:0:(obd_config.c:1682:class_config_llog_handler()) MGC10.23.22.104@tcp: cfg command failed: rc = -17

Apr 13 16:02:01 emds1 kernel: Lustre:    cmd=cf001 0:echo-OST0000-osc-MDT0000  1:osp  2:echo-MDT0000-mdtlov_UUID  

Apr 13 16:02:01 emds1 kernel: LustreError: 15c-8: MGC10.23.22.104@tcp: The configuration from log 'echo-MDT0000' failed (-17). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.

Apr 13 16:02:01 emds1 kernel: LustreError: 11380:0:(obd_mount_server.c:1389:server_start_targets()) failed to start server echo-MDT0000: -17

Apr 13 16:02:01 emds1 kernel: LustreError: 11380:0:(obd_mount_server.c:1882:server_fill_super()) Unable to start targets: -17

Apr 13 16:02:01 emds1 kernel: Lustre: Failing over echo-MDT0000

Apr 13 16:02:07 emds1 kernel: Lustre: 11380:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1586818921/real 1586818921]  req@ffff8d748ab38000 x1663898946110400/t0(0) o251->MGC10.23.22.104@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1586818927 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1

Apr 13 16:02:08 emds1 kernel: Lustre: server umount echo-MDT0000 complete

Apr 13 16:02:08 emds1 kernel: LustreError: 11380:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount  (-17)


 Comments   
Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

Looking at usage over the last 30 weeks all of a sudden at the end of February something happened and you can see it very obviously in the graph attached. We don't believe our workflows have changed.

'df -i' output shows it only 35% full inode-wise.

We've tried mounting the MDT as ldiskfs and having a look around. Nothing obvious (to my eyes) immediately showing as strange.

I did try removing an old directory in there just to see if I could free up a bit of space (knowing this would leave orphans on the OSTs we'd later have to clean up with lfsck). Watching with 'df -k' I can see the used space go down but the free space remains at 0.

dumpe2fs -h shows free blocks.

No changelog users:

cat /proc/fs/lustre/mdd/echo-MDT0000/changelog_users 
current index: 0
ID    index

Following that I tried remounting it and then using the MDS as a client to try and delete files. This immediately kernel panics.

df -i output:

/dev/md0               2463408928 842379642 1621029286   35% /mnt/mdt
10.23.22.104@tcp:/echo 2463408928 842379642 1621029286   35% /mnt/echo

df -k output:

/dev/md0                  3662011924    3662011924            0 100% /mnt/mdt
10.23.22.104@tcp:/echo 3156734724504 3022320163920 102578680848  97% /mnt/echo

dumpe2fs -h output:

Filesystem volume name:   echo-MDT0000
Last mounted on:          /
Filesystem UUID:          9ae293d6-bef3-4b02-a52d-1890fe21861b
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              2463408928
Block count:              1539776448
Reserved block count:     76988415
Free blocks:              5959371
Free inodes:              1621029388
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1024
Blocks per group:         20472
Fragments per group:      20472
Inodes per group:         32752
Inode blocks per group:   8188
RAID stride:              32
RAID stripe width:        64
Flex block group size:    16
Filesystem created:       Fri Aug  9 11:07:47 2019
Last mount time:          Tue Apr 14 10:49:56 2020
Last write time:          Tue Apr 14 10:49:56 2020
Mount count:              74
Maximum mount count:      -1
Last checked:             Fri Aug  9 11:07:47 2019
Check interval:           0 (<none>)
Lifetime writes:          337 TB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          1024
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      9d4302bc-bff9-46ea-a1d2-f020ad493de8
Journal backup:           inode blocks
User quota inode:         3
Group quota inode:        4
Journal features:         journal_incompat_revoke
Journal size:             4100M
Journal length:           1048576
Journal sequence:         0x004018cc
Journal start:            1
Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

Also worth noting, we tried an e2fsck on /dev/md0 (the block device under /mnt/mdt) - this threw a huge number of errors but most seemed to be down to not being able to change anything because the filesystem is full.

One other thing we noticed which just seems strange is that the total size reported is 3.5TB, despite md0 being just over 6TB. The block count in dumpe2fs suggests that it should be just over 6TB.

Comment by Andreas Dilger [ 14/Apr/20 ]

Things you could check:

  • run "e2fsck -fp" on the MDS to check for errors
  • mount as type ldiskfs and check the "lost+found" directory to see if there are files there
  • run "du -sk O/*" to see if there are a lot of llog files or other objects that are not visible in the namespace. If there is a particular directory there (eg. "1" or "10" that might be consuming a lot of space for Changelogs, recovery logs, unlink logs, etc.

Do you have multiple MDTs?

Comment by Andreas Dilger [ 14/Apr/20 ]

Note my previous comment was written without having seen your comment first. It would be helpful to get some of the output from e2fsck.

The 3.5TB vs 6TB discrepancy is expected. The 2.4B inodes each consume 1KB for a total of 2.4TB, and are not listed in the total/used space for "df", otherwise users would be unhappy because their filesystem was 40% full after formatting.

There are still almost 6M blocks free, about 24GB, so it isn't clear to me what is "full"? One possibility is that it is unhappy when the free blocks count < reserved block count? This should only be possible from writes by root. You could try setting the reserved block count to zero via "tune2fs -m 0 /dev/md0" and see if that allows you to modify the filesystem again.

As for what is consuming the space, it is pretty clear that something was aggressively and steadily consuming blocks from the MDT. You could also check the oi16.N files are not outrageously large. For 800M used inodes they might be in the neighborhood of 50GB in size. Not that we've seen a problem there, but just trying to figure it out. You could also check "du -sk" of the other top-level directories besides "ROOT/" to see if they are consuming a lot of space.

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

Thanks for the quick reply. To go through the bits...

  • lost+found is empty
  • du -sk O/* shows O/1 19M and O/10 136K
  • Only one MDT
  • I'll try the tune2fs and report back

I have a 9.3GB screenlog from the e2fsck that I'm just gzip -9 ing. We stopped the fsck early because it seemed quite stuck. Let me know if there's somewhere I can upload a large file.

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

Following the tune2fs, mounted as ldiskfs it does now show free space:

/dev/md0          3690846660 3667009320  23837340 100% /mnt/ldiskfs-mount

When mounted as lustre is still shows 100%:

/dev/md0          3662011924 3662011924         0 100% /mnt/mdt

Unfortunately trying to delete anything from the filesystem seems to have immediately panic'd it again. No output in dmesg before the crash but two lines in syslog:

Apr 14 12:10:03 emds1 kernel: LustreError: 22194:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: 
Apr 14 12:10:03 emds1 kernel: LustreError: 22194:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG
Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

FYI, nothing outside of ROOT seems big:

emds1 /mnt/ldiskfs-mount # du -sh `ls -1 | grep -v ROOT`
0	BATCHID
4.0K	CATALOGS
0	changelog_catalog
0	changelog_users
1.5M	CONFIGS
8.0K	fld
0	hsm_actions
8.0K	last_rcvd
260K	LFSCK
4.0K	lfsck_bookmark
4.0K	lfsck_layout
8.0K	lfsck_namespace
16K	lost+found
4.0K	lov_objid
4.0K	lov_objseq
8.0K	NIDTBL_VERSIONS
8.0K	nodemap
19M	O
1.9G	oi.16.0
1.9G	oi.16.1
1.9G	oi.16.10
1.9G	oi.16.11
1.9G	oi.16.12
1.9G	oi.16.13
1.9G	oi.16.14
1.9G	oi.16.15
1.9G	oi.16.16
1.9G	oi.16.17
1.9G	oi.16.18
1.9G	oi.16.19
1.9G	oi.16.2
2.0G	oi.16.20
1.9G	oi.16.21
1.9G	oi.16.22
1.9G	oi.16.23
2.0G	oi.16.24
1.9G	oi.16.25
1.9G	oi.16.26
1.9G	oi.16.27
1.9G	oi.16.28
1.9G	oi.16.29
1.9G	oi.16.3
1.9G	oi.16.30
1.9G	oi.16.31
1.9G	oi.16.32
1.9G	oi.16.33
1.9G	oi.16.34
2.0G	oi.16.35
1.9G	oi.16.36
1.9G	oi.16.37
1.9G	oi.16.38
1.9G	oi.16.39
1.9G	oi.16.4
1.9G	oi.16.40
1.9G	oi.16.41
1.9G	oi.16.42
1.9G	oi.16.43
1.9G	oi.16.44
1.9G	oi.16.45
1.9G	oi.16.46
1.9G	oi.16.47
1.9G	oi.16.48
1.9G	oi.16.49
1.9G	oi.16.5
1.9G	oi.16.50
1.9G	oi.16.51
1.9G	oi.16.52
1.9G	oi.16.53
1.9G	oi.16.54
1.9G	oi.16.55
1.9G	oi.16.56
1.9G	oi.16.57
1.9G	oi.16.58
1.9G	oi.16.59
1.9G	oi.16.6
1.9G	oi.16.60
1.9G	oi.16.61
1.9G	oi.16.62
1.9G	oi.16.63
1.9G	oi.16.7
2.0G	oi.16.8
1.9G	oi.16.9
4.0K	OI_scrub
4.0K	PENDING
60K	quota_master
52K	quota_slave
4.0K	REMOTE_PARENT_DIR
4.0K	reply_data
4.0K	seq_ctl
4.0K	seq_srv
4.0K	update_log
4.0K	update_log_dir
Comment by Andreas Dilger [ 14/Apr/20 ]

Can you get the full stack trace from the LBUG/crash via a serial console or IPMI, or just take a photo with a connected monitor?

As for the e2fsck output, if it is many, many MB of the same message over and over again, I don't need to see all of that. The important thing would be the start of the run until it starts hitting errors, and then a sampling of the most common messages after that.

Comment by Andreas Dilger [ 14/Apr/20 ]

Also, collecting the full debug logs from Lustre during mount would be useful using the following process:

mds# modprobe libcfs
mds# lctl set_param debug=-1 debug_mb=1024
mds# lctl set_param panic_on_lbug=0
mds# mount -t lustre /dev/md0 /mnt/mdt &
[wait until MDS is mounted]
client# [mount client, then unlink a file, presumably hitting an error]
mds# lctl dk | gzip -9 > /tmp/debug.log.gz
Comment by Andreas Dilger [ 14/Apr/20 ]

Based on the statfs results from mounting as "lustre" and "ldiskfs", it looks like the difference is 0.78%, which exactly matches OSD_STATFS_RESERVED_SHIFT=7, or 1/128th of the space. In order to get above that threshold, it looks like we would need to free up about 5GB of space, but there doesn't appear to be any easy place to get that space. Since you are running Lustre 2.10 you won't have Data-on-MDT, so there shouldn't be (m)any large files in the ROOT/ directory, but there could be large (empty?) directories, which is why the e2fsck output may be interesting.

Also, the debug log and/or stack trace from the MDS will shed light on what piece of code is hitting the problem.

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

Was watching the Dell iDRAC after settings the debug params you noted. Interestingly it let me rm a single file, so I ran a recursive rm on an old directory and it pretty much immediately hung up. The only errors output are:

Message from syslogd@emds1 at Apr 14 13:07:35 ...
 kernel:LustreError: 14902:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: 

Message from syslogd@emds1 at Apr 14 13:07:35 ...
 kernel:LustreError: 14902:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG

emds1 / # 
emds1 / # 
emds1 / # 
emds1 / # lctl dk | gzip -9 > /tmp/debug.log.gz
emds1 / # lctl dk 
Debug log: 0 lines, 0 kept, 0 dropped, 0 bad.

The e2fsck output is a lot of:

Inode XXXX is a zero-length directory. Clear?

and

Missing '.' in directory inode 988692386.
Fix? yes

Setting filetype for entry '.' in Error writing block 4246401459379240960 (Invalid argument).  Ignore error? yes

Error writing block 4246401467969175552 (Invalid argument).  Ignore error? yes

/ROOT/ARCHIVE/dirvish/filers/gungnir-vol/20200321/tree/vol/builds/dneg_houdini_coretools_binaries/4.0.3/a0fc0c34b5/bin (988692386) to 2.
Missing '..' in directory inode 988692386.
Fix? yes

Setting filetype for entry '..' in /ROOT/ARCHIVE/dirvish/filers/gungnir-vol/20200321/tree/vol/builds/dneg_houdini_coretools_binaries/4.0.3/a0fc0c34b5/bin (988692386) to 2.
Error reading block 4246401476559110144 (Invalid argument) while reading directory block.  Ignore error? yes

Force rewrite? yes

Directory inode 988692388, block #0, offset 0: directory corrupted
Salvage? yes

Probably worth saying the RAID itself has reported no errors anywhere I can see.

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

it looks like we would need to free up about 5GB of space

As briefly mentioned before, we have a bunch of old stuff on this filesystem that we can definitely delete. In order to free up that extra 5GB or more I could remove directly from the ldiskfs mount and would have to rely on lfsck later to clean up the orphaned objects....?

Also since you mention we're on 2.10, we're not averse to upgrading if it brings with it more useful debug output or bug fixes etc.

Comment by Andreas Dilger [ 14/Apr/20 ]

OK, based on your e2fsck summary, it is pointing to a problem in the ldiskfs code where it allocates the directory inode, then hit an out-of-space error trying to allocate the block for the directory, but it doesn't clean up the directory inode on error? Then, when e2fsck is running on the filesystem, it is trying to allocate blocks for all of these zero-length directories and failing, and then hitting errors trying to insert "." and ".." entries into the non-existent directory blocks. That is just speculation based on the errors, and not something we can do anything about yet.

On the MDS, when you dumped the debug log, did you previously set debug=-1 to capture all the logs? You could check for files like /tmp/lustre-log.<timestamp> that may contain the log dumped by the LBUG (if MDS has not been rebooted yet). Also, is there a stack trace shown by dmesg? That would also be useful to attach here.

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

Stack trace from dmesg (strangely wasn't on iDRAC console so I'd missed it):

[ 1060.397830] LustreError: 11759:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: 
[ 1060.399600] LustreError: 11759:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG
[ 1060.401333] Pid: 11759, comm: mdt00_006 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Mon Mar 25 19:46:55 UTC 2019
[ 1060.401335] Call Trace:
[ 1060.401353]  [<ffffffffc0a337cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 1060.401372]  [<ffffffffc0a3387c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 1060.401382]  [<ffffffffc153b159>] osp_sync_declare_add+0x3a9/0x3e0 [osp]
[ 1060.401398]  [<ffffffffc152a299>] osp_declare_destroy+0xc9/0x1c0 [osp]
[ 1060.401408]  [<ffffffffc1487c6e>] lod_sub_declare_destroy+0xce/0x2d0 [lod]
[ 1060.401431]  [<ffffffffc146c4a5>] lod_obj_stripe_destroy_cb+0x85/0x90 [lod]
[ 1060.401445]  [<ffffffffc1477ac6>] lod_obj_for_each_stripe+0xb6/0x230 [lod]
[ 1060.401458]  [<ffffffffc147884b>] lod_declare_destroy+0x43b/0x5c0 [lod]
[ 1060.401470]  [<ffffffffc14d2531>] mdd_declare_finish_unlink+0x91/0x1f0 [mdd]
[ 1060.401484]  [<ffffffffc14e312b>] mdd_unlink+0x52b/0xbf0 [mdd]
[ 1060.401498]  [<ffffffffc13ca898>] mdt_reint_unlink+0xc28/0x11d0 [mdt]
[ 1060.401522]  [<ffffffffc13cdca3>] mdt_reint_rec+0x83/0x210 [mdt]
[ 1060.401541]  [<ffffffffc13af18b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[ 1060.401555]  [<ffffffffc13bafa7>] mdt_reint+0x67/0x140 [mdt]
[ 1060.401570]  [<ffffffffc0fbd7ca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[ 1060.401654]  [<ffffffffc0f6605b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[ 1060.401707]  [<ffffffffc0f697a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[ 1060.401773]  
[ 1060.401778] [<ffffffff94ec1c31>] kthread+0xd1/0xe0
[ 1060.401779]  
[ 1060.401795] [<ffffffff95574c37>] ret_from_fork_nospec_end+0x0/0x39
[ 1060.401796]  
[ 1060.401834] [<ffffffffffffffff>] 0xffffffffffffffff
[ 1060.401947] LustreError: dumping log to /tmp/lustre-log.1586896595.11759
[ 1065.165171] Lustre: echo-MDT0000: Connection restored to MGC10.23.22.104@tcp_0 (at 0@lo)
[ 1065.165180] Lustre: Skipped 35 previous similar messages
[ 1201.484927] INFO: task mdt00_006:11759 blocked for more than 120 seconds.
[ 1201.486843] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1201.488696] mdt00_006       D ffffa15403355140     0 11759      2 0x00000000
[ 1201.488703] Call Trace:
[ 1201.488733]  [<ffffffffc0a3b510>] ? libcfs_debug_dumplog+0xc0/0x180 [libcfs]
[ 1201.488748]  [<ffffffff95567c49>] schedule+0x29/0x70
[ 1201.488761]  [<ffffffffc0a338ad>] lbug_with_loc+0x7d/0xa0 [libcfs]
[ 1201.488781]  [<ffffffffc153b159>] osp_sync_declare_add+0x3a9/0x3e0 [osp]
[ 1201.488797]  [<ffffffffc152a299>] osp_declare_destroy+0xc9/0x1c0 [osp]
[ 1201.488826]  [<ffffffffc1487c6e>] lod_sub_declare_destroy+0xce/0x2d0 [lod]
[ 1201.488841]  [<ffffffffc146c420>] ? lod_xattr_list+0x150/0x150 [lod]
[ 1201.488856]  [<ffffffffc146c4a5>] lod_obj_stripe_destroy_cb+0x85/0x90 [lod]
[ 1201.488873]  [<ffffffffc1477ac6>] lod_obj_for_each_stripe+0xb6/0x230 [lod]
[ 1201.488888]  [<ffffffffc147884b>] lod_declare_destroy+0x43b/0x5c0 [lod]
[ 1201.488911]  [<ffffffffc14e82a1>] ? mdd_env_info+0x21/0x60 [mdd]
[ 1201.488923]  [<ffffffffc14d2531>] mdd_declare_finish_unlink+0x91/0x1f0 [mdd]
[ 1201.488938]  [<ffffffffc14e312b>] mdd_unlink+0x52b/0xbf0 [mdd]
[ 1201.488965]  [<ffffffffc13ca898>] mdt_reint_unlink+0xc28/0x11d0 [mdt]
[ 1201.488988]  [<ffffffffc13cdca3>] mdt_reint_rec+0x83/0x210 [mdt]
[ 1201.489006]  [<ffffffffc13af18b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[ 1201.489024]  [<ffffffffc13bafa7>] mdt_reint+0x67/0x140 [mdt]
[ 1201.489118]  [<ffffffffc0fbd7ca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[ 1201.489196]  [<ffffffffc0f6605b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[ 1201.489273]  [<ffffffffc0f62688>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 1201.489281]  [<ffffffff94ed67c2>] ? default_wake_function+0x12/0x20
[ 1201.489289]  [<ffffffff94ecba9b>] ? __wake_up_common+0x5b/0x90
[ 1201.489364]  [<ffffffffc0f697a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[ 1201.489441]  [<ffffffffc0f68d10>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[ 1201.489449]  [<ffffffff94ec1c31>] kthread+0xd1/0xe0
[ 1201.489455]  [<ffffffff94ec1b60>] ? insert_kthread_work+0x40/0x40
[ 1201.489461]  [<ffffffff95574c37>] ret_from_fork_nospec_begin+0x21/0x21
[ 1201.489467]  [<ffffffff94ec1b60>] ? insert_kthread_work+0x40/0x40

Attaching the lustre-log file.

Comment by Andreas Dilger [ 14/Apr/20 ]

I wouldn't suggest upgrading to 2.12 before this situation is resolved. I am not aware of any particular fix that would address this issue.

It's not clear whether the "delete 5GB of space" will help, or if there is some other data structure that is having problems. AFAICS, the 0.78% free threshold for reserved space is only affecting the df output, but there is already 23GB of free space on the MDT, and at least root should be able to delete files, if the MDS didn't keep crashing. Getting the console log (via dmesg if the MDS doesn't actually reboot) would be helpful. You might need to increase the debug level for the console messages if it isn't showing anything, like "echo 7 > /proc/sys/kernel/printk" to print more information to the console.

One other question - what version of e2fsck are you using? You should use the latest release (1.45.2-wc1) from https://downloads.whamcloud.com/public/e2fsprogs/latest/el7/RPMS/x86_64/ which may help with repairing or removing the broken directories.

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

We are using:

e2fsprogs-libs-1.42.13.wc6-7.el7.x86_64
e2fsprogs-1.42.13.wc6-7.el7.x86_64

I can upgrade these two now. Do you recommend starting another e2fsck at this point or wait to see what the log/stack trace show first?

Comment by Andreas Dilger [ 14/Apr/20 ]

Is it possible that you have a tunable parameter in your config log to set debug=0? It seems that there are lots of debug logs until a certain point in the mount, then it stops entirely until the osp_sync_declare_add()) ASSERTION( ctxt ) failed message is printed:

:
:
00000100:02000000:5.0:1586896576.159231:0:11420:0:(import.c:1541:ptlrpc_import_recovery_state_machine()) echo-MDT0000: Connection restored to 10.23.22.105@tcp (at 10.23.22.105@tcp)
00000100:02000000:1.0:1586896576.159236:0:11674:0:(import.c:1541:ptlrpc_import_recovery_state_machine()) echo-MDT0000: Connection restored to 10.23.22.105@tcp (at 10.23.22.105@tcp)
00000004:00040000:20.0:1586896595.964383:0:11759:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: 
00000004:00040000:20.0:1586896595.966157:0:11759:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG

so there is about a 20s gap in the logs. It seems that after mounting the MDS you may need to run "lctl set_param debug=-1" again and then re-capture the logs to get the background of what the client was doing before the LBUG, but I'm still looking through what is in the current log to see what I can find.

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

I've been running that debug=-1 line each time. Do you know where I'd look for the tunable you mention?

Edit: Nevermind - just re-read your comment and see I need to set it again after the mount.

Comment by Andreas Dilger [ 14/Apr/20 ]

i would recommend to update e2fsck, and then run read-only "e2fsck -fn" (under script so the output is logged) to see if it behaves better.

The stack trace at least shows us what the client is doing that is causing problems. Clearly unlinking, which isn't a surprise, but details are helpful.

Is it possible that you changed the default layout at the start of March to have a very wide striping, or use PFL with a large number of components, that would cause all the files to allocate large xattr blocks? Alternately, did you add a lot of ACLs, or applications using other xattrs (SELinux?) that might have done the same?

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

None of those were done as far as I know. It's just been sat there for months being a target for rsync to back up the same set of servers.

I'll try running the fsck now in a logged screen. Guessing this will take several hours so if there's anything else you'd like me to do first, let me know.

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

so there is about a 20s gap in the logs

Isn't this just the time I'm spending navigating to the directory where I'm going to do the rm? Think I mentioned before, I'm using the MDS as the client

Comment by Andreas Dilger [ 14/Apr/20 ]

Hmm, is your backup something like "BackupPC" which is creating large numbers of hard links to the same files? That would also cause Lustre to allocate an external xattr block for each file after a certain number of hard links to each file.

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

Yes - we use dirvish which does create hardlinks for incrementals. We have been using it for years though without change.

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

Given the nature of what's on here and how long the fsck is going to take to run, I'm tempted to run it with -y unless you recommend otherwise now that those 23GB are free.

Comment by Andreas Dilger [ 14/Apr/20 ]

Another engineer working on this issue has informed me that this problem may already be resolved via patch https://review.whamcloud.com/38224 "LU-12674 osp: handle -EINPROGRESS on llog objects", which was backported to b2_10.

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

Would we need to upgrade/downgrade to any particular version?

Comment by Andreas Dilger [ 14/Apr/20 ]

There is a build at https://build.whamcloud.com/job/lustre-reviews/73314/ for RHEL7.6 x86_64 that should be installed on the MDS and preferably OSS nodes also. It is based on the 2.10.8 release, or you can build your own if that is normally what you do.

Comment by Andreas Dilger [ 14/Apr/20 ]

Note that you should be able to install the Lustre server kernel on pretty much any RHEL7 distro, unless you have other kernel modules (e.g. IB) that also have dependencies. In that case, you might be better off to rebuild the Lustre modules for your specific kernel.

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

I'll install that on the OSSs now while I wait on the fsck.

fsck seems to be running faster than expected. 22% through on pass 1.

Comment by Andreas Dilger [ 14/Apr/20 ]

The previously-mentioned LU-12674 patch should fix the "ASSERTION ( ctxt )" crashes on the MDS, so it might be possible to delete some files without crashing.

However, there are still directories that have problems due to running out of space that will need e2fsck to fix. I suspect if you try to delete them via Lustre or ldiskfs they will cause the MDT to be remounted read-only (-30 error).

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

Ok. I'm gonna leave the e2fsck to complete (hopefully) and in the meantime I'll install the updated lustre packages. Once it's done, reboots all 'round and I'll let you know how it goes.

Thanks for all your help so far.

Comment by Andreas Dilger [ 14/Apr/20 ]

The other problem to understand/address is what is consuming the space on the MDT. You could check this if you know some files that were backed up recently, and run "debugfs -c -R 'stat /ROOT/path/to/backup/file' /dev/md0" to dump the inode information. The "catastrophic mode" message is not cause for alarm, just the "-c" flag skipping loading of unnecessary filesystem metadata to run faster. This should dump something like:

debugfs 1.45.2.wc1 (27-May-2019)
/tmp/testfs-mdt1: catastrophic mode - not reading inode or group bitmaps
Inode: 171   Type: regular    Mode:  0644   Flags: 0x0
Generation: 225492012    Version: 0x00000005:000099b6
User:     0   Group:     0   Project:     0   Size: 0
File ACL: 3525837
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5e8d87f3:00000000 -- Wed Apr  8 02:14:43 2020
 atime: 0x5e8d87f3:00000000 -- Wed Apr  8 02:14:43 2020
 mtime: 0x5e8d87f3:00000000 -- Wed Apr  8 02:14:43 2020
crtime: 0x5e8d87f3:45e6e870 -- Wed Apr  8 02:14:43 2020
Size of extra inode fields: 32
Extended attributes:
  trusted.lma (24) = 00 00 00 00 00 00 00 00 42 23 00 00 02 00 00 00 bd 03 00 00 00 00 00 00 
  lma: fid=[0x200002342:0x3bd:0x0] compat=0 incompat=0
  trusted.lov (56)
  security.selinux (37) = "unconfined_u:object_r:unlabeled_t:s0\000"
  trusted.link (968)
  trusted.som (24) = 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
BLOCKS:

and of interest is the "File ACL:" line being non-zero (indicating an external xattr block) and the "trusted.link (968)" xattr line showing that it is large (at least larger than can fit into the remaining ~450 bytes of xattr space available in the inode). If that is the case, you might be able to reduce this xattr space usage by removing one of your older backup copies, or some old directory trees that have lots of files with these large xattrs.

Comment by Andreas Dilger [ 14/Apr/20 ]

You could also use "debugfs -c -R 'stat <inode_number> [<inode_number> ...]' /dev/md0" to dump the attributes of of a few problematic inodes previously reported by e2fsck, if they haven't already been fixed. That might give us an idea of what went wrong with them.

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

I'll do that as soon as the fsck completes and before I mount as lustre to try and delete stuff. It's certainly likely that we'll exceed the 450 bytes on almost all the files because we maintain (usually) 2 or 3 weeks of daily hard linked backups for each server so up to roughly 21 hard links per file.

Again though this hasn't changed recently, it's been this way for some time and I'm not sure what would have caused such an incredible linear increase in the numbers of files/size used.

Comment by Andreas Dilger [ 14/Apr/20 ]

When you have a chance (not urgent) it would be useful to check whether the trend for "df -i" on the MDT (if you track this history) is increasing similarly as shown in df.png. That would at least tell us if the extra usage is new files (it both used space and inodes are growing at about the same rate) or it is extra space usage per file or elsewhere (if space usage is increasing while inode usage is roughly constant).

Comment by Campbell Mcleay (Inactive) [ 14/Apr/20 ]

Very annoyingly we noticed that we didn't have inode stats turned on in collectd. We've corrected this but only as of today unfortunately.

Comment by Campbell Mcleay (Inactive) [ 15/Apr/20 ]

Unfortunately:

emds1 /root # e2fsck -fvy -C 0 /dev/md0
e2fsck 1.45.2.wc1 (27-May-2019)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure                                           
Internal error: couldn't find dir_info for 2391487120.
e2fsck: aborted
Comment by Campbell Mcleay (Inactive) [ 15/Apr/20 ]

A couple of debugfs examples:

debugfs -c -R 'stat /ROOT/ARCHIVE/dirvish/filers/vannfs31/20200328/tree/user_data/CRB_RESTORES/IO-82556/CRB/ldev_2d_elements/SCAN/S_ldev_2d_elements_blood_element_squib_large_flat_a_002_s01/2156x1806/s_ldev_2d_elements_blood_element_squib_large_flat_a_002_s01.1117.exr' /dev/md0
debugfs 1.45.2.wc1 (27-May-2019)
/dev/md0: catastrophic mode - not reading inode or group bitmaps
Inode: 1734097604   Type: regular    Mode:  0444   Flags: 0x0
Generation: 294602363    Version: 0x00000025:b9345598
User:  4014   Group:    20   Project:     0   Size: 0
File ACL: 1084109575
Links: 22   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5e901af0:00000000 -- Fri Apr 10 00:06:24 2020
 atime: 0x5e6b3965:00000000 -- Fri Mar 13 00:42:29 2020
 mtime: 0x5a21f22f:00000000 -- Fri Dec  1 16:22:07 2017
crtime: 0x5e6b3965:c27fd8ec -- Fri Mar 13 00:42:29 2020
Size of extra inode fields: 32
Extended attributes:
  trusted.lma (24) = 00 00 00 00 00 00 00 00 68 82 00 00 02 00 00 00 83 ac 01 00 00 00 00 00 
  lma: fid=[0x200008268:0x1ac83:0x0] compat=0 incompat=0
  trusted.lov (56)
  trusted.link (1916)
BLOCKS:
debugfs -c -R 'stat /ROOT/ARCHIVE/dirvish/filers/gungnir-vol/20200404/tree/vol/builds/usd/0.7.0/e2f93f71e4/lib/python/pxr/Pcp/__init__.py' /dev/md0
debugfs 1.45.2.wc1 (27-May-2019)
/dev/md0: catastrophic mode - not reading inode or group bitmaps
Inode: 820727609   Type: regular    Mode:  0644   Flags: 0x0
Generation: 3289393243    Version: 0x00000025:9a66a677
User:   518   Group:    20   Project:     0   Size: 0
File ACL: 0
Links: 6   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5e95e738:6ca5e29c -- Tue Apr 14 09:39:20 2020
 atime: 0x5d63cb50:00000000 -- Mon Aug 26 05:06:40 2019
 mtime: 0x579ffa2d:00000000 -- Mon Aug  1 18:41:01 2016
crtime: 0x5d63cb50:820f5640 -- Mon Aug 26 05:06:40 2019
Size of extra inode fields: 32
Extended attributes:
  trusted.lma (24) = 00 00 00 00 00 00 00 00 f5 16 00 00 02 00 00 00 15 67 01 00 00 00 00 00 
  lma: fid=[0x2000016f5:0x16715:0x0] compat=0 incompat=0
  trusted.lov (56)
  trusted.link (285)
BLOCKS:
Comment by Campbell Mcleay (Inactive) [ 15/Apr/20 ]

I am however able now to delete files without a panic.
I'm going to try and clear space and see if we can get a backup through overnight and then check the logs tomorrow. Looks like another fsck is going to be required...

Comment by Andreas Dilger [ 15/Apr/20 ]

The dir_info e2fsck error appears to be the same as LU-13197, which has a patch to fix it. There is a RHEL7 build of e2fsprogs that is known to fix this specific issue:

https://build.whamcloud.com/job/e2fsprogs-reviews/arch=x86_64,distro=el7/862/artifact/_topdir/RPMS/x86_64/

This e2fsck bug was hit at another site that has a very large number of directories (over 180M directories), which is unusual for most cases, but in the case of your symlink trees there are lots of directories with relatively few directories.  The updated e2fsck was confirmed to fix the problem on their filesystem. 

Comment by Andreas Dilger [ 15/Apr/20 ]

As for the debugfs stat output, it definitely shows that the "link" xattr is large in at least some cases, and would consume an extra block for each such inode. Also, based on the previous e2fsck issue, it seems that there are a very large number of directories compared to regular files, and each directory will also consume at least one block. Based on LU-13197, the filesystem must have at least 180M directories for only 850M inodes, so only about 5 files per directory (although this doesn't take into account the number of hard links).

Comment by Campbell Mcleay (Inactive) [ 15/Apr/20 ]

Our directory trees are ridiculously deep and overused for structure of data so this doesn't surprise me. I'm still not sure what changed end of Feb though so we're gonna have to watch this carefully.

Checking this morning, backups are still running and seem to be somewhat stable so I'll let a good backup complete then try and take it offline to run a new e2fsck.

After we've had a successful e2fsck, I'd like to upgrade to 2.12.4 but would it be sensible to run an lfsck prior to doing that, or after to get all the updates/bug-fixes?

Comment by Andreas Dilger [ 15/Apr/20 ]

If you run e2fsck on the MDT to repair any problems in the local MDT filesystem, then running LFSCK is not strictly required, as it is mostly doing garbage collection and handling cases where there is some inconsistency between the MDT and OSTs.  Generally, LFSCK has been getting better with newer releases of Lustre, so it is probably better to wait until after the upgrade if you want to run it, and unless there are visible problems with the filesystem you may want to wait until there is a good time to run it (e.g. planned system outage).

Comment by Campbell Mcleay (Inactive) [ 15/Apr/20 ]

Will do. The priority of this ticket can be dropped if you like since the filesystem is now up and running. I'll continue to report back on the e2fsck progress.

Thank you for your help getting things working so quickly.

Comment by Peter Jones [ 18/Apr/20 ]

Any update cmcl

Comment by Andreas Dilger [ 19/Apr/20 ]

I ran some tests on a local filesystem with master, filling up the MDT with DOM files and directories, and while there were some -28 = -ENOSPC errors printed on the console, I didn't have any problems with deleting the files afterward.

Comment by Campbell Mcleay (Inactive) [ 20/Apr/20 ]

After deleting a huge number of files and getting a couple of good backups through, I managed to run an fsck over the weekend (results attached). With that done, everything remounted and seemed to function fine.

I've since upgraded to 2.12.4 and we seem to be back in business.

Thanks again for all your help.

Comment by Peter Jones [ 20/Apr/20 ]

Excellent - thanks for the update!

Generated at Sat Feb 10 03:01:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.