Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Upstream
-
3
-
9223372036854775807
Description
The ENOSPC errors correspond to 'no more free slots in catalog' entries in kern log on kjcf05:
May 15 04:29:25 kjcf05n02 kernel: Lustre: 4160426:0:(llog_cat.c:99:llog_cat_new_log()) kjcf05-MDT0000-osd: there are no more free slots in catalog [0x20044e180:0x1:0x0] May 15 04:29:25 kjcf05n02 kernel: LustreError: 4160426:0:(llog_cat.c:604:llog_cat_add_rec()) kjcf05-MDT0000-osd: initialization error: rc = -28 May 15 04:29:25 kjcf05n02 kernel: LustreError: 4160426:0:(update_trans.c:985:top_trans_stop()) kjcf05-MDT0000-osd: write updates failed: rc = -28 May 15 04:29:25 kjcf05n02 kernel: LustreError: 4160426:0:(mdt_reint.c:2458:mdt_reint_migrate()) kjcf05-MDT0000: migrate [0x2404de2c2:0x8672:0x0]/file.mdtest.0.85682421 failed: rc = -28 May 15 05:18:57 kjcf05n02 kernel: Lustre: 4160957:0:(llog_cat.c:99:llog_cat_new_log()) kjcf05-MDT0000-osd: there are no more free slots in catalog [0x20044e180:0x1:0x0] May 15 05:18:57 kjcf05n02 kernel: LustreError: 4160957:0:(llog_cat.c:604:llog_cat_add_rec()) kjcf05-MDT0000-osd: initialization error: rc = -28 May 15 05:18:57 kjcf05n02 kernel: LustreError: 4160957:0:(update_trans.c:985:top_trans_stop()) kjcf05-MDT0000-osd: write updates failed: rc = -28 May 15 05:18:57 kjcf05n02 kernel: LustreError: 4160957:0:(mdt_reint.c:2458:mdt_reint_migrate()) kjcf05-MDT0000: migrate [0x2404de2c2:0x8672:0x0]/file.mdtest.0.26347188 failed: rc = -28 May 15 06:07:49 kjcf05n02 kernel: Lustre: 4160957:0:(llog_cat.c:99:llog_cat_new_log()) kjcf05-MDT0000-osd: there are no more free slots in catalog [0x20044e180:0x1:0x0] May 15 06:07:49 kjcf05n02 kernel: LustreError: 4160957:0:(llog_cat.c:604:llog_cat_add_rec()) kjcf05-MDT0000-osd: initialization error: rc = -28 May 15 06:07:49 kjcf05n02 kernel: LustreError: 4160957:0:(update_trans.c:985:top_trans_stop()) kjcf05-MDT0000-osd: write updates failed: rc = -28 May 15 06:07:49 kjcf05n02 kernel: LustreError: 4160957:0:(mdt_reint.c:2458:mdt_reint_migrate()) kjcf05-MDT0000: migrate [0x2404de2c2:0x8672:0x0]/file.mdtest.0.67717455 failed: rc = -28
The catalog fid referenced in those messages is an entry in update_log_dir:
[root@kjcf05n02 ~]# debugfs -c -R "ls -l update_log_dir" /dev/md65 | grep 0x20044e180:0x1:0x0 debugfs 1.46.5.cr1 (16-Jan-2023) 599 100644 (17) 0 0 200896 31-Dec-1969 18:00 [0x20044e180:0x1:0x0] [root@kjcf05n02 ~]# debugfs -c -R "stat update_log_dir/[0x20044e180:0x1:0x0]" /dev/md65 debugfs 1.46.5.cr1 (16-Jan-2023) Inode: 599 Type: regular Mode: 0644 Flags: 0x84000 Generation: 2402083947 Version: 0x00000000:00000000 User: 0 Group: 0 Project: 0 Size: 200896 File ACL: 0 Links: 1 Blockcount: 408 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x00000000:fffffff8 -- Wed Dec 31 18:00:00 1969 atime: 0x00000000:fffffff8 -- Wed Dec 31 18:00:00 1969 mtime: 0x00000000:fffffff8 -- Wed Dec 31 18:00:00 1969 crtime: 0x6741e7df:c0263814 -- Sat Nov 23 08:34:07 2024 Size of extra inode fields: 32 Extended attributes: lma: fid=[0x20044e180:0x1:0x0] compat=0 incompat=0 EXTENTS: (ETB0):98135, (0-8):6535168-6535176, (9):394376486, (10):2957103091, (11):2957934592, (12):3237838848, (13):3237871616, (14):2821917483, (15):2868494042, (16):2868641792, (17):4109631703, (18):4109673074, (19):4109677162, (20):4109675070, (21):4109665932, (22):4109675601, (23):4109676331, (24):4109678320, (25):4109676825, (26):4109673492, (27):4109672666, (28):4109681925, (29):4109680396, (30):4109679087, (31):4109682347, (32):4109674739, (33):663508, (34):668610, (35):6636 31, (36):666147, (37):677800, (38):675893, (39):871055, (40):1011408, (41):1037513, (42):1043165, (43):1042269, (44):1168484, (45):1160481, (46):1159442, (47):1171613, (48):1161554, (49):1178359
In case it might be interesting, I dumped that update_log_dir entry yesterday, ran llog_reader on it:
[root@kjcf05n02 ~]# llog_reader /tmp/psg/uld.fid_0x20044e180\:0x1\:0x0 | more rec #2607 type=1064553b len=64 offset 199552 Header size : 32768 llh_size : 64 Time : Mon Nov 25 01:07:46 2024 Number of records: 1 cat_idx: 2606 last_idx: 2607 Flags: 0x12 Target uuid : ----------------------- #2607 (064)fid=[0x2004f5d18:0x23ca:0x0] path=update_log_dir/[0x2004f5d18:0x23ca:0x0]
There is a race between llog_cat_new_log() and llog_cat_cleanup() for almost empty catalog.
llog_cat_clenaup()->llog_cat_set_first_idx()
changes
llh->llh_cat_idx
to the next one (lgh_last_idx). And after it, cancel records decrease llh_count at llog_cancel_rec(). If llog_cat_new_log() happens between llog_cat_set_first_idx() and llog_cancel_rec(), it relies on wrong llh_count and makes a conclusion that catalog is full.
I think I'm running into this issue. I first noticed it after a power outage on an older system still running lustre 2.15.1 and I see this when attempting to mount the MDS.
to try to fix the issue I rewrote the configuration for the MDT and all OSTs and attempted to bring the file system back up, and end up with a very similar error and the same ultimate result: the MDT attempts to mount and then fails to mount.
Relevant logs below.