Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19102

ENOSPC 'no more free slots in catalog' entries in kern

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Upstream
    • 3
    • 9223372036854775807

    Description

      The ENOSPC errors correspond to 'no more free slots in catalog' entries in kern log on kjcf05:

      May 15 04:29:25 kjcf05n02 kernel: Lustre: 4160426:0:(llog_cat.c:99:llog_cat_new_log()) kjcf05-MDT0000-osd: there are no more free slots in catalog [0x20044e180:0x1:0x0]
      May 15 04:29:25 kjcf05n02 kernel: LustreError: 4160426:0:(llog_cat.c:604:llog_cat_add_rec()) kjcf05-MDT0000-osd: initialization error: rc = -28
      May 15 04:29:25 kjcf05n02 kernel: LustreError: 4160426:0:(update_trans.c:985:top_trans_stop()) kjcf05-MDT0000-osd: write updates failed: rc = -28
      May 15 04:29:25 kjcf05n02 kernel: LustreError: 4160426:0:(mdt_reint.c:2458:mdt_reint_migrate()) kjcf05-MDT0000: migrate [0x2404de2c2:0x8672:0x0]/file.mdtest.0.85682421 failed: rc = -28
      
      May 15 05:18:57 kjcf05n02 kernel: Lustre: 4160957:0:(llog_cat.c:99:llog_cat_new_log()) kjcf05-MDT0000-osd: there are no more free slots in catalog [0x20044e180:0x1:0x0]
      May 15 05:18:57 kjcf05n02 kernel: LustreError: 4160957:0:(llog_cat.c:604:llog_cat_add_rec()) kjcf05-MDT0000-osd: initialization error: rc = -28
      May 15 05:18:57 kjcf05n02 kernel: LustreError: 4160957:0:(update_trans.c:985:top_trans_stop()) kjcf05-MDT0000-osd: write updates failed: rc = -28
      May 15 05:18:57 kjcf05n02 kernel: LustreError: 4160957:0:(mdt_reint.c:2458:mdt_reint_migrate()) kjcf05-MDT0000: migrate [0x2404de2c2:0x8672:0x0]/file.mdtest.0.26347188 failed: rc = -28
      
      May 15 06:07:49 kjcf05n02 kernel: Lustre: 4160957:0:(llog_cat.c:99:llog_cat_new_log()) kjcf05-MDT0000-osd: there are no more free slots in catalog [0x20044e180:0x1:0x0]
      May 15 06:07:49 kjcf05n02 kernel: LustreError: 4160957:0:(llog_cat.c:604:llog_cat_add_rec()) kjcf05-MDT0000-osd: initialization error: rc = -28
      May 15 06:07:49 kjcf05n02 kernel: LustreError: 4160957:0:(update_trans.c:985:top_trans_stop()) kjcf05-MDT0000-osd: write updates failed: rc = -28
      May 15 06:07:49 kjcf05n02 kernel: LustreError: 4160957:0:(mdt_reint.c:2458:mdt_reint_migrate()) kjcf05-MDT0000: migrate [0x2404de2c2:0x8672:0x0]/file.mdtest.0.67717455 failed: rc = -28
      

      The catalog fid referenced in those messages is an entry in update_log_dir:

      [root@kjcf05n02 ~]# debugfs -c -R "ls -l update_log_dir" /dev/md65 | grep 0x20044e180:0x1:0x0
      debugfs 1.46.5.cr1 (16-Jan-2023)
          599  100644 (17)      0      0   200896 31-Dec-1969 18:00 [0x20044e180:0x1:0x0]
      [root@kjcf05n02 ~]# debugfs -c -R "stat update_log_dir/[0x20044e180:0x1:0x0]" /dev/md65
      debugfs 1.46.5.cr1 (16-Jan-2023)
      Inode: 599   Type: regular    Mode:  0644   Flags: 0x84000
      Generation: 2402083947    Version: 0x00000000:00000000
      User:     0   Group:     0   Project:     0   Size: 200896
      File ACL: 0
      Links: 1   Blockcount: 408
      Fragment:  Address: 0    Number: 0    Size: 0
       ctime: 0x00000000:fffffff8 -- Wed Dec 31 18:00:00 1969
       atime: 0x00000000:fffffff8 -- Wed Dec 31 18:00:00 1969
       mtime: 0x00000000:fffffff8 -- Wed Dec 31 18:00:00 1969
      crtime: 0x6741e7df:c0263814 -- Sat Nov 23 08:34:07 2024
      Size of extra inode fields: 32
      Extended attributes:
        lma: fid=[0x20044e180:0x1:0x0] compat=0 incompat=0
      EXTENTS:
      (ETB0):98135, (0-8):6535168-6535176, (9):394376486, (10):2957103091, (11):2957934592, (12):3237838848, (13):3237871616, (14):2821917483, (15):2868494042, (16):2868641792, (17):4109631703, (18):4109673074, (19):4109677162, (20):4109675070, 
      (21):4109665932, (22):4109675601, (23):4109676331, (24):4109678320, (25):4109676825, (26):4109673492, (27):4109672666, (28):4109681925, (29):4109680396, (30):4109679087, (31):4109682347, (32):4109674739, (33):663508, (34):668610, (35):6636
      31, (36):666147, (37):677800, (38):675893, (39):871055, (40):1011408, (41):1037513, (42):1043165, (43):1042269, (44):1168484, (45):1160481, (46):1159442, (47):1171613, (48):1161554, (49):1178359
      

      In case it might be interesting, I dumped that update_log_dir entry yesterday, ran llog_reader on it:

      [root@kjcf05n02 ~]# llog_reader /tmp/psg/uld.fid_0x20044e180\:0x1\:0x0 | more
      rec #2607 type=1064553b len=64 offset 199552
      Header size : 32768      llh_size : 64
      Time : Mon Nov 25 01:07:46 2024
      Number of records: 1    cat_idx: 2606   last_idx: 2607
      Flags: 0x12 
      Target uuid : 
      -----------------------
      #2607 (064)fid=[0x2004f5d18:0x23ca:0x0] path=update_log_dir/[0x2004f5d18:0x23ca:0x0]
      

      There is a race between llog_cat_new_log() and llog_cat_cleanup() for almost empty catalog.

      llog_cat_clenaup()->llog_cat_set_first_idx()

      changes

      llh->llh_cat_idx

      to the next one (lgh_last_idx). And after it, cancel records decrease llh_count at llog_cancel_rec(). If llog_cat_new_log() happens between llog_cat_set_first_idx() and llog_cancel_rec(), it relies on wrong llh_count and makes a conclusion that catalog is full.

      Attachments

        Activity

          [LU-19102] ENOSPC 'no more free slots in catalog' entries in kern
          jstroik Jesse Stroik added a comment -

          I think I'm running into this issue. I first noticed it after a power outage on an older system still running lustre 2.15.1 and I see this when attempting to mount the MDS.

          to try to fix the issue I rewrote the configuration for the MDT and all OSTs and attempted to bring the file system back up, and end up with a very similar error and the same ultimate result: the MDT attempts to mount and then fails to mount.

          Relevant logs below.

          [  362.960255] Lustre: 4482:0:(llog_cat.c:101:llog_cat_new_log()) arc15-OST0001-osc-MDT0000: there are no more free slots in catalog [0x5:0x1:0x0]:0
          [  362.974834] LustreError: 4482:0:(osp_sync.c:1553:osp_sync_init()) arc15-OST0001-osc-MDT0000: can't initialize llog: rc = -28
          [  362.987480] LustreError: 4482:0:(obd_config.c:774:class_setup()) setup arc15-OST0001-osc-MDT0000 failed (-28)
          [  362.998565] LustreError: 4482:0:(obd_config.c:2001:class_config_llog_handler()) MGC172.16.23.25@o2ib: cfg command failed: rc = -28
          [  363.011677] Lustre:    cmd=cf003 0:arc15-OST0001-osc-MDT0000  1:arc15-OST0001_UUID  2:172.16.23.18@o2ib  [  363.024096] LustreError: 15c-8: MGC172.16.23.25@o2ib: Confguration from log arc15-MDT0000 failed from MGS -28. Communication error between node & MGS, a bad configuration, or other errors. See syslog for more info
          [  363.045259] LustreError: 4381:0:(obd_mount_server.c:1425:server_start_targets()) failed to start server arc15-MDT0000: -28
          [  363.057622] LustreError: 4381:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -28
          [  363.068427] Lustre: Failing over arc15-MDT0000

          jstroik Jesse Stroik added a comment - I think I'm running into this issue. I first noticed it after a power outage on an older system still running lustre 2.15.1 and I see this when attempting to mount the MDS. to try to fix the issue I rewrote the configuration for the MDT and all OSTs and attempted to bring the file system back up, and end up with a very similar error and the same ultimate result: the MDT attempts to mount and then fails to mount. Relevant logs below. [  362.960255] Lustre: 4482:0:(llog_cat.c:101:llog_cat_new_log()) arc15-OST0001-osc-MDT0000: there are no more free slots in catalog [0x5:0x1:0x0] :0 [  362.974834] LustreError: 4482:0:(osp_sync.c:1553:osp_sync_init()) arc15-OST0001-osc-MDT0000: can't initialize llog: rc = -28 [  362.987480] LustreError: 4482:0:(obd_config.c:774:class_setup()) setup arc15-OST0001-osc-MDT0000 failed (-28) [  362.998565] LustreError: 4482:0:(obd_config.c:2001:class_config_llog_handler()) MGC172.16.23.25@o2ib: cfg command failed: rc = -28 [  363.011677] Lustre:    cmd=cf003 0:arc15-OST0001-osc-MDT0000  1:arc15-OST0001_UUID  2:172.16.23.18@o2ib   [  363.024096] LustreError: 15c-8: MGC172.16.23.25@o2ib: Confguration from log arc15-MDT0000 failed from MGS -28. Communication error between node & MGS, a bad configuration, or other errors. See syslog for more info [  363.045259] LustreError: 4381:0:(obd_mount_server.c:1425:server_start_targets()) failed to start server arc15-MDT0000: -28 [  363.057622] LustreError: 4381:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -28 [  363.068427] Lustre: Failing over arc15-MDT0000

          "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/59730
          Subject: LU-19102 llog: enospc catalog fix
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 5ed53f236bf86f65644ba9cea1587ba5894e7859

          gerrit Gerrit Updater added a comment - "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/59730 Subject: LU-19102 llog: enospc catalog fix Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5ed53f236bf86f65644ba9cea1587ba5894e7859

          People

            aboyko Alexander Boyko
            aboyko Alexander Boyko
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: