Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13372

fir-MDD0001: there are no more free slots in catalog changelog_catalog

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.3, Lustre 2.12.4
    • None
    • CentOS 7.6
    • 3
    • 9223372036854775807

    Description

      Hi,

      We hit a LBUG on 2.12.3 last night:

      [8325133.774837] LustreError: 59728:0:(llog_osd.c:617:llog_osd_write_rec()) fir-MDT0001-osd: index 14075 already set in log bitmap
      [8325133.786320] LustreError: 59728:0:(llog_osd.c:619:llog_osd_write_rec()) LBUG
      [8325133.793471] Pid: 59728, comm: mdt_rdpg02_021 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
      

      Also, after a restart:

      Mar 18 08:35:18 fir-md1-s2 kernel: LustreError: 34211:0:(llog_osd.c:617:llog_osd_write_rec()) fir-MDT0001-osd: index 14096 already set in log bitmap
      Mar 18 08:35:18 fir-md1-s2 kernel: LustreError: 34211:0:(llog_osd.c:619:llog_osd_write_rec()) LBUG
      Mar 18 08:35:18 fir-md1-s2 kernel: Pid: 34211, comm: mdt_rdpg01_053 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
      Message from syslogd@fir-md1-s2 at Mar 18 08:35:18 ...
       kernel:LustreError: 34211:0:(llog_osd.c:619:llog_osd_write_rec()) LBUG
      Mar 18 08:35:18 fir-md1-s2 kernel: Call Trace:
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc0ccd7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc0ccd87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc0df72ba>] llog_osd_write_rec+0x16ca/0x1730 [obdclass]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc18fa06f>] mdd_changelog_write_rec+0x2f/0x120 [mdd]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc0de81ab>] llog_write_rec+0xcb/0x520 [obdclass]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc0dece0c>] llog_cat_new_log+0x62c/0xce0 [obdclass]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc0ded6bb>] llog_cat_add_rec+0x1fb/0x880 [obdclass]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc0de5180>] llog_add+0x80/0x1a0 [obdclass]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc18fa2dd>] mdd_changelog_store+0x17d/0x520 [mdd]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc1907374>] mdd_changelog_data_store_by_fid+0x1d4/0x350 [mdd]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc19096c2>] mdd_changelog_data_store+0x142/0x200 [mdd]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc190ab38>] mdd_close+0xae8/0xf30 [mdd]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc179597e>] mdt_mfd_close+0x3fe/0x860 [mdt]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc179b291>] mdt_close_internal+0x121/0x220 [mdt]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc179b5b0>] mdt_close+0x220/0x780 [mdt]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc117636a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc111d24b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc1120bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffff890c2e81>] kthread+0xd1/0xe0
      Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffff89777c24>] ret_from_fork_nospec_begin+0xe/0x21
      

      We stopped the changelog reader (Robinhood) for this MDT and then we were able to start it, but then we hit the following issues:
       

      [Wed Mar 18 10:28:44 2020][ 520.331132] Lustre: 21084:0:(llog_cat.c:98:llog_cat_new_log()) fir-MDD0001: there are no more free slots in catalog changelog_catalog
      [Wed Mar 18 10:28:44 2020][ 520.343129] Lustre: 21084:0:(llog_cat.c:98:llog_cat_new_log()) Skipped 159923 previous similar messages
      [Wed Mar 18 10:28:44 2020][ 520.353111] LustreError: 21179:0:(llog_cat.c:530:llog_cat_current_log()) fir-MDD0001: next log does not exist!
      [Wed Mar 18 10:28:44 2020][ 520.363117] LustreError: 21179:0:(llog_cat.c:530:llog_cat_current_log()) Skipped 137757 previous similar messages
      [Wed Mar 18 10:28:48 2020][ 524.957681] LustreError: 21095:0:(mdd_dir.c:1065:mdd_changelog_ns_store()) fir-MDD0001: cannot store changelog record: type = 1, name = 'alignment.eigen.indiv', t = [0x240049459:0x9f67:0x0], p = [0x2400478b1:0x1e9d6:0x0]: rc = -28
      [Wed Mar 18 10:28:54 2020][ 530.656413] LustreError: 20878:0:(mdd_dir.c:1065:mdd_changelog_ns_store()) fir-MDD0001: cannot store changelog record: type = 6, name = 'alignment.eigen.indiv', t = [0x240049419:0xea19:0x0], p = [0x24004ac39:0x22e6:0x0]: rc = -5
      [Wed Mar 18 10:28:54 2020][ 530.676655] LustreError: 20878:0:(mdd_dir.c:1065:mdd_changelog_ns_store()) Skipped 2 previous similar messages
      

      Full console log attached as fir-md1-s2_console-changelogs_no_more_free_slots_2_12_4.log

      Then, we upgraded the MDS from 2.12.3 to 2.13.4 and the issue was still there.

      Also users reported the following errors also when creating new files:

      [zhengyuh@sh02-ln01 login /scratch/users/zhengyuh/$ touch aa
      
      touch: cannot touch ‘aa’: Bad address
      

      Because this was kind of an emergency, I applied the following procedure to recreate the changelog files:

      [root@fir-md1-s2 1]# llog_reader changelog_catalog >/tmp/mdt1_changelog_catalog
      [root@fir-md1-s2 1]# llog_reader changelog_users
      Bit 1 of 1 not set
      rec #2 type=1064553b len=64 offset 8256
      Header size : 8192
      Time : Thu Jan 24 14:04:30 2019
      Number of records: 1
      Target uuid : 
      -----------------------
      #02 (064)id=[0x410fe:0x1:0x0]:0 path=O/1/d30/266494
      [root@fir-md1-s2 1]# llog_reader changelog_users >/tmp/mdt1_changelog_users
      [root@fir-md1-s2 1]# mv changelog_catalog chagnelog_catalog.bak
      [root@fir-md1-s2 1]# mv changelog_users changelog_users.bak
      [root@fir-md1-s2 1]# llog_reader O/1/d29/680157 >/tmp/mdt1_680157
      [root@fir-md1-s2 1]# ls O/1/d5/5
      O/1/d5/5
      [root@fir-md1-s2 1]# mv O/1/d5/5 O/1/d5/5.bak
      [root@fir-md1-s2 1]# mv O/1/d29/680157 O/1/d29/680157.bak
      

      I'm attaching the corresponding files for investigation.

      This MDT is far from being full:

      [root@fir-md1-s2 ~]# df -h -t lustre
      Filesystem                  Size  Used Avail Use% Mounted on
      /dev/mapper/md1-rbod1-mdt1   18T  826G   16T   5% /mnt/fir/mdt/1
      [root@fir-md1-s2 ~]# df -i -t lustre
      Filesystem                    Inodes     IUsed     IFree IUse% Mounted on
      /dev/mapper/md1-rbod1-mdt1 288005760 111873725 176132035   39% /mnt/fir/mdt/1
      

      Do you have an idea of what happened? How can we avoid this issue in the future? Thanks!

      Attachments

        Activity

          People

            tappro Mikhail Pershin
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: