Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.12.3, Lustre 2.12.4
-
None
-
CentOS 7.6
-
3
-
9223372036854775807
Description
Hi,
We hit a LBUG on 2.12.3 last night:
[8325133.774837] LustreError: 59728:0:(llog_osd.c:617:llog_osd_write_rec()) fir-MDT0001-osd: index 14075 already set in log bitmap [8325133.786320] LustreError: 59728:0:(llog_osd.c:619:llog_osd_write_rec()) LBUG [8325133.793471] Pid: 59728, comm: mdt_rdpg02_021 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
Also, after a restart:
Mar 18 08:35:18 fir-md1-s2 kernel: LustreError: 34211:0:(llog_osd.c:617:llog_osd_write_rec()) fir-MDT0001-osd: index 14096 already set in log bitmap Mar 18 08:35:18 fir-md1-s2 kernel: LustreError: 34211:0:(llog_osd.c:619:llog_osd_write_rec()) LBUG Mar 18 08:35:18 fir-md1-s2 kernel: Pid: 34211, comm: mdt_rdpg01_053 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019 Message from syslogd@fir-md1-s2 at Mar 18 08:35:18 ... kernel:LustreError: 34211:0:(llog_osd.c:619:llog_osd_write_rec()) LBUG Mar 18 08:35:18 fir-md1-s2 kernel: Call Trace: Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc0ccd7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc0ccd87c>] lbug_with_loc+0x4c/0xa0 [libcfs] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc0df72ba>] llog_osd_write_rec+0x16ca/0x1730 [obdclass] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc18fa06f>] mdd_changelog_write_rec+0x2f/0x120 [mdd] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc0de81ab>] llog_write_rec+0xcb/0x520 [obdclass] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc0dece0c>] llog_cat_new_log+0x62c/0xce0 [obdclass] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc0ded6bb>] llog_cat_add_rec+0x1fb/0x880 [obdclass] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc0de5180>] llog_add+0x80/0x1a0 [obdclass] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc18fa2dd>] mdd_changelog_store+0x17d/0x520 [mdd] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc1907374>] mdd_changelog_data_store_by_fid+0x1d4/0x350 [mdd] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc19096c2>] mdd_changelog_data_store+0x142/0x200 [mdd] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc190ab38>] mdd_close+0xae8/0xf30 [mdd] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc179597e>] mdt_mfd_close+0x3fe/0x860 [mdt] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc179b291>] mdt_close_internal+0x121/0x220 [mdt] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc179b5b0>] mdt_close+0x220/0x780 [mdt] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc117636a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc111d24b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc1120bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffff890c2e81>] kthread+0xd1/0xe0 Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffff89777c24>] ret_from_fork_nospec_begin+0xe/0x21
We stopped the changelog reader (Robinhood) for this MDT and then we were able to start it, but then we hit the following issues:
[Wed Mar 18 10:28:44 2020][ 520.331132] Lustre: 21084:0:(llog_cat.c:98:llog_cat_new_log()) fir-MDD0001: there are no more free slots in catalog changelog_catalog [Wed Mar 18 10:28:44 2020][ 520.343129] Lustre: 21084:0:(llog_cat.c:98:llog_cat_new_log()) Skipped 159923 previous similar messages [Wed Mar 18 10:28:44 2020][ 520.353111] LustreError: 21179:0:(llog_cat.c:530:llog_cat_current_log()) fir-MDD0001: next log does not exist! [Wed Mar 18 10:28:44 2020][ 520.363117] LustreError: 21179:0:(llog_cat.c:530:llog_cat_current_log()) Skipped 137757 previous similar messages [Wed Mar 18 10:28:48 2020][ 524.957681] LustreError: 21095:0:(mdd_dir.c:1065:mdd_changelog_ns_store()) fir-MDD0001: cannot store changelog record: type = 1, name = 'alignment.eigen.indiv', t = [0x240049459:0x9f67:0x0], p = [0x2400478b1:0x1e9d6:0x0]: rc = -28 [Wed Mar 18 10:28:54 2020][ 530.656413] LustreError: 20878:0:(mdd_dir.c:1065:mdd_changelog_ns_store()) fir-MDD0001: cannot store changelog record: type = 6, name = 'alignment.eigen.indiv', t = [0x240049419:0xea19:0x0], p = [0x24004ac39:0x22e6:0x0]: rc = -5 [Wed Mar 18 10:28:54 2020][ 530.676655] LustreError: 20878:0:(mdd_dir.c:1065:mdd_changelog_ns_store()) Skipped 2 previous similar messages
Full console log attached as fir-md1-s2_console-changelogs_no_more_free_slots_2_12_4.log
Then, we upgraded the MDS from 2.12.3 to 2.13.4 and the issue was still there.
Also users reported the following errors also when creating new files:
[zhengyuh@sh02-ln01 login /scratch/users/zhengyuh/$ touch aa touch: cannot touch ‘aa’: Bad address
Because this was kind of an emergency, I applied the following procedure to recreate the changelog files:
[root@fir-md1-s2 1]# llog_reader changelog_catalog >/tmp/mdt1_changelog_catalog [root@fir-md1-s2 1]# llog_reader changelog_users Bit 1 of 1 not set rec #2 type=1064553b len=64 offset 8256 Header size : 8192 Time : Thu Jan 24 14:04:30 2019 Number of records: 1 Target uuid : ----------------------- #02 (064)id=[0x410fe:0x1:0x0]:0 path=O/1/d30/266494 [root@fir-md1-s2 1]# llog_reader changelog_users >/tmp/mdt1_changelog_users [root@fir-md1-s2 1]# mv changelog_catalog chagnelog_catalog.bak [root@fir-md1-s2 1]# mv changelog_users changelog_users.bak [root@fir-md1-s2 1]# llog_reader O/1/d29/680157 >/tmp/mdt1_680157 [root@fir-md1-s2 1]# ls O/1/d5/5 O/1/d5/5 [root@fir-md1-s2 1]# mv O/1/d5/5 O/1/d5/5.bak [root@fir-md1-s2 1]# mv O/1/d29/680157 O/1/d29/680157.bak
I'm attaching the corresponding files for investigation.
This MDT is far from being full:
[root@fir-md1-s2 ~]# df -h -t lustre Filesystem Size Used Avail Use% Mounted on /dev/mapper/md1-rbod1-mdt1 18T 826G 16T 5% /mnt/fir/mdt/1 [root@fir-md1-s2 ~]# df -i -t lustre Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/md1-rbod1-mdt1 288005760 111873725 176132035 39% /mnt/fir/mdt/1
Do you have an idea of what happened? How can we avoid this issue in the future? Thanks!