[LU-13372] fir-MDD0001: there are no more free slots in catalog changelog_catalog Created: 18/Mar/20 Updated: 25/Mar/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.3, Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stephane Thiell | Assignee: | Mikhail Pershin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6 |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Hi, We hit a LBUG on 2.12.3 last night: [8325133.774837] LustreError: 59728:0:(llog_osd.c:617:llog_osd_write_rec()) fir-MDT0001-osd: index 14075 already set in log bitmap [8325133.786320] LustreError: 59728:0:(llog_osd.c:619:llog_osd_write_rec()) LBUG [8325133.793471] Pid: 59728, comm: mdt_rdpg02_021 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019 Also, after a restart: Mar 18 08:35:18 fir-md1-s2 kernel: LustreError: 34211:0:(llog_osd.c:617:llog_osd_write_rec()) fir-MDT0001-osd: index 14096 already set in log bitmap Mar 18 08:35:18 fir-md1-s2 kernel: LustreError: 34211:0:(llog_osd.c:619:llog_osd_write_rec()) LBUG Mar 18 08:35:18 fir-md1-s2 kernel: Pid: 34211, comm: mdt_rdpg01_053 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019 Message from syslogd@fir-md1-s2 at Mar 18 08:35:18 ... kernel:LustreError: 34211:0:(llog_osd.c:619:llog_osd_write_rec()) LBUG Mar 18 08:35:18 fir-md1-s2 kernel: Call Trace: Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc0ccd7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc0ccd87c>] lbug_with_loc+0x4c/0xa0 [libcfs] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc0df72ba>] llog_osd_write_rec+0x16ca/0x1730 [obdclass] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc18fa06f>] mdd_changelog_write_rec+0x2f/0x120 [mdd] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc0de81ab>] llog_write_rec+0xcb/0x520 [obdclass] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc0dece0c>] llog_cat_new_log+0x62c/0xce0 [obdclass] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc0ded6bb>] llog_cat_add_rec+0x1fb/0x880 [obdclass] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc0de5180>] llog_add+0x80/0x1a0 [obdclass] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc18fa2dd>] mdd_changelog_store+0x17d/0x520 [mdd] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc1907374>] mdd_changelog_data_store_by_fid+0x1d4/0x350 [mdd] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc19096c2>] mdd_changelog_data_store+0x142/0x200 [mdd] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc190ab38>] mdd_close+0xae8/0xf30 [mdd] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc179597e>] mdt_mfd_close+0x3fe/0x860 [mdt] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc179b291>] mdt_close_internal+0x121/0x220 [mdt] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc179b5b0>] mdt_close+0x220/0x780 [mdt] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc117636a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc111d24b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffffc1120bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffff890c2e81>] kthread+0xd1/0xe0 Mar 18 08:35:18 fir-md1-s2 kernel: [<ffffffff89777c24>] ret_from_fork_nospec_begin+0xe/0x21 We stopped the changelog reader (Robinhood) for this MDT and then we were able to start it, but then we hit the following issues: [Wed Mar 18 10:28:44 2020][ 520.331132] Lustre: 21084:0:(llog_cat.c:98:llog_cat_new_log()) fir-MDD0001: there are no more free slots in catalog changelog_catalog [Wed Mar 18 10:28:44 2020][ 520.343129] Lustre: 21084:0:(llog_cat.c:98:llog_cat_new_log()) Skipped 159923 previous similar messages [Wed Mar 18 10:28:44 2020][ 520.353111] LustreError: 21179:0:(llog_cat.c:530:llog_cat_current_log()) fir-MDD0001: next log does not exist! [Wed Mar 18 10:28:44 2020][ 520.363117] LustreError: 21179:0:(llog_cat.c:530:llog_cat_current_log()) Skipped 137757 previous similar messages [Wed Mar 18 10:28:48 2020][ 524.957681] LustreError: 21095:0:(mdd_dir.c:1065:mdd_changelog_ns_store()) fir-MDD0001: cannot store changelog record: type = 1, name = 'alignment.eigen.indiv', t = [0x240049459:0x9f67:0x0], p = [0x2400478b1:0x1e9d6:0x0]: rc = -28 [Wed Mar 18 10:28:54 2020][ 530.656413] LustreError: 20878:0:(mdd_dir.c:1065:mdd_changelog_ns_store()) fir-MDD0001: cannot store changelog record: type = 6, name = 'alignment.eigen.indiv', t = [0x240049419:0xea19:0x0], p = [0x24004ac39:0x22e6:0x0]: rc = -5 [Wed Mar 18 10:28:54 2020][ 530.676655] LustreError: 20878:0:(mdd_dir.c:1065:mdd_changelog_ns_store()) Skipped 2 previous similar messages Full console log attached as fir-md1-s2_console-changelogs_no_more_free_slots_2_12_4.log Then, we upgraded the MDS from 2.12.3 to 2.13.4 and the issue was still there. Also users reported the following errors also when creating new files: [zhengyuh@sh02-ln01 login /scratch/users/zhengyuh/$ touch aa touch: cannot touch ‘aa’: Bad address Because this was kind of an emergency, I applied the following procedure to recreate the changelog files: [root@fir-md1-s2 1]# llog_reader changelog_catalog >/tmp/mdt1_changelog_catalog [root@fir-md1-s2 1]# llog_reader changelog_users Bit 1 of 1 not set rec #2 type=1064553b len=64 offset 8256 Header size : 8192 Time : Thu Jan 24 14:04:30 2019 Number of records: 1 Target uuid : ----------------------- #02 (064)id=[0x410fe:0x1:0x0]:0 path=O/1/d30/266494 [root@fir-md1-s2 1]# llog_reader changelog_users >/tmp/mdt1_changelog_users [root@fir-md1-s2 1]# mv changelog_catalog chagnelog_catalog.bak [root@fir-md1-s2 1]# mv changelog_users changelog_users.bak [root@fir-md1-s2 1]# llog_reader O/1/d29/680157 >/tmp/mdt1_680157 [root@fir-md1-s2 1]# ls O/1/d5/5 O/1/d5/5 [root@fir-md1-s2 1]# mv O/1/d5/5 O/1/d5/5.bak [root@fir-md1-s2 1]# mv O/1/d29/680157 O/1/d29/680157.bak I'm attaching the corresponding files for investigation. This MDT is far from being full: [root@fir-md1-s2 ~]# df -h -t lustre Filesystem Size Used Avail Use% Mounted on /dev/mapper/md1-rbod1-mdt1 18T 826G 16T 5% /mnt/fir/mdt/1 [root@fir-md1-s2 ~]# df -i -t lustre Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/md1-rbod1-mdt1 288005760 111873725 176132035 39% /mnt/fir/mdt/1 Do you have an idea of what happened? How can we avoid this issue in the future? Thanks! |
| Comments |
| Comment by Stephane Thiell [ 18/Mar/20 ] |
|
Additional information, Robinhood was a bit late on this changelog reader but less than 48 hours: 2020/03/18 08:37:25 [9994/1] STATS | ChangeLog reader #1: 2020/03/18 08:37:25 [9994/1] STATS | fs_name = fir 2020/03/18 08:37:25 [9994/1] STATS | mdt_name = MDT0001 2020/03/18 08:37:25 [9994/1] STATS | reader_id = cl3 2020/03/18 08:37:25 [9994/1] STATS | records read = 22713006 2020/03/18 08:37:25 [9994/1] STATS | interesting records = 36470 2020/03/18 08:37:25 [9994/1] STATS | suppressed records = 20170655 2020/03/18 08:37:25 [9994/1] STATS | records pending = 15325 2020/03/18 08:37:25 [9994/1] STATS | status = busy 2020/03/18 08:37:25 [9994/1] STATS | last received: rec_id=33853864223, rec_time=2020/03/16 15:18:15.110496, received at 2020/03/17 21:04:32.001942 2020/03/18 08:37:25 [9994/1] STATS | receive speed: 0.00 rec/sec, log/real time ratio: 0.00 2020/03/18 08:37:25 [9994/1] STATS | last pushed: rec_id=33844744542, rec_time=2020/03/16 15:13:27.045815, pushed at 2020/03/18 08:35:18.442094 2020/03/18 08:37:25 [9994/1] STATS | push speed: 3781.59 rec/sec, log/real time ratio: 0.12 2020/03/18 08:37:25 [9994/1] STATS | last committed: rec_id=33844741446, rec_time=2020/03/16 15:13:26.962887, committed at 2020/03/18 08:35:18.462873 2020/03/18 08:37:25 [9994/1] STATS | commit speed: 4479.34 rec/sec, log/real time ratio: 0.14 2020/03/18 08:37:25 [9994/1] STATS | last cleared: rec_id=33844716263, rec_time=2020/03/16 15:13:26.236501, cleared at 2020/03/18 08:35:17.940335 2020/03/18 08:37:25 [9994/1] STATS | clear speed: 4844.73 rec/sec, log/real time ratio: 0.15 2020/03/18 08:37:25 [9994/1] STATS | ChangeLog stats: 2020/03/18 08:37:25 [9994/1] STATS | MARK: 0, CREAT: 2520877, MKDIR: 296, HLINK: 0, SLINK: 83, MKNOD: 0, UNLNK: 2521000 2020/03/18 08:37:25 [9994/1] STATS | RMDIR: 33, RENME: 543, RNMTO: 0, OPEN: 0, CLOSE: 10253618, LYOUT: 82, TRUNC: 7416466 2020/03/18 08:37:25 [9994/1] STATS | SATTR: 7, XATTR: 0, HSM: 0, MTIME: 1, CTIME: 0, ATIME: 0, MIGRT: 0, FLRW: 0, RESYNC: 0 2020/03/18 08:37:25 [9994/1] STATS | GXATR: 0, NOPEN: 0 BTW we have still a lot of problem with the changelog readers in 2.12, leading to stuck readers... We have several workarounds in place (like restarting Robinhood), but that isn't always enough to keep up. |
| Comment by Peter Jones [ 19/Mar/20 ] |
|
Mike Could you please advise? Thanks Peter |
| Comment by Mikhail Pershin [ 23/Mar/20 ] |
|
That can be related to llog catalog wrapping index miscalculations but that is just quick thought. I am analyzing supplied files right now |
| Comment by Stephane Thiell [ 24/Mar/20 ] |
|
Thanks Mike for taking a look! Now that we are using 2.12.4 on all servers for this system (Fir), we see occasional backtraces like the following one on the MDS and Robinhood still have problem processing changelogs: Mar 24 06:42:49 fir-md1-s2 kernel: LNet: Service thread pid 22333 was inactive for 202.34s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Mar 24 06:42:49 fir-md1-s2 kernel: Pid: 22333, comm: mdt00_007 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019 Mar 24 06:42:49 fir-md1-s2 kernel: Call Trace: Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffb0788c47>] call_rwsem_down_write_failed+0x17/0x30 Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc0e58649>] llog_cat_id2handle+0x69/0x5b0 [obdclass] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc0e59670>] llog_cat_cancel_records+0x120/0x3c0 [obdclass] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc1891264>] llog_changelog_cancel_cb+0x104/0x2a0 [mdd] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc0e536df>] llog_process_thread+0x82f/0x18e0 [obdclass] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc0e5484c>] llog_process_or_fork+0xbc/0x450 [obdclass] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc0e59b49>] llog_cat_process_cb+0x239/0x250 [obdclass] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc0e536df>] llog_process_thread+0x82f/0x18e0 [obdclass] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc0e5484c>] llog_process_or_fork+0xbc/0x450 [obdclass] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc0e565b1>] llog_cat_process_or_fork+0x1e1/0x360 [obdclass] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc0e5675e>] llog_cat_process+0x2e/0x30 [obdclass] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc188da34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc188fb60>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc1892c13>] mdd_changelog_clear+0x503/0x690 [mdd] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc1895d03>] mdd_iocontrol+0x163/0x540 [mdd] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc17179dc>] mdt_iocontrol+0x5ec/0xb00 [mdt] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc1718374>] mdt_set_info+0x484/0x490 [mdt] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc11e464a>] tgt_request_handle+0xada/0x1570 [ptlrpc] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc118743b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffc118ada4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffb04c2e81>] kthread+0xd1/0xe0 Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffb0b77c24>] ret_from_fork_nospec_begin+0xe/0x21 Mar 24 06:42:49 fir-md1-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Mar 24 06:42:49 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1585057369.22333 Mar 24 06:42:59 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to eceee209-ec05-4 (at 10.50.6.54@o2ib2) Mar 24 06:43:20 fir-md1-s2 kernel: LNet: Service thread pid 22333 completed after 233.43s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). |
| Comment by Sergey Cheremencev [ 12/Jan/22 ] |
|
I faced similar LBUG, see details in LU-15444. |