[LU-13372] fir-MDD0001: there are no more free slots in catalog changelog_catalog Created: 18/Mar/20  Updated: 25/Mar/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3, Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Mikhail Pershin
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.6


Attachments: Text File fir-md1-s2_console-changelogs_no_more_free_slots_2_12_4.log     HTML File mdt1_680157     HTML File mdt1_changelog_catalog     HTML File mdt1_changelog_users    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hi,

We hit a LBUG on 2.12.3 last night:

[8325133.774837] LustreError: 59728:0:(llog_osd.c:617:llog_osd_write_rec()) fir-MDT0001-osd: index 14075 already set in log bitmap
[8325133.786320] LustreError: 59728:0:(llog_osd.c:619:llog_osd_write_rec()) LBUG
[8325133.793471] Pid: 59728, comm: mdt_rdpg02_021 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019

Also, after a restart:

Mar 18 08:35:18 fir-md1-s2 kernel: LustreError: 34211:0:(llog_osd.c:617:llog_osd_write_rec()) fir-MDT0001-osd: index 14096 already set in log bitmap
Mar 18 08:35:18 fir-md1-s2 kernel: LustreError: 34211:0:(llog_osd.c:619:llog_osd_write_rec()) LBUG
Mar 18 08:35:18 fir-md1-s2 kernel: Pid: 34211, comm: mdt_rdpg01_053 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
Message from syslogd@fir-md1-s2 at Mar 18 08:35:18 ...
 kernel:LustreError: 34211:0:(llog_osd.c:619:llog_osd_write_rec()) LBUG
Mar 18 08:35:18 fir-md1-s2 kernel: Call Trace:
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc0ccd7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc0ccd87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc0df72ba>] llog_osd_write_rec+0x16ca/0x1730 [obdclass]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc18fa06f>] mdd_changelog_write_rec+0x2f/0x120 [mdd]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc0de81ab>] llog_write_rec+0xcb/0x520 [obdclass]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc0dece0c>] llog_cat_new_log+0x62c/0xce0 [obdclass]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc0ded6bb>] llog_cat_add_rec+0x1fb/0x880 [obdclass]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc0de5180>] llog_add+0x80/0x1a0 [obdclass]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc18fa2dd>] mdd_changelog_store+0x17d/0x520 [mdd]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc1907374>] mdd_changelog_data_store_by_fid+0x1d4/0x350 [mdd]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc19096c2>] mdd_changelog_data_store+0x142/0x200 [mdd]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc190ab38>] mdd_close+0xae8/0xf30 [mdd]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc179597e>] mdt_mfd_close+0x3fe/0x860 [mdt]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc179b291>] mdt_close_internal+0x121/0x220 [mdt]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc179b5b0>] mdt_close+0x220/0x780 [mdt]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc117636a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc111d24b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffffc1120bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffff890c2e81>] kthread+0xd1/0xe0
Mar 18 08:35:18 fir-md1-s2 kernel:  [<ffffffff89777c24>] ret_from_fork_nospec_begin+0xe/0x21

We stopped the changelog reader (Robinhood) for this MDT and then we were able to start it, but then we hit the following issues:
 

[Wed Mar 18 10:28:44 2020][ 520.331132] Lustre: 21084:0:(llog_cat.c:98:llog_cat_new_log()) fir-MDD0001: there are no more free slots in catalog changelog_catalog
[Wed Mar 18 10:28:44 2020][ 520.343129] Lustre: 21084:0:(llog_cat.c:98:llog_cat_new_log()) Skipped 159923 previous similar messages
[Wed Mar 18 10:28:44 2020][ 520.353111] LustreError: 21179:0:(llog_cat.c:530:llog_cat_current_log()) fir-MDD0001: next log does not exist!
[Wed Mar 18 10:28:44 2020][ 520.363117] LustreError: 21179:0:(llog_cat.c:530:llog_cat_current_log()) Skipped 137757 previous similar messages
[Wed Mar 18 10:28:48 2020][ 524.957681] LustreError: 21095:0:(mdd_dir.c:1065:mdd_changelog_ns_store()) fir-MDD0001: cannot store changelog record: type = 1, name = 'alignment.eigen.indiv', t = [0x240049459:0x9f67:0x0], p = [0x2400478b1:0x1e9d6:0x0]: rc = -28
[Wed Mar 18 10:28:54 2020][ 530.656413] LustreError: 20878:0:(mdd_dir.c:1065:mdd_changelog_ns_store()) fir-MDD0001: cannot store changelog record: type = 6, name = 'alignment.eigen.indiv', t = [0x240049419:0xea19:0x0], p = [0x24004ac39:0x22e6:0x0]: rc = -5
[Wed Mar 18 10:28:54 2020][ 530.676655] LustreError: 20878:0:(mdd_dir.c:1065:mdd_changelog_ns_store()) Skipped 2 previous similar messages

Full console log attached as fir-md1-s2_console-changelogs_no_more_free_slots_2_12_4.log

Then, we upgraded the MDS from 2.12.3 to 2.13.4 and the issue was still there.

Also users reported the following errors also when creating new files:

[zhengyuh@sh02-ln01 login /scratch/users/zhengyuh/$ touch aa

touch: cannot touch ‘aa’: Bad address

Because this was kind of an emergency, I applied the following procedure to recreate the changelog files:

[root@fir-md1-s2 1]# llog_reader changelog_catalog >/tmp/mdt1_changelog_catalog
[root@fir-md1-s2 1]# llog_reader changelog_users
Bit 1 of 1 not set
rec #2 type=1064553b len=64 offset 8256
Header size : 8192
Time : Thu Jan 24 14:04:30 2019
Number of records: 1
Target uuid : 
-----------------------
#02 (064)id=[0x410fe:0x1:0x0]:0 path=O/1/d30/266494
[root@fir-md1-s2 1]# llog_reader changelog_users >/tmp/mdt1_changelog_users
[root@fir-md1-s2 1]# mv changelog_catalog chagnelog_catalog.bak
[root@fir-md1-s2 1]# mv changelog_users changelog_users.bak
[root@fir-md1-s2 1]# llog_reader O/1/d29/680157 >/tmp/mdt1_680157
[root@fir-md1-s2 1]# ls O/1/d5/5
O/1/d5/5
[root@fir-md1-s2 1]# mv O/1/d5/5 O/1/d5/5.bak
[root@fir-md1-s2 1]# mv O/1/d29/680157 O/1/d29/680157.bak

I'm attaching the corresponding files for investigation.

This MDT is far from being full:

[root@fir-md1-s2 ~]# df -h -t lustre
Filesystem                  Size  Used Avail Use% Mounted on
/dev/mapper/md1-rbod1-mdt1   18T  826G   16T   5% /mnt/fir/mdt/1
[root@fir-md1-s2 ~]# df -i -t lustre
Filesystem                    Inodes     IUsed     IFree IUse% Mounted on
/dev/mapper/md1-rbod1-mdt1 288005760 111873725 176132035   39% /mnt/fir/mdt/1

Do you have an idea of what happened? How can we avoid this issue in the future? Thanks!



 Comments   
Comment by Stephane Thiell [ 18/Mar/20 ]

Additional information, Robinhood was a bit late on this changelog reader but less than 48 hours:

2020/03/18 08:37:25 [9994/1] STATS | ChangeLog reader #1:
2020/03/18 08:37:25 [9994/1] STATS |    fs_name    =   fir
2020/03/18 08:37:25 [9994/1] STATS |    mdt_name   =   MDT0001
2020/03/18 08:37:25 [9994/1] STATS |    reader_id  =   cl3
2020/03/18 08:37:25 [9994/1] STATS |    records read        = 22713006
2020/03/18 08:37:25 [9994/1] STATS |    interesting records = 36470
2020/03/18 08:37:25 [9994/1] STATS |    suppressed records  = 20170655
2020/03/18 08:37:25 [9994/1] STATS |    records pending     = 15325
2020/03/18 08:37:25 [9994/1] STATS |    status              = busy
2020/03/18 08:37:25 [9994/1] STATS |    last received: rec_id=33853864223, rec_time=2020/03/16 15:18:15.110496, received at 2020/03/17 21:04:32.001942
2020/03/18 08:37:25 [9994/1] STATS |        receive speed: 0.00 rec/sec, log/real time ratio: 0.00
2020/03/18 08:37:25 [9994/1] STATS |    last pushed: rec_id=33844744542, rec_time=2020/03/16 15:13:27.045815, pushed at 2020/03/18 08:35:18.442094
2020/03/18 08:37:25 [9994/1] STATS |        push speed: 3781.59 rec/sec, log/real time ratio: 0.12
2020/03/18 08:37:25 [9994/1] STATS |    last committed: rec_id=33844741446, rec_time=2020/03/16 15:13:26.962887, committed at 2020/03/18 08:35:18.462873
2020/03/18 08:37:25 [9994/1] STATS |        commit speed: 4479.34 rec/sec, log/real time ratio: 0.14
2020/03/18 08:37:25 [9994/1] STATS |    last cleared: rec_id=33844716263, rec_time=2020/03/16 15:13:26.236501, cleared at 2020/03/18 08:35:17.940335
2020/03/18 08:37:25 [9994/1] STATS |        clear speed: 4844.73 rec/sec, log/real time ratio: 0.15
2020/03/18 08:37:25 [9994/1] STATS |    ChangeLog stats:
2020/03/18 08:37:25 [9994/1] STATS |    MARK: 0, CREAT: 2520877, MKDIR: 296, HLINK: 0, SLINK: 83, MKNOD: 0, UNLNK: 2521000
2020/03/18 08:37:25 [9994/1] STATS |    RMDIR: 33, RENME: 543, RNMTO: 0, OPEN: 0, CLOSE: 10253618, LYOUT: 82, TRUNC: 7416466
2020/03/18 08:37:25 [9994/1] STATS |    SATTR: 7, XATTR: 0, HSM: 0, MTIME: 1, CTIME: 0, ATIME: 0, MIGRT: 0, FLRW: 0, RESYNC: 0
2020/03/18 08:37:25 [9994/1] STATS |    GXATR: 0, NOPEN: 0

BTW we have still a lot of problem with the changelog readers in 2.12, leading to stuck readers... We have several workarounds in place (like restarting Robinhood), but that isn't always enough to keep up.

Comment by Peter Jones [ 19/Mar/20 ]

Mike

Could you please advise?

Thanks

Peter

Comment by Mikhail Pershin [ 23/Mar/20 ]

That can be related to llog catalog wrapping index miscalculations but that is just quick thought. I am analyzing supplied files right now

Comment by Stephane Thiell [ 24/Mar/20 ]

Thanks Mike for taking a look! Now that we are using 2.12.4 on all servers for this system (Fir), we see occasional backtraces like the following one on the MDS and Robinhood still have problem processing changelogs:

Mar 24 06:42:49 fir-md1-s2 kernel: LNet: Service thread pid 22333 was inactive for 202.34s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Mar 24 06:42:49 fir-md1-s2 kernel: Pid: 22333, comm: mdt00_007 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
Mar 24 06:42:49 fir-md1-s2 kernel: Call Trace:
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffb0788c47>] call_rwsem_down_write_failed+0x17/0x30
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc0e58649>] llog_cat_id2handle+0x69/0x5b0 [obdclass]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc0e59670>] llog_cat_cancel_records+0x120/0x3c0 [obdclass]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc1891264>] llog_changelog_cancel_cb+0x104/0x2a0 [mdd]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc0e536df>] llog_process_thread+0x82f/0x18e0 [obdclass]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc0e5484c>] llog_process_or_fork+0xbc/0x450 [obdclass]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc0e59b49>] llog_cat_process_cb+0x239/0x250 [obdclass]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc0e536df>] llog_process_thread+0x82f/0x18e0 [obdclass]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc0e5484c>] llog_process_or_fork+0xbc/0x450 [obdclass]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc0e565b1>] llog_cat_process_or_fork+0x1e1/0x360 [obdclass]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc0e5675e>] llog_cat_process+0x2e/0x30 [obdclass]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc188da34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc188fb60>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc1892c13>] mdd_changelog_clear+0x503/0x690 [mdd]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc1895d03>] mdd_iocontrol+0x163/0x540 [mdd]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc17179dc>] mdt_iocontrol+0x5ec/0xb00 [mdt]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc1718374>] mdt_set_info+0x484/0x490 [mdt]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc11e464a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc118743b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffc118ada4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffb04c2e81>] kthread+0xd1/0xe0
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffb0b77c24>] ret_from_fork_nospec_begin+0xe/0x21
Mar 24 06:42:49 fir-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Mar 24 06:42:49 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1585057369.22333
Mar 24 06:42:59 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to eceee209-ec05-4 (at 10.50.6.54@o2ib2)
Mar 24 06:43:20 fir-md1-s2 kernel: LNet: Service thread pid 22333 completed after 233.43s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Comment by Sergey Cheremencev [ 12/Jan/22 ]

I faced similar LBUG, see details in LU-15444.

Generated at Sat Feb 10 03:00:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.