[LU-11980] When attempting to enable changelogs system is locking up with "no more free slots in catalog" Created: 20/Feb/19  Updated: 16/Jan/22  Resolved: 16/Jan/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Question/Request Priority: Minor
Reporter: Joe Mervini Assignee: Mikhail Pershin
Resolution: Done Votes: 0
Labels: None
Environment:

Dell R720 servers running TOSS 3.1-4.1 (RHEL 7.3)/Lustre 2.8.0.9 IB attached to DDN 7700.


Epic/Theme: Lustre-2.8.0
Rank (Obsolete): 9223372036854775807

 Description   

We have been working with changelogs recently and on this particular file system when changelogs were enabled, getting changelog information was very slow when specifying a short range greater than the index. Because of the response, changelogs have been registered and deregister a number of time. 

Today we attempted to re-register changelogs and the current index never changed and after just a couple of minutes the system froze. 

One thing just occurred to me: Should we clear the changelogs before re-registering? We haven't done that.

 

dmesg after a reboot (without deregistering changelogs) produced the following output prior to the hang:

[Wed Feb 20 10:42:41 2019] LDISKFS-fs (dm-0): recovery complete
[Wed Feb 20 10:42:41 2019] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache
[Wed Feb 20 10:42:43 2019] Lustre: MGS: Connection restored to f81759a0-aecf-98f8-d1e0-a6cf70343cdd (at 0@lo)
[Wed Feb 20 10:42:43 2019] LustreError: 137-5: gscratch-MDT0000_UUID: not available for connect from 192.168.163.154@o2ib8 (no target). If you are running an HA pair check that the target is mounted on the oth.
[Wed Feb 20 10:42:43 2019] Lustre: gscratch-MDT0000: Not available for connect from 192.168.146.62@o2ib7 (not set up)
[Wed Feb 20 10:42:43 2019] Lustre: MGS: Connection restored to (at 192.168.129.20@o2ib4)
[Wed Feb 20 10:42:43 2019] Lustre: Skipped 21 previous similar messages
[Wed Feb 20 10:42:43 2019] sd 7:0:0:0: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatical
[Wed Feb 20 10:42:44 2019] Lustre: gscratch-MDT0000: Not available for connect from 192.168.133.66@o2ib4 (not set up)
[Wed Feb 20 10:42:44 2019] Lustre: Skipped 24 previous similar messages
[Wed Feb 20 10:42:44 2019] Lustre: MGS: Connection restored to (at 192.168.163.242@o2ib8)
[Wed Feb 20 10:42:44 2019] Lustre: Skipped 71 previous similar messages
[Wed Feb 20 10:42:46 2019] Lustre: MGS: Connection restored to (at 192.168.128.111@o2ib4)
[Wed Feb 20 10:42:46 2019] Lustre: Skipped 144 previous similar messages
[Wed Feb 20 10:42:50 2019] Lustre: MGS: Connection restored to 670522c1-3eaa-288f-b21f-556fb85bf7bb (at 192.168.81.28@tcp2)
[Wed Feb 20 10:42:50 2019] Lustre: Skipped 336 previous similar messages
[Wed Feb 20 10:42:51 2019] Lustre: 5685:0:(llog_cat.c:956:llog_cat_reverse_process()) catalog 0x7:10 crosses index zero
[Wed Feb 20 10:42:51 2019] Lustre: gscratch-MDD0000: changelog on
[Wed Feb 20 10:42:52 2019] Lustre: gscratch-MDT0000: Will be in recovery for at least 5:00, or until 6653 clients reconnect
[Wed Feb 20 10:42:58 2019] Lustre: gscratch-MDT0000: Connection restored to (at 192.168.146.61@o2ib7)
[Wed Feb 20 10:42:58 2019] Lustre: Skipped 1892 previous similar messages
[Wed Feb 20 10:43:14 2019] Lustre: gscratch-MDT0000: Connection restored to 41d8e069-ae86-1be2-dc41-7835d4a0c892 (at 192.168.133.43@o2ib4)
[Wed Feb 20 10:43:14 2019] Lustre: Skipped 1630 previous similar messages
[Wed Feb 20 10:43:46 2019] Lustre: MGS: Connection restored to c858508f-9415-a50e-e377-4c75ecc97e6a (at 192.168.160.40@o2ib8)
[Wed Feb 20 10:43:46 2019] Lustre: Skipped 7165 previous similar messages
[Wed Feb 20 10:43:53 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 3:59. If you do not want to wait more, please abort the recovery by force.
[Wed Feb 20 10:43:53 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 3:58. If you do not want to wait more, please abort the recovery by force.
[Wed Feb 20 10:43:53 2019] Lustre: Skipped 68 previous similar messages
[Wed Feb 20 10:43:54 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 3:57. If you do not want to wait more, please abort the recovery by force.
[Wed Feb 20 10:43:54 2019] Lustre: Skipped 127 previous similar messages
[Wed Feb 20 10:43:56 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 3:55. If you do not want to wait more, please abort the recovery by force.
[Wed Feb 20 10:43:56 2019] Lustre: Skipped 261 previous similar messages
[Wed Feb 20 10:43:58 2019] LustreError: 5851:0:(ldlm_lib.c:2778:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880f9ad07500 x1623562148962904/t0(737400040385) o35->c1db237f-2447-dcf4-1
[Wed Feb 20 10:44:00 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 3:51. If you do not want to wait more, please abort the recovery by force.
[Wed Feb 20 10:44:00 2019] Lustre: Skipped 656 previous similar messages
[Wed Feb 20 10:44:00 2019] LustreError: 5851:0:(ldlm_lib.c:2778:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880fe010ac50 x1623924373420932/t0(737400041101) o35->28ad855d-72b3-7388-1
[Wed Feb 20 10:44:02 2019] LustreError: 6378:0:(ldlm_lib.c:2778:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880f9a40ce00 x1623286549665444/t0(737400045945) o101->ed31116f-3909-bacf1
[Wed Feb 20 10:44:08 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 3:43. If you do not want to wait more, please abort the recovery by force.
[Wed Feb 20 10:44:08 2019] Lustre: Skipped 591 previous similar messages
[Wed Feb 20 10:44:24 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 3:27. If you do not want to wait more, please abort the recovery by force.
[Wed Feb 20 10:44:24 2019] Lustre: Skipped 1055 previous similar messages
[Wed Feb 20 10:44:32 2019] LustreError: 6261:0:(ldlm_lib.c:2778:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff8807c7e6a700 x1624190622706812/t0(737400044160) o101->48ebd6c1-d5c4-15b01
[Wed Feb 20 10:44:32 2019] LustreError: 6261:0:(ldlm_lib.c:2778:target_queue_recovery_request()) Skipped 1 previous similar message
[Wed Feb 20 10:44:37 2019] LustreError: 5851:0:(ldlm_lib.c:2778:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880f99ce9800 x1623562475389600/t0(737400021311) o35->f9a9f0dd-fe45-980e-1
[Wed Feb 20 10:44:50 2019] Lustre: gscratch-MDT0000: Connection restored to 4f538c64-8ac4-c14d-4e39-050d775a27fb (at 192.168.145.204@o2ib7)
[Wed Feb 20 10:44:50 2019] Lustre: Skipped 7950 previous similar messages
[Wed Feb 20 10:44:56 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 2:55. If you do not want to wait more, please abort the recovery by force.
[Wed Feb 20 10:44:56 2019] Lustre: Skipped 3929 previous similar messages
[Wed Feb 20 10:44:59 2019] LustreError: 5851:0:(ldlm_lib.c:2778:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff8800af8b2700 x1623562148962904/t0(737400040385) o35->c1db237f-2447-dcf4-1
[Wed Feb 20 10:45:33 2019] LustreError: 6313:0:(ldlm_lib.c:2778:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880f99236900 x1624190622706812/t0(737400044160) o101->48ebd6c1-d5c4-15b01
[Wed Feb 20 10:45:33 2019] LustreError: 6313:0:(ldlm_lib.c:2778:target_queue_recovery_request()) Skipped 3 previous similar messages

 

Snippet from syslog:

Feb 20 10:18:09 gmds1 dbus[3789]: [system] Successfully activated service 'org.freedesktop.problems'
Feb 20 10:18:49 gmds1 kernel: [2070231.187993] Lustre: gscratch-MDD0000: changelog on
Feb 20 10:18:49 gmds1 kernel: [2070231.196897] Lustre: 7104:0:(llog_cat.c:93:llog_cat_new_log()) gscratch-MDD0000: there are no more free slots in catalog
Feb 20 10:18:49 gmds1 kernel: [2070231.210447] LustreError: 6714:0:(mdd_dir.c:887:mdd_changelog_ns_store()) gscratch-MDD0000: cannot store changelog record: type = 8, name = 'rest.colvars.state.old', t = [0
x200
1755d7:0xdc8:0x0], p = [0x200171e8a:0x12d2d:0x0]: rc = -28
Feb 20 10:18:49 gmds1 kernel: [2070231.221771] LustreError: 7227:0:(llog_cat.c:385:llog_cat_current_log()) gscratch-MDD0000: next log does not exist!
Feb 20 10:18:49 gmds1 kernel: [2070231.249353] LustreError: 6714:0:(mdd_dir.c:887:mdd_changelog_ns_store()) Skipped 3 previous similar messages
Feb 20 10:18:49 gmds1 kernel: [2070231.697900] Lustre: 7237:0:(llog_cat.c:93:llog_cat_new_log()) gscratch-MDD0000: there are no more free slots in catalog
Feb 20 10:18:49 gmds1 kernel: [2070231.711550] Lustre: 7237:0:(llog_cat.c:93:llog_cat_new_log()) Skipped 278 previous similar messages
Feb 20 10:18:49 gmds1 kernel: [2070231.723222] LustreError: 6711:0:(mdd_dir.c:887:mdd_changelog_ns_store()) gscratch-MDD0000: cannot store changelog record: type = 8, name = 'rest.colvars.state.old', t = [0x200
17748a:0x3684:0x0], p = [0x200171e8a:0x12c65:0x0]: rc = -28
Feb 20 10:18:49 gmds1 kernel: [2070231.749406] LustreError: 6711:0:(mdd_dir.c:887:mdd_changelog_ns_store()) Skipped 62 previous similar messages
Feb 20 10:18:49 gmds1 kernel: [2070231.814879] LustreError: 6909:0:(llog_cat.c:385:llog_cat_current_log()) gscratch-MDD0000: next log does not exist!
Feb 20 10:18:49 gmds1 kernel: [2070231.828077] LustreError: 6909:0:(llog_cat.c:385:llog_cat_current_log()) Skipped 43 previous similar messages
Feb 20 10:18:50 gmds1 kernel: [2070232.707459] Lustre: 6628:0:(llog_cat.c:93:llog_cat_new_log()) gscratch-MDD0000: there are no more free slots in catalog
Feb 20 10:18:50 gmds1 kernel: [2070232.721229] Lustre: 6628:0:(llog_cat.c:93:llog_cat_new_log()) Skipped 273 previous similar messages
Feb 20 10:18:51 gmds1 kernel: [2070232.998096] LustreError: 6711:0:(mdd_dir.c:887:mdd_changelog_ns_store()) gscratch-MDD0000: cannot store changelog recor



 Comments   
Comment by Peter Jones [ 22/Feb/19 ]

Mike

What do you advise here?

Peter

Comment by Mikhail Pershin [ 26/Feb/19 ]

Yes, you have to clear changelog manually on re-registration if you want such behavior, or it can be managed by policy engine, e.g. Robinhood. At the moment you need to purge changelog records. That can be done with

lfs changelog_clear <mdtname> <user ID> 0

command where 'user ID' is registration ID, so register first if needed.

Comment by Joe Mervini [ 13/Mar/19 ]

Just reviewing the manual, lctl changelog_deregister is supposed to clear the changelogs. That aside, right now when we attempt to reregister a changelog user the system will freeze before we can issue the lfs changelog_clear command.  How do you suggest we proceed?

Comment by Mikhail Pershin [ 15/Mar/19 ]

Joe, are there client/server logs available at the moment when system is freezing? Also could you get the output of

lctl get_param 'mdd.*.changelog*'

 on the MDS?

Comment by Joe Mervini [ 18/Mar/19 ]

Mikhail,

We haven't made any attempts to reregister the changelog_user since the last time it froze because it is a production system. I won't have any client side logs but I'll look at what might be available on the server side.

Here is the output you requested:

MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO OPEN CLOSE LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT mdd.gscratch-MDT0000.changelog_users=current index: 11390996174

Comment by Mikhail Pershin [ 29/Mar/19 ]

Joe, that means you have no current changelog users and changelogs are supposed to be cleared when last user de-registered but that didn't happen. If there will be opportunity to try registration again, maybe at maintenance time, could you try that again and collect lustre logs on server? Set maximum debug level before that with lctl set_param debug=-1

If you are unable to register user and clear logs manually then there is opportunity to do that during downtime/maintenance via the following steps:

  • remount the MDT using the backing filesystem.
  • use llog_reader <mountpoint>/changelog_catalog to parse it and extract the path=PATH components of the sublog records, and then remove all of them from fs.
  • remove changelog_catalog.
  • the same with changelog_users.

This will remove all changelogs and it will be recreated upon server start.

Generated at Sat Feb 10 02:48:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.