[LU-9055] MDS crash due to changelog being full Created: 26/Jan/17  Updated: 01/Mar/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: christopher coffey Assignee: Bruno Faccini (Inactive)
Resolution: Unresolved Votes: 0
Labels: None
Environment:

RHEL 6.7, 2.6.32-573.12.1.el6_lustre.x86_64, o2ib, redhat ofed, 165TB spread over 7OSS/OST


Issue Links:
Related
is related to LU-1586 no free catalog slots for log Resolved
is related to LU-7340 ChangeLogs catalog full condition sho... Resolved
Severity: 4
Rank (Obsolete): 9223372036854775807

 Description   

Hello,

We enabled changelogs on our mds server to prepare for using robinhood. We made the change and then began setting up robinhood. We got robinhood working well for reporting it seemed and began to look at setting up a HSM scenario where for one test user, we'd move files matching an age to a alternate filesystem. There were no errors in the lustre logs relating to robinhood up until we enabled the robinhood policy. Once enabling the robinhood policy, this error started being written to the lustre log at the same exact moment:

Jan 18 14:02:57 mds1 kernel: Lustre: 14021:0:(llog_cat.c:817:llog_cat_process_or_fork()) catlog 0x6:10 crosses index zero

Every 10 minutes (i presume from robinhood) for 4 days until the mds system crashed. Also, after powering up the mds:

Jan 22 20:37:59 mds1 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect: MMP interval 42 higher than expected, please wait.
Jan 22 20:37:59 mds1 kernel:
Jan 22 20:38:41 mds1 kernel: LDISKFS-fs (dm-2): recovery complete
Jan 22 20:38:41 mds1 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. quota=on. Opts:
Jan 22 20:38:42 mds1 kernel: Lustre: MGS: Connection restored to MGC172.16.3.231@o2ib_0 (at 0@lo)
Jan 22 20:38:42 mds1 kernel: Lustre: 2956:0:(llog_cat.c:924:llog_cat_reverse_process()) catalog 0x6:10 crosses index zero Jan 22 20:38:42 mds1 kernel: Lustre: blizzard-MDD0000: changelog on Jan 22 20:38:42 mds1 kernel: Lustre: blizzard-MDT0000: Will be in recovery for at least 5:00, or until 26 clients reconnect
Jan 22 20:38:42 mds1 kernel: Lustre: blizzard-MDT0000: Denying connection for new client f95bf979-9378-6cb7-798c-b5dfbec3221a(at 172.16.3.106@o2ib), waiting for 26 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 5:00
Jan 22 20:44:42 mds1 kernel: Lustre: blizzard-MDT0000: recovery is timed out, evict stale exports Jan 22 20:44:42 mds1 kernel: Lustre: blizzard-MDT0000: disconnecting 1 stale clients
Jan 22 20:44:42 mds1 kernel: Lustre: 3256:0:(llog_cat.c:93:llog_cat_new_log()) blizzard-MDD0000: there are no more free slots in catalog

And then a bit later the mds would crash. This would happen 3 other times we attempted to start the mds server. We ran the following command to turn off the changelogs and attempt to clear them:

[root@mds1 ~]# lctl --device blizzard-MDT0000 changelog_deregister cl1"
[root@mds1 ~]# lctl get_param mdd.blizzard-MDT0000.changelog_users mdd.blizzard-MDT0000.changelog_users=current index: 12720211297 ID index

The shell we fired the process off in ended up ending, but lctl continued running. The lctl command has been running now for 70+ hrs ... I have no idea if its normal for it to run this long. I also saw this in /tmp/lustre-log1485199531.6179 which was about 2 hrs after the above command was run, and it hasn't been written to since then:

llog_cat.cllog_cat_process_cbprocessing log 0x21f25:1:0 at index 3513 of catalog 0x6:1 llog_cat.cllog_cat_cleanupcancel plain log at index 3513 of catalog 0x6:10 llog_cat.cllog_cat_process_cbprocessing log 0x21f26:1:0 at index 3514 of catalog 0x6:10

Let me know if you need more information!



 Comments   
Comment by Bruno Faccini (Inactive) [ 26/Jan/17 ]

Hello,
According to your description, t is unclear for me if you have been running with ChangeLogs enabled and no reader (only RobinHood?) running for a long period of time and/or where a huge Lustre activity may have occurred?
My best guess is that there is an orphan LLOG entry in ChangeLog Catalog that has caused a wrap and then a full situation...

Can you debugfs the MDT device and dump the changelog_catalog and changelog_users files ?

Also the last msgs you have provided seem to indicate that the ChangeLogs cleanup is on-going, but if you don't mind about the current ChangeLogs content, and want to start from a clean state, you can umount/stop the MDT, then remount it as ldiskfs, the move/backup+remove the changelog_catalog and changelog_users files in a safe place for further analysis, then restart/remount the MDT.

Generated at Sat Feb 10 02:22:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.