[LU-9055] MDS crash due to changelog being full Created: 26/Jan/17 Updated: 01/Mar/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | christopher coffey | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 6.7, 2.6.32-573.12.1.el6_lustre.x86_64, o2ib, redhat ofed, 165TB spread over 7OSS/OST |
||
| Issue Links: |
|
||||||||||||
| Severity: | 4 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Hello, We enabled changelogs on our mds server to prepare for using robinhood. We made the change and then began setting up robinhood. We got robinhood working well for reporting it seemed and began to look at setting up a HSM scenario where for one test user, we'd move files matching an age to a alternate filesystem. There were no errors in the lustre logs relating to robinhood up until we enabled the robinhood policy. Once enabling the robinhood policy, this error started being written to the lustre log at the same exact moment: Jan 18 14:02:57 mds1 kernel: Lustre: 14021:0:(llog_cat.c:817:llog_cat_process_or_fork()) catlog 0x6:10 crosses index zero Every 10 minutes (i presume from robinhood) for 4 days until the mds system crashed. Also, after powering up the mds: Jan 22 20:37:59 mds1 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect: MMP interval 42 higher than expected, please wait. And then a bit later the mds would crash. This would happen 3 other times we attempted to start the mds server. We ran the following command to turn off the changelogs and attempt to clear them: [root@mds1 ~]# lctl --device blizzard-MDT0000 changelog_deregister cl1" The shell we fired the process off in ended up ending, but lctl continued running. The lctl command has been running now for 70+ hrs ... I have no idea if its normal for it to run this long. I also saw this in /tmp/lustre-log1485199531.6179 which was about 2 hrs after the above command was run, and it hasn't been written to since then: llog_cat.cllog_cat_process_cbprocessing log 0x21f25:1:0 at index 3513 of catalog 0x6:1 llog_cat.cllog_cat_cleanupcancel plain log at index 3513 of catalog 0x6:10 llog_cat.cllog_cat_process_cbprocessing log 0x21f26:1:0 at index 3514 of catalog 0x6:10 Let me know if you need more information! |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 26/Jan/17 ] |
|
Hello, Can you debugfs the MDT device and dump the changelog_catalog and changelog_users files ? Also the last msgs you have provided seem to indicate that the ChangeLogs cleanup is on-going, but if you don't mind about the current ChangeLogs content, and want to start from a clean state, you can umount/stop the MDT, then remount it as ldiskfs, the move/backup+remove the changelog_catalog and changelog_users files in a safe place for further analysis, then restart/remount the MDT. |