Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9055

MDS crash due to changelog being full

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.8.0
    • None
    • RHEL 6.7, 2.6.32-573.12.1.el6_lustre.x86_64, o2ib, redhat ofed, 165TB spread over 7OSS/OST
    • 4
    • 9223372036854775807

    Description

      Hello,

      We enabled changelogs on our mds server to prepare for using robinhood. We made the change and then began setting up robinhood. We got robinhood working well for reporting it seemed and began to look at setting up a HSM scenario where for one test user, we'd move files matching an age to a alternate filesystem. There were no errors in the lustre logs relating to robinhood up until we enabled the robinhood policy. Once enabling the robinhood policy, this error started being written to the lustre log at the same exact moment:

      Jan 18 14:02:57 mds1 kernel: Lustre: 14021:0:(llog_cat.c:817:llog_cat_process_or_fork()) catlog 0x6:10 crosses index zero

      Every 10 minutes (i presume from robinhood) for 4 days until the mds system crashed. Also, after powering up the mds:

      Jan 22 20:37:59 mds1 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect: MMP interval 42 higher than expected, please wait.
      Jan 22 20:37:59 mds1 kernel:
      Jan 22 20:38:41 mds1 kernel: LDISKFS-fs (dm-2): recovery complete
      Jan 22 20:38:41 mds1 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. quota=on. Opts:
      Jan 22 20:38:42 mds1 kernel: Lustre: MGS: Connection restored to MGC172.16.3.231@o2ib_0 (at 0@lo)
      Jan 22 20:38:42 mds1 kernel: Lustre: 2956:0:(llog_cat.c:924:llog_cat_reverse_process()) catalog 0x6:10 crosses index zero Jan 22 20:38:42 mds1 kernel: Lustre: blizzard-MDD0000: changelog on Jan 22 20:38:42 mds1 kernel: Lustre: blizzard-MDT0000: Will be in recovery for at least 5:00, or until 26 clients reconnect
      Jan 22 20:38:42 mds1 kernel: Lustre: blizzard-MDT0000: Denying connection for new client f95bf979-9378-6cb7-798c-b5dfbec3221a(at 172.16.3.106@o2ib), waiting for 26 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 5:00
      Jan 22 20:44:42 mds1 kernel: Lustre: blizzard-MDT0000: recovery is timed out, evict stale exports Jan 22 20:44:42 mds1 kernel: Lustre: blizzard-MDT0000: disconnecting 1 stale clients
      Jan 22 20:44:42 mds1 kernel: Lustre: 3256:0:(llog_cat.c:93:llog_cat_new_log()) blizzard-MDD0000: there are no more free slots in catalog

      And then a bit later the mds would crash. This would happen 3 other times we attempted to start the mds server. We ran the following command to turn off the changelogs and attempt to clear them:

      [root@mds1 ~]# lctl --device blizzard-MDT0000 changelog_deregister cl1"
      [root@mds1 ~]# lctl get_param mdd.blizzard-MDT0000.changelog_users mdd.blizzard-MDT0000.changelog_users=current index: 12720211297 ID index

      The shell we fired the process off in ended up ending, but lctl continued running. The lctl command has been running now for 70+ hrs ... I have no idea if its normal for it to run this long. I also saw this in /tmp/lustre-log1485199531.6179 which was about 2 hrs after the above command was run, and it hasn't been written to since then:

      llog_cat.cllog_cat_process_cbprocessing log 0x21f25:1:0 at index 3513 of catalog 0x6:1 llog_cat.cllog_cat_cleanupcancel plain log at index 3513 of catalog 0x6:10 llog_cat.cllog_cat_process_cbprocessing log 0x21f26:1:0 at index 3514 of catalog 0x6:10

      Let me know if you need more information!

      Attachments

        Issue Links

          Activity

            People

              bfaccini Bruno Faccini (Inactive)
              cbc christopher coffey (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: