Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8794

update_log_dir consuming 1.1TB on MDT0000

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • None
    • Lustre: Build Version: 2.8.0_5.chaos
    • 3
    • 9223372036854775807

    Description

      On zinc, a DNE filesystem with 16 MDTs, the pool containing MDT0000 (zinc1) ran out of space. Upon inspection, we find that 1.1 TB is occupied by files contained in updat_log_dir. The rest of the MDT occupies about 300MB, which is about the same as the space used by each of the other 15 MDTs.

      Attachments

        Issue Links

          Activity

            [LU-8794] update_log_dir consuming 1.1TB on MDT0000
            ofaaland Olaf Faaland added a comment -

            I was unable to reproduce the problem after it was initially encountered, and we have not seen it since on test or production systems since then, perhaps because we have not been testing DNE2 and use very few remote directories. Closing.

            ofaaland Olaf Faaland added a comment - I was unable to reproduce the problem after it was initially encountered, and we have not seen it since on test or production systems since then, perhaps because we have not been testing DNE2 and use very few remote directories. Closing.
            di.wang Di Wang added a comment - - edited

            http://review.whamcloud.com/18028 (LU-6838) might help here, but as it explained there, the plain log limit size is around 800M, probably can not explain why the update log file reach to 1T. something is strange here. anyway I think the suggestion on LU-8714 is the way to go.

            di.wang Di Wang added a comment - - edited http://review.whamcloud.com/18028 ( LU-6838 ) might help here, but as it explained there, the plain log limit size is around 800M, probably can not explain why the update log file reach to 1T. something is strange here. anyway I think the suggestion on LU-8714 is the way to go.
            ofaaland Olaf Faaland added a comment -

            Note that this ticket is purely for trying to figure out why the update logs are occupying so much space. There is a separate ticket, LU-8787, for how to recover.

            If the contents of the MDT won't help us learn what happened, we can just close the ticket until it happens again and we can get better information.
            We have started monitoring space used in the pool containing the MDT, and will be more likely to notice if the volume of update logs increases.

            ofaaland Olaf Faaland added a comment - Note that this ticket is purely for trying to figure out why the update logs are occupying so much space. There is a separate ticket, LU-8787 , for how to recover. If the contents of the MDT won't help us learn what happened, we can just close the ticket until it happens again and we can get better information. We have started monitoring space used in the pool containing the MDT, and will be more likely to notice if the volume of update logs increases.

            Unfortunately I cannot be certain of the filesystem activity that caused this. We were not monitoring the space usage in the pool (although we are now).

            I also cannot provide debug logs from the MDTs, as we discovered the problem after a reboot of the servers.

            The only information available is syslog output for the servers and the contents of the MDT itself.

            Di Wang suggested I can delete the contents of update_log_dir. Let me know if you need any information about its contents before I do that.

            ofaaland Olaf Faaland added a comment - Unfortunately I cannot be certain of the filesystem activity that caused this. We were not monitoring the space usage in the pool (although we are now). I also cannot provide debug logs from the MDTs, as we discovered the problem after a reboot of the servers. The only information available is syslog output for the servers and the contents of the MDT itself. Di Wang suggested I can delete the contents of update_log_dir. Let me know if you need any information about its contents before I do that.

            Hi Lai,

            Can you please take a look at this issue?

            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Lai, Can you please take a look at this issue? Thanks. Joe
            ofaaland Olaf Faaland added a comment -

            There are 158 files in update_log_dir.
            68 size>10GB
            29 10GB > size >= 1GB
            7 1GB > size >= 1M
            44 size < 1M

            ofaaland Olaf Faaland added a comment - There are 158 files in update_log_dir. 68 size>10GB 29 10GB > size >= 1GB 7 1GB > size >= 1M 44 size < 1M

            People

              laisiyao Lai Siyao
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: