[LU-8794] update_log_dir consuming 1.1TB on MDT0000 Created: 02/Nov/16  Updated: 29/Nov/17  Resolved: 29/Nov/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Olaf Faaland Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: llnl
Environment:

Lustre: Build Version: 2.8.0_5.chaos


Issue Links:
Related
is related to LU-8714 too many update logs during soak-test. Open
is related to LU-6838 update llog become too big before it ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On zinc, a DNE filesystem with 16 MDTs, the pool containing MDT0000 (zinc1) ran out of space. Upon inspection, we find that 1.1 TB is occupied by files contained in updat_log_dir. The rest of the MDT occupies about 300MB, which is about the same as the space used by each of the other 15 MDTs.



 Comments   
Comment by Olaf Faaland [ 02/Nov/16 ]

There are 158 files in update_log_dir.
68 size>10GB
29 10GB > size >= 1GB
7 1GB > size >= 1M
44 size < 1M

Comment by Joseph Gmitter (Inactive) [ 03/Nov/16 ]

Hi Lai,

Can you please take a look at this issue?

Thanks.
Joe

Comment by Olaf Faaland [ 03/Nov/16 ]

Unfortunately I cannot be certain of the filesystem activity that caused this. We were not monitoring the space usage in the pool (although we are now).

I also cannot provide debug logs from the MDTs, as we discovered the problem after a reboot of the servers.

The only information available is syslog output for the servers and the contents of the MDT itself.

Di Wang suggested I can delete the contents of update_log_dir. Let me know if you need any information about its contents before I do that.

Comment by Olaf Faaland [ 04/Nov/16 ]

Note that this ticket is purely for trying to figure out why the update logs are occupying so much space. There is a separate ticket, LU-8787, for how to recover.

If the contents of the MDT won't help us learn what happened, we can just close the ticket until it happens again and we can get better information.
We have started monitoring space used in the pool containing the MDT, and will be more likely to notice if the volume of update logs increases.

Comment by Di Wang [ 04/Nov/16 ]

http://review.whamcloud.com/18028 (LU-6838) might help here, but as it explained there, the plain log limit size is around 800M, probably can not explain why the update log file reach to 1T. something is strange here. anyway I think the suggestion on LU-8714 is the way to go.

Comment by Olaf Faaland [ 29/Nov/17 ]

I was unable to reproduce the problem after it was initially encountered, and we have not seen it since on test or production systems since then, perhaps because we have not been testing DNE2 and use very few remote directories. Closing.

Generated at Sat Feb 10 02:20:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.