[LU-8794] update_log_dir consuming 1.1TB on MDT0000 Created: 02/Nov/16 Updated: 29/Nov/17 Resolved: 29/Nov/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Olaf Faaland | Assignee: | Lai Siyao |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
Lustre: Build Version: 2.8.0_5.chaos |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
On zinc, a DNE filesystem with 16 MDTs, the pool containing MDT0000 (zinc1) ran out of space. Upon inspection, we find that 1.1 TB is occupied by files contained in updat_log_dir. The rest of the MDT occupies about 300MB, which is about the same as the space used by each of the other 15 MDTs. |
| Comments |
| Comment by Olaf Faaland [ 02/Nov/16 ] |
|
There are 158 files in update_log_dir. |
| Comment by Joseph Gmitter (Inactive) [ 03/Nov/16 ] |
|
Hi Lai, Can you please take a look at this issue? Thanks. |
| Comment by Olaf Faaland [ 03/Nov/16 ] |
|
Unfortunately I cannot be certain of the filesystem activity that caused this. We were not monitoring the space usage in the pool (although we are now). I also cannot provide debug logs from the MDTs, as we discovered the problem after a reboot of the servers. The only information available is syslog output for the servers and the contents of the MDT itself. Di Wang suggested I can delete the contents of update_log_dir. Let me know if you need any information about its contents before I do that. |
| Comment by Olaf Faaland [ 04/Nov/16 ] |
|
Note that this ticket is purely for trying to figure out why the update logs are occupying so much space. There is a separate ticket, If the contents of the MDT won't help us learn what happened, we can just close the ticket until it happens again and we can get better information. |
| Comment by Di Wang [ 04/Nov/16 ] |
|
http://review.whamcloud.com/18028 ( |
| Comment by Olaf Faaland [ 29/Nov/17 ] |
|
I was unable to reproduce the problem after it was initially encountered, and we have not seen it since on test or production systems since then, perhaps because we have not been testing DNE2 and use very few remote directories. Closing. |