[LU-8787] zpool containing MDT0000 out of space Created: 01/Nov/16 Updated: 02/Nov/17 Resolved: 02/Nov/17 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | nasf (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
Lustre: Build Version: 2.8.0_5.chaos |
||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
On a DNE file system, MDT0000 ran out of space while one or more other MDTs were in recovery. 2016-10-31 18:26:53 [20537.964631] Lustre: Skipped 1 previous similar message 2016-10-31 18:26:58 [20542.793836] LustreError: 31561:0:(osd_handler.c:223:osd_trans_start()) lsh-MDT0000: failed to start transaction due to ENOSPC. Metadata overhead is underestimated or grant_ratio is too low. 2016-10-31 18:26:58 [20542.815473] LustreError: 31561:0:(osd_handler.c:223:osd_trans_start()) Skipped 39 previous similar messages 2016-10-31 18:26:58 [20542.827434] LustreError: 31561:0:(llog_cat.c:744:llog_cat_cancel_records()) lsh-OST0009-osc-MDT0000: fail to cancel 1 of 1 llog-records: rc = -28 2016-10-31 18:26:58 [20542.843771] LustreError: 31561:0:(osp_sync.c:1031:osp_sync_process_committed()) lsh-OST0009-osc-MDT0000: can't cancel record: -28 Obviously the first step is to increase the capacity of the pool. However, after that is done, is further action required? Should I run lfsck, or do anything else? |
| Comments |
| Comment by Peter Jones [ 02/Nov/16 ] |
|
Fan Yong Could you please look into this issue? Thanks Peter |
| Comment by Olaf Faaland [ 02/Nov/16 ] |
|
I find that update_log_dir is taking 1.1T out of the 1.4T available. Shall I create a separate ticket for that? It seems far too large to me, but maybe I'm wrong. |
| Comment by Olaf Faaland [ 02/Nov/16 ] |
|
There are 158 files in update_log_dir. |
| Comment by Olaf Faaland [ 02/Nov/16 ] |
|
Created a separate ticket This ticket is only for the procedure to be followed when an MDT fills up, since it could happen in production and we need to know the procedure for recovering. thanks, |
| Comment by nasf (Inactive) [ 03/Nov/16 ] |
|
According to current DNE implementation, the cross-MDTs operations (in detail) will be recorded as llog under the update_log_dir for recovery purpose. For most of use cases, the llog is append only, if there are too much cross-MDTs operations, the llog will become huge. So if you can describe your operations before the out of space, that may help us to judge the issue. Anyway, if there are some Lustre kernel debug logs on the MDT0000, that will be better. |
| Comment by Di Wang [ 03/Nov/16 ] |
|
you can delete update_log* manually as what we did on |
| Comment by Olaf Faaland [ 04/Nov/16 ] |
|
Di, |
| Comment by Olaf Faaland [ 04/Nov/16 ] |
|
nasf, |
| Comment by nasf (Inactive) [ 04/Nov/16 ] |
Generally, the llog became bigger after the reboot means there are something to be recovered. But as long as your recovery complete successfully after the reboot, even if you removed the llogs, your namespace should be in consistent status unless there were some inconsistency before your reboot (for ZFS backend, it should very rare case). |
| Comment by Di Wang [ 04/Nov/16 ] |
|
Olaf: |
| Comment by nasf (Inactive) [ 26/Nov/16 ] |
|
Olaf, |
| Comment by Olaf Faaland [ 02/Nov/17 ] |
|
Basic advice that we should delete the update logs and then run lfsck is a sufficient answer. This occurred during DNE2 testing with Lustre 2.8, which we have decided not to work at any further. Instead we will test DNE2 when we start testing Lustre 2.10.x. So we will test the advice only if we encounter the problem again, and in that case will file a new ticket. |