[LU-5812] osc_request.c:853:osc_announce_cached() dirty 129051 + 1573 - 0 > system dirty_max 130608 Created: 27/Oct/14 Updated: 03/Jun/21 Resolved: 29/Nov/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Olaf Faaland | Assignee: | Jian Yu |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
IBM BG/Q system vulcan |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 16298 |
| Description |
|
Intermittent occurrences of the following in console logs: 2014-10-25 04:32:08.953847 {RMP22Oc150844422} [mmcs]{0}.7.0: LustreError: 3262:0:(osc_request.c:853:osc_announce_cached()) fsv-OST0045-osc-c0000003e09df1c0: dirty 129051 + 1573 - 0 > system dirty_max 130608
2014-10-25 04:32:08.954403 {RMP22Oc150844422} [mmcs]{0}.7.0: LustreError: 3262:0:(osc_request.c:853:osc_announce_cached()) Skipped 152 previous similar messages
|
| Comments |
| Comment by Olaf Faaland [ 27/Oct/14 ] |
|
The error is initially reported 33 times (not counting the "n previous similar messages"). In those 33 cases: 345 <= obd_dirty_pages <= 5787 So obd_unstable_pages is high ( >=124,821) |
| Comment by Olaf Faaland [ 27/Oct/14 ] |
|
console logs from I/O nodes |
| Comment by Peter Jones [ 28/Oct/14 ] |
|
Yu, Jian Could you please advise on this issue? Thanks Peter |
| Comment by Jian Yu [ 28/Oct/14 ] |
|
Hi Olaf, Could you please check whether the client build contained patch http://review.whamcloud.com/10937 or not? Thanks. |
| Comment by Christopher Morrone [ 28/Oct/14 ] |
|
It does not. |
| Comment by Jian Yu [ 04/Nov/14 ] |
|
Thank you Chris for the info. The "dirty 129051 + 1573 - 0 > system dirty_max 130608" error message was printed by the following codes on 2.5.3-0.13morrone ( https://github.com/chaos/lustre ): static void osc_announce_cached(struct client_obd *cli, struct obdo *oa, long writing_bytes) { //…… CERROR("%s: dirty %d + %d - %d > system dirty_max %d\n", cli->cl_import->imp_obd->obd_name, cfs_atomic_read(&obd_unstable_pages), cfs_atomic_read(&obd_dirty_pages), cfs_atomic_read(&obd_dirty_transit_pages), obd_max_dirty_pages); //…… } While on Lustre 2.5.3, the codes are: static void osc_announce_cached(struct client_obd *cli, struct obdo *oa, long writing_bytes) { //…… CERROR("dirty %d - %d > system dirty_max %d\n", cfs_atomic_read(&obd_dirty_pages), cfs_atomic_read(&obd_dirty_transit_pages), obd_max_dirty_pages); //...... } The difference is that obd_unstable_pages is tracked and counted on 2.5.3-0.13morrone by the patch http://review.whamcloud.com/4245 for If we do not want to revert the patch from 2.5.3-0.13morrone, we need figure out why the amount of unstable pages is large. |
| Comment by Jian Yu [ 06/Nov/14 ] |
|
On master branch, patch http://review.whamcloud.com/4245/ for I'll back-port those patches to Lustre b2_5 branch. |
| Comment by Jian Yu [ 07/Nov/14 ] |
|
Here are the back-ported patches for Lustre b2_5 branch:
With the above patches, the unstable pages tracking and counting issues were fixed. |
| Comment by Jian Yu [ 21/Nov/14 ] |
|
The first 5 patches are ready to land. The sixth patch needs to be re-implemented on Lustre b2_5 branch. |
| Comment by Jian Yu [ 01/Dec/14 ] |
|
The sixth patch from http://review.whamcloud.com/10003 heavily depends on the patches for Hi Jinshan, With applying the first 5 patches on Lustre b2_5 branch, for the sixth one, may I just made the following change so as to quiet the error message in this ticket? diff --git a/lustre/osc/osc_request.c b/lustre/osc/osc_request.c index 1c42033..dcfc660 100644 --- a/lustre/osc/osc_request.c +++ b/lustre/osc/osc_request.c @@ -839,16 +839,14 @@ static void osc_announce_cached(struct client_obd *cli, struct obdo *oa, cli->cl_dirty_pages, cli->cl_dirty_transit, cli->cl_dirty_max_pages); oa->o_undirty = 0; - } else if (unlikely(cfs_atomic_read(&obd_unstable_pages) + - cfs_atomic_read(&obd_dirty_pages) - + } else if (unlikely(cfs_atomic_read(&obd_dirty_pages) - cfs_atomic_read(&obd_dirty_transit_pages) > (long)(obd_max_dirty_pages + 1))) { /* The cfs_atomic_read() allowing the cfs_atomic_inc() are * not covered by a lock thus they may safely race and trip * this CERROR() unless we add in a small fudge factor (+1). */ - CERROR("%s: dirty %d + %d - %d > system dirty_max %d\n", + CERROR("%s: dirty %d - %d > system dirty_max %d\n", cli->cl_import->imp_obd->obd_name, - cfs_atomic_read(&obd_unstable_pages), cfs_atomic_read(&obd_dirty_pages), cfs_atomic_read(&obd_dirty_transit_pages), obd_max_dirty_pages); |
| Comment by Jinshan Xiong (Inactive) [ 11/May/15 ] |
|
Yes, that looks good. |
| Comment by Jian Yu [ 11/May/15 ] |
|
Thank you, Jinshan. The sixth patch http://review.whamcloud.com/12615 contains the above change. |
| Comment by D. Marc Stearman (Inactive) [ 19/Feb/16 ] |
|
Olaf, is this in our local releases? |
| Comment by Olaf Faaland [ 22/Feb/16 ] |
|
Marc, Chris, It appears to me that some of the patches are not in our lustre 2.5.5-3chaos stack. Details: http://review.whamcloud.com/12604 partial * 9722ebf LU-2139 osc: Track and limit "unstable" pages http://review.whamcloud.com/12605 full * 666430c LU-2139 osc: Track number of "unstable" pages per osc http://review.whamcloud.com/12606 full * 534ef35 LU-2139 osc: Use SOFT_SYNC to urge server commit http://review.whamcloud.com/12612 full * 003f186 LU-2139 ofd: Do async commit if SOFT_SYNC is seen http://review.whamcloud.com/12613 not found http://review.whamcloud.com/12615 not found where "partial" means some of the files changed in the patch shown on gerritt were not changed by the patch in the 2.5.5-3chaos stack. -Olaf |
| Comment by Christopher Morrone [ 22/Feb/16 ] |
|
If the first one is not exact, it is likely because that is not the most recent version of the patch. You have no way to know that from the information presented on line, unfortunately. |
| Comment by Olaf Faaland [ 29/Nov/17 ] |
|
Although this was never resolved, the impact of this issue is low and we are moving our clients from Lustre 2.5 to 2.8. Closing. |
| Comment by Homer Li (Inactive) [ 03/Jun/21 ] |
|
The 2.12.6 has the same issue. |