[LU-5812] osc_request.c:853:osc_announce_cached() dirty 129051 + 1573 - 0 > system dirty_max 130608 Created: 27/Oct/14  Updated: 03/Jun/21  Resolved: 29/Nov/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Jian Yu
Resolution: Won't Fix Votes: 0
Labels: llnl
Environment:

IBM BG/Q system vulcan
Lustre client Build Version: 2.5.3-0.13morrone-0.13morrone--PRISTINE-2.6.32-431.1.1.bgq.3blueos.V1R2M2.bl2.2_4.ppc64
Lustre server Build Version: 2.4.2-17chaos-17chaos--PRISTINE-2.6.32-431.29.2.1chaos.ch5.2.x86_64


Attachments: File R00-ID-J00.log.lustre     File R00-ID-J01.log.lustre     File R00-ID-J02.log.lustre     File R00-ID-J03.log.lustre    
Severity: 3
Rank (Obsolete): 16298

 Description   

Intermittent occurrences of the following in console logs:

2014-10-25 04:32:08.953847 {RMP22Oc150844422} [mmcs]{0}.7.0: LustreError: 3262:0:(osc_request.c:853:osc_announce_cached()) fsv-OST0045-osc-c0000003e09df1c0: dirty 129051 + 1573 - 0 > system dirty_max 130608
2014-10-25 04:32:08.954403 {RMP22Oc150844422} [mmcs]{0}.7.0: LustreError: 3262:0:(osc_request.c:853:osc_announce_cached()) Skipped 152 previous similar messages


 Comments   
Comment by Olaf Faaland [ 27/Oct/14 ]

The error is initially reported 33 times (not counting the "n previous similar messages"). In those 33 cases:

345 <= obd_dirty_pages <= 5787
obd_dirty_transit_pages = 0

So obd_unstable_pages is high ( >=124,821)

Comment by Olaf Faaland [ 27/Oct/14 ]

console logs from I/O nodes

Comment by Peter Jones [ 28/Oct/14 ]

Yu, Jian

Could you please advise on this issue?

Thanks

Peter

Comment by Jian Yu [ 28/Oct/14 ]

Hi Olaf,

Could you please check whether the client build contained patch http://review.whamcloud.com/10937 or not? Thanks.

Comment by Christopher Morrone [ 28/Oct/14 ]

It does not.

Comment by Jian Yu [ 04/Nov/14 ]

Thank you Chris for the info.

The "dirty 129051 + 1573 - 0 > system dirty_max 130608" error message was printed by the following codes on 2.5.3-0.13morrone ( https://github.com/chaos/lustre ):

static void osc_announce_cached(struct client_obd *cli, struct obdo *oa,
                                                       long writing_bytes)
{
        //……
        CERROR("%s: dirty %d + %d - %d > system dirty_max %d\n",
                        cli->cl_import->imp_obd->obd_name,
                        cfs_atomic_read(&obd_unstable_pages),
                        cfs_atomic_read(&obd_dirty_pages),
                        cfs_atomic_read(&obd_dirty_transit_pages),
                        obd_max_dirty_pages);
        //……
}

While on Lustre 2.5.3, the codes are:

static void osc_announce_cached(struct client_obd *cli, struct obdo *oa,
                                                       long writing_bytes)
{
        //……
        CERROR("dirty %d - %d > system dirty_max %d\n",
                         cfs_atomic_read(&obd_dirty_pages),
                         cfs_atomic_read(&obd_dirty_transit_pages),
                         obd_max_dirty_pages);
        //......
}

The difference is that obd_unstable_pages is tracked and counted on 2.5.3-0.13morrone by the patch http://review.whamcloud.com/4245 for LU-2139, which was originally added into Lustre 2.5.3 but finally reverted due to regressions (LU-3274 and LU-3277).

If we do not want to revert the patch from 2.5.3-0.13morrone, we need figure out why the amount of unstable pages is large.

Comment by Jian Yu [ 06/Nov/14 ]

On master branch, patch http://review.whamcloud.com/4245/ for LU-2139 was reverted, and the revised version in http://review.whamcloud.com/6284 was landed later.
After that, patch http://review.whamcloud.com/10003 for LU-4841 was landed to revise the counting of unstable pages, which can resolve the issue in this ticket.

I'll back-port those patches to Lustre b2_5 branch.

Comment by Jian Yu [ 07/Nov/14 ]

Here are the back-ported patches for Lustre b2_5 branch:

  1. http://review.whamcloud.com/12604 (from http://review.whamcloud.com/6284)
  2. http://review.whamcloud.com/12605 (from http://review.whamcloud.com/4374)
  3. http://review.whamcloud.com/12606 (from http://review.whamcloud.com/4375)
  4. http://review.whamcloud.com/12612 (from http://review.whamcloud.com/5935)
  5. http://review.whamcloud.com/12613 (from http://review.whamcloud.com/8215)
  6. http://review.whamcloud.com/12615 (from http://review.whamcloud.com/10003)

With the above patches, the unstable pages tracking and counting issues were fixed.

Comment by Jian Yu [ 21/Nov/14 ]

The first 5 patches are ready to land. The sixth patch needs to be re-implemented on Lustre b2_5 branch.

Comment by Jian Yu [ 01/Dec/14 ]

The sixth patch from http://review.whamcloud.com/10003 heavily depends on the patches for LU-3321, which would not likely to be back-ported to Lustre b2_5 branch.

Hi Jinshan,

With applying the first 5 patches on Lustre b2_5 branch, for the sixth one, may I just made the following change so as to quiet the error message in this ticket?

diff --git a/lustre/osc/osc_request.c b/lustre/osc/osc_request.c
index 1c42033..dcfc660 100644
--- a/lustre/osc/osc_request.c
+++ b/lustre/osc/osc_request.c
@@ -839,16 +839,14 @@ static void osc_announce_cached(struct client_obd *cli, struct obdo *oa,
                       cli->cl_dirty_pages, cli->cl_dirty_transit,
                       cli->cl_dirty_max_pages);
                oa->o_undirty = 0;
-       } else if (unlikely(cfs_atomic_read(&obd_unstable_pages) +
-                           cfs_atomic_read(&obd_dirty_pages) -
+       } else if (unlikely(cfs_atomic_read(&obd_dirty_pages) -
                            cfs_atomic_read(&obd_dirty_transit_pages) >
                            (long)(obd_max_dirty_pages + 1))) {
                /* The cfs_atomic_read() allowing the cfs_atomic_inc() are
                 * not covered by a lock thus they may safely race and trip
                 * this CERROR() unless we add in a small fudge factor (+1). */
-               CERROR("%s: dirty %d + %d - %d > system dirty_max %d\n",
+               CERROR("%s: dirty %d - %d > system dirty_max %d\n",
                       cli->cl_import->imp_obd->obd_name,
-                      cfs_atomic_read(&obd_unstable_pages),
                       cfs_atomic_read(&obd_dirty_pages),
                       cfs_atomic_read(&obd_dirty_transit_pages),
                       obd_max_dirty_pages);
Comment by Jinshan Xiong (Inactive) [ 11/May/15 ]

Yes, that looks good.

Comment by Jian Yu [ 11/May/15 ]

Thank you, Jinshan. The sixth patch http://review.whamcloud.com/12615 contains the above change.

Comment by D. Marc Stearman (Inactive) [ 19/Feb/16 ]

Olaf, is this in our local releases?

Comment by Olaf Faaland [ 22/Feb/16 ]

Marc, Chris,

It appears to me that some of the patches are not in our lustre 2.5.5-3chaos stack. Details:

http://review.whamcloud.com/12604       partial         * 9722ebf LU-2139 osc: Track and limit "unstable" pages
http://review.whamcloud.com/12605       full            * 666430c LU-2139 osc: Track number of "unstable" pages per osc
http://review.whamcloud.com/12606       full            * 534ef35 LU-2139 osc: Use SOFT_SYNC to urge server commit
http://review.whamcloud.com/12612       full            * 003f186 LU-2139 ofd: Do async commit if SOFT_SYNC is seen
http://review.whamcloud.com/12613       not found
http://review.whamcloud.com/12615       not found

where "partial" means some of the files changed in the patch shown on gerritt were not changed by the patch in the 2.5.5-3chaos stack.

-Olaf

Comment by Christopher Morrone [ 22/Feb/16 ]

If the first one is not exact, it is likely because that is not the most recent version of the patch. You have no way to know that from the information presented on line, unfortunately.

Comment by Olaf Faaland [ 29/Nov/17 ]

Although this was never resolved, the impact of this issue is low and we are moving our clients from Lustre 2.5 to 2.8. Closing.

Comment by Homer Li (Inactive) [ 03/Jun/21 ]

The 2.12.6 has the same issue.

Generated at Sat Feb 10 01:54:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.