Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
3
-
5129
Description
Some examples, taken for convenience from LU-1153 logs but which are very frequent on T100 site :
=================================================================================================
Feb 28 09:37:28 compute-01-32 kernel: LustreError: 2567:0:(osc_request.c:797:osc_announce_cached()) dirty 2028 - 2028 > system dirty_max 647168 Feb 28 09:37:40 compute-01-32 kernel: LustreError: 2568:0:(osc_request.c:797:osc_announce_cached()) dirty 2040 - 2040 > system dirty_max 647168 Feb 28 09:37:46 compute-01-32 kernel: LustreError: 2565:0:(osc_request.c:797:osc_announce_cached()) dirty 1954 - 1954 > system dirty_max 647168 Feb 28 09:37:47 compute-01-32 kernel: LustreError: 2561:0:(osc_request.c:797:osc_announce_cached()) dirty 1975 - 1976 > system dirty_max 647168 Feb 28 09:37:50 compute-01-32 kernel: LustreError: 2566:0:(osc_request.c:797:osc_announce_cached()) dirty 2001 - 2002 > system dirty_max 647168 Feb 28 09:38:09 compute-01-32 kernel: LustreError: 2567:0:(osc_request.c:797:osc_announce_cached()) dirty 1946 - 1946 > system dirty_max 647168 Feb 28 09:39:20 compute-01-32 kernel: LustreError: 2615:0:(osc_request.c:797:osc_announce_cached()) dirty 2017 - 2017 > system dirty_max 647168 Feb 28 09:39:47 compute-01-32 kernel: LustreError: 2615:0:(osc_request.c:797:osc_announce_cached()) dirty 1847 - 1848 > system dirty_max 647168
=================================================================================================
this comes from the following piece of code :
=============================================
. } else if (cfs_atomic_read(&obd_dirty_pages) - cfs_atomic_read(&obd_dirty_transit_pages) > obd_max_dirty_pages + 1){ /* The cfs_atomic_read() allowing the cfs_atomic_inc() are * not covered by a lock thus they may safely race and trip * this CERROR() unless we add in a small fudge factor (+1). */ CERROR("dirty %d - %d > system dirty_max %d\n", cfs_atomic_read(&obd_dirty_pages), cfs_atomic_read(&obd_dirty_transit_pages), obd_max_dirty_pages);
=============================================
where both obd_dirty_pages/obd_dirty_transit_pages are "likely to change" signed ints and obd_max_dirty_pages unsigned which an easily explain the wrong error msgs ...
This is still true in current master and, if I am right and there is no tricky stuff I may have missed there ..., the following patch should fix :
==================================================================================================================================================
diff --git a/lustre/osc/osc_request.c b/lustre/osc/osc_request.c index af60e4b..5f75329 100644 --- a/lustre/osc/osc_request.c +++ b/lustre/osc/osc_request.c @@ -788,6 +788,7 @@ static void osc_announce_cached(struct client_obd *cli, struct obdo *oa, long writing_bytes) { obd_flag bits = OBD_MD_FLBLOCKS|OBD_MD_FLGRANT; + int diff; LASSERT(!(oa->o_valid & bits)); @@ -798,9 +799,9 @@ static void osc_announce_cached(struct client_obd *cli, struct obdo *oa, CERROR("dirty %lu - %lu > dirty_max %lu\n", cli->cl_dirty, cli->cl_dirty_transit, cli->cl_dirty_max); oa->o_undirty = 0; - } else if (cfs_atomic_read(&obd_dirty_pages) - - cfs_atomic_read(&obd_dirty_transit_pages) > - obd_max_dirty_pages + 1){ + } else if (((diff = (cfs_atomic_read(&obd_dirty_pages) - + cfs_atomic_read(&obd_dirty_transit_pages))) > + obd_max_dirty_pages + 1) && (diff > 0)){ /* The cfs_atomic_read() allowing the cfs_atomic_inc() are * not covered by a lock thus they may safely race and trip * this CERROR() unless we add in a small fudge factor (+1). */
==================================================================================================================================================