Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
3
-
12950
Description
The console message rate limiting logic which is used by CDEBUG_LIMIT, and therefore by CWARN, CERROR, CNETERR, and CEMERG, has a bug which causes some messages to be printed when they should be skipped.
For example, here are three messages from the same function on one node:
2013-05-24T22:37:30.674942-05:00 c2-0c0s7n3 Lustre: 7197:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1369453043/real 1369453043] req@ffff880fef070c00 x1435963081072428/t0(0) o400->snx11014-OST0006-osc-ffff881039030c00@10.10.100.6@o2ib1003:28/4 lens 224/224 e 0 to 1 dl 1369453050 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2013-05-24T22:37:30.683792-05:00 c2-0c0s7n3 Lustre: 7197:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 11 previous similar messages 2013-05-24T22:37:30.800884-05:00 c2-0c0s7n3 Lustre: 7197:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1369453043/real 1369453043] req@ffff880fef070800 x1435963081072424/t0(0) o400->snx11014-OST0005-osc-ffff881039030c00@10.10.100.6@o2ib1003:28/4 lens 224/224 e 0 to 1 dl 1369453050 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
These messages are all from the same function and the same thread. The first message is printed at 22:37.30. At the same time, it prints "Skipped %d previous similar messages." That is correct. What is not correct is that the next ptlrpc_expire_one_request message is printed at 22:37.30. This message should have been skipped by the rate limiting logic.
I have identified the problem in the code, and I have tested a fix. The problem is with this code:
if (cfs_time_after(cfs_time_current(), cdls->cdls_next + libcfs_console_max_delay + cfs_time_seconds(10))) { /* last timeout was a long time ago */ cdls->cdls_delay /= libcfs_console_backoff * 4; } else { cdls->cdls_delay *= libcfs_console_backoff; if (cdls->cdls_delay < libcfs_console_min_delay) cdls->cdls_delay = libcfs_console_min_delay; else if (cdls->cdls_delay > libcfs_console_max_delay) cdls->cdls_delay = libcfs_console_max_delay; }
The code which clamps cdls_delay in between the libcfs_console_min_delay and libcfs_console_max_delay should be moved outside of the else statement. Otherwise, the cdls_delay can be set lower than the libcfs_console_min_delay, causing the next message to be printed sooner than it should.
I have also noticed that this bug causes the first two messages through libcfs_debug_vmsg2 for a given cdls struct to be printed, no matter how close they are in time.
I will upload my patch for this issue.