[LU-4711] Console rate limiting logic sets cdls_delay smaller than the minimum in libcfs_debug_vmsg2 Created: 04/Mar/14 Updated: 20/May/14 Resolved: 12/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.2 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Ryan Haasken | Assignee: | Cliff White (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 12950 |
| Description |
|
The console message rate limiting logic which is used by CDEBUG_LIMIT, and therefore by CWARN, CERROR, CNETERR, and CEMERG, has a bug which causes some messages to be printed when they should be skipped. For example, here are three messages from the same function on one node: 2013-05-24T22:37:30.674942-05:00 c2-0c0s7n3 Lustre: 7197:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1369453043/real 1369453043] req@ffff880fef070c00 x1435963081072428/t0(0) o400->snx11014-OST0006-osc-ffff881039030c00@10.10.100.6@o2ib1003:28/4 lens 224/224 e 0 to 1 dl 1369453050 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2013-05-24T22:37:30.683792-05:00 c2-0c0s7n3 Lustre: 7197:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 11 previous similar messages 2013-05-24T22:37:30.800884-05:00 c2-0c0s7n3 Lustre: 7197:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1369453043/real 1369453043] req@ffff880fef070800 x1435963081072424/t0(0) o400->snx11014-OST0005-osc-ffff881039030c00@10.10.100.6@o2ib1003:28/4 lens 224/224 e 0 to 1 dl 1369453050 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 These messages are all from the same function and the same thread. The first message is printed at 22:37.30. At the same time, it prints "Skipped %d previous similar messages." That is correct. What is not correct is that the next ptlrpc_expire_one_request message is printed at 22:37.30. This message should have been skipped by the rate limiting logic. I have identified the problem in the code, and I have tested a fix. The problem is with this code: if (cfs_time_after(cfs_time_current(), cdls->cdls_next + libcfs_console_max_delay + cfs_time_seconds(10))) { /* last timeout was a long time ago */ cdls->cdls_delay /= libcfs_console_backoff * 4; } else { cdls->cdls_delay *= libcfs_console_backoff; if (cdls->cdls_delay < libcfs_console_min_delay) cdls->cdls_delay = libcfs_console_min_delay; else if (cdls->cdls_delay > libcfs_console_max_delay) cdls->cdls_delay = libcfs_console_max_delay; } The code which clamps cdls_delay in between the libcfs_console_min_delay and libcfs_console_max_delay should be moved outside of the else statement. Otherwise, the cdls_delay can be set lower than the libcfs_console_min_delay, causing the next message to be printed sooner than it should. I have also noticed that this bug causes the first two messages through libcfs_debug_vmsg2 for a given cdls struct to be printed, no matter how close they are in time. I will upload my patch for this issue. |
| Comments |
| Comment by Ryan Haasken [ 05/Mar/14 ] |
|
The patch has been uploaded: http://review.whamcloud.com/#/c/9503/ |
| Comment by Ryan Haasken [ 05/Mar/14 ] |
|
In order to test this patch, I added a readable /proc interface at /proc/fs/lustre/test-console-limit that just did a CERROR and then returned a string. (I could have added this CERROR to an existing /proc interface or to some other part of the code, but I wanted it to be quick and easy to execute that code.) I also changed the module parameters libcfs_console_min/max_delay to be set in seconds and converted to jiffies. Then I set them to 5 and 20 seconds, respectively. For debugging purposes, I also added a printk to libcfs_debug_vmsg2 that printed the value of cdls_delay if it was less than libcfs_console_min_delay. I executed a test script that read the proc interface in a way that would demonstrate the bug. Here are the relevant console messages from that test script: 2014-03-04T13:25:36.828018-06:00 c0-0c0s2n2 cdls_delay=0 is less thanlibcfs_console_min_delay=1250 2014-03-04T13:25:36.865375-06:00 c0-0c0s2n2 LustreError: 8624:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:25:36.865405-06:00 c0-0c0s2n2 LustreError: 8625:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:25:41.930256-06:00 c0-0c0s2n2 LustreError: 8635:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:25:41.930305-06:00 c0-0c0s2n2 LustreError: 8635:0:(linux-module.c:255:obd_proc_read_console_limit()) Skipped 8 previous similar messages 2014-03-04T13:26:21.963698-06:00 c0-0c0s2n2 cdls_delay=312 is less thanlibcfs_console_min_delay=1250 2014-03-04T13:26:21.963749-06:00 c0-0c0s2n2 LustreError: 8638:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:26:56.977699-06:00 c0-0c0s2n2 cdls_delay=39 is less thanlibcfs_console_min_delay=1250 2014-03-04T13:26:57.018967-06:00 c0-0c0s2n2 LustreError: 8641:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:27:32.035007-06:00 c0-0c0s2n2 cdls_delay=4 is less thanlibcfs_console_min_delay=1250 2014-03-04T13:27:32.035057-06:00 c0-0c0s2n2 LustreError: 8644:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:28:07.043974-06:00 c0-0c0s2n2 cdls_delay=0 is less thanlibcfs_console_min_delay=1250 2014-03-04T13:28:07.084192-06:00 c0-0c0s2n2 LustreError: 8646:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:28:42.100546-06:00 c0-0c0s2n2 cdls_delay=0 is less thanlibcfs_console_min_delay=1250 2014-03-04T13:28:42.100595-06:00 c0-0c0s2n2 LustreError: 8649:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:29:17.134376-06:00 c0-0c0s2n2 cdls_delay=0 is less thanlibcfs_console_min_delay=1250 2014-03-04T13:29:17.134425-06:00 c0-0c0s2n2 LustreError: 8653:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:29:17.248881-06:00 c0-0c0s2n2 LustreError: 8655:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. As you can see, the first two messages occur almost at the same time, so the second one should have been skipped. Then we can see that if we wait a while between messages, the cdls_delay continues decreasing below libcfs_console_min_delay. This allows the last message to be printed when it should have been skipped. I then applied the fix from 9503 and re-ran the test script (and executed one extra "cat /proc/fs/lustre/test-console-limit" after ~6 minutes). Here are the relevant console messages from that run: 2014-03-04T14:18:03.556798-06:00 c0-0c0s2n2 LustreError: 9119:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:18:08.601399-06:00 c0-0c0s2n2 LustreError: 9130:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:18:08.601442-06:00 c0-0c0s2n2 LustreError: 9130:0:(linux-module.c:255:obd_proc_read_console_limit()) Skipped 9 previous similar messages 2014-03-04T14:18:48.639191-06:00 c0-0c0s2n2 LustreError: 9134:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:19:23.667557-06:00 c0-0c0s2n2 LustreError: 9138:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:19:58.690327-06:00 c0-0c0s2n2 LustreError: 9140:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:20:33.717131-06:00 c0-0c0s2n2 LustreError: 9142:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:21:08.738812-06:00 c0-0c0s2n2 LustreError: 9146:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:21:43.764265-06:00 c0-0c0s2n2 LustreError: 9148:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:27:06.534620-06:00 c0-0c0s2n2 LustreError: 9156:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:27:06.534666-06:00 c0-0c0s2n2 LustreError: 9156:0:(linux-module.c:255:obd_proc_read_console_limit()) Skipped 1 previous similar message As you can see, none of my CERROR messages were printed within libcfs_console_min_delay jiffies of each other. I have attached my patch which adds the proc interface, allows you to set libcfs_console_min/max_delay in seconds, and adds the extra printk. I have also attached my test script. |
| Comment by Ryan Haasken [ 05/Mar/14 ] |
|
This is the bash test script which writes to the new /proc interface. |
| Comment by Ryan Haasken [ 05/Mar/14 ] |
|
This is the patch that adds the test-console-limit proc interface, changes the libcfs_console_min/max_delay to seconds, and adds a printk statement to libcfs_debug_vmsg2. |
| Comment by Ryan Haasken [ 21/Apr/14 ] |
|
The patch has landed, and this ticket can be resolved. |