Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4711

Console rate limiting logic sets cdls_delay smaller than the minimum in libcfs_debug_vmsg2

Details

    • 3
    • 12950

    Description

      The console message rate limiting logic which is used by CDEBUG_LIMIT, and therefore by CWARN, CERROR, CNETERR, and CEMERG, has a bug which causes some messages to be printed when they should be skipped.

      For example, here are three messages from the same function on one node:

      2013-05-24T22:37:30.674942-05:00 c2-0c0s7n3 Lustre: 7197:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1369453043/real 1369453043]  req@ffff880fef070c00 x1435963081072428/t0(0) o400->snx11014-OST0006-osc-ffff881039030c00@10.10.100.6@o2ib1003:28/4 lens 224/224 e 0 to 1 dl 1369453050 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      2013-05-24T22:37:30.683792-05:00 c2-0c0s7n3 Lustre: 7197:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
      2013-05-24T22:37:30.800884-05:00 c2-0c0s7n3 Lustre: 7197:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1369453043/real 1369453043]  req@ffff880fef070800 x1435963081072424/t0(0) o400->snx11014-OST0005-osc-ffff881039030c00@10.10.100.6@o2ib1003:28/4 lens 224/224 e 0 to 1 dl 1369453050 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      

      These messages are all from the same function and the same thread. The first message is printed at 22:37.30. At the same time, it prints "Skipped %d previous similar messages." That is correct. What is not correct is that the next ptlrpc_expire_one_request message is printed at 22:37.30. This message should have been skipped by the rate limiting logic.

      I have identified the problem in the code, and I have tested a fix. The problem is with this code:

                      if (cfs_time_after(cfs_time_current(), cdls->cdls_next +
                                                             libcfs_console_max_delay
                                                             + cfs_time_seconds(10))) {
                              /* last timeout was a long time ago */
                              cdls->cdls_delay /= libcfs_console_backoff * 4;
                      } else {
                              cdls->cdls_delay *= libcfs_console_backoff;
      
                              if (cdls->cdls_delay < libcfs_console_min_delay)
                                      cdls->cdls_delay = libcfs_console_min_delay;
                              else if (cdls->cdls_delay > libcfs_console_max_delay)
                                      cdls->cdls_delay = libcfs_console_max_delay;
                      }
      

      The code which clamps cdls_delay in between the libcfs_console_min_delay and libcfs_console_max_delay should be moved outside of the else statement. Otherwise, the cdls_delay can be set lower than the libcfs_console_min_delay, causing the next message to be printed sooner than it should.

      I have also noticed that this bug causes the first two messages through libcfs_debug_vmsg2 for a given cdls struct to be printed, no matter how close they are in time.

      I will upload my patch for this issue.

      Attachments

        Activity

          [LU-4711] Console rate limiting logic sets cdls_delay smaller than the minimum in libcfs_debug_vmsg2
          haasken Ryan Haasken added a comment -

          The patch has landed, and this ticket can be resolved.

          haasken Ryan Haasken added a comment - The patch has landed, and this ticket can be resolved.
          haasken Ryan Haasken added a comment -

          This is the patch that adds the test-console-limit proc interface, changes the libcfs_console_min/max_delay to seconds, and adds a printk statement to libcfs_debug_vmsg2.

          haasken Ryan Haasken added a comment - This is the patch that adds the test-console-limit proc interface, changes the libcfs_console_min/max_delay to seconds, and adds a printk statement to libcfs_debug_vmsg2.
          haasken Ryan Haasken added a comment -

          This is the bash test script which writes to the new /proc interface.

          haasken Ryan Haasken added a comment - This is the bash test script which writes to the new /proc interface.
          haasken Ryan Haasken added a comment -

          In order to test this patch, I added a readable /proc interface at /proc/fs/lustre/test-console-limit that just did a CERROR and then returned a string. (I could have added this CERROR to an existing /proc interface or to some other part of the code, but I wanted it to be quick and easy to execute that code.)

          I also changed the module parameters libcfs_console_min/max_delay to be set in seconds and converted to jiffies. Then I set them to 5 and 20 seconds, respectively. For debugging purposes, I also added a printk to libcfs_debug_vmsg2 that printed the value of cdls_delay if it was less than libcfs_console_min_delay.

          I executed a test script that read the proc interface in a way that would demonstrate the bug.

          Here are the relevant console messages from that test script:

          2014-03-04T13:25:36.828018-06:00 c0-0c0s2n2 cdls_delay=0 is less thanlibcfs_console_min_delay=1250
          2014-03-04T13:25:36.865375-06:00 c0-0c0s2n2 LustreError: 8624:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T13:25:36.865405-06:00 c0-0c0s2n2 LustreError: 8625:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T13:25:41.930256-06:00 c0-0c0s2n2 LustreError: 8635:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T13:25:41.930305-06:00 c0-0c0s2n2 LustreError: 8635:0:(linux-module.c:255:obd_proc_read_console_limit()) Skipped 8 previous similar messages
          2014-03-04T13:26:21.963698-06:00 c0-0c0s2n2 cdls_delay=312 is less thanlibcfs_console_min_delay=1250
          2014-03-04T13:26:21.963749-06:00 c0-0c0s2n2 LustreError: 8638:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T13:26:56.977699-06:00 c0-0c0s2n2 cdls_delay=39 is less thanlibcfs_console_min_delay=1250
          2014-03-04T13:26:57.018967-06:00 c0-0c0s2n2 LustreError: 8641:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T13:27:32.035007-06:00 c0-0c0s2n2 cdls_delay=4 is less thanlibcfs_console_min_delay=1250
          2014-03-04T13:27:32.035057-06:00 c0-0c0s2n2 LustreError: 8644:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T13:28:07.043974-06:00 c0-0c0s2n2 cdls_delay=0 is less thanlibcfs_console_min_delay=1250
          2014-03-04T13:28:07.084192-06:00 c0-0c0s2n2 LustreError: 8646:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T13:28:42.100546-06:00 c0-0c0s2n2 cdls_delay=0 is less thanlibcfs_console_min_delay=1250
          2014-03-04T13:28:42.100595-06:00 c0-0c0s2n2 LustreError: 8649:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T13:29:17.134376-06:00 c0-0c0s2n2 cdls_delay=0 is less thanlibcfs_console_min_delay=1250
          2014-03-04T13:29:17.134425-06:00 c0-0c0s2n2 LustreError: 8653:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T13:29:17.248881-06:00 c0-0c0s2n2 LustreError: 8655:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          

          As you can see, the first two messages occur almost at the same time, so the second one should have been skipped. Then we can see that if we wait a while between messages, the cdls_delay continues decreasing below libcfs_console_min_delay. This allows the last message to be printed when it should have been skipped.

          I then applied the fix from 9503 and re-ran the test script (and executed one extra "cat /proc/fs/lustre/test-console-limit" after ~6 minutes). Here are the relevant console messages from that run:

          2014-03-04T14:18:03.556798-06:00 c0-0c0s2n2 LustreError: 9119:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T14:18:08.601399-06:00 c0-0c0s2n2 LustreError: 9130:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T14:18:08.601442-06:00 c0-0c0s2n2 LustreError: 9130:0:(linux-module.c:255:obd_proc_read_console_limit()) Skipped 9 previous similar messages
          2014-03-04T14:18:48.639191-06:00 c0-0c0s2n2 LustreError: 9134:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T14:19:23.667557-06:00 c0-0c0s2n2 LustreError: 9138:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T14:19:58.690327-06:00 c0-0c0s2n2 LustreError: 9140:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T14:20:33.717131-06:00 c0-0c0s2n2 LustreError: 9142:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T14:21:08.738812-06:00 c0-0c0s2n2 LustreError: 9146:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T14:21:43.764265-06:00 c0-0c0s2n2 LustreError: 9148:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T14:27:06.534620-06:00 c0-0c0s2n2 LustreError: 9156:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic.
          2014-03-04T14:27:06.534666-06:00 c0-0c0s2n2 LustreError: 9156:0:(linux-module.c:255:obd_proc_read_console_limit()) Skipped 1 previous similar message
          

          As you can see, none of my CERROR messages were printed within libcfs_console_min_delay jiffies of each other.

          I have attached my patch which adds the proc interface, allows you to set libcfs_console_min/max_delay in seconds, and adds the extra printk. I have also attached my test script.

          haasken Ryan Haasken added a comment - In order to test this patch, I added a readable /proc interface at /proc/fs/lustre/test-console-limit that just did a CERROR and then returned a string. (I could have added this CERROR to an existing /proc interface or to some other part of the code, but I wanted it to be quick and easy to execute that code.) I also changed the module parameters libcfs_console_min/max_delay to be set in seconds and converted to jiffies. Then I set them to 5 and 20 seconds, respectively. For debugging purposes, I also added a printk to libcfs_debug_vmsg2 that printed the value of cdls_delay if it was less than libcfs_console_min_delay. I executed a test script that read the proc interface in a way that would demonstrate the bug. Here are the relevant console messages from that test script: 2014-03-04T13:25:36.828018-06:00 c0-0c0s2n2 cdls_delay=0 is less thanlibcfs_console_min_delay=1250 2014-03-04T13:25:36.865375-06:00 c0-0c0s2n2 LustreError: 8624:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:25:36.865405-06:00 c0-0c0s2n2 LustreError: 8625:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:25:41.930256-06:00 c0-0c0s2n2 LustreError: 8635:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:25:41.930305-06:00 c0-0c0s2n2 LustreError: 8635:0:(linux-module.c:255:obd_proc_read_console_limit()) Skipped 8 previous similar messages 2014-03-04T13:26:21.963698-06:00 c0-0c0s2n2 cdls_delay=312 is less thanlibcfs_console_min_delay=1250 2014-03-04T13:26:21.963749-06:00 c0-0c0s2n2 LustreError: 8638:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:26:56.977699-06:00 c0-0c0s2n2 cdls_delay=39 is less thanlibcfs_console_min_delay=1250 2014-03-04T13:26:57.018967-06:00 c0-0c0s2n2 LustreError: 8641:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:27:32.035007-06:00 c0-0c0s2n2 cdls_delay=4 is less thanlibcfs_console_min_delay=1250 2014-03-04T13:27:32.035057-06:00 c0-0c0s2n2 LustreError: 8644:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:28:07.043974-06:00 c0-0c0s2n2 cdls_delay=0 is less thanlibcfs_console_min_delay=1250 2014-03-04T13:28:07.084192-06:00 c0-0c0s2n2 LustreError: 8646:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:28:42.100546-06:00 c0-0c0s2n2 cdls_delay=0 is less thanlibcfs_console_min_delay=1250 2014-03-04T13:28:42.100595-06:00 c0-0c0s2n2 LustreError: 8649:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:29:17.134376-06:00 c0-0c0s2n2 cdls_delay=0 is less thanlibcfs_console_min_delay=1250 2014-03-04T13:29:17.134425-06:00 c0-0c0s2n2 LustreError: 8653:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T13:29:17.248881-06:00 c0-0c0s2n2 LustreError: 8655:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. As you can see, the first two messages occur almost at the same time, so the second one should have been skipped. Then we can see that if we wait a while between messages, the cdls_delay continues decreasing below libcfs_console_min_delay. This allows the last message to be printed when it should have been skipped. I then applied the fix from 9503 and re-ran the test script (and executed one extra "cat /proc/fs/lustre/test-console-limit" after ~6 minutes). Here are the relevant console messages from that run: 2014-03-04T14:18:03.556798-06:00 c0-0c0s2n2 LustreError: 9119:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:18:08.601399-06:00 c0-0c0s2n2 LustreError: 9130:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:18:08.601442-06:00 c0-0c0s2n2 LustreError: 9130:0:(linux-module.c:255:obd_proc_read_console_limit()) Skipped 9 previous similar messages 2014-03-04T14:18:48.639191-06:00 c0-0c0s2n2 LustreError: 9134:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:19:23.667557-06:00 c0-0c0s2n2 LustreError: 9138:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:19:58.690327-06:00 c0-0c0s2n2 LustreError: 9140:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:20:33.717131-06:00 c0-0c0s2n2 LustreError: 9142:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:21:08.738812-06:00 c0-0c0s2n2 LustreError: 9146:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:21:43.764265-06:00 c0-0c0s2n2 LustreError: 9148:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:27:06.534620-06:00 c0-0c0s2n2 LustreError: 9156:0:(linux-module.c:255:obd_proc_read_console_limit()) haasken is testing the console rate limiting logic. 2014-03-04T14:27:06.534666-06:00 c0-0c0s2n2 LustreError: 9156:0:(linux-module.c:255:obd_proc_read_console_limit()) Skipped 1 previous similar message As you can see, none of my CERROR messages were printed within libcfs_console_min_delay jiffies of each other. I have attached my patch which adds the proc interface, allows you to set libcfs_console_min/max_delay in seconds, and adds the extra printk. I have also attached my test script.
          haasken Ryan Haasken added a comment -

          The patch has been uploaded: http://review.whamcloud.com/#/c/9503/

          haasken Ryan Haasken added a comment - The patch has been uploaded: http://review.whamcloud.com/#/c/9503/

          People

            cliffw Cliff White (Inactive)
            haasken Ryan Haasken
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: