Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4711

Console rate limiting logic sets cdls_delay smaller than the minimum in libcfs_debug_vmsg2

    XMLWordPrintable

Details

    • 3
    • 12950

    Description

      The console message rate limiting logic which is used by CDEBUG_LIMIT, and therefore by CWARN, CERROR, CNETERR, and CEMERG, has a bug which causes some messages to be printed when they should be skipped.

      For example, here are three messages from the same function on one node:

      2013-05-24T22:37:30.674942-05:00 c2-0c0s7n3 Lustre: 7197:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1369453043/real 1369453043]  req@ffff880fef070c00 x1435963081072428/t0(0) o400->snx11014-OST0006-osc-ffff881039030c00@10.10.100.6@o2ib1003:28/4 lens 224/224 e 0 to 1 dl 1369453050 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      2013-05-24T22:37:30.683792-05:00 c2-0c0s7n3 Lustre: 7197:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
      2013-05-24T22:37:30.800884-05:00 c2-0c0s7n3 Lustre: 7197:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1369453043/real 1369453043]  req@ffff880fef070800 x1435963081072424/t0(0) o400->snx11014-OST0005-osc-ffff881039030c00@10.10.100.6@o2ib1003:28/4 lens 224/224 e 0 to 1 dl 1369453050 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      

      These messages are all from the same function and the same thread. The first message is printed at 22:37.30. At the same time, it prints "Skipped %d previous similar messages." That is correct. What is not correct is that the next ptlrpc_expire_one_request message is printed at 22:37.30. This message should have been skipped by the rate limiting logic.

      I have identified the problem in the code, and I have tested a fix. The problem is with this code:

                      if (cfs_time_after(cfs_time_current(), cdls->cdls_next +
                                                             libcfs_console_max_delay
                                                             + cfs_time_seconds(10))) {
                              /* last timeout was a long time ago */
                              cdls->cdls_delay /= libcfs_console_backoff * 4;
                      } else {
                              cdls->cdls_delay *= libcfs_console_backoff;
      
                              if (cdls->cdls_delay < libcfs_console_min_delay)
                                      cdls->cdls_delay = libcfs_console_min_delay;
                              else if (cdls->cdls_delay > libcfs_console_max_delay)
                                      cdls->cdls_delay = libcfs_console_max_delay;
                      }
      

      The code which clamps cdls_delay in between the libcfs_console_min_delay and libcfs_console_max_delay should be moved outside of the else statement. Otherwise, the cdls_delay can be set lower than the libcfs_console_min_delay, causing the next message to be printed sooner than it should.

      I have also noticed that this bug causes the first two messages through libcfs_debug_vmsg2 for a given cdls struct to be printed, no matter how close they are in time.

      I will upload my patch for this issue.

      Attachments

        Activity

          People

            cliffw Cliff White (Inactive)
            haasken Ryan Haasken
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: