Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4533

rpc_stats histogram does not support max_rpcs_in_flight greater than 31

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.5.0
    • 3
    • 12395

    Description

      The "rpcs in flight" histogram which is displayed by reading the proc file /proc/fs/lustre/osc/*/rpc_stats does not show values higher than 31. When max_rpcs_in_flight is set to a value greater than 31, we should see rows for "rpcs in flight" values greater than 31. Instead, all rpcs which are sent when there are 31 or more rpcs in flight are accounted for in the 31st bucket of the histogram.

                              read                    write
      rpcs in flight        rpcs   % cum % |       rpcs   % cum %
      0:                       0   0   0   |          0   0   0
      1:                     504   5   5   |        621  30  30
      2:                     330   3   8   |        405  20  51
      3:                     337   3  12   |          1   0  51
      4:                     349   3  16   |          1   0  51
      5:                     338   3  19   |          1   0  51
      6:                     325   3  23   |          1   0  51
      7:                     327   3  26   |          1   0  51
      8:                     324   3  30   |          1   0  51
      9:                     307   3  33   |          1   0  51
      10:                    306   3  36   |          1   0  51
      11:                    306   3  40   |          1   0  51
      12:                    301   3  43   |          1   0  51
      13:                    291   3  46   |          1   0  51
      14:                    283   3  49   |          1   0  51
      15:                    278   2  52   |          2   0  51
      16:                    276   2  55   |          1   0  51
      17:                    270   2  58   |          1   0  51
      18:                    270   2  61   |          1   0  52
      19:                    266   2  63   |          1   0  52
      20:                    265   2  66   |          1   0  52
      21:                    262   2  69   |          2   0  52
      22:                    263   2  72   |          4   0  52
      23:                    262   2  75   |          3   0  52
      24:                    263   2  77   |          1   0  52
      25:                    262   2  80   |          2   0  52
      26:                    261   2  83   |          1   0  52
      27:                    260   2  86   |          1   0  52
      28:                    259   2  89   |          3   0  52
      29:                    256   2  91   |          2   0  53
      30:                    256   2  94   |          1   0  53
      31:                    512   5 100   |        939  46 100
      

      According to the current version of the Lustre manual, the valid range for max_rpcs_in_flight is between 1 and 256. Those values should be supported by this histogram. The maximum value for max_rpcs_in_flight is determined by the value of this preprocessor macro:

      #define OSC_MAX_RIF_MAX         256
      

      The size of the obd_histogram struct is determined by a preprocessor macro as well:

      /* if we find more consumers this could be generalized */
      #define OBD_HIST_MAX 32
      struct obd_histogram {
              spinlock_t      oh_lock;
              unsigned long   oh_buckets[OBD_HIST_MAX];
      };
      

      It looks like the histogram for recording the number of rpcs in flight has the greatest space requirements, so it would be a sufficient fix if we defined OBD_HIST_MAX to OSC_MAX_RIF_MAX. However, this would increase the size of every obd_histogram by about a factor of 8. I'm not sure yet if this would be a significant increase.

      Another option would be to generalize the obd_histogram struct to use a flexible array for oh_buckets, but this would require a lot more work, and all obd_histogram structures would need to be dynamically allocated.

      Attachments

        Issue Links

          Activity

            [LU-4533] rpc_stats histogram does not support max_rpcs_in_flight greater than 31

            Attached my old_prototype_dynamic_histogram.patch . It only allocates buckets on demand as they are used. It is from the time of LU-7990, so about Lustre 2.9.

            adilger Andreas Dilger added a comment - Attached my old_prototype_dynamic_histogram.patch . It only allocates buckets on demand as they are used. It is from the time of LU-7990 , so about Lustre 2.9.

            Has there been any further progress here?

            cfaber#1 Colin Faber [X] (Inactive) added a comment - Has there been any further progress here?
            hornc Chris Horn added a comment -

            I am re-upping this patch since we never saw the dynamic obd_histogram allocation that was the reason we abandoned this change originally.

            hornc Chris Horn added a comment - I am re-upping this patch since we never saw the dynamic obd_histogram allocation that was the reason we abandoned this change originally.

            Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/31236
            Subject: LU-4533 lprocfs: Increase OBD_HIST_MAX to OSC_MAX_RIF_MAX
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 87f1cfd9f4fb9ed200c3bb14e65f5993faf82e0c

            gerrit Gerrit Updater added a comment - Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/31236 Subject: LU-4533 lprocfs: Increase OBD_HIST_MAX to OSC_MAX_RIF_MAX Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 87f1cfd9f4fb9ed200c3bb14e65f5993faf82e0c
            haasken Ryan Haasken added a comment -

            I have abandoned this patch:

            http://review.whamcloud.com/#/c/9930/

            Andreas Dilger says he has a patch in the works to do dynamic allocation of the obd_histogram buckets. That will be a much better long-term solution.

            haasken Ryan Haasken added a comment - I have abandoned this patch: http://review.whamcloud.com/#/c/9930/ Andreas Dilger says he has a patch in the works to do dynamic allocation of the obd_histogram buckets. That will be a much better long-term solution.
            haasken Ryan Haasken added a comment - - edited

            Here is a patch which just sets OBD_HIST_MAX to 256 to match the value of OSC_MAX_RIF_MAX. This may not be the best solution because of the extra space used by every obd_histogram, but it fixes this bug.

            http://review.whamcloud.com/#/c/9930/

            Here is an example of the output from reading /proc/fs/lustre/osc/*/rpc_stats when max_rpcs_in_flight=64:

            			read			write
            rpcs in flight        rpcs   % cum % |       rpcs   % cum %
            0:		         0   0   0   |          0   0   0
            1:		         2 100 100   |       4592  25  25
            2:		         0   0 100   |       3216  17  43
            3:		         0   0 100   |       2390  13  56
            4:		         0   0 100   |       1966  10  67
            5:		         0   0 100   |       1663   9  76
            6:		         0   0 100   |       1292   7  84
            7:		         0   0 100   |        907   5  89
            8:		         0   0 100   |        592   3  92
            9:		         0   0 100   |        414   2  94
            10:		         0   0 100   |        254   1  96
            11:		         0   0 100   |        155   0  96
            12:		         0   0 100   |        107   0  97
            13:		         0   0 100   |         78   0  97
            14:		         0   0 100   |         56   0  98
            15:		         0   0 100   |         38   0  98
            16:		         0   0 100   |         32   0  98
            17:		         0   0 100   |         28   0  98
            18:		         0   0 100   |         23   0  98
            19:		         0   0 100   |         22   0  99
            20:		         0   0 100   |         22   0  99
            21:		         0   0 100   |         17   0  99
            22:		         0   0 100   |         14   0  99
            23:		         0   0 100   |         11   0  99
            24:		         0   0 100   |         12   0  99
            25:		         0   0 100   |         11   0  99
            26:		         0   0 100   |         10   0  99
            27:		         0   0 100   |         14   0  99
            28:		         0   0 100   |         16   0  99
            29:		         0   0 100   |          9   0  99
            30:		         0   0 100   |          7   0  99
            31:		         0   0 100   |          5   0  99
            32:		         0   0 100   |          7   0  99
            33:		         0   0 100   |          4   0  99
            34:		         0   0 100   |          3   0  99
            35:		         0   0 100   |          2   0  99
            36:		         0   0 100   |          2   0  99
            37:		         0   0 100   |          1   0  99
            38:		         0   0 100   |          1   0  99
            39:		         0   0 100   |          1   0 100
            

            As you can see, now the histogram goes past 31 and all values of rpcs_in_flight which occurred are reported properly in the histogram.

            haasken Ryan Haasken added a comment - - edited Here is a patch which just sets OBD_HIST_MAX to 256 to match the value of OSC_MAX_RIF_MAX. This may not be the best solution because of the extra space used by every obd_histogram, but it fixes this bug. http://review.whamcloud.com/#/c/9930/ Here is an example of the output from reading /proc/fs/lustre/osc/*/rpc_stats when max_rpcs_in_flight=64: read write rpcs in flight rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 2 100 100 | 4592 25 25 2: 0 0 100 | 3216 17 43 3: 0 0 100 | 2390 13 56 4: 0 0 100 | 1966 10 67 5: 0 0 100 | 1663 9 76 6: 0 0 100 | 1292 7 84 7: 0 0 100 | 907 5 89 8: 0 0 100 | 592 3 92 9: 0 0 100 | 414 2 94 10: 0 0 100 | 254 1 96 11: 0 0 100 | 155 0 96 12: 0 0 100 | 107 0 97 13: 0 0 100 | 78 0 97 14: 0 0 100 | 56 0 98 15: 0 0 100 | 38 0 98 16: 0 0 100 | 32 0 98 17: 0 0 100 | 28 0 98 18: 0 0 100 | 23 0 98 19: 0 0 100 | 22 0 99 20: 0 0 100 | 22 0 99 21: 0 0 100 | 17 0 99 22: 0 0 100 | 14 0 99 23: 0 0 100 | 11 0 99 24: 0 0 100 | 12 0 99 25: 0 0 100 | 11 0 99 26: 0 0 100 | 10 0 99 27: 0 0 100 | 14 0 99 28: 0 0 100 | 16 0 99 29: 0 0 100 | 9 0 99 30: 0 0 100 | 7 0 99 31: 0 0 100 | 5 0 99 32: 0 0 100 | 7 0 99 33: 0 0 100 | 4 0 99 34: 0 0 100 | 3 0 99 35: 0 0 100 | 2 0 99 36: 0 0 100 | 2 0 99 37: 0 0 100 | 1 0 99 38: 0 0 100 | 1 0 99 39: 0 0 100 | 1 0 100 As you can see, now the histogram goes past 31 and all values of rpcs_in_flight which occurred are reported properly in the histogram.

            People

              wc-triage WC Triage
              haasken Ryan Haasken
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: