Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3300

Restore missing proc information for LMT

Details

    • Task
    • Resolution: Unresolved
    • Major
    • Lustre 2.4.0
    • Lustre 2.4.0
    • None
    • 8169

    Description

      Lustre's proc seems to have had a number of regressions. LMT's ltop is no longer able to find many of the values it used to display.

      In particular, brw_stats from obdfilter is gone, and does not appear to have been replaced after the OSD work. At minimum, that was used by ltop to report the number of bulk rpcs handled.

      The MDS display is also missing a number of values.

      We don't necessarily need to put them back exactly how they were before, but we need to export them in some way that will make them usable for folks.

      It would be best to decide on interfaces before 2.4.0 is locked in.

      Attachments

        Issue Links

          Activity

            [LU-3300] Restore missing proc information for LMT
            jhammond John Hammond added a comment -

            Robert, I broke the osd statfs handlers. A patch is forthcoming.

            jhammond John Hammond added a comment - Robert, I broke the osd statfs handlers. A patch is forthcoming.
            rread Robert Read added a comment -

            I noticed with recent 2.4 builds that lmt is failing to capture metrics on the MDS because several files are empty, however this worked in the 2.4.63 builds that I was using for my LUG testing, so this is a recent regression. These files are in both the lod and osd-ldiskfs directories and they're empty in both:

            [ec2-user@mds0 ~]$ head /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/*
            ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/activeobd <==
            8

            ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/blocksize <==

            ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/desc_uuid <==
            scratch-MDT0000-mdtlov_UUID

            ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/filesfree <==

            ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/filestotal <==

            ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/kbytesavail <==

            ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/kbytesfree <==

            ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/kbytestotal <==

            ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/numobd <==
            8

            rread Robert Read added a comment - I noticed with recent 2.4 builds that lmt is failing to capture metrics on the MDS because several files are empty, however this worked in the 2.4.63 builds that I was using for my LUG testing, so this is a recent regression. These files are in both the lod and osd-ldiskfs directories and they're empty in both: [ec2-user@mds0 ~] $ head /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/* ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/activeobd <== 8 ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/blocksize <== ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/desc_uuid <== scratch-MDT0000-mdtlov_UUID ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/filesfree <== ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/filestotal <== ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/kbytesavail <== ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/kbytesfree <== ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/kbytestotal <== ==> /proc/fs/lustre/lod/scratch-MDT0000-mdtlov/numobd <== 8
            pjones Peter Jones added a comment -

            I discussed this with Marc Stearman today. This does not need to hold the release but it will be a support priority to find ways to enable LLNL sysadmins to perform common tasks. Marc will provide a prioritized list of those that matter most to LLNL.

            pjones Peter Jones added a comment - I discussed this with Marc Stearman today. This does not need to hold the release but it will be a support priority to find ways to enable LLNL sysadmins to perform common tasks. Marc will provide a prioritized list of those that matter most to LLNL.

            Chris, does John's proposal for using the obdfilter "stats" data address your needs for LMT? This would be a way for LMT to get aggregate IO and per-client IO stats that works on both 2.1 and 2.4.

            It would also be possible to create the brw_stats for ZFS with just the RPC ("page") information to start with, but based on your comment I don't know if this is what you want. It isn't clear to me if it will be possible to add the ZFS block IO stats later or not. Would just having the "page" information in ZFS brw_stats be useful? This would allow the admins to at least see whether the clients are submitting poorly formed RPCs. I don't think it would be too hard to do just that part.

            adilger Andreas Dilger added a comment - Chris, does John's proposal for using the obdfilter "stats" data address your needs for LMT? This would be a way for LMT to get aggregate IO and per-client IO stats that works on both 2.1 and 2.4. It would also be possible to create the brw_stats for ZFS with just the RPC ("page") information to start with, but based on your comment I don't know if this is what you want. It isn't clear to me if it will be possible to add the ZFS block IO stats later or not. Would just having the "page" information in ZFS brw_stats be useful? This would allow the admins to at least see whether the clients are submitting poorly formed RPCs. I don't think it would be too hard to do just that part.
            jhammond John Hammond added a comment -

            The obdfilter stats should provide a good view of utilization on a global and per client level, regardless of the backend. I have personally used them to this effect.

            # cat /proc/fs/lustre/obdfilter/lustre-OST0000/stats
            snapshot_time             1368063951.465558 secs.usecs
            read_bytes                7 samples [bytes] 4096 1048576 5251072
            write_bytes               1287 samples [bytes] 4096 4096 5271552
            get_info                  8 samples [reqs]
            connect                   1 samples [reqs]
            reconnect                 4 samples [reqs]
            disconnect                1 samples [reqs]
            statfs                    9278 samples [reqs]
            create                    4 samples [reqs]
            destroy                   3 samples [reqs]
            sync                      1282 samples [reqs]
            preprw                    1294 samples [reqs]
            commitrw                  1294 samples [reqs]
            ping                      9588 samples [reqs]
            # cat /proc/fs/lustre/obdfilter/lustre-OST0000/exports/0@lo/stats
            snapshot_time             1368063954.225882 secs.usecs
            read_bytes                7 samples [bytes] 4096 1048576 5251072
            write_bytes               1287 samples [bytes] 4096 4096 5271552
            get_info                  8 samples [reqs]
            disconnect                1 samples [reqs]u 
            create                    4 samples [reqs]
            destroy                   3 samples [reqs]
            sync                      1282 samples [reqs]
            preprw                    1294 samples [reqs]
            commitrw                  1294 samples [reqs]
            ping                      9591 samples [reqs]
            

            If something is missing then please say so. Not that I can guarantee anything for 2.4, but I would like to know.

            As an aside, if LLNL/LMT/you depend on some aspect of proc then it would be good to have some sanity tests to verify that it doesn't go away, along a comment in the test to the effect that LLNL/LMT/you will be unhappy if it does. I finally get to say that to somebody, rather than have it said to me. I thought I would enjoy it more. Weird.

            jhammond John Hammond added a comment - The obdfilter stats should provide a good view of utilization on a global and per client level, regardless of the backend. I have personally used them to this effect. # cat /proc/fs/lustre/obdfilter/lustre-OST0000/stats snapshot_time 1368063951.465558 secs.usecs read_bytes 7 samples [bytes] 4096 1048576 5251072 write_bytes 1287 samples [bytes] 4096 4096 5271552 get_info 8 samples [reqs] connect 1 samples [reqs] reconnect 4 samples [reqs] disconnect 1 samples [reqs] statfs 9278 samples [reqs] create 4 samples [reqs] destroy 3 samples [reqs] sync 1282 samples [reqs] preprw 1294 samples [reqs] commitrw 1294 samples [reqs] ping 9588 samples [reqs] # cat /proc/fs/lustre/obdfilter/lustre-OST0000/exports/0@lo/stats snapshot_time 1368063954.225882 secs.usecs read_bytes 7 samples [bytes] 4096 1048576 5251072 write_bytes 1287 samples [bytes] 4096 4096 5271552 get_info 8 samples [reqs] disconnect 1 samples [reqs]u create 4 samples [reqs] destroy 3 samples [reqs] sync 1282 samples [reqs] preprw 1294 samples [reqs] commitrw 1294 samples [reqs] ping 9591 samples [reqs] If something is missing then please say so. Not that I can guarantee anything for 2.4, but I would like to know. As an aside, if LLNL/LMT/you depend on some aspect of proc then it would be good to have some sanity tests to verify that it doesn't go away, along a comment in the test to the effect that LLNL/LMT/you will be unhappy if it does. I finally get to say that to somebody, rather than have it said to me. I thought I would enjoy it more. Weird.

            I don't really want to fully recreate brw_stats as it exists now. But we need a consistent way to get the same information form both ldiskfs and zfs osds, that lets us fully fill in the information that ltop has always presented.

            Proc information is important to your customers. Proc has regressed, and is incomplete. I think that means that lustre isn't ready for RCs yet.

            morrone Christopher Morrone (Inactive) added a comment - I don't really want to fully recreate brw_stats as it exists now. But we need a consistent way to get the same information form both ldiskfs and zfs osds, that lets us fully fill in the information that ltop has always presented. Proc information is important to your customers. Proc has regressed, and is incomplete. I think that means that lustre isn't ready for RCs yet.

            Doh, forgot about the lack of brw_stats for ZFS. Alex's patch is only helping if brw_stats is available in the first place.

            Many of the disk brw_stats might be available on an aggregate basis, if the plumbing is available in ZFS. The per-client and per-job brw_stats is much more tricky because the ZFS IO is not allocated or submitted to disk until long after the service thread has completed processing the request.

            The RPC information like "pages per bulk r/w" and "discontiguous pages" could be available independent of the OSD type. These should really be OFD statistics, maybe in a new "rpc_stats" file, possibly in YAML format? These could also be available on a per-client or per-request basis. This might be enough for your debugging purposes?

            Information like "disk I/Os in flight", "I/O time", and "disk I/O size" might be gotten at an aggregate basis from ZFS. The "discontiguous blocks" and "disk fragmented I/Os" would be much harder to collect for writes, without deep hooks into the ZFS IO scheduler. Some of this information could be extracted for reads, by hacking into the ZFS block pointers to get the physical disk blocks.

            As for getting this into 2.4.0, I don't think that is very likely, since we are very close to making an RC1 tag. I don't think it would be unreasonable to call this a regression and fix it for 2.4.1 if it can be done cleanly.

            adilger Andreas Dilger added a comment - Doh, forgot about the lack of brw_stats for ZFS. Alex's patch is only helping if brw_stats is available in the first place. Many of the disk brw_stats might be available on an aggregate basis, if the plumbing is available in ZFS. The per-client and per-job brw_stats is much more tricky because the ZFS IO is not allocated or submitted to disk until long after the service thread has completed processing the request. The RPC information like "pages per bulk r/w" and "discontiguous pages" could be available independent of the OSD type. These should really be OFD statistics, maybe in a new "rpc_stats" file, possibly in YAML format? These could also be available on a per-client or per-request basis. This might be enough for your debugging purposes? Information like "disk I/Os in flight", "I/O time", and "disk I/O size" might be gotten at an aggregate basis from ZFS. The "discontiguous blocks" and "disk fragmented I/Os" would be much harder to collect for writes, without deep hooks into the ZFS IO scheduler. Some of this information could be extracted for reads, by hacking into the ZFS block pointers to get the physical disk blocks. As for getting this into 2.4.0, I don't think that is very likely, since we are very close to making an RC1 tag. I don't think it would be unreasonable to call this a regression and fix it for 2.4.1 if it can be done cleanly.

            Ah, that explains why I couldn't find brw_stats. That is a problem.

            For MDS I won't have time to figure it out before I disappear on vacation. But some of the values never show anything but zeroes. Robert's pointer about md_stats could be the problem.

            Another missing item is the per-client brw_stats on servers. That used to be our admins' main method of determining who was overloading a server when server loads went through the roof. What do we do to handle that now?

            morrone Christopher Morrone (Inactive) added a comment - Ah, that explains why I couldn't find brw_stats. That is a problem. For MDS I won't have time to figure it out before I disappear on vacation. But some of the values never show anything but zeroes. Robert's pointer about md_stats could be the problem. Another missing item is the per-client brw_stats on servers. That used to be our admins' main method of determining who was overloading a server when server loads went through the roof. What do we do to handle that now?

            Under ldiskfs Lustre code had driect access to the block device request queue, making brw_stats possible, as they are really just enhanced block device stats. With zfs backends there is no such access so no brw_stats.

            During proc init for the ofd device, if the underlying osd has a brw_stats file in proc we create a symlink from the obdfilter/*/ directory. If the underlying osd device does not then there will be no symlink.

            There remain stats in /proc/fs/lustre/obdfilter/lustre-OST0000/stats for bulk transfers.

            Which MDS/MDT values are you missing?

            jhammond John Hammond added a comment - Under ldiskfs Lustre code had driect access to the block device request queue, making brw_stats possible, as they are really just enhanced block device stats. With zfs backends there is no such access so no brw_stats. During proc init for the ofd device, if the underlying osd has a brw_stats file in proc we create a symlink from the obdfilter/*/ directory. If the underlying osd device does not then there will be no symlink. There remain stats in /proc/fs/lustre/obdfilter/lustre-OST0000/stats for bulk transfers. Which MDS/MDT values are you missing?

            Chris, the brw_stats symlink was recently added back in http://review.whamcloud.com/5873 (LU-3106), so maybe that isn't in your tree yet?

            There is also a separate patch http://review.whamcloud.com/4618 (LU-2096), which adds symlinks from obdfilter/ to ofd/ for some of the other tunables so that the names are cleaned up. I haven't refreshed that patch lately since it isn't really a 2.4.0 priority.

            adilger Andreas Dilger added a comment - Chris, the brw_stats symlink was recently added back in http://review.whamcloud.com/5873 ( LU-3106 ), so maybe that isn't in your tree yet? There is also a separate patch http://review.whamcloud.com/4618 ( LU-2096 ), which adds symlinks from obdfilter/ to ofd/ for some of the other tunables so that the names are cleaned up. I haven't refreshed that patch lately since it isn't really a 2.4.0 priority.
            rread Robert Read added a comment -

            LU-3296 is another issue related to disappearing md_stats.

            I do see brw_stats in obdfilter, but it's a symlink to osd-ldiskfs/*/brw_stats.

            rread Robert Read added a comment - LU-3296 is another issue related to disappearing md_stats. I do see brw_stats in obdfilter, but it's a symlink to osd-ldiskfs/*/brw_stats.

            People

              emoly.liu Emoly Liu
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: