Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4080

Troubleshooting poor filesystem performance

Details

    • Task
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.1.3
    • None
    • RHEL 6.4, Lustre client kernel 2.6.32-279.2.1_x86-64
      96TB filesystem comprised of 6 x 16 TB OSTs on 2 OSS - 3 OSTs per OSS
    • 10955

    Description

      On a Lustre client with 96 TB filesystem mounted over Infiniband, we seem to be having write performance issues and need assistance with troubleshooting. On this particular client, we receive ~600,000 binary data files daily from an external source. Each file is anywhere from 100 bytes to 1 MB. These small files are then appended to data files already on the system from previous writes/appends.

      The speed at which the files are appended is very poor compared to a less powerful system we have that runs an old version of HP's IBRIX filesystem. What commands or troubleshooting steps can I take to find out where the bottleneck is in processing?

      Note this is a DOD classified system no connected to the internet so supplying logs may be problematic. I will be glad to answer any questions about the filesystem layout and configuration itself as well as discuss output from any commands you recommend I run.

      Thanks,
      George Jackson

      Attachments

        Activity

          [LU-4080] Troubleshooting poor filesystem performance

          For any file within the filesystem that we are appending and reading, an lfs getstripe says:

          lmm_stripe_count: 6
          lmm_stripe_size: 1048576
          lmm_stripe_offset: 4

          The offset is different for each file but the count and size are the same. The output goes on to list the obdidx and objid info but I think the count and size were what you wanted.

          I am unable to provide any more information at this time on the issue. Until I can figure out more of what's going on please feel free to close this ticket and I will open a different ticket with more detailed info based on the information you have provided so far.

          Thanks,
          George Jackson

          jacksong George Jackson (Inactive) added a comment - For any file within the filesystem that we are appending and reading, an lfs getstripe says: lmm_stripe_count: 6 lmm_stripe_size: 1048576 lmm_stripe_offset: 4 The offset is different for each file but the count and size are the same. The output goes on to list the obdidx and objid info but I think the count and size were what you wanted. I am unable to provide any more information at this time on the issue. Until I can figure out more of what's going on please feel free to close this ticket and I will open a different ticket with more detailed info based on the information you have provided so far. Thanks, George Jackson
          green Oleg Drokin added a comment -

          70-100 samples every 5 seconds is 14-20 conflicting file access per second, which is usually a signal that something tried to read the file very soon after it was appended. So this is a kind of cuncurrence where one client writes a file and then the other clients read it quite soon afterwards, and then if the same file needs appending again, the reading clients would also need to drop their caches and so on (all organized via internal Lustre locking).
          you can use lfs getstrip command with a filename of the file you are interested in to see how is it striped.
          BTW, another disadvantage of these small unaligned writes is before you can even write somewhere, a partial file block needs to be fetched in first.

          if the reading clients never read more than one piece of sequential data, switching them to directio probably would have helped if not for the directio alignment requirements. Unfortunately unaligned directio is not supported in 2.1 clients.

          Essentially cache ping pong s a situation where:
          client1: write some data into file1 (obtains a lock on the region to write, which in case of O_APPEND is from known end of file to infinity, then populate local cache with first clean dat athat it reads from the server for the partial page affected, then with dirty data from the application)
          cleint2: attempt to read some data from file1 (blocked at attempt to lock the intended file region for read)
          client1: received request to drop the lock on the region in file1 it holds (writes all dirty cache pages to the server, then drops the lock)
          client2: reads some data from file1 (gets the lock, reads some data from the server and populates local cache with the data).
          client1: attempt to write some data to file1 again (blocks on the attempt to lock the file region in file1 again, causing client2 to drop all of it's ache and release the lock ...)

          and so on.

          green Oleg Drokin added a comment - 70-100 samples every 5 seconds is 14-20 conflicting file access per second, which is usually a signal that something tried to read the file very soon after it was appended. So this is a kind of cuncurrence where one client writes a file and then the other clients read it quite soon afterwards, and then if the same file needs appending again, the reading clients would also need to drop their caches and so on (all organized via internal Lustre locking). you can use lfs getstrip command with a filename of the file you are interested in to see how is it striped. BTW, another disadvantage of these small unaligned writes is before you can even write somewhere, a partial file block needs to be fetched in first. if the reading clients never read more than one piece of sequential data, switching them to directio probably would have helped if not for the directio alignment requirements. Unfortunately unaligned directio is not supported in 2.1 clients. Essentially cache ping pong s a situation where: client1: write some data into file1 (obtains a lock on the region to write, which in case of O_APPEND is from known end of file to infinity, then populate local cache with first clean dat athat it reads from the server for the partial page affected, then with dirty data from the application) cleint2: attempt to read some data from file1 (blocked at attempt to lock the intended file region for read) client1: received request to drop the lock on the region in file1 it holds (writes all dirty cache pages to the server, then drops the lock) client2: reads some data from file1 (gets the lock, reads some data from the server and populates local cache with the data). client1: attempt to write some data to file1 again (blocks on the attempt to lock the file region in file1 again, causing client2 to drop all of it's ache and release the lock ...) and so on.

          Thank you both for your comments. To answer Oleg's questions first:

          The typical size for files entering the filesystem from remote host is anywhere from ~100 bytes up to 3 MB. Over the course of a day, the file getting appended to by these incremental files grows as much as 3 GB. There is no file locking during this operation, which is indeed a file open with the O_APPEND flag then the binary data is written. Afterwards, it becomes data at rest for an indefinite period of time to be read by user applications. I am not sure what the typical striping is. Is there a command I could run or file I can look at to find this? We do stripe data across the 6 OSTs if that is what you mean.

          Although we have 5 clients that mount the fs, only one client does the writes/appends. All other clients only perform reads. So no concurrency.

          My apologies for not quantifying what I mean by poor. I am currently gathering some stats to give you more detail on the performance I'm seeing and will provide at a later date.

          I'm not familiar with the cache ping pong term you mentioned but I looked at the ldlm_bl_callback samples in /proc/fs/lustre/ldlm/services/ldlm_cbd/stats and found that over the course of one minute, the first number incremented by 70-95 samples every 5 seconds. Is that significant?

          For Andreas' questions:

          I couldn't determine what sync function is being used but am trying to get in touch with the developer of the application that does the writes/appends for an answer.

          I'm studying the possibility of initially staging the incremental files on a local fs but it doesn't look good. The internal disks we have on these hosts are only 300GB so we rely mostly on SAN (which is what we use for Lustre) for all our storage needs.

          I think I see what you are talking about in your assessment of file writes. Is there some type of configuration parameter we can set to make processing smaller files more efficient? In other words, in our scenario of writing mostly smaller files (<200KB), what tuning parameters should we be using?

          Thanks again for your help,
          George Jackson

          jacksong George Jackson (Inactive) added a comment - Thank you both for your comments. To answer Oleg's questions first: The typical size for files entering the filesystem from remote host is anywhere from ~100 bytes up to 3 MB. Over the course of a day, the file getting appended to by these incremental files grows as much as 3 GB. There is no file locking during this operation, which is indeed a file open with the O_APPEND flag then the binary data is written. Afterwards, it becomes data at rest for an indefinite period of time to be read by user applications. I am not sure what the typical striping is. Is there a command I could run or file I can look at to find this? We do stripe data across the 6 OSTs if that is what you mean. Although we have 5 clients that mount the fs, only one client does the writes/appends. All other clients only perform reads. So no concurrency. My apologies for not quantifying what I mean by poor. I am currently gathering some stats to give you more detail on the performance I'm seeing and will provide at a later date. I'm not familiar with the cache ping pong term you mentioned but I looked at the ldlm_bl_callback samples in /proc/fs/lustre/ldlm/services/ldlm_cbd/stats and found that over the course of one minute, the first number incremented by 70-95 samples every 5 seconds. Is that significant? For Andreas' questions: I couldn't determine what sync function is being used but am trying to get in touch with the developer of the application that does the writes/appends for an answer. I'm studying the possibility of initially staging the incremental files on a local fs but it doesn't look good. The internal disks we have on these hosts are only 300GB so we rely mostly on SAN (which is what we use for Lustre) for all our storage needs. I think I see what you are talking about in your assessment of file writes. Is there some type of configuration parameter we can set to make processing smaller files more efficient? In other words, in our scenario of writing mostly smaller files (<200KB), what tuning parameters should we be using? Thanks again for your help, George Jackson
          adilger Andreas Dilger added a comment - - edited

          Another question is whether the files are being sync'd to disk (e.g. application calling fsync() for each file, or opening with O_SYNC, or marking the directory dirsync)? That would definitely kill performance right off the bat.

          The other major question is whether the 600k files/day could initially be staged into a separate filesystem (e.g. fast locally-attached SSD with ext4) before being appended onto the larger files in Lustre?

          Writing a 100-byte file into Lustre currently requires at least four separate RPCs (open/create, OST lock enqueue, OST write, close, not counting the OST lock cancel which might be batched with another RPC). At a minimum, each write of a file needs at least:

            ~64 bytes MDT filename
          + 512 byte MDT inode
          + 128 byte MDT replay log
          + 32 byte MDT object index
          + 256 byte OST inode
          + 128 byte OST replay log
          * 2 because of metadata journaling
          + file size rounded up to 4096 byte multiple
          

          This 2kB+ of overhead is already averaged over many file updates already and does not include operations that happen asynchronously in the background, it would be even higher for individual files. It also involves many separate IOs (i.e. seeks), though these are also amortized over many files. It is the same regardless of the file size, so the overhead only gets acceptable (e.g. < 1%) for files over 200kB in size.

          That isn't to say that there isn't anything to be done to improve the performance (I can't really comment without knowing exactly what "good" and "poor" are in your scenario), but there are currently limits on how efficient small file IO can get. We are already working on a feature for significantly reducing the small file IO overhead for future releases, but this is the situation today.

          adilger Andreas Dilger added a comment - - edited Another question is whether the files are being sync'd to disk (e.g. application calling fsync() for each file, or opening with O_SYNC, or marking the directory dirsync)? That would definitely kill performance right off the bat. The other major question is whether the 600k files/day could initially be staged into a separate filesystem (e.g. fast locally-attached SSD with ext4) before being appended onto the larger files in Lustre? Writing a 100-byte file into Lustre currently requires at least four separate RPCs (open/create, OST lock enqueue, OST write, close, not counting the OST lock cancel which might be batched with another RPC). At a minimum, each write of a file needs at least: ~64 bytes MDT filename + 512 byte MDT inode + 128 byte MDT replay log + 32 byte MDT object index + 256 byte OST inode + 128 byte OST replay log * 2 because of metadata journaling + file size rounded up to 4096 byte multiple This 2kB+ of overhead is already averaged over many file updates already and does not include operations that happen asynchronously in the background, it would be even higher for individual files. It also involves many separate IOs (i.e. seeks), though these are also amortized over many files. It is the same regardless of the file size, so the overhead only gets acceptable (e.g. < 1%) for files over 200kB in size. That isn't to say that there isn't anything to be done to improve the performance (I can't really comment without knowing exactly what "good" and "poor" are in your scenario), but there are currently limits on how efficient small file IO can get. We are already working on a feature for significantly reducing the small file IO overhead for future releases, but this is the situation today.
          green Oleg Drokin added a comment -

          What's the typical size for the files that are being appended? What's the typical striping for such files?
          When you say append, is it done by opening the file with O_APPEND flag andthen writing some data or how do you do it?
          How much concurrency is there (as in, none = every client writes to files that only it touches at all times, or high = there are frequently many clients that append to the same file (or in quick succession). How many clients)?

          When you say slow, is there some number or something that makes it slow, like "open takes 2 seconds" or "append-write takes 0.1 seconds for every 100k" or some other quantifiable number and ideally an operation that's slow?

          Cache ping pong between nodes might result in significant slowdowns in writes, you can check for it in /proc/fs/lustre/ldlm/services/ldlm_cbd/stats and check if the first number in ldlm_bl_callback line increases at a fast pace.

          green Oleg Drokin added a comment - What's the typical size for the files that are being appended? What's the typical striping for such files? When you say append, is it done by opening the file with O_APPEND flag andthen writing some data or how do you do it? How much concurrency is there (as in, none = every client writes to files that only it touches at all times, or high = there are frequently many clients that append to the same file (or in quick succession). How many clients)? When you say slow, is there some number or something that makes it slow, like "open takes 2 seconds" or "append-write takes 0.1 seconds for every 100k" or some other quantifiable number and ideally an operation that's slow? Cache ping pong between nodes might result in significant slowdowns in writes, you can check for it in /proc/fs/lustre/ldlm/services/ldlm_cbd/stats and check if the first number in ldlm_bl_callback line increases at a fast pace.
          pjones Peter Jones added a comment -

          Oleg

          Could you please assist with this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Oleg Could you please assist with this one? Thanks Peter

          People

            green Oleg Drokin
            jacksong George Jackson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: