Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4906

rm -rf triggers too much MDS_READPAGE

Details

    • 3
    • 13548

    Description

      On 2.5.57-79-ge7f99a5 I see that rm -rf on a directory with n regular files causes O(n) MDS_READPAGE requests to be sent. I ran the following on 2.4.3 and master:

      export MOUNT_2=y
      llmount.sh
      
      cd /mnt/lustre
      echo clear | tee /proc/fs/lustre/mdc/*/stats
      mkdir d
      touch d/{0..255}
      cat /proc/fs/lustre/mdc/*/stats
      
      cd /mnt/lustre2
      echo clear | tee /proc/fs/lustre/mdc/*/stats
      rm -rf d
      cat /proc/fs/lustre/mdc/*/stats
      

      On 2.4.3:

      ## mkdir and touch
      req_waittime              773 samples [usec] 65 55988 339933 3295555639
      req_active                773 samples [reqs] 1 1 773 773
      mds_close                 256 samples [usec] 65 486 58958 16078566
      mds_reint                 257 samples [usec] 93 826 74764 29250026
      ldlm_enqueue              259 samples [usec] 191 1423 150223 115570903
      seq_query                 1 samples [usec] 55988 55988 55988 3134656144
      ## rm -rf
      snapshot_time             1397509895.872324 secs.usecs
      req_waittime              258 samples [usec] 50 1054 32830 6165600
      req_active                258 samples [reqs] 1 1 258 258
      ldlm_cancel               258 samples [usec] 50 1054 32830 6165600
      snapshot_time             1397509895.872354 secs.usecs
      req_waittime              524 samples [usec] 45 9854 212336 225452482
      req_active                524 samples [reqs] 1 4 876 1660
      mds_close                 1 samples [usec] 390 390 390 152100
      mds_reint                 257 samples [usec] 331 9854 151751 209416267
      mds_readpage              3 samples [usec] 271 323 902 272634
      ldlm_enqueue              261 samples [usec] 45 736 59120 15595504
      ldlm_cancel               2 samples [usec] 64 109 173 15977
      

      On master:

      ## mkdir and touch:
      snapshot_time             1397507941.992796 secs.usecs
      snapshot_time             1397507941.992828 secs.usecs
      req_waittime              1282 samples [usec] 50 2674 364172 203043372
      req_active                1282 samples [reqs] 1 1 1282 1282
      mds_close                 256 samples [usec] 61 640 50251 15984775
      mds_reint                 257 samples [usec] 89 1045 60223 21230981
      mds_getxattr              256 samples [usec] 50 658 38276 9345158
      ldlm_enqueue              513 samples [usec] 82 2674 215422 156482458
      ## rm -rf
      snapshot_time             1397507954.948995 secs.usecs
      req_waittime              991 samples [usec] 31 5949 371017 322109413
      req_active                991 samples [reqs] 1 9 2132 6404
      mds_close                 1 samples [usec] 126 126 126 15876
      mds_reint                 257 samples [usec] 168 5949 173654 221727790
      mds_readpage              132 samples [usec] 158 2173 44232 21316906
      mds_getxattr              60 samples [usec] 31 345 5769 828911
      ldlm_enqueue              423 samples [usec] 44 2496 123479 70146809
      ldlm_cancel               118 samples [usec] 65 891 23757 8073121
      snapshot_time             1397507954.949096 secs.usecs
      req_waittime              1 samples [usec] 108 108 108 11664
      req_active                1 samples [reqs] 1 1 1 1
      ldlm_cancel               1 samples [usec] 108 108 108 11664
      

      (If you noticed that ldlm_enqueue and mds_reint are present, it's because I used http://review.whamcloud.com/#/c/6223/1 which is awesome and has been landable for nearly one year (still on patch set 1), but nobody ever reviews it.)

      Attachments

        Issue Links

          Activity

            [LU-4906] rm -rf triggers too much MDS_READPAGE

            I very strongly suspect the landing of this fix has broken lustre builds on 3.12 kernels. See comment in LU-4416.

            bogl Bob Glossman (Inactive) added a comment - I very strongly suspect the landing of this fix has broken lustre builds on 3.12 kernels. See comment in LU-4416 .

            Patch landed to Master. Remaining patch will be tracked under a new ticket.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Remaining patch will be tracked under a new ticket.
            di.wang Di Wang added a comment -

            I added another patch http://review.whamcloud.com/10695 based on 10622, to add striped entry page cache to master inode, please check.

            di.wang Di Wang added a comment - I added another patch http://review.whamcloud.com/10695 based on 10622, to add striped entry page cache to master inode, please check.

            This kind of performance test results should be great to include in the patch commit message.

            adilger Andreas Dilger added a comment - This kind of performance test results should be great to include in the patch commit message.
            di.wang Di Wang added a comment -

            With this patch http://review.whamcloud.com/#/c/10622/
            rm -rf dir with 10k files goes from 64 seconds to 8 seconds in my local test,
            with the patch

            [root@mds tests]# ./createmany -o /mnt/lustre/test1/f- 10000
            total: 10000 creates in 11.72 seconds: 853.50 creates/second
            [root@mds tests]# time rm -rf /mnt/lustre/test1
            
            real	0m8.151s
            user	0m0.012s
            sys	0m2.242s
            

            without patch

            [root@mds tests]# ./createmany -o /mnt/lustre/test1/f- 10000
            total: 10000 creates in 11.98 seconds: 835.02 creates/second
            [root@mds tests]# ls /mnt/lustre/test1/ | wc
              10000   10000   68890
            [root@mds tests]# time rm -rf /mnt/lustre/test1
            
            real	1m4.496s
            user	0m0.014s
            sys	0m3.832s
            
            di.wang Di Wang added a comment - With this patch http://review.whamcloud.com/#/c/10622/ rm -rf dir with 10k files goes from 64 seconds to 8 seconds in my local test, with the patch [root@mds tests]# ./createmany -o /mnt/lustre/test1/f- 10000 total: 10000 creates in 11.72 seconds: 853.50 creates/second [root@mds tests]# time rm -rf /mnt/lustre/test1 real 0m8.151s user 0m0.012s sys 0m2.242s without patch [root@mds tests]# ./createmany -o /mnt/lustre/test1/f- 10000 total: 10000 creates in 11.98 seconds: 835.02 creates/second [root@mds tests]# ls /mnt/lustre/test1/ | wc 10000 10000 68890 [root@mds tests]# time rm -rf /mnt/lustre/test1 real 1m4.496s user 0m0.014s sys 0m3.832s
            di.wang Di Wang added a comment -

            Hmm, for non-striped directory, it does not matter cached it on llite or mdc, IMHO. But for striped directory, caching the dir-entry page in MDC layer would help LMV to build the dir-entry page by hash-order from all of stripes.

            In current readdir implementation, it iterate the dir_entry one by one(i.e. go through the stack for each entry), which is the reason causes this performance issue. I am working on a patch to build the dir-entry page by page in LMV layer, which should resolve the issue. There are still some issue with the patch, I will post the patch soon.

            di.wang Di Wang added a comment - Hmm, for non-striped directory, it does not matter cached it on llite or mdc, IMHO. But for striped directory, caching the dir-entry page in MDC layer would help LMV to build the dir-entry page by hash-order from all of stripes. In current readdir implementation, it iterate the dir_entry one by one(i.e. go through the stack for each entry), which is the reason causes this performance issue. I am working on a patch to build the dir-entry page by page in LMV layer, which should resolve the issue. There are still some issue with the patch, I will post the patch soon.

            I think the easiest and best solution here is to have llite cache the readdir directory pages on the file handle if the lock is cancelled, either until the file handle is closed or until it does rewinddir(). That avoids extra READPAGE RPCs for many reasons.

            From the readdir() and rewinddir() specifications in SUSv2: http://pubs.opengroup.org/onlinepubs/007908799/xsh/rewinddir.html

            If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir_r() returns an entry for that file is unspecified.

            The readdir_r() function may buffer several directory entries per actual read operation; the readdir_r() function marks for update the st_atime field of the directory each time the directory is actually read.

            The rewinddir() function resets the position of the directory stream to which dirp refers to the beginning of the directory. It also causes the directory stream to refer to the current state of the corresponding directory, as a call to opendir() would have done.

            adilger Andreas Dilger added a comment - I think the easiest and best solution here is to have llite cache the readdir directory pages on the file handle if the lock is cancelled, either until the file handle is closed or until it does rewinddir(). That avoids extra READPAGE RPCs for many reasons. From the readdir() and rewinddir() specifications in SUSv2: http://pubs.opengroup.org/onlinepubs/007908799/xsh/rewinddir.html If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir_r() returns an entry for that file is unspecified. The readdir_r() function may buffer several directory entries per actual read operation; the readdir_r() function marks for update the st_atime field of the directory each time the directory is actually read. The rewinddir() function resets the position of the directory stream to which dirp refers to the beginning of the directory. It also causes the directory stream to refer to the current state of the corresponding directory, as a call to opendir() would have done.

            Di, it seems this is caused by the DNE2 readdir changes. Could you please look at this c

            adilger Andreas Dilger added a comment - Di, it seems this is caused by the DNE2 readdir changes. Could you please look at this c

            Currently, the LLITE cannot be aware of the directory ldlm lock ping-pong which happened in MDC, so the statahead cannot know it should stop itself because of the bad performance unless someone indicates that.

            yong.fan nasf (Inactive) added a comment - Currently, the LLITE cannot be aware of the directory ldlm lock ping-pong which happened in MDC, so the statahead cannot know it should stop itself because of the bad performance unless someone indicates that.

            The statahead behaviour has been changed after moving the directory page cache from LLITE to MDC for striped directory.

            Originally, the directory page cache is in LLITE, and the ldlm lock is only held when fetch the directory page from MDS. After that the directory ldlm lock will be released. And then the statahead thread will traversal the directory page without the directory ldlm lock, even if someone is performing "rm -rf" and cancel (or ELC) the directory ldlm lock from the client, the statahead still holds the page reference. It is not a serious issue for statahead to get "-ENOENT" because of some name entry has been removed, because it is statahead internal failure and invisible to the applications. The statahead thread still can go ahead to fee-fetch as much as possible. So there is no much directory ldlm lock ping-pong.

            But the situation has changed after moving the directory page to MDC. In current implementation, when the statahead traversal the directory, it does not hold the page reference, it needs to verify the directory ldlm lock for every name entry. That means there will be the directory ldlm lock ping-pong between the statahead thread and the "rm -rf" thread for every name entry.

            Summary, the original directory ldlm lock ping-pong is per page, the current directory ldlm lock ping-pong is per name entry. That is why the bad performance.

            yong.fan nasf (Inactive) added a comment - The statahead behaviour has been changed after moving the directory page cache from LLITE to MDC for striped directory. Originally, the directory page cache is in LLITE, and the ldlm lock is only held when fetch the directory page from MDS. After that the directory ldlm lock will be released. And then the statahead thread will traversal the directory page without the directory ldlm lock, even if someone is performing "rm -rf" and cancel (or ELC) the directory ldlm lock from the client, the statahead still holds the page reference. It is not a serious issue for statahead to get "-ENOENT" because of some name entry has been removed, because it is statahead internal failure and invisible to the applications. The statahead thread still can go ahead to fee-fetch as much as possible. So there is no much directory ldlm lock ping-pong. But the situation has changed after moving the directory page to MDC. In current implementation, when the statahead traversal the directory, it does not hold the page reference, it needs to verify the directory ldlm lock for every name entry. That means there will be the directory ldlm lock ping-pong between the statahead thread and the "rm -rf" thread for every name entry. Summary, the original directory ldlm lock ping-pong is per page, the current directory ldlm lock ping-pong is per name entry. That is why the bad performance.

            People

              di.wang Di Wang
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: