Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3381

lock enqueue fails for open(".lustre/fid", ...)

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • Lustre 2.4.0
    • Lustre 2.4.0
    • Lustre 2.4 client

    Description

      While watching the performance reported by Robinhood while it was scanning a test filesystem with 24 million files I noticed something strange. Performance was gradually decreasing over time because the STAGE_GET_INFO_FS stage was taking longer and longer. That's not good.

      May 22 13:31:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   1 | Done:   0 | Total:  20898 | ms/op: 1.00
      May 22 13:32:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   0 | Done:   0 | Total:  54724 | ms/op: 0.49
      May 22 13:33:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   0 | Done:   0 | Total:  95518 | ms/op: 0.36
      May 22 13:34:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   0 | Done:   0 | Total: 133057 | ms/op: 0.31
      May 22 13:35:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   0 | Done:   0 | Total: 141859 | ms/op: 0.31
      May 22 13:36:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   0 | Done:   0 | Total: 173883 | ms/op: 0.29
      May 22 13:37:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   5 | Done:   0 | Total: 210361 | ms/op: 0.58
      May 22 13:38:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   3 | Done:   0 | Total: 262009 | ms/op: 1.49
      May 22 13:39:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   6 | Done:   0 | Total: 316537 | ms/op: 2.07
      May 22 13:40:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   4 | Done:   1 | Total: 373078 | ms/op: 2.47
      May 22 13:41:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   5 | Done:   0 | Total: 432204 | ms/op: 2.74
      May 22 13:42:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   6 | Done:   0 | Total: 492933 | ms/op: 2.93
      May 22 13:43:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   2 | Done:   0 | Total: 554324 | ms/op: 3.10
      May 22 13:44:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   7 | Done:   0 | Total: 612107 | ms/op: 3.25
      May 22 13:45:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   6 | Done:   0 | Total: 665969 | ms/op: 3.40
      May 22 13:46:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   5 | Done:   0 | Total: 720286 | ms/op: 3.52
      May 22 13:47:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   6 | Done:   0 | Total: 776099 | ms/op: 3.62
      May 22 13:48:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   6 | Done:   1 | Total: 834779 | ms/op: 3.72
      May 22 13:49:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   5 | Done:   0 | Total: 879068 | ms/op: 3.90
      May 22 13:50:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   4 | Done:   0 | Total: 932166 | ms/op: 4.01
      May 22 13:51:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   5 | Done:   0 | Total: 986457 | ms/op: 4.11
      May 22 13:52:48 zeno66 robinhood[12124]: STATS |  2: STAGE_GET_INFO_FS    | Wait:     0 | Curr:   5 | Done:   1 | Total: 1014814 | ms/op: 4.31
      ...
      

      After a little investigation it was clear that the open("./lustre/fid", O_RDONLY) call was getting on average slower and slower. It could take as little as 0.1ms or as long as 15ms from what I observed. The subsequent GETSTRIPE ioctl() was always fast.

      The Lustre debug logs show that all the time is being spent in do_lookup() because there is never a valid dentry on the client. That means for every open we attempt to enqueue a lock for this special file. That lock enqueue fails on the MDT with ELDLM_LOCK_ABORTED but the open still seems to succeed. To make maters worse the client serials the enqueues because they are all IT_OPEN.

      00010000:00010000:0.0:1369264019.850032:0:11631:0:(ldlm_lockd.c:1426:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=301, rc=0) ns: mdt-zeno-MDT0000_UUID lock: ffff8802f011a000/0xed969b9b44bc7d27 lrc: 1/0,0 mode: --/CR res: 8589934594/1 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: 172.19.1.68@o2ib199 remote: 0xe4f855be29d7308d expref: 12 pid: 11631 timeout: 0 lvb_type: 0
      

      To get a handle on what was going on I wrote a trivial reproducer which just opens and closes a specified file repeatedly. If you run just a few iterations and grab the Lustre debug logs you can easily see the failing enqueues.

      I suspect if the enqueues were allowed to succeed this would be a non-issue since we would have a valid read lock on the client. However, I haven't tested that and it's not at all clear to me what you guys are planning to do with that .lustre directory. Perhaps you can propose a fix.

      #include <stdio.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <errno.h>
      
      int
      main(int argc, char **argv)
      {
              long int start, last;
              int error, fd, i, j, i_max, j_max;
              char *path;
      
              start = last = time(0);
              path = argv[1];
              i_max = strtol(argv[2], NULL, 10);
              j_max = strtol(argv[3], NULL, 10);
      
              printf("file:  %s\n", path);
              printf("i_max: %d\n", i_max);
              printf("j_max: %d\n", j_max);
      
              for (i = 0; i < i_max; i++) {
                      printf("%4d / %4d - %ld sec total, %ld sec delta\n", i, i_max,
                          time(0) - start, time(0) - last);
                      last = time(0);
      
                      for (j = 0; j < j_max; j++) {
                              fd = open(path, O_RDONLY);
                              if (fd == -1) {
                                      printf("open() = %d\n", errno);
                                      return (1);
                              }
      
                              error = close(fd);
                              if (error) {
                                      printf("close() = %d\n", errno);
                                      return (1);
                              }
                      }
              }
      
              return (0);
      }
      

      Attachments

        Issue Links

          Activity

            [LU-3381] lock enqueue fails for open(".lustre/fid", ...)

            This is a bug in RobinHood.

            jhammond John Hammond added a comment - This is a bug in RobinHood.

            > Do you know if there is any reason that robinhood does not just hold .lustre/fid open?
            robinhood does not explicitly open .lustre/fid, but it uses liblustreapi calls that do open it (e.g. llapi_fid2path).

            I don't like performing the ioctl directly from robinhood code, as this shortcuts the abstraction liblustreapi provides.

            Optimizing .lustre/fid opening would require adding calls to llapi that take the root fd as argument.
            But I don't like it that much, as the need for root fd in the llapi code is implementation dependant and this may change.

            leibovici-cea Thomas LEIBOVICI - CEA (Inactive) added a comment - > Do you know if there is any reason that robinhood does not just hold .lustre/fid open? robinhood does not explicitly open .lustre/fid, but it uses liblustreapi calls that do open it (e.g. llapi_fid2path). I don't like performing the ioctl directly from robinhood code, as this shortcuts the abstraction liblustreapi provides. Optimizing .lustre/fid opening would require adding calls to llapi that take the root fd as argument. But I don't like it that much, as the need for root fd in the llapi code is implementation dependant and this may change.

            I think there is potentially a red herring here with the locking. It is normal Lustre protocol that trying to lock a parent directory for lookup will not return a directory lock, but instead return a lock on the file. If the client was returned a directory lock during lookup, it would cause a large amount of contention among all of the clients, since the directory lock would be revoked each time another client wants to create a file therein.

            It looks like this problem is actually related to LU-2613, which is caused by the client caching the open/close RPCs in memory if it is not doing any other updates to the filesystem. A simple workaround listed in that bug is to periodically "touch" any file on the node running the scanner, which will cause a new transaction commit and flush the cache. This might take some time if it hasn't been done previously, since there will be two saved RPCs in memory for every file that was opened/closed.

            adilger Andreas Dilger added a comment - I think there is potentially a red herring here with the locking. It is normal Lustre protocol that trying to lock a parent directory for lookup will not return a directory lock, but instead return a lock on the file. If the client was returned a directory lock during lookup, it would cause a large amount of contention among all of the clients, since the directory lock would be revoked each time another client wants to create a file therein. It looks like this problem is actually related to LU-2613 , which is caused by the client caching the open/close RPCs in memory if it is not doing any other updates to the filesystem. A simple workaround listed in that bug is to periodically "touch" any file on the node running the scanner, which will cause a new transaction commit and flush the cache. This might take some time if it hasn't been done previously, since there will be two saved RPCs in memory for every file that was opened/closed.

            > to clarify, how is this specific to .lustre/fid?

            It may not be. My concern is that what I expected to happen when opening .lustre/fid was for the first open to enqueue a CR lock on the directory. Then on all subsequent opens, because we've cached a lock on the directory, we shouldn't need to revalidate the dentry and issue another enqueue to the server. What I'm seeing is that for every open on that directory (perhaps all directories?) we're forced to resend an enqueue to revalidate the dentry. Presumably this is because we were never properly granted the lock. But the end result is that opening the directory can be slow and we're generating a LOT of extra RPCs. Perhaps this is expected behavior for directories, but why that would be isn't immediately clear to me.

            > Do you know if there is any reason that robinhood does not just hold .lustre/fid open?

            I've no idea, I haven't looked at their code yet. Perhaps because this is during the scan which might run for many days (or weeks) and they didn't want to keep it open. I'm sure they assumed (as I did) that open/closing the directory would be very cheap once we'd initially acquired the lock. That is the point of the ldlm after all...

            behlendorf Brian Behlendorf added a comment - > to clarify, how is this specific to .lustre/fid? It may not be. My concern is that what I expected to happen when opening .lustre/fid was for the first open to enqueue a CR lock on the directory. Then on all subsequent opens, because we've cached a lock on the directory, we shouldn't need to revalidate the dentry and issue another enqueue to the server. What I'm seeing is that for every open on that directory (perhaps all directories?) we're forced to resend an enqueue to revalidate the dentry. Presumably this is because we were never properly granted the lock. But the end result is that opening the directory can be slow and we're generating a LOT of extra RPCs. Perhaps this is expected behavior for directories, but why that would be isn't immediately clear to me. > Do you know if there is any reason that robinhood does not just hold .lustre/fid open? I've no idea, I haven't looked at their code yet. Perhaps because this is during the scan which might run for many days (or weeks) and they didn't want to keep it open. I'm sure they assumed (as I did) that open/closing the directory would be very cheap once we'd initially acquired the lock. That is the point of the ldlm after all...
            jhammond John Hammond added a comment -

            Brian, to clarify, how is this specific to .lustre/fid? Or are you suggesting that .lustre/fid be handled specially in this case?

            Do you know if there is any reason that robinhood does not just hold .lustre/fid open? (Re your example, I think you'll see much better performance if you hoist the open out of the two loops...)

            jhammond John Hammond added a comment - Brian, to clarify, how is this specific to .lustre/fid? Or are you suggesting that .lustre/fid be handled specially in this case? Do you know if there is any reason that robinhood does not just hold .lustre/fid open? (Re your example, I think you'll see much better performance if you hoist the open out of the two loops...)

            I'm attaching a section from the Lustre server debug log showing how the enqueue for FID_SEQ_DOT_LUSTRE is handled. According to the log the request is properly handled, the object is successfully looked up, locks are created, the handles are hashed, everything looks great. Then at the very end of mdt_reint_open() we call mdt_object_open_unlock() which drops the only (and last) reference on the lock causing it to be destroyed. Therefore when we return back to mdt_intent_reint() there is no longer a valid lock handle and we return success to the client with a ELDLM_LOCK_ABORTED status (lustre/mdt/mdt_handler.c:3834).

            Unfortunately for me, since there's no documented protocol I'm aware of describing how the Lustre server is supposed to behave in this case it's hard for me to say how this should be handled. It's certainly inefficient since it forces a round trip enqueue RPC for every open of FID_SEQ_DOT_LUSTRE which I'm sure was not intended. Can we get whomever is familiar with this area of the code to weigh in with the expected behavior (and better yet a fix)?

            behlendorf Brian Behlendorf added a comment - I'm attaching a section from the Lustre server debug log showing how the enqueue for FID_SEQ_DOT_LUSTRE is handled. According to the log the request is properly handled, the object is successfully looked up, locks are created, the handles are hashed, everything looks great. Then at the very end of mdt_reint_open() we call mdt_object_open_unlock() which drops the only (and last) reference on the lock causing it to be destroyed. Therefore when we return back to mdt_intent_reint() there is no longer a valid lock handle and we return success to the client with a ELDLM_LOCK_ABORTED status (lustre/mdt/mdt_handler.c:3834). Unfortunately for me, since there's no documented protocol I'm aware of describing how the Lustre server is supposed to behave in this case it's hard for me to say how this should be handled. It's certainly inefficient since it forces a round trip enqueue RPC for every open of FID_SEQ_DOT_LUSTRE which I'm sure was not intended. Can we get whomever is familiar with this area of the code to weigh in with the expected behavior (and better yet a fix)?
            pjones Peter Jones added a comment -

            John

            Could you please look into this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - John Could you please look into this one? Thanks Peter

            People

              jhammond John Hammond
              behlendorf Brian Behlendorf
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: