Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6679

ASSERTION( !ext->oe_hp ) failed with group lock

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.7.0, Lustre 2.8.0
    • CentOS 6 + Lustre master tree 2.7+
    • 3
    • 9223372036854775807

    Description

      We can crash the lustre client using the simple code below:

      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <sys/ioctl.h>
      #include <stdlib.h>
      #include <string.h>
      #include <unistd.h>
      #include <stdio.h>
      
      #include <lustre/lustre_user.h>
      
      int main(int argc, char *argv[])
      {
              int fd;
              int rc;
              int gid;
              char *buf;
      
              fd = open("/mnt/testfile.00000000", O_RDWR);
              gid = atoi(argv[1]);
              rc = ioctl(fd, LL_IOC_GROUP_LOCK, gid);
              if (rc) {
                      printf("ioctl %d\n", rc);
                      return rc;
              }
              buf = malloc(1<<20);
              memset(buf, 1, 1<<20);
              while (1)
                      write(fd, buf, 1<<20);
              return 0;
      }
      

      demo purpose only, no error handling.

      Run the program with a gid and then run another instance of the program with a different gid, the client will crash right after.

      crash> bt
      PID: 1989   TASK: ffff88013ca86ae0  CPU: 1   COMMAND: "ldlm_bl_02"
       #0 [ffff88011fe0d8c8] machine_kexec at ffffffff8103b5bb
       #1 [ffff88011fe0d928] crash_kexec at ffffffff810c9852
       #2 [ffff88011fe0d9f8] panic at ffffffff81529343
       #3 [ffff88011fe0da78] lbug_with_loc at ffffffffa0251ecb [libcfs]
       #4 [ffff88011fe0da98] osc_cache_writeback_range at ffffffffa0958e35 [osc]
       #5 [ffff88011fe0dc28] osc_lock_flush at ffffffffa0941495 [osc]
       #6 [ffff88011fe0dca8] osc_ldlm_blocking_ast at ffffffffa0941828 [osc]
       #7 [ffff88011fe0dd18] ldlm_cancel_callback at ffffffffa05404cc [ptlrpc]
       #8 [ffff88011fe0dd38] ldlm_cli_cancel_local at ffffffffa05525da [ptlrpc]
       #9 [ffff88011fe0dd68] ldlm_cli_cancel at ffffffffa0557200 [ptlrpc]
      #10 [ffff88011fe0dda8] osc_ldlm_blocking_ast at ffffffffa094165b [osc]
      #11 [ffff88011fe0de18] ldlm_handle_bl_callback at ffffffffa055ac40 [ptlrpc]
      #12 [ffff88011fe0de48] ldlm_bl_thread_main at ffffffffa055b1a1 [ptlrpc]
      #13 [ffff88011fe0dee8] kthread at ffffffff8109e66e
      #14 [ffff88011fe0df48] kernel_thread at ffffffff8100c20a
      

      Attachments

        Issue Links

          Activity

            [LU-6679] ASSERTION( !ext->oe_hp ) failed with group lock

            ok do you mean server won't send blocking ast to the client, it will only issue ast to the client when it's possible to grant the lock(previous lock got released)?
            If so, do we still need patch http://review.whamcloud.com/15119 as we won't handle blocking ast for the group lock on the client anyway?

            lidongyang Li Dongyang (Inactive) added a comment - ok do you mean server won't send blocking ast to the client, it will only issue ast to the client when it's possible to grant the lock(previous lock got released)? If so, do we still need patch http://review.whamcloud.com/15119 as we won't handle blocking ast for the group lock on the client anyway?
            jay Jinshan Xiong (Inactive) added a comment - - edited

            My point is now that clients won't cache group lock, why do servers bother sending blocking AST? It's simply wasting RPCs.

            Therefore a proper fix is your patch + corresponding fix on server side.

            jay Jinshan Xiong (Inactive) added a comment - - edited My point is now that clients won't cache group lock, why do servers bother sending blocking AST? It's simply wasting RPCs. Therefore a proper fix is your patch + corresponding fix on server side.

            I don't think server has anything to do with it.
            Say we have process A which holds a group lock with a gid,
            process B from the same client requests the group lock on the same file but with a different gid.

            The server sends blocking ast which is all right, so when process A releases the lock, process B can get it.
            The problem is on the client, when we get blocking ast for the original group lock, cbpending got set on the lock which shouldn't happen.
            I reckon this is a problem on the client and this patch really should be a part of http://review.whamcloud.com/14093

            lidongyang Li Dongyang (Inactive) added a comment - I don't think server has anything to do with it. Say we have process A which holds a group lock with a gid, process B from the same client requests the group lock on the same file but with a different gid. The server sends blocking ast which is all right, so when process A releases the lock, process B can get it. The problem is on the client, when we get blocking ast for the original group lock, cbpending got set on the lock which shouldn't happen. I reckon this is a problem on the client and this patch really should be a part of http://review.whamcloud.com/14093

            After checking the code, now I understand the problem. I agree that your patch will fix the problem but I would like to enhance the patch by not trying to revoke group lock on the server side. How do you think?

            jay Jinshan Xiong (Inactive) added a comment - After checking the code, now I understand the problem. I agree that your patch will fix the problem but I would like to enhance the patch by not trying to revoke group lock on the server side. How do you think?

            Hi Jinshan,
            Yes I have http://review.whamcloud.com/#/c/14093/ and I'm using master.
            Like I said in http://review.whamcloud.com/#/c/15119/ the patch is a missed case of 14093

            lidongyang Li Dongyang (Inactive) added a comment - Hi Jinshan, Yes I have http://review.whamcloud.com/#/c/14093/ and I'm using master. Like I said in http://review.whamcloud.com/#/c/15119/ the patch is a missed case of 14093
            jay Jinshan Xiong (Inactive) added a comment - - edited

            You probably need this patch http://review.whamcloud.com/14093 from master.

            Can you reproduce it on master?

            jay Jinshan Xiong (Inactive) added a comment - - edited You probably need this patch http://review.whamcloud.com/14093 from master. Can you reproduce it on master?

            I can still reproduce the problem with patch 13934, besides, do we really need 13934? mode is passed in from osc_dlm_blocking_ast0 and
            it will be set to CLM_WRITE from there for the group locks as well.

            lidongyang Li Dongyang (Inactive) added a comment - I can still reproduce the problem with patch 13934, besides, do we really need 13934? mode is passed in from osc_dlm_blocking_ast0 and it will be set to CLM_WRITE from there for the group locks as well.
            jay Jinshan Xiong (Inactive) added a comment - will patch http://review.whamcloud.com/13934 help this case?

            Li Dongyang (dongyang.li@anu.edu.au) uploaded a new patch: http://review.whamcloud.com/15119
            Subject: LU-6679 ldlm: do not set cbpending for group locks
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2c02927b9858c4e7858f569a304d6b2024a186a5

            gerrit Gerrit Updater added a comment - Li Dongyang (dongyang.li@anu.edu.au) uploaded a new patch: http://review.whamcloud.com/15119 Subject: LU-6679 ldlm: do not set cbpending for group locks Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2c02927b9858c4e7858f569a304d6b2024a186a5

            People

              jay Jinshan Xiong (Inactive)
              lidongyang Li Dongyang (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: