[LU-6679] ASSERTION( !ext->oe_hp ) failed with group lock Created: 03/Jun/15  Updated: 13/Jan/17  Resolved: 25/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Li Dongyang (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: patch
Environment:

CentOS 6 + Lustre master tree 2.7+


Issue Links:
Related
is related to LU-6368 ASSERTION( cur->oe_dlmlock == victim-... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We can crash the lustre client using the simple code below:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>

#include <lustre/lustre_user.h>

int main(int argc, char *argv[])
{
        int fd;
        int rc;
        int gid;
        char *buf;

        fd = open("/mnt/testfile.00000000", O_RDWR);
        gid = atoi(argv[1]);
        rc = ioctl(fd, LL_IOC_GROUP_LOCK, gid);
        if (rc) {
                printf("ioctl %d\n", rc);
                return rc;
        }
        buf = malloc(1<<20);
        memset(buf, 1, 1<<20);
        while (1)
                write(fd, buf, 1<<20);
        return 0;
}

demo purpose only, no error handling.

Run the program with a gid and then run another instance of the program with a different gid, the client will crash right after.

crash> bt
PID: 1989   TASK: ffff88013ca86ae0  CPU: 1   COMMAND: "ldlm_bl_02"
 #0 [ffff88011fe0d8c8] machine_kexec at ffffffff8103b5bb
 #1 [ffff88011fe0d928] crash_kexec at ffffffff810c9852
 #2 [ffff88011fe0d9f8] panic at ffffffff81529343
 #3 [ffff88011fe0da78] lbug_with_loc at ffffffffa0251ecb [libcfs]
 #4 [ffff88011fe0da98] osc_cache_writeback_range at ffffffffa0958e35 [osc]
 #5 [ffff88011fe0dc28] osc_lock_flush at ffffffffa0941495 [osc]
 #6 [ffff88011fe0dca8] osc_ldlm_blocking_ast at ffffffffa0941828 [osc]
 #7 [ffff88011fe0dd18] ldlm_cancel_callback at ffffffffa05404cc [ptlrpc]
 #8 [ffff88011fe0dd38] ldlm_cli_cancel_local at ffffffffa05525da [ptlrpc]
 #9 [ffff88011fe0dd68] ldlm_cli_cancel at ffffffffa0557200 [ptlrpc]
#10 [ffff88011fe0dda8] osc_ldlm_blocking_ast at ffffffffa094165b [osc]
#11 [ffff88011fe0de18] ldlm_handle_bl_callback at ffffffffa055ac40 [ptlrpc]
#12 [ffff88011fe0de48] ldlm_bl_thread_main at ffffffffa055b1a1 [ptlrpc]
#13 [ffff88011fe0dee8] kthread at ffffffff8109e66e
#14 [ffff88011fe0df48] kernel_thread at ffffffff8100c20a


 Comments   
Comment by Gerrit Updater [ 03/Jun/15 ]

Li Dongyang (dongyang.li@anu.edu.au) uploaded a new patch: http://review.whamcloud.com/15119
Subject: LU-6679 ldlm: do not set cbpending for group locks
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2c02927b9858c4e7858f569a304d6b2024a186a5

Comment by Jinshan Xiong (Inactive) [ 03/Jun/15 ]

will patch http://review.whamcloud.com/13934 help this case?

Comment by Li Dongyang (Inactive) [ 04/Jun/15 ]

I can still reproduce the problem with patch 13934, besides, do we really need 13934? mode is passed in from osc_dlm_blocking_ast0 and
it will be set to CLM_WRITE from there for the group locks as well.

Comment by Jinshan Xiong (Inactive) [ 04/Jun/15 ]

You probably need this patch http://review.whamcloud.com/14093 from master.

Can you reproduce it on master?

Comment by Li Dongyang (Inactive) [ 04/Jun/15 ]

Hi Jinshan,
Yes I have http://review.whamcloud.com/#/c/14093/ and I'm using master.
Like I said in http://review.whamcloud.com/#/c/15119/ the patch is a missed case of 14093

Comment by Jinshan Xiong (Inactive) [ 04/Jun/15 ]

After checking the code, now I understand the problem. I agree that your patch will fix the problem but I would like to enhance the patch by not trying to revoke group lock on the server side. How do you think?

Comment by Li Dongyang (Inactive) [ 04/Jun/15 ]

I don't think server has anything to do with it.
Say we have process A which holds a group lock with a gid,
process B from the same client requests the group lock on the same file but with a different gid.

The server sends blocking ast which is all right, so when process A releases the lock, process B can get it.
The problem is on the client, when we get blocking ast for the original group lock, cbpending got set on the lock which shouldn't happen.
I reckon this is a problem on the client and this patch really should be a part of http://review.whamcloud.com/14093

Comment by Jinshan Xiong (Inactive) [ 04/Jun/15 ]

My point is now that clients won't cache group lock, why do servers bother sending blocking AST? It's simply wasting RPCs.

Therefore a proper fix is your patch + corresponding fix on server side.

Comment by Li Dongyang (Inactive) [ 04/Jun/15 ]

ok do you mean server won't send blocking ast to the client, it will only issue ast to the client when it's possible to grant the lock(previous lock got released)?
If so, do we still need patch http://review.whamcloud.com/15119 as we won't handle blocking ast for the group lock on the client anyway?

Comment by Jinshan Xiong (Inactive) [ 04/Jun/15 ]

At present we still need patch 15119 to handle the case that the client is talking w/ old servers. But in the long run, we won't need patch 15119.

Comment by Gerrit Updater [ 25/Jul/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15119/
Subject: LU-6679 ldlm: do not send blocking ast for group locks
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3016d3eba34f1718278e40d16d1e7e62e7c7abfa

Comment by Peter Jones [ 25/Jul/15 ]

landed for 2.8

Generated at Sat Feb 10 02:02:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.