[LU-6679] ASSERTION( !ext->oe_hp ) failed with group lock - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0
Labels:
- patch
Environment:
CentOS 6 + Lustre master tree 2.7+

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We can crash the lustre client using the simple code below:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>

#include <lustre/lustre_user.h>

int main(int argc, char *argv[])
{
        int fd;
        int rc;
        int gid;
        char *buf;

        fd = open("/mnt/testfile.00000000", O_RDWR);
        gid = atoi(argv[1]);
        rc = ioctl(fd, LL_IOC_GROUP_LOCK, gid);
        if (rc) {
                printf("ioctl %d\n", rc);
                return rc;
        }
        buf = malloc(1<<20);
        memset(buf, 1, 1<<20);
        while (1)
                write(fd, buf, 1<<20);
        return 0;
}

demo purpose only, no error handling.

Run the program with a gid and then run another instance of the program with a different gid, the client will crash right after.

crash> bt
PID: 1989   TASK: ffff88013ca86ae0  CPU: 1   COMMAND: "ldlm_bl_02"
 #0 [ffff88011fe0d8c8] machine_kexec at ffffffff8103b5bb
 #1 [ffff88011fe0d928] crash_kexec at ffffffff810c9852
 #2 [ffff88011fe0d9f8] panic at ffffffff81529343
 #3 [ffff88011fe0da78] lbug_with_loc at ffffffffa0251ecb [libcfs]
 #4 [ffff88011fe0da98] osc_cache_writeback_range at ffffffffa0958e35 [osc]
 #5 [ffff88011fe0dc28] osc_lock_flush at ffffffffa0941495 [osc]
 #6 [ffff88011fe0dca8] osc_ldlm_blocking_ast at ffffffffa0941828 [osc]
 #7 [ffff88011fe0dd18] ldlm_cancel_callback at ffffffffa05404cc [ptlrpc]
 #8 [ffff88011fe0dd38] ldlm_cli_cancel_local at ffffffffa05525da [ptlrpc]
 #9 [ffff88011fe0dd68] ldlm_cli_cancel at ffffffffa0557200 [ptlrpc]
#10 [ffff88011fe0dda8] osc_ldlm_blocking_ast at ffffffffa094165b [osc]
#11 [ffff88011fe0de18] ldlm_handle_bl_callback at ffffffffa055ac40 [ptlrpc]
#12 [ffff88011fe0de48] ldlm_bl_thread_main at ffffffffa055b1a1 [ptlrpc]
#13 [ffff88011fe0dee8] kthread at ffffffff8109e66e
#14 [ffff88011fe0df48] kernel_thread at ffffffff8100c20a

Attachments

Issue Links

is related to

LU-6368 ASSERTION( cur->oe_dlmlock == victim->oe_dlmlock ) failed

Resolved

Activity

[LU-6679] ASSERTION( !ext->oe_hp ) failed with group lock

Li Dongyang (Inactive) added a comment - 04/Jun/15 6:55 AM

ok do you mean server won't send blocking ast to the client, it will only issue ast to the client when it's possible to grant the lock(previous lock got released)?
If so, do we still need patch http://review.whamcloud.com/15119 as we won't handle blocking ast for the group lock on the client anyway?

Li Dongyang (Inactive) added a comment - 04/Jun/15 6:55 AM ok do you mean server won't send blocking ast to the client, it will only issue ast to the client when it's possible to grant the lock(previous lock got released)? If so, do we still need patch http://review.whamcloud.com/15119 as we won't handle blocking ast for the group lock on the client anyway?

Jinshan Xiong (Inactive) added a comment - 04/Jun/15 6:21 AM - edited

My point is now that clients won't cache group lock, why do servers bother sending blocking AST? It's simply wasting RPCs.

Therefore a proper fix is your patch + corresponding fix on server side.

Jinshan Xiong (Inactive) added a comment - 04/Jun/15 6:21 AM - edited My point is now that clients won't cache group lock, why do servers bother sending blocking AST? It's simply wasting RPCs. Therefore a proper fix is your patch + corresponding fix on server side.

Li Dongyang (Inactive) added a comment - 04/Jun/15 6:04 AM

I don't think server has anything to do with it.
Say we have process A which holds a group lock with a gid,
process B from the same client requests the group lock on the same file but with a different gid.

The server sends blocking ast which is all right, so when process A releases the lock, process B can get it.
The problem is on the client, when we get blocking ast for the original group lock, cbpending got set on the lock which shouldn't happen.
I reckon this is a problem on the client and this patch really should be a part of http://review.whamcloud.com/14093

Li Dongyang (Inactive) added a comment - 04/Jun/15 6:04 AM I don't think server has anything to do with it. Say we have process A which holds a group lock with a gid, process B from the same client requests the group lock on the same file but with a different gid. The server sends blocking ast which is all right, so when process A releases the lock, process B can get it. The problem is on the client, when we get blocking ast for the original group lock, cbpending got set on the lock which shouldn't happen. I reckon this is a problem on the client and this patch really should be a part of http://review.whamcloud.com/14093

Jinshan Xiong (Inactive) added a comment - 04/Jun/15 5:48 AM

After checking the code, now I understand the problem. I agree that your patch will fix the problem but I would like to enhance the patch by not trying to revoke group lock on the server side. How do you think?

Jinshan Xiong (Inactive) added a comment - 04/Jun/15 5:48 AM After checking the code, now I understand the problem. I agree that your patch will fix the problem but I would like to enhance the patch by not trying to revoke group lock on the server side. How do you think?

Li Dongyang (Inactive) added a comment - 04/Jun/15 12:43 AM

Hi Jinshan,
Yes I have http://review.whamcloud.com/#/c/14093/ and I'm using master.
Like I said in http://review.whamcloud.com/#/c/15119/ the patch is a missed case of 14093

Li Dongyang (Inactive) added a comment - 04/Jun/15 12:43 AM Hi Jinshan, Yes I have http://review.whamcloud.com/#/c/14093/ and I'm using master. Like I said in http://review.whamcloud.com/#/c/15119/ the patch is a missed case of 14093

Jinshan Xiong (Inactive) added a comment - 04/Jun/15 12:35 AM - edited

You probably need this patch http://review.whamcloud.com/14093 from master.

Can you reproduce it on master?

Jinshan Xiong (Inactive) added a comment - 04/Jun/15 12:35 AM - edited You probably need this patch http://review.whamcloud.com/14093 from master. Can you reproduce it on master?

Li Dongyang (Inactive) added a comment - 04/Jun/15 12:17 AM

I can still reproduce the problem with patch 13934, besides, do we really need 13934? mode is passed in from osc_dlm_blocking_ast0 and
it will be set to CLM_WRITE from there for the group locks as well.

Li Dongyang (Inactive) added a comment - 04/Jun/15 12:17 AM I can still reproduce the problem with patch 13934, besides, do we really need 13934? mode is passed in from osc_dlm_blocking_ast0 and it will be set to CLM_WRITE from there for the group locks as well.

Jinshan Xiong (Inactive) added a comment - 03/Jun/15 3:55 PM

will patch http://review.whamcloud.com/13934 help this case?

Jinshan Xiong (Inactive) added a comment - 03/Jun/15 3:55 PM will patch http://review.whamcloud.com/13934 help this case?

Gerrit Updater added a comment - 03/Jun/15 6:33 AM

Li Dongyang (dongyang.li@anu.edu.au) uploaded a new patch: http://review.whamcloud.com/15119
Subject: ~~LU-6679~~ ldlm: do not set cbpending for group locks
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2c02927b9858c4e7858f569a304d6b2024a186a5

Gerrit Updater added a comment - 03/Jun/15 6:33 AM Li Dongyang (dongyang.li@anu.edu.au) uploaded a new patch: http://review.whamcloud.com/15119 Subject: LU-6679 ldlm: do not set cbpending for group locks Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2c02927b9858c4e7858f569a304d6b2024a186a5

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Li Dongyang (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 03/Jun/15 6:31 AM

Updated:: 13/Jan/17 12:52 AM

Resolved:: 25/Jul/15 3:21 PM