[LU-6679] ASSERTION( !ext->oe_hp ) failed with group lock Created: 03/Jun/15 Updated: 13/Jan/17 Resolved: 25/Jul/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Li Dongyang (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
CentOS 6 + Lustre master tree 2.7+ |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We can crash the lustre client using the simple code below: #include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <lustre/lustre_user.h>
int main(int argc, char *argv[])
{
int fd;
int rc;
int gid;
char *buf;
fd = open("/mnt/testfile.00000000", O_RDWR);
gid = atoi(argv[1]);
rc = ioctl(fd, LL_IOC_GROUP_LOCK, gid);
if (rc) {
printf("ioctl %d\n", rc);
return rc;
}
buf = malloc(1<<20);
memset(buf, 1, 1<<20);
while (1)
write(fd, buf, 1<<20);
return 0;
}
demo purpose only, no error handling. Run the program with a gid and then run another instance of the program with a different gid, the client will crash right after. crash> bt PID: 1989 TASK: ffff88013ca86ae0 CPU: 1 COMMAND: "ldlm_bl_02" #0 [ffff88011fe0d8c8] machine_kexec at ffffffff8103b5bb #1 [ffff88011fe0d928] crash_kexec at ffffffff810c9852 #2 [ffff88011fe0d9f8] panic at ffffffff81529343 #3 [ffff88011fe0da78] lbug_with_loc at ffffffffa0251ecb [libcfs] #4 [ffff88011fe0da98] osc_cache_writeback_range at ffffffffa0958e35 [osc] #5 [ffff88011fe0dc28] osc_lock_flush at ffffffffa0941495 [osc] #6 [ffff88011fe0dca8] osc_ldlm_blocking_ast at ffffffffa0941828 [osc] #7 [ffff88011fe0dd18] ldlm_cancel_callback at ffffffffa05404cc [ptlrpc] #8 [ffff88011fe0dd38] ldlm_cli_cancel_local at ffffffffa05525da [ptlrpc] #9 [ffff88011fe0dd68] ldlm_cli_cancel at ffffffffa0557200 [ptlrpc] #10 [ffff88011fe0dda8] osc_ldlm_blocking_ast at ffffffffa094165b [osc] #11 [ffff88011fe0de18] ldlm_handle_bl_callback at ffffffffa055ac40 [ptlrpc] #12 [ffff88011fe0de48] ldlm_bl_thread_main at ffffffffa055b1a1 [ptlrpc] #13 [ffff88011fe0dee8] kthread at ffffffff8109e66e #14 [ffff88011fe0df48] kernel_thread at ffffffff8100c20a |
| Comments |
| Comment by Gerrit Updater [ 03/Jun/15 ] |
|
Li Dongyang (dongyang.li@anu.edu.au) uploaded a new patch: http://review.whamcloud.com/15119 |
| Comment by Jinshan Xiong (Inactive) [ 03/Jun/15 ] |
|
will patch http://review.whamcloud.com/13934 help this case? |
| Comment by Li Dongyang (Inactive) [ 04/Jun/15 ] |
|
I can still reproduce the problem with patch 13934, besides, do we really need 13934? mode is passed in from osc_dlm_blocking_ast0 and |
| Comment by Jinshan Xiong (Inactive) [ 04/Jun/15 ] |
|
You probably need this patch http://review.whamcloud.com/14093 from master. Can you reproduce it on master? |
| Comment by Li Dongyang (Inactive) [ 04/Jun/15 ] |
|
Hi Jinshan, |
| Comment by Jinshan Xiong (Inactive) [ 04/Jun/15 ] |
|
After checking the code, now I understand the problem. I agree that your patch will fix the problem but I would like to enhance the patch by not trying to revoke group lock on the server side. How do you think? |
| Comment by Li Dongyang (Inactive) [ 04/Jun/15 ] |
|
I don't think server has anything to do with it. The server sends blocking ast which is all right, so when process A releases the lock, process B can get it. |
| Comment by Jinshan Xiong (Inactive) [ 04/Jun/15 ] |
|
My point is now that clients won't cache group lock, why do servers bother sending blocking AST? It's simply wasting RPCs. Therefore a proper fix is your patch + corresponding fix on server side. |
| Comment by Li Dongyang (Inactive) [ 04/Jun/15 ] |
|
ok do you mean server won't send blocking ast to the client, it will only issue ast to the client when it's possible to grant the lock(previous lock got released)? |
| Comment by Jinshan Xiong (Inactive) [ 04/Jun/15 ] |
|
At present we still need patch 15119 to handle the case that the client is talking w/ old servers. But in the long run, we won't need patch 15119. |
| Comment by Gerrit Updater [ 25/Jul/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15119/ |
| Comment by Peter Jones [ 25/Jul/15 ] |
|
landed for 2.8 |