[LU-5282] (mdc_request.c:791:mdc_free_open()) ASSERTION( mod->mod_open_req->rq_replay == 0 ) failed Created: 01/Jul/14 Updated: 04/Aug/17 Resolved: 08/Jul/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | John Hammond | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llite, malloc, mdc, patch | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 14743 | ||||
| Description |
|
In ll_close_inode_openhandle() if mdc_close() is not called or if it fails to clear mod->mod_open_req->rq_replay then we have the following LBUG: [ 2759.567725] LustreError: 23179:0:(mdc_request.c:791:mdc_free_open()) ASSERTION( mod->m\ od_open_req->rq_replay == 0 ) failed: [ 2759.568559] LustreError: 23203:0:(mdc_request.c:791:mdc_free_open()) ASSERTION( mod->m\ od_open_req->rq_replay == 0 ) failed: [ 2759.568563] LustreError: 23203:0:(mdc_request.c:791:mdc_free_open()) LBUG [ 2759.568564] Pid: 23203, comm: ls [ 2759.568566] [ 2759.568566] Call Trace: [ 2759.568584] [<ffffffffa0ef18c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [ 2759.568594] [<ffffffffa0ef1ec7>] lbug_with_loc+0x47/0xb0 [libcfs] [ 2759.568601] [<ffffffffa055ce60>] mdc_clear_open_replay_data+0x0/0x240 [mdc] [ 2759.568607] [<ffffffffa055ceb8>] mdc_clear_open_replay_data+0x58/0x240 [mdc] [ 2759.568630] [<ffffffffa0523fce>] lmv_clear_open_replay_data+0x20e/0x6f0 [lmv] [ 2759.568655] [<ffffffffa0b13375>] ll_close_inode_openhandle+0x485/0xfe0 [lustre] [ 2759.568670] [<ffffffffa0b1463f>] ll_md_real_close+0x1af/0x220 [lustre] [ 2759.568685] [<ffffffffa0b14d15>] ll_file_release+0x665/0xb90 [lustre] [ 2759.568698] [<ffffffffa0afbaab>] ll_dir_release+0xdb/0xf0 [lustre] [ 2759.568703] [<ffffffff811a4318>] __fput+0x108/0x230 [ 2759.568706] [<ffffffff811a4465>] fput+0x25/0x30 [ 2759.568708] [<ffffffff8119f64d>] filp_close+0x5d/0x90 This easiest way to see it is to inject a failure to allocate op_data in ll_close_inode_openhandle(). But early returns from lmv_close() or mdc_close() will likely trigger the same assertion. |
| Comments |
| Comment by Andreas Dilger [ 02/Jul/14 ] |
|
John, how did you hit this originally, and how easily is it reproduced? Just trying to decide if this is a regression that needs to be addressed for 2.6.0? |
| Comment by John Hammond [ 02/Jul/14 ] |
|
After |
| Comment by Ann Koehler (Inactive) [ 13/Nov/15 ] |
|
A couple of Cray customers have hit this assertion using Lustre 2.5. One case suggests that the bug may be reproduced by running du against a large file system. The du causes Lustre to run the node out of memory (millions of unused ldlm_locks being canceled). An ldlm_bl thread calls into mdc_close but cannot allocate memory for the rpc. The mdc_close does not reset the rq_replay flag, and the subsequent call to mdc_free_open asserts. The ldlm_locks were held primarily by 2 processes, one of which had finished by the time of the LBUG. So there may have been at least 2 processes running du that triggered the LBUG. >c0-0c1s1n2 LustreError: 28818:0:(file.c:172:ll_close_inode_openhandle()) inode 4035227916085244137 mdc close failed: rc = -12 >c0-0c1s1n2 LustreError: 28818:0:(mdc_request.c:816:mdc_free_open()) ASSERTION( mod->mod_open_req->rq_replay == 0 ) failed: > c0-0c1s1n2 LustreError: 28818:0:(mdc_request.c:816:mdc_free_open()) LBUG > c0-0c1s1n2 Pid: 28818, comm: ldlm_bl_51 > c0-0c1s1n2 Call Trace: > c0-0c1s1n2 [<ffffffff81005f31>] try_stack_unwind+0x191/0x1a0 > c0-0c1s1n2 [<ffffffff810047eb>] dump_trace+0x8b/0x350 > c0-0c1s1n2 [<ffffffffa059e813>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] > c0-0c1s1n2 [<ffffffffa059eda5>] lbug_with_loc+0x45/0xc0 [libcfs] > c0-0c1s1n2 [<ffffffffa0b5912a>] mdc_free_open+0xda/0x110 [mdc] > c0-0c1s1n2 [<ffffffffa0b5b9e8>] mdc_clear_open_replay_data+0x58/0x280 [mdc] > c0-0c1s1n2 [<ffffffffa0269c3d>] lmv_clear_open_replay_data+0x17d/0x690 [lmv] > c0-0c1s1n2 [<ffffffffa0bb3a57>] ll_close_inode_openhandle+0x437/0xf20 [lustre] > c0-0c1s1n2 [<ffffffffa0bb5d0a>] ll_md_real_close+0x11a/0x210 [lustre] > c0-0c1s1n2 [<ffffffffa0be5635>] ll_md_blocking_ast+0x665/0x780 [lustre] > c0-0c1s1n2 [<ffffffffa09355cd>] ldlm_cancel_callback+0x6d/0x150 [ptlrpc] > c0-0c1s1n2 [<ffffffffa0940ca0>] ldlm_cli_cancel_local+0xa0/0x420 [ptlrpc] > c0-0c1s1n2 [<ffffffffa094575a>] ldlm_cli_cancel_list_local+0xea/0x280 [ptlrpc] > c0-0c1s1n2 [<ffffffffa094abe7>] ldlm_bl_thread_main+0xd7/0x3f0 [ptlrpc] > c0-0c1s1n2 [<ffffffff81069a70>] kthread+0xc0/0xd0 > c0-0c1s1n2 [<ffffffff815a0a88>] ret_from_fork+0x58/0x90 > PID: 11210 TASK: ffff880c5835f200 CPU: 4 COMMAND: "du" > #0 [ffff880cfba2bbc8] schedule at ffffffff81595e45 > #1 [ffff880cfba2bc48] cl_lock_state_wait at ffffffffa0825025 [obdclass] > #2 [ffff880cfba2bcc0] cl_enqueue_locked at ffffffffa0829003 [obdclass] > #3 [ffff880cfba2bcf8] cl_lock_request at ffffffffa0829d9b [obdclass] > #4 [ffff880cfba2bd50] cl_glimpse_lock at ffffffffa0c03785 [lustre] > #5 [ffff880cfba2bda0] cl_glimpse_size0 at ffffffffa0c03c25 [lustre] > #6 [ffff880cfba2bde0] ll_inode_revalidate_it at ffffffffa0bbf416 [lustre] > #7 [ffff880cfba2be10] ll_getattr_it at ffffffffa0bbf496 [lustre] > #8 [ffff880cfba2be40] ll_getattr at ffffffffa0bbf5d4 [lustre] > #9 [ffff880cfba2be98] vfs_getattr_nosec at ffffffff8117f4a9 > #10 [ffff880cfba2bea8] vfs_fstatat at ffffffff8117f622 > #11 [ffff880cfba2bef0] SYSC_newfstatat at ffffffff8117fb15 > #12 [ffff880cfba2bf70] sys_newfstatat at ffffffff8117fcfe > #13 [ffff880cfba2bf80] system_call_fastpath at ffffffff815a0b32 > RIP: 00007f1ab40b383a RSP: 00007ffc0061e630 RFLAGS: 00000246 > RAX: 0000000000000106 RBX: ffffffff815a0b32 RCX: 0000000000000000 > RDX: 0000000000e695e8 RSI: 0000000000e69678 RDI: 0000000000000005 > RBP: 0000000000e695e8 R8: 0000000000000100 R9: 0000000000000002 > R10: 0000000000000100 R11: 0000000000000246 R12: ffffffff8117fcfe > R13: ffff880cfba2bf78 R14: 0000000000e69570 R15: 0000000000000000 > ORIG_RAX: 0000000000000106 CS: 0033 SS: 002b |
| Comment by Gerrit Updater [ 03/Dec/15 ] |
|
Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: http://review.whamcloud.com/17442 |
| Comment by Peter Jones [ 07/Dec/15 ] |
|
Fix pushed to gerrit - http://review.whamcloud.com/#/c/17495/ |
| Comment by Alexander Boyko [ 18/Dec/15 ] |
|
The patch change the assertion to warning. It is not a fatal error if client doesn`t send a close request to MDS. In worst case MDS evict client and client doesn`t send cookies to OST. With reproducer I see that mdc export was not fully released. |
| Comment by Gerrit Updater [ 05/Jul/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17495/ |
| Comment by Joseph Gmitter (Inactive) [ 08/Jul/16 ] |
|
Landed to master for 2.9.0 |