[LU-5282] (mdc_request.c:791:mdc_free_open()) ASSERTION( mod->mod_open_req->rq_replay == 0 ) failed Created: 01/Jul/14  Updated: 04/Aug/17  Resolved: 08/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: llite, malloc, mdc, patch

Attachments: File alloc-fail.patch    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 14743

 Description   

In ll_close_inode_openhandle() if mdc_close() is not called or if it fails to clear mod->mod_open_req->rq_replay then we have the following LBUG:

[ 2759.567725] LustreError: 23179:0:(mdc_request.c:791:mdc_free_open()) ASSERTION( mod->m\
od_open_req->rq_replay == 0 ) failed:
[ 2759.568559] LustreError: 23203:0:(mdc_request.c:791:mdc_free_open()) ASSERTION( mod->m\
od_open_req->rq_replay == 0 ) failed:
[ 2759.568563] LustreError: 23203:0:(mdc_request.c:791:mdc_free_open()) LBUG
[ 2759.568564] Pid: 23203, comm: ls
[ 2759.568566]
[ 2759.568566] Call Trace:
[ 2759.568584]  [<ffffffffa0ef18c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[ 2759.568594]  [<ffffffffa0ef1ec7>] lbug_with_loc+0x47/0xb0 [libcfs]
[ 2759.568601]  [<ffffffffa055ce60>] mdc_clear_open_replay_data+0x0/0x240 [mdc]
[ 2759.568607]  [<ffffffffa055ceb8>] mdc_clear_open_replay_data+0x58/0x240 [mdc]
[ 2759.568630]  [<ffffffffa0523fce>] lmv_clear_open_replay_data+0x20e/0x6f0 [lmv]
[ 2759.568655]  [<ffffffffa0b13375>] ll_close_inode_openhandle+0x485/0xfe0 [lustre]
[ 2759.568670]  [<ffffffffa0b1463f>] ll_md_real_close+0x1af/0x220 [lustre]
[ 2759.568685]  [<ffffffffa0b14d15>] ll_file_release+0x665/0xb90 [lustre]
[ 2759.568698]  [<ffffffffa0afbaab>] ll_dir_release+0xdb/0xf0 [lustre]
[ 2759.568703]  [<ffffffff811a4318>] __fput+0x108/0x230
[ 2759.568706]  [<ffffffff811a4465>] fput+0x25/0x30
[ 2759.568708]  [<ffffffff8119f64d>] filp_close+0x5d/0x90

This easiest way to see it is to inject a failure to allocate op_data in ll_close_inode_openhandle(). But early returns from lmv_close() or mdc_close() will likely trigger the same assertion.



 Comments   
Comment by Andreas Dilger [ 02/Jul/14 ]

John, how did you hit this originally, and how easily is it reproduced? Just trying to decide if this is a regression that needs to be addressed for 2.6.0?

Comment by John Hammond [ 02/Jul/14 ]

After LU-5274 I dug up my OBD_ALLOC() fault injection patch and ran racer with it applied.

Comment by Ann Koehler (Inactive) [ 13/Nov/15 ]

A couple of Cray customers have hit this assertion using Lustre 2.5. One case suggests that the bug may be reproduced by running du against a large file system. The du causes Lustre to run the node out of memory (millions of unused ldlm_locks being canceled). An ldlm_bl thread calls into mdc_close but cannot allocate memory for the rpc. The mdc_close does not reset the rq_replay flag, and the subsequent call to mdc_free_open asserts. The ldlm_locks were held primarily by 2 processes, one of which had finished by the time of the LBUG. So there may have been at least 2 processes running du that triggered the LBUG.

>c0-0c1s1n2 LustreError: 28818:0:(file.c:172:ll_close_inode_openhandle()) inode 4035227916085244137 mdc close failed: rc = -12
>c0-0c1s1n2 LustreError: 28818:0:(mdc_request.c:816:mdc_free_open()) ASSERTION( mod->mod_open_req->rq_replay == 0 ) failed:
> c0-0c1s1n2 LustreError: 28818:0:(mdc_request.c:816:mdc_free_open()) LBUG
> c0-0c1s1n2 Pid: 28818, comm: ldlm_bl_51
> c0-0c1s1n2 Call Trace:
> c0-0c1s1n2 [<ffffffff81005f31>] try_stack_unwind+0x191/0x1a0
> c0-0c1s1n2 [<ffffffff810047eb>] dump_trace+0x8b/0x350
> c0-0c1s1n2 [<ffffffffa059e813>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
> c0-0c1s1n2 [<ffffffffa059eda5>] lbug_with_loc+0x45/0xc0 [libcfs]
> c0-0c1s1n2 [<ffffffffa0b5912a>] mdc_free_open+0xda/0x110 [mdc]
> c0-0c1s1n2 [<ffffffffa0b5b9e8>] mdc_clear_open_replay_data+0x58/0x280 [mdc]
> c0-0c1s1n2 [<ffffffffa0269c3d>] lmv_clear_open_replay_data+0x17d/0x690 [lmv]
> c0-0c1s1n2 [<ffffffffa0bb3a57>] ll_close_inode_openhandle+0x437/0xf20 [lustre]
> c0-0c1s1n2 [<ffffffffa0bb5d0a>] ll_md_real_close+0x11a/0x210 [lustre]
> c0-0c1s1n2 [<ffffffffa0be5635>] ll_md_blocking_ast+0x665/0x780 [lustre]
> c0-0c1s1n2 [<ffffffffa09355cd>] ldlm_cancel_callback+0x6d/0x150 [ptlrpc]
> c0-0c1s1n2 [<ffffffffa0940ca0>] ldlm_cli_cancel_local+0xa0/0x420 [ptlrpc]
> c0-0c1s1n2 [<ffffffffa094575a>] ldlm_cli_cancel_list_local+0xea/0x280 [ptlrpc]
> c0-0c1s1n2 [<ffffffffa094abe7>] ldlm_bl_thread_main+0xd7/0x3f0 [ptlrpc]
> c0-0c1s1n2 [<ffffffff81069a70>] kthread+0xc0/0xd0
> c0-0c1s1n2 [<ffffffff815a0a88>] ret_from_fork+0x58/0x90

> PID: 11210  TASK: ffff880c5835f200  CPU: 4   COMMAND: "du"
>  #0 [ffff880cfba2bbc8] schedule at ffffffff81595e45
>  #1 [ffff880cfba2bc48] cl_lock_state_wait at ffffffffa0825025 [obdclass]
>  #2 [ffff880cfba2bcc0] cl_enqueue_locked at ffffffffa0829003 [obdclass]
>  #3 [ffff880cfba2bcf8] cl_lock_request at ffffffffa0829d9b [obdclass]
>  #4 [ffff880cfba2bd50] cl_glimpse_lock at ffffffffa0c03785 [lustre]
>  #5 [ffff880cfba2bda0] cl_glimpse_size0 at ffffffffa0c03c25 [lustre]
>  #6 [ffff880cfba2bde0] ll_inode_revalidate_it at ffffffffa0bbf416 [lustre]
>  #7 [ffff880cfba2be10] ll_getattr_it at ffffffffa0bbf496 [lustre]
>  #8 [ffff880cfba2be40] ll_getattr at ffffffffa0bbf5d4 [lustre]
>  #9 [ffff880cfba2be98] vfs_getattr_nosec at ffffffff8117f4a9
> #10 [ffff880cfba2bea8] vfs_fstatat at ffffffff8117f622
> #11 [ffff880cfba2bef0] SYSC_newfstatat at ffffffff8117fb15
> #12 [ffff880cfba2bf70] sys_newfstatat at ffffffff8117fcfe
> #13 [ffff880cfba2bf80] system_call_fastpath at ffffffff815a0b32
>     RIP: 00007f1ab40b383a  RSP: 00007ffc0061e630  RFLAGS: 00000246
>     RAX: 0000000000000106  RBX: ffffffff815a0b32  RCX: 0000000000000000
>     RDX: 0000000000e695e8  RSI: 0000000000e69678  RDI: 0000000000000005
>     RBP: 0000000000e695e8   R8: 0000000000000100   R9: 0000000000000002
>     R10: 0000000000000100  R11: 0000000000000246  R12: ffffffff8117fcfe
>     R13: ffff880cfba2bf78  R14: 0000000000e69570  R15: 0000000000000000
>     ORIG_RAX: 0000000000000106  CS: 0033  SS: 002b
Comment by Gerrit Updater [ 03/Dec/15 ]

Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: http://review.whamcloud.com/17442
Subject: LU-5282 test: check error during mdc_close
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5f71f93bac5472e2835ee1544770c243de1f95a0

Comment by Peter Jones [ 07/Dec/15 ]

Fix pushed to gerrit - http://review.whamcloud.com/#/c/17495/

Comment by Alexander Boyko [ 18/Dec/15 ]

The patch change the assertion to warning. It is not a fatal error if client doesn`t send a close request to MDS. In worst case MDS evict client and client doesn`t send cookies to OST. With reproducer I see that mdc export was not fully released.
18 ST mdc lustre-MDT0000-mdc-ffff88005be2d800 18902755-d63e-aa43-5b3f-5e81f3be1b56 2

Comment by Gerrit Updater [ 05/Jul/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17495/
Subject: LU-5282 mdc: fix panic at mdc_free_open()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 62713a8530349a75b7202c9bfc6be121409a0203

Comment by Joseph Gmitter (Inactive) [ 08/Jul/16 ]

Landed to master for 2.9.0

Generated at Sat Feb 10 01:50:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.