Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5282

(mdc_request.c:791:mdc_free_open()) ASSERTION( mod->mod_open_req->rq_replay == 0 ) failed

Details

    • 3
    • 14743

    Description

      In ll_close_inode_openhandle() if mdc_close() is not called or if it fails to clear mod->mod_open_req->rq_replay then we have the following LBUG:

      [ 2759.567725] LustreError: 23179:0:(mdc_request.c:791:mdc_free_open()) ASSERTION( mod->m\
      od_open_req->rq_replay == 0 ) failed:
      [ 2759.568559] LustreError: 23203:0:(mdc_request.c:791:mdc_free_open()) ASSERTION( mod->m\
      od_open_req->rq_replay == 0 ) failed:
      [ 2759.568563] LustreError: 23203:0:(mdc_request.c:791:mdc_free_open()) LBUG
      [ 2759.568564] Pid: 23203, comm: ls
      [ 2759.568566]
      [ 2759.568566] Call Trace:
      [ 2759.568584]  [<ffffffffa0ef18c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [ 2759.568594]  [<ffffffffa0ef1ec7>] lbug_with_loc+0x47/0xb0 [libcfs]
      [ 2759.568601]  [<ffffffffa055ce60>] mdc_clear_open_replay_data+0x0/0x240 [mdc]
      [ 2759.568607]  [<ffffffffa055ceb8>] mdc_clear_open_replay_data+0x58/0x240 [mdc]
      [ 2759.568630]  [<ffffffffa0523fce>] lmv_clear_open_replay_data+0x20e/0x6f0 [lmv]
      [ 2759.568655]  [<ffffffffa0b13375>] ll_close_inode_openhandle+0x485/0xfe0 [lustre]
      [ 2759.568670]  [<ffffffffa0b1463f>] ll_md_real_close+0x1af/0x220 [lustre]
      [ 2759.568685]  [<ffffffffa0b14d15>] ll_file_release+0x665/0xb90 [lustre]
      [ 2759.568698]  [<ffffffffa0afbaab>] ll_dir_release+0xdb/0xf0 [lustre]
      [ 2759.568703]  [<ffffffff811a4318>] __fput+0x108/0x230
      [ 2759.568706]  [<ffffffff811a4465>] fput+0x25/0x30
      [ 2759.568708]  [<ffffffff8119f64d>] filp_close+0x5d/0x90
      

      This easiest way to see it is to inject a failure to allocate op_data in ll_close_inode_openhandle(). But early returns from lmv_close() or mdc_close() will likely trigger the same assertion.

      Attachments

        Issue Links

          Activity

            [LU-5282] (mdc_request.c:791:mdc_free_open()) ASSERTION( mod->mod_open_req->rq_replay == 0 ) failed

            Landed to master for 2.9.0

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed to master for 2.9.0

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17495/
            Subject: LU-5282 mdc: fix panic at mdc_free_open()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 62713a8530349a75b7202c9bfc6be121409a0203

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17495/ Subject: LU-5282 mdc: fix panic at mdc_free_open() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 62713a8530349a75b7202c9bfc6be121409a0203

            The patch change the assertion to warning. It is not a fatal error if client doesn`t send a close request to MDS. In worst case MDS evict client and client doesn`t send cookies to OST. With reproducer I see that mdc export was not fully released.
            18 ST mdc lustre-MDT0000-mdc-ffff88005be2d800 18902755-d63e-aa43-5b3f-5e81f3be1b56 2

            aboyko Alexander Boyko added a comment - The patch change the assertion to warning. It is not a fatal error if client doesn`t send a close request to MDS. In worst case MDS evict client and client doesn`t send cookies to OST. With reproducer I see that mdc export was not fully released. 18 ST mdc lustre-MDT0000-mdc-ffff88005be2d800 18902755-d63e-aa43-5b3f-5e81f3be1b56 2
            pjones Peter Jones added a comment - Fix pushed to gerrit - http://review.whamcloud.com/#/c/17495/

            Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: http://review.whamcloud.com/17442
            Subject: LU-5282 test: check error during mdc_close
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5f71f93bac5472e2835ee1544770c243de1f95a0

            gerrit Gerrit Updater added a comment - Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: http://review.whamcloud.com/17442 Subject: LU-5282 test: check error during mdc_close Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5f71f93bac5472e2835ee1544770c243de1f95a0
            amk Ann Koehler (Inactive) added a comment - - edited

            A couple of Cray customers have hit this assertion using Lustre 2.5. One case suggests that the bug may be reproduced by running du against a large file system. The du causes Lustre to run the node out of memory (millions of unused ldlm_locks being canceled). An ldlm_bl thread calls into mdc_close but cannot allocate memory for the rpc. The mdc_close does not reset the rq_replay flag, and the subsequent call to mdc_free_open asserts. The ldlm_locks were held primarily by 2 processes, one of which had finished by the time of the LBUG. So there may have been at least 2 processes running du that triggered the LBUG.

            >c0-0c1s1n2 LustreError: 28818:0:(file.c:172:ll_close_inode_openhandle()) inode 4035227916085244137 mdc close failed: rc = -12
            >c0-0c1s1n2 LustreError: 28818:0:(mdc_request.c:816:mdc_free_open()) ASSERTION( mod->mod_open_req->rq_replay == 0 ) failed:
            > c0-0c1s1n2 LustreError: 28818:0:(mdc_request.c:816:mdc_free_open()) LBUG
            > c0-0c1s1n2 Pid: 28818, comm: ldlm_bl_51
            > c0-0c1s1n2 Call Trace:
            > c0-0c1s1n2 [<ffffffff81005f31>] try_stack_unwind+0x191/0x1a0
            > c0-0c1s1n2 [<ffffffff810047eb>] dump_trace+0x8b/0x350
            > c0-0c1s1n2 [<ffffffffa059e813>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
            > c0-0c1s1n2 [<ffffffffa059eda5>] lbug_with_loc+0x45/0xc0 [libcfs]
            > c0-0c1s1n2 [<ffffffffa0b5912a>] mdc_free_open+0xda/0x110 [mdc]
            > c0-0c1s1n2 [<ffffffffa0b5b9e8>] mdc_clear_open_replay_data+0x58/0x280 [mdc]
            > c0-0c1s1n2 [<ffffffffa0269c3d>] lmv_clear_open_replay_data+0x17d/0x690 [lmv]
            > c0-0c1s1n2 [<ffffffffa0bb3a57>] ll_close_inode_openhandle+0x437/0xf20 [lustre]
            > c0-0c1s1n2 [<ffffffffa0bb5d0a>] ll_md_real_close+0x11a/0x210 [lustre]
            > c0-0c1s1n2 [<ffffffffa0be5635>] ll_md_blocking_ast+0x665/0x780 [lustre]
            > c0-0c1s1n2 [<ffffffffa09355cd>] ldlm_cancel_callback+0x6d/0x150 [ptlrpc]
            > c0-0c1s1n2 [<ffffffffa0940ca0>] ldlm_cli_cancel_local+0xa0/0x420 [ptlrpc]
            > c0-0c1s1n2 [<ffffffffa094575a>] ldlm_cli_cancel_list_local+0xea/0x280 [ptlrpc]
            > c0-0c1s1n2 [<ffffffffa094abe7>] ldlm_bl_thread_main+0xd7/0x3f0 [ptlrpc]
            > c0-0c1s1n2 [<ffffffff81069a70>] kthread+0xc0/0xd0
            > c0-0c1s1n2 [<ffffffff815a0a88>] ret_from_fork+0x58/0x90
            
            > PID: 11210  TASK: ffff880c5835f200  CPU: 4   COMMAND: "du"
            >  #0 [ffff880cfba2bbc8] schedule at ffffffff81595e45
            >  #1 [ffff880cfba2bc48] cl_lock_state_wait at ffffffffa0825025 [obdclass]
            >  #2 [ffff880cfba2bcc0] cl_enqueue_locked at ffffffffa0829003 [obdclass]
            >  #3 [ffff880cfba2bcf8] cl_lock_request at ffffffffa0829d9b [obdclass]
            >  #4 [ffff880cfba2bd50] cl_glimpse_lock at ffffffffa0c03785 [lustre]
            >  #5 [ffff880cfba2bda0] cl_glimpse_size0 at ffffffffa0c03c25 [lustre]
            >  #6 [ffff880cfba2bde0] ll_inode_revalidate_it at ffffffffa0bbf416 [lustre]
            >  #7 [ffff880cfba2be10] ll_getattr_it at ffffffffa0bbf496 [lustre]
            >  #8 [ffff880cfba2be40] ll_getattr at ffffffffa0bbf5d4 [lustre]
            >  #9 [ffff880cfba2be98] vfs_getattr_nosec at ffffffff8117f4a9
            > #10 [ffff880cfba2bea8] vfs_fstatat at ffffffff8117f622
            > #11 [ffff880cfba2bef0] SYSC_newfstatat at ffffffff8117fb15
            > #12 [ffff880cfba2bf70] sys_newfstatat at ffffffff8117fcfe
            > #13 [ffff880cfba2bf80] system_call_fastpath at ffffffff815a0b32
            >     RIP: 00007f1ab40b383a  RSP: 00007ffc0061e630  RFLAGS: 00000246
            >     RAX: 0000000000000106  RBX: ffffffff815a0b32  RCX: 0000000000000000
            >     RDX: 0000000000e695e8  RSI: 0000000000e69678  RDI: 0000000000000005
            >     RBP: 0000000000e695e8   R8: 0000000000000100   R9: 0000000000000002
            >     R10: 0000000000000100  R11: 0000000000000246  R12: ffffffff8117fcfe
            >     R13: ffff880cfba2bf78  R14: 0000000000e69570  R15: 0000000000000000
            >     ORIG_RAX: 0000000000000106  CS: 0033  SS: 002b
            
            amk Ann Koehler (Inactive) added a comment - - edited A couple of Cray customers have hit this assertion using Lustre 2.5. One case suggests that the bug may be reproduced by running du against a large file system. The du causes Lustre to run the node out of memory (millions of unused ldlm_locks being canceled). An ldlm_bl thread calls into mdc_close but cannot allocate memory for the rpc. The mdc_close does not reset the rq_replay flag, and the subsequent call to mdc_free_open asserts. The ldlm_locks were held primarily by 2 processes, one of which had finished by the time of the LBUG. So there may have been at least 2 processes running du that triggered the LBUG. >c0-0c1s1n2 LustreError: 28818:0:(file.c:172:ll_close_inode_openhandle()) inode 4035227916085244137 mdc close failed: rc = -12 >c0-0c1s1n2 LustreError: 28818:0:(mdc_request.c:816:mdc_free_open()) ASSERTION( mod->mod_open_req->rq_replay == 0 ) failed: > c0-0c1s1n2 LustreError: 28818:0:(mdc_request.c:816:mdc_free_open()) LBUG > c0-0c1s1n2 Pid: 28818, comm: ldlm_bl_51 > c0-0c1s1n2 Call Trace: > c0-0c1s1n2 [<ffffffff81005f31>] try_stack_unwind+0x191/0x1a0 > c0-0c1s1n2 [<ffffffff810047eb>] dump_trace+0x8b/0x350 > c0-0c1s1n2 [<ffffffffa059e813>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] > c0-0c1s1n2 [<ffffffffa059eda5>] lbug_with_loc+0x45/0xc0 [libcfs] > c0-0c1s1n2 [<ffffffffa0b5912a>] mdc_free_open+0xda/0x110 [mdc] > c0-0c1s1n2 [<ffffffffa0b5b9e8>] mdc_clear_open_replay_data+0x58/0x280 [mdc] > c0-0c1s1n2 [<ffffffffa0269c3d>] lmv_clear_open_replay_data+0x17d/0x690 [lmv] > c0-0c1s1n2 [<ffffffffa0bb3a57>] ll_close_inode_openhandle+0x437/0xf20 [lustre] > c0-0c1s1n2 [<ffffffffa0bb5d0a>] ll_md_real_close+0x11a/0x210 [lustre] > c0-0c1s1n2 [<ffffffffa0be5635>] ll_md_blocking_ast+0x665/0x780 [lustre] > c0-0c1s1n2 [<ffffffffa09355cd>] ldlm_cancel_callback+0x6d/0x150 [ptlrpc] > c0-0c1s1n2 [<ffffffffa0940ca0>] ldlm_cli_cancel_local+0xa0/0x420 [ptlrpc] > c0-0c1s1n2 [<ffffffffa094575a>] ldlm_cli_cancel_list_local+0xea/0x280 [ptlrpc] > c0-0c1s1n2 [<ffffffffa094abe7>] ldlm_bl_thread_main+0xd7/0x3f0 [ptlrpc] > c0-0c1s1n2 [<ffffffff81069a70>] kthread+0xc0/0xd0 > c0-0c1s1n2 [<ffffffff815a0a88>] ret_from_fork+0x58/0x90 > PID: 11210 TASK: ffff880c5835f200 CPU: 4 COMMAND: "du" > #0 [ffff880cfba2bbc8] schedule at ffffffff81595e45 > #1 [ffff880cfba2bc48] cl_lock_state_wait at ffffffffa0825025 [obdclass] > #2 [ffff880cfba2bcc0] cl_enqueue_locked at ffffffffa0829003 [obdclass] > #3 [ffff880cfba2bcf8] cl_lock_request at ffffffffa0829d9b [obdclass] > #4 [ffff880cfba2bd50] cl_glimpse_lock at ffffffffa0c03785 [lustre] > #5 [ffff880cfba2bda0] cl_glimpse_size0 at ffffffffa0c03c25 [lustre] > #6 [ffff880cfba2bde0] ll_inode_revalidate_it at ffffffffa0bbf416 [lustre] > #7 [ffff880cfba2be10] ll_getattr_it at ffffffffa0bbf496 [lustre] > #8 [ffff880cfba2be40] ll_getattr at ffffffffa0bbf5d4 [lustre] > #9 [ffff880cfba2be98] vfs_getattr_nosec at ffffffff8117f4a9 > #10 [ffff880cfba2bea8] vfs_fstatat at ffffffff8117f622 > #11 [ffff880cfba2bef0] SYSC_newfstatat at ffffffff8117fb15 > #12 [ffff880cfba2bf70] sys_newfstatat at ffffffff8117fcfe > #13 [ffff880cfba2bf80] system_call_fastpath at ffffffff815a0b32 > RIP: 00007f1ab40b383a RSP: 00007ffc0061e630 RFLAGS: 00000246 > RAX: 0000000000000106 RBX: ffffffff815a0b32 RCX: 0000000000000000 > RDX: 0000000000e695e8 RSI: 0000000000e69678 RDI: 0000000000000005 > RBP: 0000000000e695e8 R8: 0000000000000100 R9: 0000000000000002 > R10: 0000000000000100 R11: 0000000000000246 R12: ffffffff8117fcfe > R13: ffff880cfba2bf78 R14: 0000000000e69570 R15: 0000000000000000 > ORIG_RAX: 0000000000000106 CS: 0033 SS: 002b
            jhammond John Hammond added a comment -

            After LU-5274 I dug up my OBD_ALLOC() fault injection patch and ran racer with it applied.

            jhammond John Hammond added a comment - After LU-5274 I dug up my OBD_ALLOC() fault injection patch and ran racer with it applied.

            John, how did you hit this originally, and how easily is it reproduced? Just trying to decide if this is a regression that needs to be addressed for 2.6.0?

            adilger Andreas Dilger added a comment - John, how did you hit this originally, and how easily is it reproduced? Just trying to decide if this is a regression that needs to be addressed for 2.6.0?

            People

              wc-triage WC Triage
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: