Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3421

(ost_handler.c:1762:ost_blocking_ast()) Error -2 syncing data on lock cancel causes first ENOSPC client issues then MDS server locks up

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.4.1, Lustre 2.5.0
    • Lustre 2.4.0
    • RHEL6.4 running with Lutsre 2.4-rc2 and cray clients running [2.4.0-rc1 SLES11 SP1 / 2.4.0-rc2 SLES11 SP2] clients
    • 3
    • 8486

    Description

      Several times during our test shot we would encounter a situation where OSTs would report ENOSPC even tho there was enough space and inodes available. In time the MDS would become pinned and it would have to be rebooted to get a working file system again. I have attached the console logs for the MDS and OSS.

      Attachments

        Issue Links

          Activity

            [LU-3421] (ost_handler.c:1762:ost_blocking_ast()) Error -2 syncing data on lock cancel causes first ENOSPC client issues then MDS server locks up

            The only OSS logs I have are like the one I posted here.

            simmonsja James A Simmons added a comment - The only OSS logs I have are like the one I posted here.
            bobijam Zhenyu Xu added a comment -

            James,

            Do you have OSS debug log? Want to check the reason why it reports ENOSPC while it seems there are plenty.

            bobijam Zhenyu Xu added a comment - James, Do you have OSS debug log? Want to check the reason why it reports ENOSPC while it seems there are plenty.

            I agree that osp_pre_update_status() should not be setting the status to 0 first. though I think that check to limit time we spent in osp_precreate_reserve() does not depend on the status, it's a hardlimit. and in the worst case when osp_precreate_ready_condition() missed -ENOSPC due to the race we still should leave osp_pre_update_status() up on expiration? notice the expiration point is not reset in the loop.

            bzzz Alex Zhuravlev added a comment - I agree that osp_pre_update_status() should not be setting the status to 0 first. though I think that check to limit time we spent in osp_precreate_reserve() does not depend on the status, it's a hardlimit. and in the worst case when osp_precreate_ready_condition() missed -ENOSPC due to the race we still should leave osp_pre_update_status() up on expiration? notice the expiration point is not reset in the loop.

            Andreas the lfs df -i output was pretty much what you seen below. The lfs df results were basically the same. In the last test shot we saw issues with 4 ost becoming very full but this was not the case this time.

            /lustre/routed1-OST0130_UUID 3872000 86598 3785402 2% /lustre/routed1[OST:304]
            /lustre/routed1-OST00ac_UUID 3872000 86584 3785416 2% /lustre/routed1[OST:172]
            /lustre/routed1-OST0047_UUID 3872000 86539 3785461 2% /lustre/routed1[OST:71]
            /lustre/routed1-OST01c2_UUID 3872000 85067 3786933 2% /lustre/routed1[OST:450]
            UUID Inodes IUsed IFree IUse% Mounted on

            Talking with Oleg he didn't believe it was due to grants. Each OST is 250GB in size. I also expected the real issue is happening on the OSS rather than the MDS but the admins only gathered a vmcore from the MDS.

            simmonsja James A Simmons added a comment - Andreas the lfs df -i output was pretty much what you seen below. The lfs df results were basically the same. In the last test shot we saw issues with 4 ost becoming very full but this was not the case this time. /lustre/routed1-OST0130_UUID 3872000 86598 3785402 2% /lustre/routed1 [OST:304] /lustre/routed1-OST00ac_UUID 3872000 86584 3785416 2% /lustre/routed1 [OST:172] /lustre/routed1-OST0047_UUID 3872000 86539 3785461 2% /lustre/routed1 [OST:71] /lustre/routed1-OST01c2_UUID 3872000 85067 3786933 2% /lustre/routed1 [OST:450] UUID Inodes IUsed IFree IUse% Mounted on Talking with Oleg he didn't believe it was due to grants. Each OST is 250GB in size. I also expected the real issue is happening on the OSS rather than the MDS but the admins only gathered a vmcore from the MDS.

            I uploaded the vmcore to ftp.whamcloud.com/uploads/LU-3421

            simmonsja James A Simmons added a comment - I uploaded the vmcore to ftp.whamcloud.com/uploads/ LU-3421

            James, I expect the stack trace you pointed out is not a primary cause of this problem. That looks just to be an MDS thread that is blocked waiting for a DLM lock that is held by another thread. My guess is that there is something wrong in the precreate code at ENOSPC time that is keeping the MDS threads blocked for too long a time, and this is caused by either really being out of space on the OSTs (either inodes or blocks is the same, since there is no point allocating objects on an OST with no space) or maybe false ENOSPC caused by a problem with the space grants or similar.

            It does point out that we have something wrong with the usage of the DLM locks, or similar, because we should never be blocked for such a long time holding a spinlock.

            adilger Andreas Dilger added a comment - James, I expect the stack trace you pointed out is not a primary cause of this problem. That looks just to be an MDS thread that is blocked waiting for a DLM lock that is held by another thread. My guess is that there is something wrong in the precreate code at ENOSPC time that is keeping the MDS threads blocked for too long a time, and this is caused by either really being out of space on the OSTs (either inodes or blocks is the same, since there is no point allocating objects on an OST with no space) or maybe false ENOSPC caused by a problem with the space grants or similar. It does point out that we have something wrong with the usage of the DLM locks, or similar, because we should never be blocked for such a long time holding a spinlock.

            It looks like there are a couple of different bugs being hit here:

            • the OSTs are apparently reporting -ENOSPC when there is space available on them. Is this possibly a case where the OST is reporting ENOSPC during precreate when there are free inodes, but there are no free blocks? What is the lfs df and lfs df -i output from the filesystem? If this state hits again, please also collect the output from lctl get_param osp..sync_ osp.*.create_count on the MDS
            • the MDS goes into a busy loop trying to create objects on the OSTs, rather than returning -ENOSPC to the client. While it is good to ensure that the OSTs have been tried for creates, it doesn't make sense to try precreate so often if the OST is already out of space.
            May 29 08:18:40 widow-mds1 kernel: [76013.652071] Lustre: routed1-OST0015-osc-MDT0000: slow creates, last=[0x100150000:0x22041:0x0], next=[0x100150000:0x22041:0x0], reserved=0, syn_changes=2061, syn_rpc_in_progress=1, status=-19
            May 29 08:19:32 widow-mds1 kernel: [76065.440057] Lustre: routed1-OST00f0-osc-MDT0000: slow creates, last=[0x100f00000:0x25e93:0x0], next=[0x100f00000:0x25e93:0x0], reserved=0, syn_changes=0, syn_rpc_in_progress=1, status=-19
            May 29 08:23:37 widow-mds1 kernel: [76310.325026] LustreError: 6065:0:(osp_precreate.c:484:osp_precreate_send()) routed1-OST002a-osc-MDT0000: can't precreate: rc = -28
            May 29 08:23:37 widow-mds1 kernel: [76310.367372] LustreError: 6065:0:(osp_precreate.c:484:osp_precreate_send()) Skipped 70657 previous similar messages
            

            There was previously debugging code in osp_precreate_reserve() because I was worried about exactly this kind of situation, but my recent patch http://review.whamcloud.com/6219 (commit dc2bcafd2a0b) removed it before the 2.4.0 release because it would have LBUG'd in this case:

            @@ -1062,17 +1061,6 @@ int osp_precreate_reserve(const struct lu_env *env, struct osp_device *d)
                    while ((rc = d->opd_pre_status) == 0 || rc == -ENOSPC ||
                            rc == -ENODEV || rc == -EAGAIN) {
             
            -#if LUSTRE_VERSION_CODE < OBD_OCD_VERSION(2, 3, 90, 0)
            -               /*
            -                * to address Andreas's concern on possible busy-loop
            -                * between this thread and osp_precreate_send()
            -                */
            -               if (unlikely(count++ == 1000)) {
            -                       osp_precreate_timeout_condition(d);
            -                       LBUG();
            -               }
            -#endif
            -
                            /*
                             * increase number of precreations
                             */
            

            It probably makes sense to reinstate this if it loops only, say, twice for -ENOSPC, since there are other OSTs that could be used and it doesn't make sense to block the MDS thread for such a long time.

            It seems to me that the update of osp_pre_status at the start of osp_pre_update_status() is racy. If rc == 0 (i.e. the statfs succeeded) then osp_pre_status = 0, even though it is set to -ENOSPC again later on. It would be better to have something like:

                    if (rc != 0) {
                            d->opd_pre_status = rc;
                            goto out;
                    }
            

            Otherwise, opd_pre_status can be changed to 0 and break osp_precreate_ready_condition() checking of -ENOSPC.

            It also seems like the -ENOSPC hysteresis that is so well described in the comment in osp_pre_update_status() is not actually implemented. When the free space is < 0.1% opd_pre_status = -ENOSPC, but it is immediately cleared if free space is >= 0.1%. It also seems that there is a longstanding bug in the code, with min(used blocks vs. 1 GByte). It seems something like the following is needed:

                     * On very large disk, say 16TB 0.1% will be 16 GB. We don't want to
                     * lose that amount of space so in those cases we report no space left
                     * if their is less than 1 GB left, and clear it at 2GB. 
                    if (likely(msfs->os_type != 0)) { /* msfs contains valid data */
                            used = min((msfs->os_blocks - msfs->os_bfree) >> 10,
                                       1ULL << (30 - msfs->os_bsize);
                            if (d->opd_pre_status == 0 &&
                                msfs->os_ffree < 32 || msfs->os_bavail < used) {
                                    d->opd_pre_status = -ENOSPC;
                                    :
                                    :
                            } else if (d->opd_pre_status == -ENOSPC &&
                                       msgs->os_ffree > 64 && msfs->os_bavail > used * 2) {
                                    d->opd_pre_status = 0;
                                    :
                                    :
                            }
            

            The opd_pre_status = 0 state should only be set if there is enough space, not only if the STATFS RPC succeeded.

            adilger Andreas Dilger added a comment - It looks like there are a couple of different bugs being hit here: the OSTs are apparently reporting -ENOSPC when there is space available on them. Is this possibly a case where the OST is reporting ENOSPC during precreate when there are free inodes, but there are no free blocks? What is the lfs df and lfs df -i output from the filesystem? If this state hits again, please also collect the output from lctl get_param osp. .sync_ osp.*.create_count on the MDS the MDS goes into a busy loop trying to create objects on the OSTs, rather than returning -ENOSPC to the client. While it is good to ensure that the OSTs have been tried for creates, it doesn't make sense to try precreate so often if the OST is already out of space. May 29 08:18:40 widow-mds1 kernel: [76013.652071] Lustre: routed1-OST0015-osc-MDT0000: slow creates, last=[0x100150000:0x22041:0x0], next=[0x100150000:0x22041:0x0], reserved=0, syn_changes=2061, syn_rpc_in_progress=1, status=-19 May 29 08:19:32 widow-mds1 kernel: [76065.440057] Lustre: routed1-OST00f0-osc-MDT0000: slow creates, last=[0x100f00000:0x25e93:0x0], next=[0x100f00000:0x25e93:0x0], reserved=0, syn_changes=0, syn_rpc_in_progress=1, status=-19 May 29 08:23:37 widow-mds1 kernel: [76310.325026] LustreError: 6065:0:(osp_precreate.c:484:osp_precreate_send()) routed1-OST002a-osc-MDT0000: can't precreate: rc = -28 May 29 08:23:37 widow-mds1 kernel: [76310.367372] LustreError: 6065:0:(osp_precreate.c:484:osp_precreate_send()) Skipped 70657 previous similar messages There was previously debugging code in osp_precreate_reserve() because I was worried about exactly this kind of situation, but my recent patch http://review.whamcloud.com/6219 (commit dc2bcafd2a0b) removed it before the 2.4.0 release because it would have LBUG'd in this case: @@ -1062,17 +1061,6 @@ int osp_precreate_reserve( const struct lu_env *env, struct osp_device *d) while ((rc = d->opd_pre_status) == 0 || rc == -ENOSPC || rc == -ENODEV || rc == -EAGAIN) { -# if LUSTRE_VERSION_CODE < OBD_OCD_VERSION(2, 3, 90, 0) - /* - * to address Andreas's concern on possible busy-loop - * between this thread and osp_precreate_send() - */ - if (unlikely(count++ == 1000)) { - osp_precreate_timeout_condition(d); - LBUG(); - } -#endif - /* * increase number of precreations */ It probably makes sense to reinstate this if it loops only, say, twice for -ENOSPC, since there are other OSTs that could be used and it doesn't make sense to block the MDS thread for such a long time. It seems to me that the update of osp_pre_status at the start of osp_pre_update_status() is racy. If rc == 0 (i.e. the statfs succeeded) then osp_pre_status = 0 , even though it is set to -ENOSPC again later on. It would be better to have something like: if (rc != 0) { d->opd_pre_status = rc; goto out; } Otherwise, opd_pre_status can be changed to 0 and break osp_precreate_ready_condition() checking of -ENOSPC. It also seems like the -ENOSPC hysteresis that is so well described in the comment in osp_pre_update_status() is not actually implemented. When the free space is < 0.1% opd_pre_status = -ENOSPC , but it is immediately cleared if free space is >= 0.1%. It also seems that there is a longstanding bug in the code, with min(used blocks vs. 1 GByte). It seems something like the following is needed: * On very large disk, say 16TB 0.1% will be 16 GB. We don't want to * lose that amount of space so in those cases we report no space left * if their is less than 1 GB left, and clear it at 2GB. if (likely(msfs->os_type != 0)) { /* msfs contains valid data */ used = min((msfs->os_blocks - msfs->os_bfree) >> 10, 1ULL << (30 - msfs->os_bsize); if (d->opd_pre_status == 0 && msfs->os_ffree < 32 || msfs->os_bavail < used) { d->opd_pre_status = -ENOSPC; : : } else if (d->opd_pre_status == -ENOSPC && msgs->os_ffree > 64 && msfs->os_bavail > used * 2) { d->opd_pre_status = 0; : : } The opd_pre_status = 0 state should only be set if there is enough space, not only if the STATFS RPC succeeded.

            backtrace from the vmcore which will be uploaded soon:

            PID: 4814 TASK: ffff880e23333500 CPU: 15 COMMAND: "mdt03_140"
            #0 [ffff8800607e7e90] crash_nmi_callback at ffffffff8102d316
            #1 [ffff8800607e7ea0] notifier_call_chain at ffffffff81513a85
            #2 [ffff8800607e7ee0] atomic_notifier_call_chain at ffffffff81513aea
            #3 [ffff8800607e7ef0] notify_die at ffffffff8109cc1e
            #4 [ffff8800607e7f20] do_nmi at ffffffff8151174b
            #5 [ffff8800607e7f50] nmi at ffffffff81511010
            [exception RIP: _spin_lock+30]
            RIP: ffffffff8151087e RSP: ffff880e23335c40 RFLAGS: 00000293
            RAX: 00000000000014a6 RBX: ffff88099a82c200 RCX: ffff880e23335cf8
            RDX: 00000000000014a3 RSI: ffff880e23335cf0 RDI: ffff880ac63c35d8
            RBP: ffff880e23335c40 R8: 0000000000000000 R9: ffff8800b6c1dc00
            R10: 0000000000000009 R11: ffffffffa08dd5a0 R12: ffff88099a82c200
            R13: ffff880ac63c35c0 R14: ffff880e23335cf8 R15: 0000000000000000
            ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
            — <NMI exception stack> —
            #6 [ffff880e23335c40] _spin_lock at ffffffff8151087e
            #7 [ffff880e23335c48] lock_res_and_lock at ffffffffa07e3070 [ptlrpc]
            #8 [ffff880e23335c68] ldlm_lock_enqueue at ffffffffa07e859d [ptlrpc]
            #9 [ffff880e23335cc8] ldlm_handle_enqueue0 at ffffffffa080f1bf [ptlrpc]
            #10 [ffff880e23335d38] mdt_enqueue at ffffffffa0e6e3c6 [mdt]
            #11 [ffff8#12 [ffff880e23335da8] mds_regular_handle at ffffffffa0eae165 [mdt]
            #13 [ffff880e23335db8] ptlrpc_server_handle_request at ffffffffa0841388 [ptlrpc]
            #14 [ffff880e23335eb8] ptlrpc_main at ffffffffa084271e [ptlrpc]
            #15 [ffff880e23335f48] kernel_thread at ffffffff8100c0ca80e23335d58]

            simmonsja James A Simmons added a comment - backtrace from the vmcore which will be uploaded soon: PID: 4814 TASK: ffff880e23333500 CPU: 15 COMMAND: "mdt03_140" #0 [ffff8800607e7e90] crash_nmi_callback at ffffffff8102d316 #1 [ffff8800607e7ea0] notifier_call_chain at ffffffff81513a85 #2 [ffff8800607e7ee0] atomic_notifier_call_chain at ffffffff81513aea #3 [ffff8800607e7ef0] notify_die at ffffffff8109cc1e #4 [ffff8800607e7f20] do_nmi at ffffffff8151174b #5 [ffff8800607e7f50] nmi at ffffffff81511010 [exception RIP: _spin_lock+30] RIP: ffffffff8151087e RSP: ffff880e23335c40 RFLAGS: 00000293 RAX: 00000000000014a6 RBX: ffff88099a82c200 RCX: ffff880e23335cf8 RDX: 00000000000014a3 RSI: ffff880e23335cf0 RDI: ffff880ac63c35d8 RBP: ffff880e23335c40 R8: 0000000000000000 R9: ffff8800b6c1dc00 R10: 0000000000000009 R11: ffffffffa08dd5a0 R12: ffff88099a82c200 R13: ffff880ac63c35c0 R14: ffff880e23335cf8 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 — <NMI exception stack> — #6 [ffff880e23335c40] _spin_lock at ffffffff8151087e #7 [ffff880e23335c48] lock_res_and_lock at ffffffffa07e3070 [ptlrpc] #8 [ffff880e23335c68] ldlm_lock_enqueue at ffffffffa07e859d [ptlrpc] #9 [ffff880e23335cc8] ldlm_handle_enqueue0 at ffffffffa080f1bf [ptlrpc] #10 [ffff880e23335d38] mdt_enqueue at ffffffffa0e6e3c6 [mdt] #11 [ffff8#12 [ffff880e23335da8] mds_regular_handle at ffffffffa0eae165 [mdt] #13 [ffff880e23335db8] ptlrpc_server_handle_request at ffffffffa0841388 [ptlrpc] #14 [ffff880e23335eb8] ptlrpc_main at ffffffffa084271e [ptlrpc] #15 [ffff880e23335f48] kernel_thread at ffffffff8100c0ca80e23335d58]
            pjones Peter Jones added a comment -

            Bobijam

            Could you please look into this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you please look into this one? Thanks Peter

            People

              bobijam Zhenyu Xu
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: