Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12120

LustreError: 15069:0:(tgt_grant.c:561:tgt_grant_incoming()) LBUG

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0, Lustre 2.12.3
    • Lustre 2.10.5
    • None
    • 3
    • 9223372036854775807

    Description

      First time we have see this lbug. We have a crash dump.

      [11168227.609396] Lustre: Skipped 4149 previous similar messages
      [11168546.559600] LustreError: 15069:0:(tgt_grant.c:559:tgt_grant_incoming()) nbp2-OST0121: cli 811dc111-6b22-b447-9144-bcd1173a571d/ffff881d57bff000 dirty -610055168 pend 0 grant -643806208
      [11168546.609562] LustreError: 15069:0:(tgt_grant.c:561:tgt_grant_incoming()) LBUG
      [11168546.631425] Pid: 15069, comm: ll_ost_io01_084 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
      [11168546.631428] Call Trace:
      [11168546.631439]  [<ffffffff8103a1f2>] save_stack_trace_tsk+0x22/0x40
      [11168546.650223]  [<ffffffffa08597cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [11168546.650228]  [<ffffffffa085987c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [11168546.650292]  [<ffffffffa0e8c270>] tgt_grant_prepare_read+0x0/0x3b0 [ptlrpc]
      [11168546.650325]  [<ffffffffa0e8c37b>] tgt_grant_prepare_read+0x10b/0x3b0 [ptlrpc]
      [11168546.650335]  [<ffffffffa0c86590>] ofd_preprw+0x450/0x1170 [ofd]
      [11168546.650365]  [<ffffffffa0e702a5>] tgt_brw_read+0x975/0x1860 [ptlrpc]
      [11168546.650394]  [<ffffffffa0e6deca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      [11168546.650420]  [<ffffffffa0e164bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      [11168546.650445]  [<ffffffffa0e1a4a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [11168546.650449]  [<ffffffff810b1131>] kthread+0xd1/0xe0
      [11168546.650452]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
      [11168546.650468]  [<ffffffffffffffff>] 0xffffffffffffffff
      [11168546.650469] Kernel panic - not syncing: LBUG 
       

      Attachments

        Issue Links

          Activity

            [LU-12120] LustreError: 15069:0:(tgt_grant.c:561:tgt_grant_incoming()) LBUG

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34996/
            Subject: LU-12120 grants: prevent negative ted_grant value
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7e08317ef5cbed5cd587017cbe343eb4cc52822c

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34996/ Subject: LU-12120 grants: prevent negative ted_grant value Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7e08317ef5cbed5cd587017cbe343eb4cc52822c

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34996
            Subject: LU-12120 grants: prevent negative ted_grant value
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 296ac6b910880dc65c9cee8be2552bba759cea57

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34996 Subject: LU-12120 grants: prevent negative ted_grant value Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 296ac6b910880dc65c9cee8be2552bba759cea57

            Ok, thanks Mikhail!

            From what I understand clients and servers maintain separate views of grants and this LU relates to pairs of (client; server) that drift apart from each other. At the next shrink request, that results in the client asking to give back more space than the server remembers granting.

            Does anyone know if there a valid reason as to why in tgt_grant_incoming():

            static void tgt_grant_incoming(...)
            {
                    ...
                    CDEBUG(D_CACHE, "%s: cli %s/%p reports grant %llu dropped %u, local %lu\n",
                           obd->obd_name, exp->exp_client_uuid.uuid, exp, oa->o_grant,
                           oa->o_dropped, ted->ted_grant);
                    ...
            }
            

            The value printed after "reports grant" would not be the same as the value printed after "local"?
            Can this be used to track the desynchronization?

            bougetq Quentin Bouget (Inactive) added a comment - Ok, thanks Mikhail! From what I understand clients and servers maintain separate views of grants and this LU relates to pairs of (client; server) that drift apart from each other. At the next shrink request, that results in the client asking to give back more space than the server remembers granting. — Does anyone know if there a valid reason as to why in tgt_grant_incoming() : static void tgt_grant_incoming(...) { ... CDEBUG(D_CACHE, "%s: cli %s/%p reports grant %llu dropped %u, local %lu\n" , obd->obd_name, exp->exp_client_uuid.uuid, exp, oa->o_grant, oa->o_dropped, ted->ted_grant); ... } The value printed after "reports grant" would not be the same as the value printed after "local"? Can this be used to track the desynchronization?
            tappro Mikhail Pershin added a comment - - edited

            Quentin, I agree with your analysis and also think that tgt_grant_shrink should not decrease ted_grant blindly but do sanity checks there. I will prepare patch today

            tappro Mikhail Pershin added a comment - - edited Quentin, I agree with your analysis and also think that tgt_grant_shrink should not decrease ted_grant blindly but do sanity checks there. I will prepare patch today

            I am fairly certain that the code that generates the negative value is in tgt_grant_shrink().

            @Stephane, as a temporary fix, you can try deactivating grant shrinking on clients with:

            lctl set_param osc.*.grant_shrink=0
            

            I do not know how much of an impact this has on OSSes, but considering the feature was disable by default up until july 2018 (cf. LU-8708), I assume it should be bearable.

            bougetq Quentin Bouget (Inactive) added a comment - - edited I am fairly certain that the code that generates the negative value is in tgt_grant_shrink() . @Stephane, as a temporary fix, you can try deactivating grant shrinking on clients with: lctl set_param osc.*.grant_shrink=0 I do not know how much of an impact this has on OSSes, but considering the feature was disable by default up until july 2018 (cf. LU-8708 ), I assume it should be bearable.
            pjones Peter Jones added a comment -

            Note that some comments that may assist in the analysis have been added to the resolved unassigned ticket LU-12322

            pjones Peter Jones added a comment - Note that some comments that may assist in the analysis have been added to the resolved unassigned ticket LU-12322

            Patrick, I think this may actually be due to another thing, likely an OSS deadlock. Please see LU-12215

            sthiell Stephane Thiell added a comment - Patrick, I think this may actually be due to another thing, likely an OSS deadlock. Please see LU-12215

            Stephane,

            If true, that would seem to put you in exactly the same situation as NASA, who are also hitting that LBUG because of the LU-11644 evictions.

            By the way, I meant to close this bug, was just waiting on a reply from Mahmoud - the LBUG is tracked under LU-11939 if you want to follow that.

            pfarrell Patrick Farrell (Inactive) added a comment - - edited Stephane, If true, that would seem to put you in exactly the same situation as NASA, who are also hitting that LBUG because of the LU-11644 evictions. By the way, I meant to close this bug, was just waiting on a reply from Mahmoud - the LBUG is tracked under LU-11939 if you want to follow that.

            We're actually investigating if this could be caused by LU-11644 reported by NASA, as I noticed a few similar call traces:

            Apr 22 10:48:49 fir-io3-s2 kernel: LustreError: 78496:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.9.101.28@o2ib4) failed to reply to blocking AST (req@ffff981d18ba8c00 x1631526698598368 status 0 rc -110), evict it ns: filter-fir-OST0023_UUID lock: ffff98016d218b40/0xd80148300515f99f lrc: 4/0,0 mo
            Apr 22 10:48:49 fir-io3-s2 kernel: LustreError: 138-a: fir-OST001b: A client on nid 10.9.101.28@o2ib4 was evicted due to a lock blocking callback time out: rc -110
            Apr 22 10:48:49 fir-io3-s2 kernel: LustreError: 77026:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 105s: evicting client at 10.9.101.28@o2ib4  ns: filter-fir-OST001b_UUID lock: ffff97f619e7af40/0xd80148300516127a lrc: 3/0,0 mode: PR/PR res: [0xd00000401:0x91e8cf:0x0].0x0 rrc: 3 type
            Apr 22 10:48:49 fir-io3-s2 kernel: LustreError: 78496:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) Skipped 1 previous similar message
            Apr 22 10:49:49 fir-io3-s2 kernel: LNet: Service thread pid 77058 was inactive for 200.48s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            Apr 22 10:49:49 fir-io3-s2 kernel: Pid: 77058, comm: ll_ost00_000 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
            Apr 22 10:49:49 fir-io3-s2 kernel: Call Trace:
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc0f9b640>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc0f58fe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc0f7a69b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc162910b>] ofd_intent_policy+0x69b/0x920 [ofd]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc0f59d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc0f826d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc10090b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc101010a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc0fb46db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc0fb800c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffff8b8c1c31>] kthread+0xd1/0xe0
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffff8bf74c24>] ret_from_fork_nospec_begin+0xe/0x21
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
            Apr 22 10:49:49 fir-io3-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1555955389.77058
            Apr 22 10:49:49 fir-io3-s2 kernel: Pid: 78291, comm: ll_ost00_019 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
            Apr 22 10:49:49 fir-io3-s2 kernel: Call Trace:
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc0f9b640>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc0f58fe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc0f7a69b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc162910b>] ofd_intent_policy+0x69b/0x920 [ofd]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc0f59d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc0f826d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc10090b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc101010a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc0fb46db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffc0fb800c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffff8b8c1c31>] kthread+0xd1/0xe0
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffff8bf74c24>] ret_from_fork_nospec_begin+0xe/0x21
            Apr 22 10:49:49 fir-io3-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
            Apr 22 10:49:49 fir-io3-s2 kernel: Lustre: fir-OST0019: Connection restored to 708f469b-f504-7275-d236-e90aa8bcec67 (at 10.9.0.64@o2ib4)
            Apr 22 10:49:49 fir-io3-s2 kernel: Lustre: Skipped 24 previous similar messages
            Apr 22 10:49:55 fir-io3-s2 kernel: LNet: Service thread pid 78318 was inactive for 200.37s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            Apr 22 10:49:55 fir-io3-s2 kernel: LNet: Skipped 1 previous similar message
            Apr 22 10:49:55 fir-io3-s2 kernel: Pid: 78318, comm: ll_ost02_032 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
            Apr 22 10:49:55 fir-io3-s2 kernel: Call Trace:
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc0f9b640>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc0f58fe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc0f7a69b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc162910b>] ofd_intent_policy+0x69b/0x920 [ofd]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc0f59d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc0f826d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc10090b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc101010a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc0fb46db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc0fb800c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffff8b8c1c31>] kthread+0xd1/0xe0
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffff8bf74c24>] ret_from_fork_nospec_begin+0xe/0x21
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
            Apr 22 10:49:55 fir-io3-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1555955395.78318
            Apr 22 10:49:55 fir-io3-s2 kernel: Pid: 78868, comm: ll_ost03_105 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
            Apr 22 10:49:55 fir-io3-s2 kernel: Call Trace:
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc0f9b640>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc0f58fe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc0f7a69b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc162910b>] ofd_intent_policy+0x69b/0x920 [ofd]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc0f59d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc0f826d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc10090b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc101010a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            Apr 22 10:49:55 fir-io3-s2 kernel:  [<ffffffffc0fb46db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            Apr 22 10:49:56 fir-io3-s2 kernel:  [<ffffffffc0fb800c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            Apr 22 10:49:56 fir-io3-s2 kernel:  [<ffffffff8b8c1c31>] kthread+0xd1/0xe0
            Apr 22 10:49:56 fir-io3-s2 kernel:  [<ffffffff8bf74c24>] ret_from_fork_nospec_begin+0xe/0x21
            Apr 22 10:49:56 fir-io3-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
            Apr 22 10:50:12 fir-io3-s2 kernel: Lustre: fir-OST0021: haven't heard from client 83316b44-aad9-bbe2-1580-9c149ecd135e (at 10.9.101.28@o2ib4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff97fd659af000, cur 1555955412 expire 1555955262 last 1555955185
            Apr 22 10:50:12 fir-io3-s2 kernel: Lustre: Skipped 19 previous similar messages
            Apr 22 10:50:12 fir-io3-s2 kernel: LNet: Service thread pid 77058 completed after 223.36s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            Apr 22 10:50:12 fir-io3-s2 kernel: LustreError: 78624:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff981b3d9bad00 x1631526704122896/t0(0) o104->fir-OST0021@10.9.101.28@o2ib4:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
            

            I have enabled +lnet debug on all servers to see how it goes.

            sthiell Stephane Thiell added a comment - We're actually investigating if this could be caused by LU-11644 reported by NASA, as I noticed a few similar call traces: Apr 22 10:48:49 fir-io3-s2 kernel: LustreError: 78496:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.9.101.28@o2ib4) failed to reply to blocking AST (req@ffff981d18ba8c00 x1631526698598368 status 0 rc -110), evict it ns: filter-fir-OST0023_UUID lock: ffff98016d218b40/0xd80148300515f99f lrc: 4/0,0 mo Apr 22 10:48:49 fir-io3-s2 kernel: LustreError: 138-a: fir-OST001b: A client on nid 10.9.101.28@o2ib4 was evicted due to a lock blocking callback time out: rc -110 Apr 22 10:48:49 fir-io3-s2 kernel: LustreError: 77026:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 105s: evicting client at 10.9.101.28@o2ib4 ns: filter-fir-OST001b_UUID lock: ffff97f619e7af40/0xd80148300516127a lrc: 3/0,0 mode: PR/PR res: [0xd00000401:0x91e8cf:0x0].0x0 rrc: 3 type Apr 22 10:48:49 fir-io3-s2 kernel: LustreError: 78496:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) Skipped 1 previous similar message Apr 22 10:49:49 fir-io3-s2 kernel: LNet: Service thread pid 77058 was inactive for 200.48s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 22 10:49:49 fir-io3-s2 kernel: Pid: 77058, comm: ll_ost00_000 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 22 10:49:49 fir-io3-s2 kernel: Call Trace: Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f9b640>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f58fe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f7a69b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc162910b>] ofd_intent_policy+0x69b/0x920 [ofd] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f59d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f826d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc10090b2>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc101010a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0fb46db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0fb800c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffff8b8c1c31>] kthread+0xd1/0xe0 Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffff8bf74c24>] ret_from_fork_nospec_begin+0xe/0x21 Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Apr 22 10:49:49 fir-io3-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1555955389.77058 Apr 22 10:49:49 fir-io3-s2 kernel: Pid: 78291, comm: ll_ost00_019 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 22 10:49:49 fir-io3-s2 kernel: Call Trace: Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f9b640>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f58fe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f7a69b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc162910b>] ofd_intent_policy+0x69b/0x920 [ofd] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f59d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f826d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc10090b2>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc101010a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0fb46db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0fb800c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffff8b8c1c31>] kthread+0xd1/0xe0 Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffff8bf74c24>] ret_from_fork_nospec_begin+0xe/0x21 Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Apr 22 10:49:49 fir-io3-s2 kernel: Lustre: fir-OST0019: Connection restored to 708f469b-f504-7275-d236-e90aa8bcec67 (at 10.9.0.64@o2ib4) Apr 22 10:49:49 fir-io3-s2 kernel: Lustre: Skipped 24 previous similar messages Apr 22 10:49:55 fir-io3-s2 kernel: LNet: Service thread pid 78318 was inactive for 200.37s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 22 10:49:55 fir-io3-s2 kernel: LNet: Skipped 1 previous similar message Apr 22 10:49:55 fir-io3-s2 kernel: Pid: 78318, comm: ll_ost02_032 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 22 10:49:55 fir-io3-s2 kernel: Call Trace: Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f9b640>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f58fe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f7a69b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc162910b>] ofd_intent_policy+0x69b/0x920 [ofd] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f59d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f826d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc10090b2>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc101010a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0fb46db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0fb800c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffff8b8c1c31>] kthread+0xd1/0xe0 Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffff8bf74c24>] ret_from_fork_nospec_begin+0xe/0x21 Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Apr 22 10:49:55 fir-io3-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1555955395.78318 Apr 22 10:49:55 fir-io3-s2 kernel: Pid: 78868, comm: ll_ost03_105 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 22 10:49:55 fir-io3-s2 kernel: Call Trace: Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f9b640>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f58fe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f7a69b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc162910b>] ofd_intent_policy+0x69b/0x920 [ofd] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f59d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f826d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc10090b2>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc101010a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0fb46db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 22 10:49:56 fir-io3-s2 kernel: [<ffffffffc0fb800c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 22 10:49:56 fir-io3-s2 kernel: [<ffffffff8b8c1c31>] kthread+0xd1/0xe0 Apr 22 10:49:56 fir-io3-s2 kernel: [<ffffffff8bf74c24>] ret_from_fork_nospec_begin+0xe/0x21 Apr 22 10:49:56 fir-io3-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Apr 22 10:50:12 fir-io3-s2 kernel: Lustre: fir-OST0021: haven't heard from client 83316b44-aad9-bbe2-1580-9c149ecd135e (at 10.9.101.28@o2ib4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff97fd659af000, cur 1555955412 expire 1555955262 last 1555955185 Apr 22 10:50:12 fir-io3-s2 kernel: Lustre: Skipped 19 previous similar messages Apr 22 10:50:12 fir-io3-s2 kernel: LNet: Service thread pid 77058 completed after 223.36s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 22 10:50:12 fir-io3-s2 kernel: LustreError: 78624:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff981b3d9bad00 x1631526704122896/t0(0) o104->fir-OST0021@10.9.101.28@o2ib4:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 I have enabled +lnet debug on all servers to see how it goes.

            Do you think disabling ost quota enforcement could help avoiding these LBUG (for now)?

            sthiell Stephane Thiell added a comment - Do you think disabling ost quota enforcement could help avoiding these LBUG (for now)?

            Hit again after about 1 hour uptime:

            [Mon Apr 22 10:20:58 2019][ 6643.068873] LustreError: 81562:0:(tgt_grant.c:563:tgt_grant_incoming()) fir-OST0011: cli b214bbaa-dfa8-5550-6c23-a4a7f34b2d1a/ffff970be7f44800 dirty 4169728 pend 0 grant -12582912^M
            [Mon Apr 22 10:20:58 2019][ 6643.084869] LustreError: 81562:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG^M
            [Mon Apr 22 10:20:58 2019][ 6643.091935] Pid: 81562, comm: ll_ost_io01_025 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018^M
            [Mon Apr 22 10:20:58 2019][ 6643.102364] Call Trace:^M
            [Mon Apr 22 10:20:58 2019][ 6643.104866]  [<ffffffffc09947cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M
            [Mon Apr 22 10:20:58 2019][ 6643.111440]  [<ffffffffc099487c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M
            [Mon Apr 22 10:20:58 2019][ 6643.117678]  [<ffffffffc0dd66f0>] tgt_grant_prepare_read+0x0/0x3b0 [ptlrpc]^M
            [Mon Apr 22 10:20:58 2019][ 6643.124732]  [<ffffffffc0dd81d4>] tgt_grant_prepare_write+0x2f4/0xf70 [ptlrpc]^M
            [Mon Apr 22 10:20:58 2019][ 6643.132032]  [<ffffffffc13aea9c>] ofd_preprw_write.isra.31+0x14c/0xe60 [ofd]^M
            [Mon Apr 22 10:20:58 2019][ 6643.139121]  [<ffffffffc13afbd2>] ofd_preprw+0x422/0x1160 [ofd]^M
            [Mon Apr 22 10:20:58 2019][ 6643.145075]  [<ffffffffc0db54be>] tgt_brw_write+0xc7e/0x1a90 [ptlrpc]^M
            [Mon Apr 22 10:20:58 2019][ 6643.151584]  [<ffffffffc0db910a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
            [Mon Apr 22 10:20:58 2019][ 6643.158566]  [<ffffffffc0d5d6db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
            [Mon Apr 22 10:20:58 2019][ 6643.166284]  [<ffffffffc0d6100c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]^M
            [Mon Apr 22 10:20:58 2019][ 6643.172657]  [<ffffffff8aac1c31>] kthread+0xd1/0xe0^M
            [Mon Apr 22 10:20:58 2019][ 6643.177576]  [<ffffffff8b174c24>] ret_from_fork_nospec_begin+0xe/0x21^M
            [Mon Apr 22 10:20:58 2019][ 6643.184081]  [<ffffffffffffffff>] 0xffffffffffffffff^M
            [Mon Apr 22 10:20:58 2019][ 6643.189112] Kernel panic - not syncing: LBUG^M
            [Mon Apr 22 10:20:58 2019][ 6643.193392] CPU: 13 PID: 81562 Comm: ll_ost_io01_025 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.1.3.el7_lustre.x86_64 #1^M
            [Mon Apr 22 10:20:58 2019][ 6643.206155] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.6.7 10/29/2018^M
            [Mon Apr 22 10:20:58 2019][ 6643.213723] Call Trace:^M
            [Mon Apr 22 10:20:58 2019][ 6643.216183]  [<ffffffff8b161e41>] dump_stack+0x19/0x1b^M
            [Mon Apr 22 10:20:58 2019][ 6643.221326]  [<ffffffff8b15b550>] panic+0xe8/0x21f^M
            [Mon Apr 22 10:20:59 2019][ 6643.226124]  [<ffffffffc09948cb>] lbug_with_loc+0x9b/0xa0 [libcfs]^M
            [Mon Apr 22 10:20:59 2019][ 6643.232358]  [<ffffffffc0dd66f0>] tgt_grant_incoming.isra.6+0x540/0x540 [ptlrpc]^M
            [Mon Apr 22 10:20:59 2019][ 6643.239788]  [<ffffffffc0dd81d4>] tgt_grant_prepare_write+0x2f4/0xf70 [ptlrpc]^M
            [Mon Apr 22 10:20:59 2019][ 6643.247038]  [<ffffffffc0b18423>] ? lu_object_find_at+0x1b3/0x280 [obdclass]^M
            [Mon Apr 22 10:20:59 2019][ 6643.254091]  [<ffffffffc13aea9c>] ofd_preprw_write.isra.31+0x14c/0xe60 [ofd]^M
            [Mon Apr 22 10:20:59 2019][ 6643.261144]  [<ffffffffc13afbd2>] ofd_preprw+0x422/0x1160 [ofd]^M
            [Mon Apr 22 10:20:59 2019][ 6643.267102]  [<ffffffffc0db54be>] tgt_brw_write+0xc7e/0x1a90 [ptlrpc]^M
            [Mon Apr 22 10:20:59 2019][ 6643.273571]  [<ffffffffc0d4f0a7>] ? lustre_msg_buf+0x17/0x60 [ptlrpc]^M
            [Mon Apr 22 10:20:59 2019][ 6643.280047]  [<ffffffffc0d76b4f>] ? __req_capsule_get+0x15f/0x740 [ptlrpc]^M
            [Mon Apr 22 10:20:59 2019][ 6643.286957]  [<ffffffffc0af8279>] ? class_handle2object+0xb9/0x1c0 [obdclass]^M
            [Mon Apr 22 10:20:59 2019][ 6643.294095]  [<ffffffff8aae011c>] ? update_curr+0x14c/0x1e0^M
            [Mon Apr 22 10:20:59 2019][ 6643.299673]  [<ffffffff8aadca7e>] ? account_entity_dequeue+0xae/0xd0^M
            [Mon Apr 22 10:20:59 2019][ 6643.306025]  [<ffffffff8aadb738>] ? __enqueue_entity+0x78/0x80^M
            [Mon Apr 22 10:20:59 2019][ 6643.311859]  [<ffffffff8b165e92>] ? mutex_lock+0x12/0x2f^M
            [Mon Apr 22 10:20:59 2019][ 6643.317212]  [<ffffffffc0db910a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
            [Mon Apr 22 10:20:59 2019][ 6643.324118]  [<ffffffffc0d92801>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M
            [Mon Apr 22 10:20:59 2019][ 6643.331694]  [<ffffffffc0994bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M
            [Mon Apr 22 10:20:59 2019][ 6643.338772]  [<ffffffffc0d5d6db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
            [Mon Apr 22 10:20:59 2019][ 6643.346462]  [<ffffffffc0d5a565>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M
            [Mon Apr 22 10:20:59 2019][ 6643.353251]  [<ffffffff8aad67c2>] ? default_wake_function+0x12/0x20^M
            [Mon Apr 22 10:20:59 2019][ 6643.359519]  [<ffffffff8aacba9b>] ? __wake_up_common+0x5b/0x90^M
            [Mon Apr 22 10:20:59 2019][ 6643.365380]  [<ffffffffc0d6100c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]^M
            [Mon Apr 22 10:20:59 2019][ 6643.371679]  [<ffffffffc0d60510>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]^M
            [Mon Apr 22 10:20:59 2019][ 6643.379080]  [<ffffffff8aac1c31>] kthread+0xd1/0xe0^M
            [Mon Apr 22 10:20:59 2019][ 6643.383957]  [<ffffffff8aac1b60>] ? insert_kthread_work+0x40/0x40^M
            [Mon Apr 22 10:20:59 2019][ 6643.390052]  [<ffffffff8b174c24>] ret_from_fork_nospec_begin+0xe/0x21^M
            [Mon Apr 22 10:20:59 2019][ 6643.396497]  [<ffffffff8aac1b60>] ? insert_kthread_work+0x40/0x40^M
            

            Filesystem was down last night due to this LBUG, in the morning we ran a fsck on this OST (with e2fsprogs-1.44.5.wc1-0.el7.x86_64) that fixed the following things:

            Apr 22 08:32:31 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      Starting fsck
            Apr 22 08:32:32 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      MMP interval is 10 seconds and total wait time is 42 seconds. Please wait...
            Apr 22 08:33:15 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      fir-OST0011: recovering journal
            Apr 22 08:33:42 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      Pass 1: Checking inodes, blocks, and sizes
            Apr 22 08:36:28 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      Pass 2: Checking directory structure
            Apr 22 08:37:14 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      Pass 3: Checking directory connectivity
            Apr 22 08:37:17 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      Pass 4: Checking reference counts
            Apr 22 08:37:25 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      Pass 5: Checking group summary information
            Apr 22 08:37:38 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      Free blocks count wrong (10628562260, counted=10628555100).
            Apr 22 08:37:38 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      Fix? yes
            Apr 22 08:37:38 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011
            Apr 22 08:37:46 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      Free inodes count wrong (53346944, counted=53345769).
            Apr 22 08:37:46 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      Fix? yes
            Apr 22 08:37:46 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011
            Apr 22 08:37:46 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      Update quota info for quota type 0? yes
            Apr 22 08:37:46 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011
            Apr 22 08:37:47 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      Update quota info for quota type 1? yes
            Apr 22 08:37:47 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011
            Apr 22 08:38:13 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011
            Apr 22 08:38:13 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      fir-OST0011: ***** FILE SYSTEM WAS MODIFIED *****
            Apr 22 08:38:13 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO  fir-OST0011      fir-OST0011: 7700375/61046144 files (4.1% non-contiguous), 4999233699/15627788800 blocks
            

            But it just crashed anyway. Any recommendation are welcomed. We're running 2.12.0 + the following patches on all servers:

            2eec4f8 LU-12018 quota: do not start a thread under memory pressure
            1819063 LU-11359 mdt: fix mdt_dom_discard_data() timeouts
            3ed10f4 LU-11964 mdc: prevent glimpse lock count grow
            565011c LU-12037 mdt: add option for cross-MDT rename
            b6be1d9 LU-12065 lnd: increase CQ entries
            6b2c97b LU-12037 mdt: call mdt_dom_discard_data() after rename unlock
            

            Those crashes seem to have started only after we applied "LU-11359 mdt: fix mdt_dom_discard_data() timeouts" which seems very weird... not sure how it could be related..

            sthiell Stephane Thiell added a comment - Hit again after about 1 hour uptime: [Mon Apr 22 10:20:58 2019][ 6643.068873] LustreError: 81562:0:(tgt_grant.c:563:tgt_grant_incoming()) fir-OST0011: cli b214bbaa-dfa8-5550-6c23-a4a7f34b2d1a/ffff970be7f44800 dirty 4169728 pend 0 grant -12582912^M [Mon Apr 22 10:20:58 2019][ 6643.084869] LustreError: 81562:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG^M [Mon Apr 22 10:20:58 2019][ 6643.091935] Pid: 81562, comm: ll_ost_io01_025 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018^M [Mon Apr 22 10:20:58 2019][ 6643.102364] Call Trace:^M [Mon Apr 22 10:20:58 2019][ 6643.104866] [<ffffffffc09947cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M [Mon Apr 22 10:20:58 2019][ 6643.111440] [<ffffffffc099487c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M [Mon Apr 22 10:20:58 2019][ 6643.117678] [<ffffffffc0dd66f0>] tgt_grant_prepare_read+0x0/0x3b0 [ptlrpc]^M [Mon Apr 22 10:20:58 2019][ 6643.124732] [<ffffffffc0dd81d4>] tgt_grant_prepare_write+0x2f4/0xf70 [ptlrpc]^M [Mon Apr 22 10:20:58 2019][ 6643.132032] [<ffffffffc13aea9c>] ofd_preprw_write.isra.31+0x14c/0xe60 [ofd]^M [Mon Apr 22 10:20:58 2019][ 6643.139121] [<ffffffffc13afbd2>] ofd_preprw+0x422/0x1160 [ofd]^M [Mon Apr 22 10:20:58 2019][ 6643.145075] [<ffffffffc0db54be>] tgt_brw_write+0xc7e/0x1a90 [ptlrpc]^M [Mon Apr 22 10:20:58 2019][ 6643.151584] [<ffffffffc0db910a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M [Mon Apr 22 10:20:58 2019][ 6643.158566] [<ffffffffc0d5d6db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M [Mon Apr 22 10:20:58 2019][ 6643.166284] [<ffffffffc0d6100c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]^M [Mon Apr 22 10:20:58 2019][ 6643.172657] [<ffffffff8aac1c31>] kthread+0xd1/0xe0^M [Mon Apr 22 10:20:58 2019][ 6643.177576] [<ffffffff8b174c24>] ret_from_fork_nospec_begin+0xe/0x21^M [Mon Apr 22 10:20:58 2019][ 6643.184081] [<ffffffffffffffff>] 0xffffffffffffffff^M [Mon Apr 22 10:20:58 2019][ 6643.189112] Kernel panic - not syncing: LBUG^M [Mon Apr 22 10:20:58 2019][ 6643.193392] CPU: 13 PID: 81562 Comm: ll_ost_io01_025 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.1.3.el7_lustre.x86_64 #1^M [Mon Apr 22 10:20:58 2019][ 6643.206155] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.6.7 10/29/2018^M [Mon Apr 22 10:20:58 2019][ 6643.213723] Call Trace:^M [Mon Apr 22 10:20:58 2019][ 6643.216183] [<ffffffff8b161e41>] dump_stack+0x19/0x1b^M [Mon Apr 22 10:20:58 2019][ 6643.221326] [<ffffffff8b15b550>] panic+0xe8/0x21f^M [Mon Apr 22 10:20:59 2019][ 6643.226124] [<ffffffffc09948cb>] lbug_with_loc+0x9b/0xa0 [libcfs]^M [Mon Apr 22 10:20:59 2019][ 6643.232358] [<ffffffffc0dd66f0>] tgt_grant_incoming.isra.6+0x540/0x540 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.239788] [<ffffffffc0dd81d4>] tgt_grant_prepare_write+0x2f4/0xf70 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.247038] [<ffffffffc0b18423>] ? lu_object_find_at+0x1b3/0x280 [obdclass]^M [Mon Apr 22 10:20:59 2019][ 6643.254091] [<ffffffffc13aea9c>] ofd_preprw_write.isra.31+0x14c/0xe60 [ofd]^M [Mon Apr 22 10:20:59 2019][ 6643.261144] [<ffffffffc13afbd2>] ofd_preprw+0x422/0x1160 [ofd]^M [Mon Apr 22 10:20:59 2019][ 6643.267102] [<ffffffffc0db54be>] tgt_brw_write+0xc7e/0x1a90 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.273571] [<ffffffffc0d4f0a7>] ? lustre_msg_buf+0x17/0x60 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.280047] [<ffffffffc0d76b4f>] ? __req_capsule_get+0x15f/0x740 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.286957] [<ffffffffc0af8279>] ? class_handle2object+0xb9/0x1c0 [obdclass]^M [Mon Apr 22 10:20:59 2019][ 6643.294095] [<ffffffff8aae011c>] ? update_curr+0x14c/0x1e0^M [Mon Apr 22 10:20:59 2019][ 6643.299673] [<ffffffff8aadca7e>] ? account_entity_dequeue+0xae/0xd0^M [Mon Apr 22 10:20:59 2019][ 6643.306025] [<ffffffff8aadb738>] ? __enqueue_entity+0x78/0x80^M [Mon Apr 22 10:20:59 2019][ 6643.311859] [<ffffffff8b165e92>] ? mutex_lock+0x12/0x2f^M [Mon Apr 22 10:20:59 2019][ 6643.317212] [<ffffffffc0db910a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.324118] [<ffffffffc0d92801>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.331694] [<ffffffffc0994bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M [Mon Apr 22 10:20:59 2019][ 6643.338772] [<ffffffffc0d5d6db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.346462] [<ffffffffc0d5a565>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.353251] [<ffffffff8aad67c2>] ? default_wake_function+0x12/0x20^M [Mon Apr 22 10:20:59 2019][ 6643.359519] [<ffffffff8aacba9b>] ? __wake_up_common+0x5b/0x90^M [Mon Apr 22 10:20:59 2019][ 6643.365380] [<ffffffffc0d6100c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.371679] [<ffffffffc0d60510>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.379080] [<ffffffff8aac1c31>] kthread+0xd1/0xe0^M [Mon Apr 22 10:20:59 2019][ 6643.383957] [<ffffffff8aac1b60>] ? insert_kthread_work+0x40/0x40^M [Mon Apr 22 10:20:59 2019][ 6643.390052] [<ffffffff8b174c24>] ret_from_fork_nospec_begin+0xe/0x21^M [Mon Apr 22 10:20:59 2019][ 6643.396497] [<ffffffff8aac1b60>] ? insert_kthread_work+0x40/0x40^M Filesystem was down last night due to this LBUG, in the morning we ran a fsck on this OST (with e2fsprogs-1.44.5.wc1-0.el7.x86_64) that fixed the following things: Apr 22 08:32:31 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Starting fsck Apr 22 08:32:32 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 MMP interval is 10 seconds and total wait time is 42 seconds. Please wait... Apr 22 08:33:15 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 fir-OST0011: recovering journal Apr 22 08:33:42 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Pass 1: Checking inodes, blocks, and sizes Apr 22 08:36:28 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Pass 2: Checking directory structure Apr 22 08:37:14 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Pass 3: Checking directory connectivity Apr 22 08:37:17 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Pass 4: Checking reference counts Apr 22 08:37:25 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Pass 5: Checking group summary information Apr 22 08:37:38 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Free blocks count wrong (10628562260, counted=10628555100). Apr 22 08:37:38 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Fix? yes Apr 22 08:37:38 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Apr 22 08:37:46 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Free inodes count wrong (53346944, counted=53345769). Apr 22 08:37:46 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Fix? yes Apr 22 08:37:46 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Apr 22 08:37:46 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Update quota info for quota type 0? yes Apr 22 08:37:46 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Apr 22 08:37:47 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Update quota info for quota type 1? yes Apr 22 08:37:47 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Apr 22 08:38:13 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Apr 22 08:38:13 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 fir-OST0011: ***** FILE SYSTEM WAS MODIFIED ***** Apr 22 08:38:13 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 fir-OST0011: 7700375/61046144 files (4.1% non-contiguous), 4999233699/15627788800 blocks But it just crashed anyway. Any recommendation are welcomed. We're running 2.12.0 + the following patches on all servers: 2eec4f8 LU-12018 quota: do not start a thread under memory pressure 1819063 LU-11359 mdt: fix mdt_dom_discard_data() timeouts 3ed10f4 LU-11964 mdc: prevent glimpse lock count grow 565011c LU-12037 mdt: add option for cross-MDT rename b6be1d9 LU-12065 lnd: increase CQ entries 6b2c97b LU-12037 mdt: call mdt_dom_discard_data() after rename unlock Those crashes seem to have started only after we applied " LU-11359 mdt: fix mdt_dom_discard_data() timeouts" which seems very weird... not sure how it could be related..

            People

              pfarrell Patrick Farrell (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: