Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.10.5
-
None
-
3
-
9223372036854775807
Description
First time we have see this lbug. We have a crash dump.
[11168227.609396] Lustre: Skipped 4149 previous similar messages [11168546.559600] LustreError: 15069:0:(tgt_grant.c:559:tgt_grant_incoming()) nbp2-OST0121: cli 811dc111-6b22-b447-9144-bcd1173a571d/ffff881d57bff000 dirty -610055168 pend 0 grant -643806208 [11168546.609562] LustreError: 15069:0:(tgt_grant.c:561:tgt_grant_incoming()) LBUG [11168546.631425] Pid: 15069, comm: ll_ost_io01_084 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018 [11168546.631428] Call Trace: [11168546.631439] [<ffffffff8103a1f2>] save_stack_trace_tsk+0x22/0x40 [11168546.650223] [<ffffffffa08597cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [11168546.650228] [<ffffffffa085987c>] lbug_with_loc+0x4c/0xa0 [libcfs] [11168546.650292] [<ffffffffa0e8c270>] tgt_grant_prepare_read+0x0/0x3b0 [ptlrpc] [11168546.650325] [<ffffffffa0e8c37b>] tgt_grant_prepare_read+0x10b/0x3b0 [ptlrpc] [11168546.650335] [<ffffffffa0c86590>] ofd_preprw+0x450/0x1170 [ofd] [11168546.650365] [<ffffffffa0e702a5>] tgt_brw_read+0x975/0x1860 [ptlrpc] [11168546.650394] [<ffffffffa0e6deca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [11168546.650420] [<ffffffffa0e164bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [11168546.650445] [<ffffffffa0e1a4a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [11168546.650449] [<ffffffff810b1131>] kthread+0xd1/0xe0 [11168546.650452] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 [11168546.650468] [<ffffffffffffffff>] 0xffffffffffffffff [11168546.650469] Kernel panic - not syncing: LBUG
Attachments
Issue Links
Activity
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34996
Subject: LU-12120 grants: prevent negative ted_grant value
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 296ac6b910880dc65c9cee8be2552bba759cea57
Ok, thanks Mikhail!
From what I understand clients and servers maintain separate views of grants and this LU relates to pairs of (client; server) that drift apart from each other. At the next shrink request, that results in the client asking to give back more space than the server remembers granting.
—
Does anyone know if there a valid reason as to why in tgt_grant_incoming():
static void tgt_grant_incoming(...) { ... CDEBUG(D_CACHE, "%s: cli %s/%p reports grant %llu dropped %u, local %lu\n", obd->obd_name, exp->exp_client_uuid.uuid, exp, oa->o_grant, oa->o_dropped, ted->ted_grant); ... }
The value printed after "reports grant" would not be the same as the value printed after "local"?
Can this be used to track the desynchronization?
Quentin, I agree with your analysis and also think that tgt_grant_shrink should not decrease ted_grant blindly but do sanity checks there. I will prepare patch today
I am fairly certain that the code that generates the negative value is in tgt_grant_shrink().
@Stephane, as a temporary fix, you can try deactivating grant shrinking on clients with:
lctl set_param osc.*.grant_shrink=0
I do not know how much of an impact this has on OSSes, but considering the feature was disable by default up until july 2018 (cf. LU-8708), I assume it should be bearable.
Note that some comments that may assist in the analysis have been added to the resolved unassigned ticket LU-12322
Patrick, I think this may actually be due to another thing, likely an OSS deadlock. Please see LU-12215
Stephane,
If true, that would seem to put you in exactly the same situation as NASA, who are also hitting that LBUG because of the LU-11644 evictions.
By the way, I meant to close this bug, was just waiting on a reply from Mahmoud - the LBUG is tracked under LU-11939 if you want to follow that.
We're actually investigating if this could be caused by LU-11644 reported by NASA, as I noticed a few similar call traces:
Apr 22 10:48:49 fir-io3-s2 kernel: LustreError: 78496:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.9.101.28@o2ib4) failed to reply to blocking AST (req@ffff981d18ba8c00 x1631526698598368 status 0 rc -110), evict it ns: filter-fir-OST0023_UUID lock: ffff98016d218b40/0xd80148300515f99f lrc: 4/0,0 mo Apr 22 10:48:49 fir-io3-s2 kernel: LustreError: 138-a: fir-OST001b: A client on nid 10.9.101.28@o2ib4 was evicted due to a lock blocking callback time out: rc -110 Apr 22 10:48:49 fir-io3-s2 kernel: LustreError: 77026:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 105s: evicting client at 10.9.101.28@o2ib4 ns: filter-fir-OST001b_UUID lock: ffff97f619e7af40/0xd80148300516127a lrc: 3/0,0 mode: PR/PR res: [0xd00000401:0x91e8cf:0x0].0x0 rrc: 3 type Apr 22 10:48:49 fir-io3-s2 kernel: LustreError: 78496:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) Skipped 1 previous similar message Apr 22 10:49:49 fir-io3-s2 kernel: LNet: Service thread pid 77058 was inactive for 200.48s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 22 10:49:49 fir-io3-s2 kernel: Pid: 77058, comm: ll_ost00_000 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 22 10:49:49 fir-io3-s2 kernel: Call Trace: Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f9b640>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f58fe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f7a69b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc162910b>] ofd_intent_policy+0x69b/0x920 [ofd] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f59d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f826d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc10090b2>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc101010a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0fb46db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0fb800c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffff8b8c1c31>] kthread+0xd1/0xe0 Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffff8bf74c24>] ret_from_fork_nospec_begin+0xe/0x21 Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Apr 22 10:49:49 fir-io3-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1555955389.77058 Apr 22 10:49:49 fir-io3-s2 kernel: Pid: 78291, comm: ll_ost00_019 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 22 10:49:49 fir-io3-s2 kernel: Call Trace: Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f9b640>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f58fe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f7a69b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc162910b>] ofd_intent_policy+0x69b/0x920 [ofd] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f59d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0f826d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc10090b2>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc101010a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0fb46db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffc0fb800c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffff8b8c1c31>] kthread+0xd1/0xe0 Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffff8bf74c24>] ret_from_fork_nospec_begin+0xe/0x21 Apr 22 10:49:49 fir-io3-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Apr 22 10:49:49 fir-io3-s2 kernel: Lustre: fir-OST0019: Connection restored to 708f469b-f504-7275-d236-e90aa8bcec67 (at 10.9.0.64@o2ib4) Apr 22 10:49:49 fir-io3-s2 kernel: Lustre: Skipped 24 previous similar messages Apr 22 10:49:55 fir-io3-s2 kernel: LNet: Service thread pid 78318 was inactive for 200.37s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 22 10:49:55 fir-io3-s2 kernel: LNet: Skipped 1 previous similar message Apr 22 10:49:55 fir-io3-s2 kernel: Pid: 78318, comm: ll_ost02_032 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 22 10:49:55 fir-io3-s2 kernel: Call Trace: Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f9b640>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f58fe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f7a69b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc162910b>] ofd_intent_policy+0x69b/0x920 [ofd] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f59d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f826d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc10090b2>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc101010a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0fb46db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0fb800c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffff8b8c1c31>] kthread+0xd1/0xe0 Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffff8bf74c24>] ret_from_fork_nospec_begin+0xe/0x21 Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Apr 22 10:49:55 fir-io3-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1555955395.78318 Apr 22 10:49:55 fir-io3-s2 kernel: Pid: 78868, comm: ll_ost03_105 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 22 10:49:55 fir-io3-s2 kernel: Call Trace: Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f9b640>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f58fe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f7a69b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc162910b>] ofd_intent_policy+0x69b/0x920 [ofd] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f59d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0f826d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc10090b2>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc101010a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 22 10:49:55 fir-io3-s2 kernel: [<ffffffffc0fb46db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 22 10:49:56 fir-io3-s2 kernel: [<ffffffffc0fb800c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 22 10:49:56 fir-io3-s2 kernel: [<ffffffff8b8c1c31>] kthread+0xd1/0xe0 Apr 22 10:49:56 fir-io3-s2 kernel: [<ffffffff8bf74c24>] ret_from_fork_nospec_begin+0xe/0x21 Apr 22 10:49:56 fir-io3-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Apr 22 10:50:12 fir-io3-s2 kernel: Lustre: fir-OST0021: haven't heard from client 83316b44-aad9-bbe2-1580-9c149ecd135e (at 10.9.101.28@o2ib4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff97fd659af000, cur 1555955412 expire 1555955262 last 1555955185 Apr 22 10:50:12 fir-io3-s2 kernel: Lustre: Skipped 19 previous similar messages Apr 22 10:50:12 fir-io3-s2 kernel: LNet: Service thread pid 77058 completed after 223.36s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 22 10:50:12 fir-io3-s2 kernel: LustreError: 78624:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff981b3d9bad00 x1631526704122896/t0(0) o104->fir-OST0021@10.9.101.28@o2ib4:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
I have enabled +lnet debug on all servers to see how it goes.
Do you think disabling ost quota enforcement could help avoiding these LBUG (for now)?
Hit again after about 1 hour uptime:
[Mon Apr 22 10:20:58 2019][ 6643.068873] LustreError: 81562:0:(tgt_grant.c:563:tgt_grant_incoming()) fir-OST0011: cli b214bbaa-dfa8-5550-6c23-a4a7f34b2d1a/ffff970be7f44800 dirty 4169728 pend 0 grant -12582912^M [Mon Apr 22 10:20:58 2019][ 6643.084869] LustreError: 81562:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG^M [Mon Apr 22 10:20:58 2019][ 6643.091935] Pid: 81562, comm: ll_ost_io01_025 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018^M [Mon Apr 22 10:20:58 2019][ 6643.102364] Call Trace:^M [Mon Apr 22 10:20:58 2019][ 6643.104866] [<ffffffffc09947cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M [Mon Apr 22 10:20:58 2019][ 6643.111440] [<ffffffffc099487c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M [Mon Apr 22 10:20:58 2019][ 6643.117678] [<ffffffffc0dd66f0>] tgt_grant_prepare_read+0x0/0x3b0 [ptlrpc]^M [Mon Apr 22 10:20:58 2019][ 6643.124732] [<ffffffffc0dd81d4>] tgt_grant_prepare_write+0x2f4/0xf70 [ptlrpc]^M [Mon Apr 22 10:20:58 2019][ 6643.132032] [<ffffffffc13aea9c>] ofd_preprw_write.isra.31+0x14c/0xe60 [ofd]^M [Mon Apr 22 10:20:58 2019][ 6643.139121] [<ffffffffc13afbd2>] ofd_preprw+0x422/0x1160 [ofd]^M [Mon Apr 22 10:20:58 2019][ 6643.145075] [<ffffffffc0db54be>] tgt_brw_write+0xc7e/0x1a90 [ptlrpc]^M [Mon Apr 22 10:20:58 2019][ 6643.151584] [<ffffffffc0db910a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M [Mon Apr 22 10:20:58 2019][ 6643.158566] [<ffffffffc0d5d6db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M [Mon Apr 22 10:20:58 2019][ 6643.166284] [<ffffffffc0d6100c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]^M [Mon Apr 22 10:20:58 2019][ 6643.172657] [<ffffffff8aac1c31>] kthread+0xd1/0xe0^M [Mon Apr 22 10:20:58 2019][ 6643.177576] [<ffffffff8b174c24>] ret_from_fork_nospec_begin+0xe/0x21^M [Mon Apr 22 10:20:58 2019][ 6643.184081] [<ffffffffffffffff>] 0xffffffffffffffff^M [Mon Apr 22 10:20:58 2019][ 6643.189112] Kernel panic - not syncing: LBUG^M [Mon Apr 22 10:20:58 2019][ 6643.193392] CPU: 13 PID: 81562 Comm: ll_ost_io01_025 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.1.3.el7_lustre.x86_64 #1^M [Mon Apr 22 10:20:58 2019][ 6643.206155] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.6.7 10/29/2018^M [Mon Apr 22 10:20:58 2019][ 6643.213723] Call Trace:^M [Mon Apr 22 10:20:58 2019][ 6643.216183] [<ffffffff8b161e41>] dump_stack+0x19/0x1b^M [Mon Apr 22 10:20:58 2019][ 6643.221326] [<ffffffff8b15b550>] panic+0xe8/0x21f^M [Mon Apr 22 10:20:59 2019][ 6643.226124] [<ffffffffc09948cb>] lbug_with_loc+0x9b/0xa0 [libcfs]^M [Mon Apr 22 10:20:59 2019][ 6643.232358] [<ffffffffc0dd66f0>] tgt_grant_incoming.isra.6+0x540/0x540 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.239788] [<ffffffffc0dd81d4>] tgt_grant_prepare_write+0x2f4/0xf70 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.247038] [<ffffffffc0b18423>] ? lu_object_find_at+0x1b3/0x280 [obdclass]^M [Mon Apr 22 10:20:59 2019][ 6643.254091] [<ffffffffc13aea9c>] ofd_preprw_write.isra.31+0x14c/0xe60 [ofd]^M [Mon Apr 22 10:20:59 2019][ 6643.261144] [<ffffffffc13afbd2>] ofd_preprw+0x422/0x1160 [ofd]^M [Mon Apr 22 10:20:59 2019][ 6643.267102] [<ffffffffc0db54be>] tgt_brw_write+0xc7e/0x1a90 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.273571] [<ffffffffc0d4f0a7>] ? lustre_msg_buf+0x17/0x60 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.280047] [<ffffffffc0d76b4f>] ? __req_capsule_get+0x15f/0x740 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.286957] [<ffffffffc0af8279>] ? class_handle2object+0xb9/0x1c0 [obdclass]^M [Mon Apr 22 10:20:59 2019][ 6643.294095] [<ffffffff8aae011c>] ? update_curr+0x14c/0x1e0^M [Mon Apr 22 10:20:59 2019][ 6643.299673] [<ffffffff8aadca7e>] ? account_entity_dequeue+0xae/0xd0^M [Mon Apr 22 10:20:59 2019][ 6643.306025] [<ffffffff8aadb738>] ? __enqueue_entity+0x78/0x80^M [Mon Apr 22 10:20:59 2019][ 6643.311859] [<ffffffff8b165e92>] ? mutex_lock+0x12/0x2f^M [Mon Apr 22 10:20:59 2019][ 6643.317212] [<ffffffffc0db910a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.324118] [<ffffffffc0d92801>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.331694] [<ffffffffc0994bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M [Mon Apr 22 10:20:59 2019][ 6643.338772] [<ffffffffc0d5d6db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.346462] [<ffffffffc0d5a565>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.353251] [<ffffffff8aad67c2>] ? default_wake_function+0x12/0x20^M [Mon Apr 22 10:20:59 2019][ 6643.359519] [<ffffffff8aacba9b>] ? __wake_up_common+0x5b/0x90^M [Mon Apr 22 10:20:59 2019][ 6643.365380] [<ffffffffc0d6100c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.371679] [<ffffffffc0d60510>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]^M [Mon Apr 22 10:20:59 2019][ 6643.379080] [<ffffffff8aac1c31>] kthread+0xd1/0xe0^M [Mon Apr 22 10:20:59 2019][ 6643.383957] [<ffffffff8aac1b60>] ? insert_kthread_work+0x40/0x40^M [Mon Apr 22 10:20:59 2019][ 6643.390052] [<ffffffff8b174c24>] ret_from_fork_nospec_begin+0xe/0x21^M [Mon Apr 22 10:20:59 2019][ 6643.396497] [<ffffffff8aac1b60>] ? insert_kthread_work+0x40/0x40^M
Filesystem was down last night due to this LBUG, in the morning we ran a fsck on this OST (with e2fsprogs-1.44.5.wc1-0.el7.x86_64) that fixed the following things:
Apr 22 08:32:31 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Starting fsck Apr 22 08:32:32 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 MMP interval is 10 seconds and total wait time is 42 seconds. Please wait... Apr 22 08:33:15 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 fir-OST0011: recovering journal Apr 22 08:33:42 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Pass 1: Checking inodes, blocks, and sizes Apr 22 08:36:28 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Pass 2: Checking directory structure Apr 22 08:37:14 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Pass 3: Checking directory connectivity Apr 22 08:37:17 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Pass 4: Checking reference counts Apr 22 08:37:25 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Pass 5: Checking group summary information Apr 22 08:37:38 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Free blocks count wrong (10628562260, counted=10628555100). Apr 22 08:37:38 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Fix? yes Apr 22 08:37:38 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Apr 22 08:37:46 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Free inodes count wrong (53346944, counted=53345769). Apr 22 08:37:46 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Fix? yes Apr 22 08:37:46 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Apr 22 08:37:46 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Update quota info for quota type 0? yes Apr 22 08:37:46 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Apr 22 08:37:47 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Update quota info for quota type 1? yes Apr 22 08:37:47 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Apr 22 08:38:13 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 Apr 22 08:38:13 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 fir-OST0011: ***** FILE SYSTEM WAS MODIFIED ***** Apr 22 08:38:13 fir-io2-s2 journal: Shine.Lustre.Actions.Fsck INFO fir-OST0011 fir-OST0011: 7700375/61046144 files (4.1% non-contiguous), 4999233699/15627788800 blocks
But it just crashed anyway. Any recommendation are welcomed. We're running 2.12.0 + the following patches on all servers:
2eec4f8 LU-12018 quota: do not start a thread under memory pressure 1819063 LU-11359 mdt: fix mdt_dom_discard_data() timeouts 3ed10f4 LU-11964 mdc: prevent glimpse lock count grow 565011c LU-12037 mdt: add option for cross-MDT rename b6be1d9 LU-12065 lnd: increase CQ entries 6b2c97b LU-12037 mdt: call mdt_dom_discard_data() after rename unlock
Those crashes seem to have started only after we applied "LU-11359 mdt: fix mdt_dom_discard_data() timeouts" which seems very weird... not sure how it could be related..
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34996/
Subject:
LU-12120grants: prevent negative ted_grant valueProject: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7e08317ef5cbed5cd587017cbe343eb4cc52822c