[LU-12592] sanity 134a hit panic time to time. Created: 26/Jul/19 Updated: 29/Jul/19 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Alexey Lyashkov | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
It looks patch aa82cc8361 ("obdclass: put all service's env on the list") introduce one more regression. [84431.019155] Lustre: DEBUG MARKER: == sanity test 134a: Server reclaims locks when reaching lock_reclaim_threshold ====================== 12:42:06 (1564134126) [84431.770714] Lustre: *** cfs_fail_loc=327, val=0*** [84431.787957] LustreError: 10728:0:(ofd_internal.h:410:ofd_info()) ASSERTION( info ) failed: [84431.788969] LustreError: 10728:0:(ofd_internal.h:410:ofd_info()) LBUG [84431.790039] Pid: 10728, comm: mdt00_005 3.10.0-neo7.4.x86_64 #0 SMP Thu Nov 15 06:30:59 EST 2018 [84431.791221] Call Trace: [84431.792331] [<ffffffff810434f2>] save_stack_trace_tsk+0x22/0x40 [84431.793382] [<ffffffffc06c47ec>] libcfs_call_trace+0x8c/0xc0 [libcfs] [84431.794333] [<ffffffffc06c489c>] lbug_with_loc+0x4c/0xa0 [libcfs] [84431.795260] [<ffffffffc1394e71>] ofd_exit+0x0/0x18f [ofd] [84431.796169] [<ffffffffc1393a3b>] ofd_lvbo_update+0xd5b/0xe60 [ofd] [84431.797248] [<ffffffffc0b1d6b5>] ldlm_handle_ast_error+0x475/0x860 [ptlrpc] [84431.798655] [<ffffffffc0b1f32a>] ldlm_cb_interpret+0x19a/0x750 [ptlrpc] [84431.800193] [<ffffffffc0b3a954>] ptlrpc_check_set.part.22+0x494/0x1e90 [ptlrpc] [84431.801494] [<ffffffffc0b3c3ab>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [84431.802831] [<ffffffffc0b3c774>] ptlrpc_set_wait+0x344/0x7c0 [ptlrpc] [84431.803902] [<ffffffffc0af8475>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] [84431.805074] [<ffffffffc0b31235>] ldlm_reclaim_full+0x425/0x7a0 [ptlrpc] [84431.806060] [<ffffffffc0b2243b>] ldlm_handle_enqueue0+0x13b/0x1650 [ptlrpc] [84431.807019] [<ffffffffc0bae3a2>] tgt_enqueue+0x62/0x210 [ptlrpc] [84431.807903] [<ffffffffc0bb70a8>] tgt_request_handle+0x998/0x1610 [ptlrpc] [84431.808731] [<ffffffffc0b56966>] ptlrpc_server_handle_request+0x266/0xb30 [ptlrpc] [84431.809561] [<ffffffffc0b5afc0>] ptlrpc_main+0xd20/0x1cf0 [ptlrpc] [84431.810355] [<ffffffff810ce2df>] kthread+0xef/0x100 [84431.811077] [<ffffffff8178bedd>] ret_from_fork+0x5d/0xb0 [84431.811755] [<ffffffffffffffff>] 0xffffffffffffffff |
| Comments |
| Comment by Alex Zhuravlev [ 26/Jul/19 ] |
|
it would be good to have logs. there is already ticket for this issue - |
| Comment by Alexey Lyashkov [ 26/Jul/19 ] |
|
I have several crash dumps (two ?) but both with default sanity debug logs. 00000100:00100000:0.0:1564134127.527799:0:10728:0:(service.c:2227:ptlrpc_server_handle_request()) Handling RPC req@ffff8800369e2c00 pname:cluuid+ref :pid:xid:nid:opc:job mdt00_005:29092c9d-0c24-a754-6f34-44253c4359c6+3009:9120:x1640112644356192:12345-0@lo:101:jobid_name=touch.0 00010000:00010000:0.0:1564134127.527807:0:10728:0:(ldlm_lockd.c:1186:ldlm_handle_enqueue0()) ### server-side enqueue handler START 00010000:02000000:0.0:1564134127.527809:0:10728:0:(libcfs_fail.h:92:cfs_fail_check_set()) *** cfs_fail_loc=327, val=0*** 00010000:00010000:0.0:1564134127.531157:0:10728:0:(ldlm_reclaim.c:214:ldlm_reclaim_res()) NS(filter-lustre-OST0000_UUID): 512 locks to be reclaimed, found 512/512 locks. 00000100:00100000:0.0:1564134127.544998:0:10728:0:(client.c:1675:ptlrpc_send_new_req()) Sending RPC req@ffff88010fc91b80 pname:cluuid:pid:xid:nid:opc:job mdt00_005:lustre-OST0000_UUID:10728:1640112644364384:0@lo:104:(null) 00000100:00100000:0.0:1564134127.545004:0:10728:0:(events.c:354:request_in_callback()) peer: 12345-0@lo (source: 12345-0@lo) 00000100:00100000:0.0:1564134127.545036:0:10728:0:(client.c:2408:ptlrpc_set_wait()) set ffff880008306300 going to sleep for 6 seconds 00010000:00010000:0.0:1564134127.545045:0:10728:0:(ldlm_lockd.c:696:ldlm_handle_ast_error()) ### client (nid 0@lo) returned -22 from blocking AST (req@ffff8800369e3180 x1640112644356208) - normal race ns: filter-lustre-OST0000_UUID lock: ffff8800506a1800/0xe918d2a075fe22a8 lrc: 4/0,0 mode: PR/PR res: [0xc279:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400000020 nid: 0@lo remote: 0xe918d2a075fe229a expref: 583 pid: 14303 timeout: 84529 lvb_type: 1 00002000:00040000:0.0:1564134127.545051:0:10728:0:(ofd_internal.h:410:ofd_info()) ASSERTION( info ) failed:
thr ffff8800763dde00 - pid 10728
req ffff8800369e2c00
request ffff8800369e2c00
from 0.0.0.0@9:0
to 0.0.0.0@9:0
xid 1640112644356192:0
transno 0
time 18446612134297984512:1564134133:0
flags 3
lens 184 104 8 136 0 0 2 240 17 37 0
ptlrpc body(v3) ffff880137d56598
type 4711
opc 101 LDLM_ENQUEUE
Intent: OPEN+CREATE
So open+create request was trigger a ldlm_reclaim_full which flush an extent locks also, but no DT_THREAD env for mdt thread. it looks it bug can triggered in just single node setup. |
| Comment by Alexey Lyashkov [ 26/Jul/19 ] |
|
so regression from
commit fe60e0135ee2334440247cde167b707b223cf11d
Author: Niu Yawei <yawei.niu@intel.com>
Date: Thu May 21 11:07:54 2015 -0400
LU-6529 ldlm: reclaim granted locks defensively
I don't understand this patch at all. server lock volume is created for same purpose, so reduce a SLV will reduce a ldlm locks memory consumption. So my suggestion - just revert this patch and use SLV for same purpose. |
| Comment by Oleg Drokin [ 26/Jul/19 ] |
|
same crashes with SSK at 100% rate |
| Comment by Alexey Lyashkov [ 29/Jul/19 ] |
|
based on discussion with Oleg, it looks two ways to fix exist. Current patch need to be interact with LRU resize to don't allow to set hard limit over SLV, otherwise stupid situation exist. |