[LU-12592] sanity 134a hit panic time to time. Created: 26/Jul/19  Updated: 29/Jul/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12570 sanity test 134a crash with SSK in use Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It looks patch aa82cc8361 ("obdclass: put all service's env on the list") introduce one more regression.
sanity test hit a panic sometimes.

[84431.019155] Lustre: DEBUG MARKER: == sanity test 134a: Server reclaims locks when reaching lock_reclaim_threshold ====================== 12:42:06
 (1564134126)
[84431.770714] Lustre: *** cfs_fail_loc=327, val=0***
[84431.787957] LustreError: 10728:0:(ofd_internal.h:410:ofd_info()) ASSERTION( info ) failed:
[84431.788969] LustreError: 10728:0:(ofd_internal.h:410:ofd_info()) LBUG
[84431.790039] Pid: 10728, comm: mdt00_005 3.10.0-neo7.4.x86_64 #0 SMP Thu Nov 15 06:30:59 EST 2018
[84431.791221] Call Trace:
[84431.792331]  [<ffffffff810434f2>] save_stack_trace_tsk+0x22/0x40
[84431.793382]  [<ffffffffc06c47ec>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[84431.794333]  [<ffffffffc06c489c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[84431.795260]  [<ffffffffc1394e71>] ofd_exit+0x0/0x18f [ofd]
[84431.796169]  [<ffffffffc1393a3b>] ofd_lvbo_update+0xd5b/0xe60 [ofd]
[84431.797248]  [<ffffffffc0b1d6b5>] ldlm_handle_ast_error+0x475/0x860 [ptlrpc]
[84431.798655]  [<ffffffffc0b1f32a>] ldlm_cb_interpret+0x19a/0x750 [ptlrpc]
[84431.800193]  [<ffffffffc0b3a954>] ptlrpc_check_set.part.22+0x494/0x1e90 [ptlrpc]
[84431.801494]  [<ffffffffc0b3c3ab>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
[84431.802831]  [<ffffffffc0b3c774>] ptlrpc_set_wait+0x344/0x7c0 [ptlrpc]
[84431.803902]  [<ffffffffc0af8475>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
[84431.805074]  [<ffffffffc0b31235>] ldlm_reclaim_full+0x425/0x7a0 [ptlrpc]
[84431.806060]  [<ffffffffc0b2243b>] ldlm_handle_enqueue0+0x13b/0x1650 [ptlrpc]
[84431.807019]  [<ffffffffc0bae3a2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[84431.807903]  [<ffffffffc0bb70a8>] tgt_request_handle+0x998/0x1610 [ptlrpc]
[84431.808731]  [<ffffffffc0b56966>] ptlrpc_server_handle_request+0x266/0xb30 [ptlrpc]
[84431.809561]  [<ffffffffc0b5afc0>] ptlrpc_main+0xd20/0x1cf0 [ptlrpc]
[84431.810355]  [<ffffffff810ce2df>] kthread+0xef/0x100
[84431.811077]  [<ffffffff8178bedd>] ret_from_fork+0x5d/0xb0
[84431.811755]  [<ffffffffffffffff>] 0xffffffffffffffff


 Comments   
Comment by Alex Zhuravlev [ 26/Jul/19 ]

it would be good to have logs.
the very first "confusion" is that mdt thread is handling extents locks (ofd_lvbo_update())

there is already ticket for this issue - LU-12570

Comment by Alexey Lyashkov [ 26/Jul/19 ]

I have several crash dumps (two ?) but both with default sanity debug logs.
based on logs, thread started to cancel a sort of extent locks.

00000100:00100000:0.0:1564134127.527799:0:10728:0:(service.c:2227:ptlrpc_server_handle_request()) Handling RPC req@ffff8800369e2c00 pname:cluuid+ref
:pid:xid:nid:opc:job mdt00_005:29092c9d-0c24-a754-6f34-44253c4359c6+3009:9120:x1640112644356192:12345-0@lo:101:jobid_name=touch.0
00010000:00010000:0.0:1564134127.527807:0:10728:0:(ldlm_lockd.c:1186:ldlm_handle_enqueue0()) ### server-side enqueue handler START
00010000:02000000:0.0:1564134127.527809:0:10728:0:(libcfs_fail.h:92:cfs_fail_check_set()) *** cfs_fail_loc=327, val=0***
00010000:00010000:0.0:1564134127.531157:0:10728:0:(ldlm_reclaim.c:214:ldlm_reclaim_res()) NS(filter-lustre-OST0000_UUID): 512 locks to be reclaimed, found 512/512 locks.
00000100:00100000:0.0:1564134127.544998:0:10728:0:(client.c:1675:ptlrpc_send_new_req()) Sending RPC req@ffff88010fc91b80 pname:cluuid:pid:xid:nid:opc:job mdt00_005:lustre-OST0000_UUID:10728:1640112644364384:0@lo:104:(null)
00000100:00100000:0.0:1564134127.545004:0:10728:0:(events.c:354:request_in_callback()) peer: 12345-0@lo (source: 12345-0@lo)
00000100:00100000:0.0:1564134127.545036:0:10728:0:(client.c:2408:ptlrpc_set_wait()) set ffff880008306300 going to sleep for 6 seconds
00010000:00010000:0.0:1564134127.545045:0:10728:0:(ldlm_lockd.c:696:ldlm_handle_ast_error()) ### client (nid 0@lo) returned -22 from blocking AST (req@ffff8800369e3180 x1640112644356208) - normal race ns: filter-lustre-OST0000_UUID lock: ffff8800506a1800/0xe918d2a075fe22a8 lrc: 4/0,0 mode: PR/PR res: [0xc279:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400000020 nid: 0@lo remote: 0xe918d2a075fe229a expref: 583 pid: 14303 timeout: 84529 lvb_type: 1
00002000:00040000:0.0:1564134127.545051:0:10728:0:(ofd_internal.h:410:ofd_info()) ASSERTION( info ) failed:
                thr ffff8800763dde00 - pid 10728
                        req ffff8800369e2c00
request ffff8800369e2c00
from 0.0.0.0@9:0
to 0.0.0.0@9:0
xid 1640112644356192:0
transno 0
time 18446612134297984512:1564134133:0
flags 3
lens 184 104 8 136 0 0 2 240 17 37 0
ptlrpc body(v3) ffff880137d56598
        type 4711
        opc 101 LDLM_ENQUEUE
Intent: OPEN+CREATE

So open+create request was trigger a ldlm_reclaim_full which flush an extent locks also, but no DT_THREAD env for mdt thread.

it looks it bug can triggered in just single node setup.

Comment by Alexey Lyashkov [ 26/Jul/19 ]

so regression from

commit fe60e0135ee2334440247cde167b707b223cf11d
Author: Niu Yawei <yawei.niu@intel.com>
Date:   Thu May 21 11:07:54 2015 -0400

    LU-6529 ldlm: reclaim granted locks defensively

I don't understand this patch at all. server lock volume is created for same purpose, so reduce a SLV will reduce a ldlm locks memory consumption. So my suggestion - just revert this patch and use SLV for same purpose.

Comment by Oleg Drokin [ 26/Jul/19 ]

same crashes with SSK at 100% rate

Comment by Alexey Lyashkov [ 29/Jul/19 ]

based on discussion with Oleg, it looks two ways to fix exist.
1) fast - just limit a flush to same ldlm namespace type as request ordinated.
2) long - but better.

Current patch need to be interact with LRU resize to don't allow to set hard limit over SLV, otherwise stupid situation exist.
this tunable allowed to set less than LRU resize limit and it block to work LRU resize at all.

Generated at Sat Feb 10 02:53:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.