Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12592

sanity 134a hit panic time to time.

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      It looks patch aa82cc8361 ("obdclass: put all service's env on the list") introduce one more regression.
      sanity test hit a panic sometimes.

      [84431.019155] Lustre: DEBUG MARKER: == sanity test 134a: Server reclaims locks when reaching lock_reclaim_threshold ====================== 12:42:06
       (1564134126)
      [84431.770714] Lustre: *** cfs_fail_loc=327, val=0***
      [84431.787957] LustreError: 10728:0:(ofd_internal.h:410:ofd_info()) ASSERTION( info ) failed:
      [84431.788969] LustreError: 10728:0:(ofd_internal.h:410:ofd_info()) LBUG
      [84431.790039] Pid: 10728, comm: mdt00_005 3.10.0-neo7.4.x86_64 #0 SMP Thu Nov 15 06:30:59 EST 2018
      [84431.791221] Call Trace:
      [84431.792331]  [<ffffffff810434f2>] save_stack_trace_tsk+0x22/0x40
      [84431.793382]  [<ffffffffc06c47ec>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [84431.794333]  [<ffffffffc06c489c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [84431.795260]  [<ffffffffc1394e71>] ofd_exit+0x0/0x18f [ofd]
      [84431.796169]  [<ffffffffc1393a3b>] ofd_lvbo_update+0xd5b/0xe60 [ofd]
      [84431.797248]  [<ffffffffc0b1d6b5>] ldlm_handle_ast_error+0x475/0x860 [ptlrpc]
      [84431.798655]  [<ffffffffc0b1f32a>] ldlm_cb_interpret+0x19a/0x750 [ptlrpc]
      [84431.800193]  [<ffffffffc0b3a954>] ptlrpc_check_set.part.22+0x494/0x1e90 [ptlrpc]
      [84431.801494]  [<ffffffffc0b3c3ab>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
      [84431.802831]  [<ffffffffc0b3c774>] ptlrpc_set_wait+0x344/0x7c0 [ptlrpc]
      [84431.803902]  [<ffffffffc0af8475>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
      [84431.805074]  [<ffffffffc0b31235>] ldlm_reclaim_full+0x425/0x7a0 [ptlrpc]
      [84431.806060]  [<ffffffffc0b2243b>] ldlm_handle_enqueue0+0x13b/0x1650 [ptlrpc]
      [84431.807019]  [<ffffffffc0bae3a2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [84431.807903]  [<ffffffffc0bb70a8>] tgt_request_handle+0x998/0x1610 [ptlrpc]
      [84431.808731]  [<ffffffffc0b56966>] ptlrpc_server_handle_request+0x266/0xb30 [ptlrpc]
      [84431.809561]  [<ffffffffc0b5afc0>] ptlrpc_main+0xd20/0x1cf0 [ptlrpc]
      [84431.810355]  [<ffffffff810ce2df>] kthread+0xef/0x100
      [84431.811077]  [<ffffffff8178bedd>] ret_from_fork+0x5d/0xb0
      [84431.811755]  [<ffffffffffffffff>] 0xffffffffffffffff
      

      Attachments

        Issue Links

          Activity

            [LU-12592] sanity 134a hit panic time to time.

            based on discussion with Oleg, it looks two ways to fix exist.
            1) fast - just limit a flush to same ldlm namespace type as request ordinated.
            2) long - but better.

            Current patch need to be interact with LRU resize to don't allow to set hard limit over SLV, otherwise stupid situation exist.
            this tunable allowed to set less than LRU resize limit and it block to work LRU resize at all.

            shadow Alexey Lyashkov added a comment - based on discussion with Oleg, it looks two ways to fix exist. 1) fast - just limit a flush to same ldlm namespace type as request ordinated. 2) long - but better. Current patch need to be interact with LRU resize to don't allow to set hard limit over SLV, otherwise stupid situation exist. this tunable allowed to set less than LRU resize limit and it block to work LRU resize at all.
            green Oleg Drokin added a comment -

            same crashes with SSK at 100% rate

            green Oleg Drokin added a comment - same crashes with SSK at 100% rate
            shadow Alexey Lyashkov added a comment - - edited

            so regression from

            commit fe60e0135ee2334440247cde167b707b223cf11d
            Author: Niu Yawei <yawei.niu@intel.com>
            Date:   Thu May 21 11:07:54 2015 -0400
            
                LU-6529 ldlm: reclaim granted locks defensively
            

            I don't understand this patch at all. server lock volume is created for same purpose, so reduce a SLV will reduce a ldlm locks memory consumption. So my suggestion - just revert this patch and use SLV for same purpose.

            shadow Alexey Lyashkov added a comment - - edited so regression from commit fe60e0135ee2334440247cde167b707b223cf11d Author: Niu Yawei <yawei.niu@intel.com> Date: Thu May 21 11:07:54 2015 -0400 LU-6529 ldlm: reclaim granted locks defensively I don't understand this patch at all. server lock volume is created for same purpose, so reduce a SLV will reduce a ldlm locks memory consumption. So my suggestion - just revert this patch and use SLV for same purpose.
            shadow Alexey Lyashkov added a comment - - edited

            I have several crash dumps (two ?) but both with default sanity debug logs.
            based on logs, thread started to cancel a sort of extent locks.

            00000100:00100000:0.0:1564134127.527799:0:10728:0:(service.c:2227:ptlrpc_server_handle_request()) Handling RPC req@ffff8800369e2c00 pname:cluuid+ref
            :pid:xid:nid:opc:job mdt00_005:29092c9d-0c24-a754-6f34-44253c4359c6+3009:9120:x1640112644356192:12345-0@lo:101:jobid_name=touch.0
            00010000:00010000:0.0:1564134127.527807:0:10728:0:(ldlm_lockd.c:1186:ldlm_handle_enqueue0()) ### server-side enqueue handler START
            00010000:02000000:0.0:1564134127.527809:0:10728:0:(libcfs_fail.h:92:cfs_fail_check_set()) *** cfs_fail_loc=327, val=0***
            00010000:00010000:0.0:1564134127.531157:0:10728:0:(ldlm_reclaim.c:214:ldlm_reclaim_res()) NS(filter-lustre-OST0000_UUID): 512 locks to be reclaimed, found 512/512 locks.
            00000100:00100000:0.0:1564134127.544998:0:10728:0:(client.c:1675:ptlrpc_send_new_req()) Sending RPC req@ffff88010fc91b80 pname:cluuid:pid:xid:nid:opc:job mdt00_005:lustre-OST0000_UUID:10728:1640112644364384:0@lo:104:(null)
            00000100:00100000:0.0:1564134127.545004:0:10728:0:(events.c:354:request_in_callback()) peer: 12345-0@lo (source: 12345-0@lo)
            00000100:00100000:0.0:1564134127.545036:0:10728:0:(client.c:2408:ptlrpc_set_wait()) set ffff880008306300 going to sleep for 6 seconds
            00010000:00010000:0.0:1564134127.545045:0:10728:0:(ldlm_lockd.c:696:ldlm_handle_ast_error()) ### client (nid 0@lo) returned -22 from blocking AST (req@ffff8800369e3180 x1640112644356208) - normal race ns: filter-lustre-OST0000_UUID lock: ffff8800506a1800/0xe918d2a075fe22a8 lrc: 4/0,0 mode: PR/PR res: [0xc279:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400000020 nid: 0@lo remote: 0xe918d2a075fe229a expref: 583 pid: 14303 timeout: 84529 lvb_type: 1
            00002000:00040000:0.0:1564134127.545051:0:10728:0:(ofd_internal.h:410:ofd_info()) ASSERTION( info ) failed:
            
                            thr ffff8800763dde00 - pid 10728
                                    req ffff8800369e2c00
            request ffff8800369e2c00
            from 0.0.0.0@9:0
            to 0.0.0.0@9:0
            xid 1640112644356192:0
            transno 0
            time 18446612134297984512:1564134133:0
            flags 3
            lens 184 104 8 136 0 0 2 240 17 37 0
            ptlrpc body(v3) ffff880137d56598
                    type 4711
                    opc 101 LDLM_ENQUEUE
            Intent: OPEN+CREATE
            

            So open+create request was trigger a ldlm_reclaim_full which flush an extent locks also, but no DT_THREAD env for mdt thread.

            it looks it bug can triggered in just single node setup.

            shadow Alexey Lyashkov added a comment - - edited I have several crash dumps (two ?) but both with default sanity debug logs. based on logs, thread started to cancel a sort of extent locks. 00000100:00100000:0.0:1564134127.527799:0:10728:0:(service.c:2227:ptlrpc_server_handle_request()) Handling RPC req@ffff8800369e2c00 pname:cluuid+ref :pid:xid:nid:opc:job mdt00_005:29092c9d-0c24-a754-6f34-44253c4359c6+3009:9120:x1640112644356192:12345-0@lo:101:jobid_name=touch.0 00010000:00010000:0.0:1564134127.527807:0:10728:0:(ldlm_lockd.c:1186:ldlm_handle_enqueue0()) ### server-side enqueue handler START 00010000:02000000:0.0:1564134127.527809:0:10728:0:(libcfs_fail.h:92:cfs_fail_check_set()) *** cfs_fail_loc=327, val=0*** 00010000:00010000:0.0:1564134127.531157:0:10728:0:(ldlm_reclaim.c:214:ldlm_reclaim_res()) NS(filter-lustre-OST0000_UUID): 512 locks to be reclaimed, found 512/512 locks. 00000100:00100000:0.0:1564134127.544998:0:10728:0:(client.c:1675:ptlrpc_send_new_req()) Sending RPC req@ffff88010fc91b80 pname:cluuid:pid:xid:nid:opc:job mdt00_005:lustre-OST0000_UUID:10728:1640112644364384:0@lo:104:( null ) 00000100:00100000:0.0:1564134127.545004:0:10728:0:(events.c:354:request_in_callback()) peer: 12345-0@lo (source: 12345-0@lo) 00000100:00100000:0.0:1564134127.545036:0:10728:0:(client.c:2408:ptlrpc_set_wait()) set ffff880008306300 going to sleep for 6 seconds 00010000:00010000:0.0:1564134127.545045:0:10728:0:(ldlm_lockd.c:696:ldlm_handle_ast_error()) ### client (nid 0@lo) returned -22 from blocking AST (req@ffff8800369e3180 x1640112644356208) - normal race ns: filter-lustre-OST0000_UUID lock: ffff8800506a1800/0xe918d2a075fe22a8 lrc: 4/0,0 mode: PR/PR res: [0xc279:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400000020 nid: 0@lo remote: 0xe918d2a075fe229a expref: 583 pid: 14303 timeout: 84529 lvb_type: 1 00002000:00040000:0.0:1564134127.545051:0:10728:0:(ofd_internal.h:410:ofd_info()) ASSERTION( info ) failed: thr ffff8800763dde00 - pid 10728 req ffff8800369e2c00 request ffff8800369e2c00 from 0.0.0.0@9:0 to 0.0.0.0@9:0 xid 1640112644356192:0 transno 0 time 18446612134297984512:1564134133:0 flags 3 lens 184 104 8 136 0 0 2 240 17 37 0 ptlrpc body(v3) ffff880137d56598 type 4711 opc 101 LDLM_ENQUEUE Intent: OPEN+CREATE So open+create request was trigger a ldlm_reclaim_full which flush an extent locks also, but no DT_THREAD env for mdt thread. it looks it bug can triggered in just single node setup.

            it would be good to have logs.
            the very first "confusion" is that mdt thread is handling extents locks (ofd_lvbo_update())

            there is already ticket for this issue - LU-12570

            bzzz Alex Zhuravlev added a comment - it would be good to have logs. the very first "confusion" is that mdt thread is handling extents locks (ofd_lvbo_update()) there is already ticket for this issue - LU-12570

            People

              wc-triage WC Triage
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: