Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13089

ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.12.4
    • 3
    • 9223372036854775807

    Description

      I hit this in b2_12-next testing, but there's nothing I see in there that appears related so I think this is just super rare race that's pretty similar in nature to LU-4269

      except this time it's glimpse cb vs cancel cb race

      console gives us:

      [19625.122592] LustreError: 8347:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: 
      [19625.124370] LustreError: 8347:0:(ldlm_lock.c:213:ldlm_lock_put()) LBUG
      [19625.124704] LustreError: 8367:0:(ldlm_lock.c:205:ldlm_lock_put()) ASSERTION( atomic_read(&lock->l_refc) > 0 ) failed: 
      [19625.124705] LustreError: 8367:0:(ldlm_lock.c:205:ldlm_lock_put()) LBUG
      [19625.124706] Pid: 8367, comm: ldlm_bl_02 3.10.0-7.7-debug #1 SMP Wed Oct 30 09:43:00 EDT 2019
      [19625.124707] Call Trace:
      [19625.124739]  [<ffffffffa01f97dc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [19625.124744]  [<ffffffffa01f988c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [19625.124796]  [<ffffffffa0638197>] ldlm_lock_put+0x557/0x730 [ptlrpc]
      [19625.124821]  [<ffffffffa0639980>] ldlm_lock_destroy_nolock+0x50/0x110 [ptlrpc]
      [19625.124839]  [<ffffffffa063febf>] ldlm_lock_cancel+0x6f/0x1f0 [ptlrpc]
      [19625.124866]  [<ffffffffa0656a3a>] ldlm_cli_cancel_local+0xca/0x3f0 [ptlrpc]
      [19625.124887]  [<ffffffffa065c767>] ldlm_cli_cancel+0x157/0x620 [ptlrpc]
      [19625.124917]  [<ffffffffa08a8b1a>] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc]
      [19625.124949]  [<ffffffffa0668835>] ldlm_handle_bl_callback+0xf5/0x4f0 [ptlrpc]
      [19625.124977]  [<ffffffffa06693e8>] ldlm_bl_thread_main+0x7b8/0x980 [ptlrpc]
      [19625.125040]  [<ffffffff810b8254>] kthread+0xe4/0xf0
      [19625.125053]  [<ffffffff817e0df7>] ret_from_fork_nospec_end+0x0/0x39
      [19625.125063]  [<ffffffffffffffff>] 0xffffffffffffffff
      [19625.125067] Kernel panic - not syncing: LBUG
      

      and the oud 8347 backtrace is

       #4 [ffff88003e313820] vt_console_print at ffffffff814dbe54
       #5 [ffff88003e313880] call_console_drivers.constprop.17 at ffffffff8108cf00
       #6 [ffff88003e3138a8] console_unlock at ffffffff8108e559
       #7 [ffff88003e3138e8] vprintk_emit at ffffffff8108e984
       #8 [ffff88003e313958] vprintk_default at ffffffff8108ed79
       #9 [ffff88003e313968] printk at ffffffff817c67d1
      #10 [ffff88003e3139c8] cfs_print_to_console at ffffffffa01f958a [libcfs]
      #11 [ffff88003e3139f8] libcfs_debug_vmsg2 at ffffffffa01ff9e3 [libcfs]
      #12 [ffff88003e313b48] libcfs_debug_msg at ffffffffa0200077 [libcfs]
      #13 [ffff88003e313ba8] lbug_with_loc at ffffffffa01f9866 [libcfs]
      #14 [ffff88003e313bc8] ldlm_lock_put at ffffffffa0638266 [ptlrpc]
      #15 [ffff88003e313be8] osc_ldlm_glimpse_ast at ffffffffa08a8f28 [osc]
      #16 [ffff88003e313ca0] ldlm_callback_handler at ffffffffa066a0a8 [ptlrpc]
      #17 [ffff88003e313d18] ldlm_callback_handler at ffffffffa066b3c7 [ptlrpc]
      #18 [ffff88003e313d30] ptlrpc_server_handle_request at ffffffffa0699266 [ptlrpc]
      #19 [ffff88003e313dd0] ptlrpc_main at ffffffffa069d261 [ptlrpc]
      #20 [ffff88003e313ea8] kthread at ffffffff810b8254
      

      Attachments

        Issue Links

          Activity

            [LU-13089] ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed
            pjones Peter Jones added a comment -

            Believed to be a duplicate of LU-11719

            pjones Peter Jones added a comment - Believed to be a duplicate of LU-11719
            gerrit Gerrit Updater added a comment - - edited

            Bobi Jam (bobijam@hotmail.com) abandoned the patch: https://review.whamcloud.com/40286

            gerrit Gerrit Updater added a comment - - edited Bobi Jam (bobijam@hotmail.com) abandoned the patch: https://review.whamcloud.com/40286
            gerrit Gerrit Updater added a comment - - edited

            Bobi Jam (bobijam@hotmail.com) uploaded a patch port: https://review.whamcloud.com/40399

             

            The search_itree was glitched and messed up data->lmd_lock.

            gerrit Gerrit Updater added a comment - - edited Bobi Jam (bobijam@hotmail.com) uploaded a patch port: https://review.whamcloud.com/40399   The search_itree was glitched and messed up data->lmd_lock.
            kim.sebo Kim Sebo added a comment -

            We at ANU/NCI are also seeing this assert, ~10 times per month on 3000 clients. Clients are Centos 8 / 2.12.5 + some backports.

            kim.sebo Kim Sebo added a comment - We at ANU/NCI are also seeing this assert, ~10 times per month on 3000 clients. Clients are Centos 8 / 2.12.5 + some backports.
            lflis Lukasz Flis added a comment - - edited

            Applying the https://review.whamcloud.com/39819 fixed the problem in CYFRONET
            Before going to production with this fix we need to test if LU-11670 is a problem for us.

             

            lflis Lukasz Flis added a comment - - edited Applying the https://review.whamcloud.com/39819  fixed the problem in CYFRONET Before going to production with this fix we need to test if LU-11670  is a problem for us.  

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/39819
            Subject: LU-13089 osc: revert "glimpse - search for active lock"
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 37205c60bb2d99363a0c9dbf29d8f4fd684b6fab

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/39819 Subject: LU-13089 osc: revert "glimpse - search for active lock" Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 37205c60bb2d99363a0c9dbf29d8f4fd684b6fab
            ofaaland Olaf Faaland added a comment - - edited

            We are seeing this regularly at LLNL with Lustre 2.12.4_5.chaos.

            ofaaland Olaf Faaland added a comment - - edited We are seeing this regularly at LLNL with Lustre 2.12.4_5.chaos.
            lflis Lukasz Flis added a comment -

            CYFRONET here,

            Since migration to 2.12 we noted that some types of hpc jobs  are crashing the nodes(few crashes a month / 2200 compute nodes). Until today we were not able to reproduce the issue - now we belive to have workload which does it every time it's run.

             

            [2195888.180647] LustreError: 3966:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed:

            Client version is 2.12.4

            If you need a debug log for any susbystem or other debug data please let us know. We'd be happy to provide more info which could help with diagnosis

             

            Best Regards

            Lukasz Flis

            lflis Lukasz Flis added a comment - CYFRONET here, Since migration to 2.12 we noted that some types of hpc jobs  are crashing the nodes(few crashes a month / 2200 compute nodes). Until today we were not able to reproduce the issue - now we belive to have workload which does it every time it's run.   [2195888.180647] LustreError: 3966:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: Client version is 2.12.4 If you need a debug log for any susbystem or other debug data please let us know. We'd be happy to provide more info which could help with diagnosis   Best Regards – Lukasz Flis
            bobijam Zhenyu Xu added a comment -

            This is more like lock's reference issue other than the race about setting lock's flags.

            From the log, the lock was requested to be cancelled, and process 8367 handled it and was trying to flush data covered by it. The lock's reference account looks ok.

            428786 00010000:00010000:3.0:1576584703.467448:0:8367:0:(ldlm_lockd.c:1730:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: ?? lock: ffff880012bd5d80/0x232daeb61003014b lrc: 3/0,0 mode: PW/PW res: ?? rrc=?? type: ??? flags: 0x420000020000 nid: local remote: 0x232daeb610030278 expref: -99 pid: 22687 timeout: 0 lvb_type: 1
            428787 00010000:00010000:3.0:1576584703.467450:0:8367:0:(ldlm_lockd.c:1761:ldlm_handle_bl_callback()) Lock ffff880012bd5d80 already unused, calling callback (ffffffffa08a89a0)
            428788 00010000:00010000:3.0:1576584703.467452:0:8367:0:(ldlm_request.c:1268:ldlm_cli_cancel_local()) ### client-side cancel ns: ?? lock: ffff880012bd5d80/0x232daeb61003014b lrc: 4/0,0 mode: PW/PW res:?? rrc=?? type: ??? flags: 0x428400020000 nid: local remote: 0x232daeb610030278 expref: -99 pid: 22687 timeout: 0 lvb_type: 1
            428790 00000008:00000020:3.0:1576584703.467457:0:8367:0:(osc_cache.c:3008:osc_cache_wait_range()) obj ffff88002b179e60 ready 0|-|- wr 0|-|- rd 0|- sync file range.
            428791 00000008:00000020:3.0:1576584703.467459:0:8367:0:(osc_cache.c:3140:osc_cache_writeback_range()) obj ffff88002b179e60 ready 0|-|- wr 0|-|- rd 0|- pageout [224, 479], 0.                            
            428792 00000008:00000020:3.0:1576584703.467461:0:8367:0:(osc_lock.c:375:osc_lock_flush()) object ffff88002b179e60: [224 -> 479] 0 pages were written back.
            428794 00000080:00200000:3.0:1576584703.467465:0:8367:0:(vvp_io.c:1469:vvp_io_init()) [0x200000402:0x70c0:0x0] ignore/verify layout 1/0, layout version 0 restore needed 0
            

            While process 8347 handling gl callback tripped over the lock before flush finished, but it's ref count strangely became 0 then, and for sure the lock hadn't been set destroy flag yet.

            428898 00010000:00010000:1.0:1576584703.468909:0:8347:0:(ldlm_lock.c:210:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-OST0001-osc-ffff88003f2f8800 lock: ffff880012bd5d80/0x232daeb61003014b lrc: 0/0,0 mode: PW/PW res: [0x12cb:0x0:0x0].0x0 rrc: 5 type: EXT [917504->1966079] (req 917504->983039) flags: 0x429400020000 nid: local remote: 0x232daeb610030278 expref: -99 pid: 22687 timeout: 0 lvb_type: 1
            428899 00010000:00040000:1.0:1576584703.468914:0:8347:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: 
            

            process 8367 picked up the flush IO and found out the lock's erroneous ref count.

            428913 00000080:00200000:3.0:1576584703.471016:0:8367:0:(vvp_io.c:312:vvp_io_fini()) [0x200000402:0x70c0:0x0] ignore/verify layout 1/0, layout version 0 need write layout 0, restore needed 0
            428914 00000008:00000020:3.0:1576584703.471021:0:8367:0:(osc_object.c:183:osc_attr_update()) set kms from 1966080to 0 
            428915 00010000:00040000:3.0:1576584703.471027:0:8367:0:(ldlm_lock.c:205:ldlm_lock_put()) ASSERTION( atomic_read(&lock->l_refc) > 0 ) failed: 
            428916 00010000:00040000:3.0:1576584703.471030:0:8367:0:(ldlm_lock.c:205:ldlm_lock_put()) LBUG
            
            bobijam Zhenyu Xu added a comment - This is more like lock's reference issue other than the race about setting lock's flags. From the log, the lock was requested to be cancelled, and process 8367 handled it and was trying to flush data covered by it. The lock's reference account looks ok. 428786 00010000:00010000:3.0:1576584703.467448:0:8367:0:(ldlm_lockd.c:1730:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: ?? lock: ffff880012bd5d80/0x232daeb61003014b lrc: 3/0,0 mode: PW/PW res: ?? rrc=?? type: ??? flags: 0x420000020000 nid: local remote: 0x232daeb610030278 expref: -99 pid: 22687 timeout: 0 lvb_type: 1 428787 00010000:00010000:3.0:1576584703.467450:0:8367:0:(ldlm_lockd.c:1761:ldlm_handle_bl_callback()) Lock ffff880012bd5d80 already unused, calling callback (ffffffffa08a89a0) 428788 00010000:00010000:3.0:1576584703.467452:0:8367:0:(ldlm_request.c:1268:ldlm_cli_cancel_local()) ### client-side cancel ns: ?? lock: ffff880012bd5d80/0x232daeb61003014b lrc: 4/0,0 mode: PW/PW res:?? rrc=?? type: ??? flags: 0x428400020000 nid: local remote: 0x232daeb610030278 expref: -99 pid: 22687 timeout: 0 lvb_type: 1 428790 00000008:00000020:3.0:1576584703.467457:0:8367:0:(osc_cache.c:3008:osc_cache_wait_range()) obj ffff88002b179e60 ready 0|-|- wr 0|-|- rd 0|- sync file range. 428791 00000008:00000020:3.0:1576584703.467459:0:8367:0:(osc_cache.c:3140:osc_cache_writeback_range()) obj ffff88002b179e60 ready 0|-|- wr 0|-|- rd 0|- pageout [224, 479], 0. 428792 00000008:00000020:3.0:1576584703.467461:0:8367:0:(osc_lock.c:375:osc_lock_flush()) object ffff88002b179e60: [224 -> 479] 0 pages were written back. 428794 00000080:00200000:3.0:1576584703.467465:0:8367:0:(vvp_io.c:1469:vvp_io_init()) [0x200000402:0x70c0:0x0] ignore/verify layout 1/0, layout version 0 restore needed 0 While process 8347 handling gl callback tripped over the lock before flush finished, but it's ref count strangely became 0 then, and for sure the lock hadn't been set destroy flag yet. 428898 00010000:00010000:1.0:1576584703.468909:0:8347:0:(ldlm_lock.c:210:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-OST0001-osc-ffff88003f2f8800 lock: ffff880012bd5d80/0x232daeb61003014b lrc: 0/0,0 mode: PW/PW res: [0x12cb:0x0:0x0].0x0 rrc: 5 type: EXT [917504->1966079] (req 917504->983039) flags: 0x429400020000 nid: local remote: 0x232daeb610030278 expref: -99 pid: 22687 timeout: 0 lvb_type: 1 428899 00010000:00040000:1.0:1576584703.468914:0:8347:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: process 8367 picked up the flush IO and found out the lock's erroneous ref count. 428913 00000080:00200000:3.0:1576584703.471016:0:8367:0:(vvp_io.c:312:vvp_io_fini()) [0x200000402:0x70c0:0x0] ignore/verify layout 1/0, layout version 0 need write layout 0, restore needed 0 428914 00000008:00000020:3.0:1576584703.471021:0:8367:0:(osc_object.c:183:osc_attr_update()) set kms from 1966080to 0 428915 00010000:00040000:3.0:1576584703.471027:0:8367:0:(ldlm_lock.c:205:ldlm_lock_put()) ASSERTION( atomic_read(&lock->l_refc) > 0 ) failed: 428916 00010000:00040000:3.0:1576584703.471030:0:8367:0:(ldlm_lock.c:205:ldlm_lock_put()) LBUG
            green Oleg Drokin added a comment - - edited

            attached is lustre.txt that's lustre debug log extracted from the core file where you clearly can see the race unfolding.

            green Oleg Drokin added a comment - - edited attached is lustre.txt that's lustre debug log extracted from the core file where you clearly can see the race unfolding.

            People

              wc-triage WC Triage
              green Oleg Drokin
              Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: