Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13908

ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.12.4
    • 2.12.4_5.chaos
      toss 3.6-2 (RHEL 7.8)
    • 3
    • 9223372036854775807

    Description

      On compute node, ASSERT fails and node crashes. One node reports two failed ASSERTs in the dumped log:

      LustreError: 13759:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed:
      LustreError: 10188:0:(ldlm_lock.c:205:ldlm_lock_put()) ASSERTION( atomic_read(&lock->l_refc) > 0 ) failed:
      LustreError: 10188:0:(ldlm_lock.c:205:ldlm_lock_put()) LBUG
      Pid: 10188, comm: ldlm_bl_16 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1 SMP Fri Apr 3 08:56:52 PDT 2020
      Call Trace:
       [<ffffffffc0a637ec>] libcfs_call_trace+0x8c/0xd0 [libcfs]
       [<ffffffffc0a638ac>] lbug_with_loc+0x4c/0xa0 [libcfs]
       [<ffffffffc16cb366>] ldlm_lock_put+0x616/0x7b0 [ptlrpc]
       [<ffffffffc0c5828b>] osc_extent_put+0x6b/0x320 [osc]
       [<ffffffffc0c645fb>] osc_cache_wait_range+0x30b/0x960 [osc]
       [<ffffffffc0c655ce>] osc_cache_writeback_range+0x97e/0x1000 [osc]
       [<ffffffffc0c51195>] osc_lock_flush+0x195/0x290 [osc]
       [<ffffffffc0c51653>] osc_ldlm_blocking_ast+0x2e3/0x3a0 [osc]
       [<ffffffffc16d2dea>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
       [<ffffffffc16ea620>] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc]
       [<ffffffffc16f03f7>] ldlm_cli_cancel+0x157/0x620 [ptlrpc]
       [<ffffffffc0c514ea>] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc]
       [<ffffffffc16fc618>] ldlm_handle_bl_callback+0xf8/0x4f0 [ptlrpc]
       [<ffffffffc16fd230>] ldlm_bl_thread_main+0x820/0xa60 [ptlrpc]
       [<ffffffffbaccca01>] kthread+0xd1/0xe0
       [<ffffffffbb3bff5d>] ret_from_fork_nospec_begin+0x7/0x21
       [<ffffffffffffffff>] 0xffffffffffffffff
      Kernel panic - not syncing: LBUG
      CPU: 53 PID: 10188 Comm: ldlm_bl_16 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1
      Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0027.071020182329 07/10/2018
      
      

      The other reports the same ASSERT twice:

      LustreError: 20571:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed:
      LustreError: 36887:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed:
      LustreError: 36887:0:(ldlm_lock.c:213:ldlm_lock_put()) LBUG
      Pid: 36887, comm: ldlm_bl_62 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1 SMP Fri Apr 3 08:56:52 PDT 2020
      Call Trace:
       [<ffffffffc0a727ec>] libcfs_call_trace+0x8c/0xd0 [libcfs]
       [<ffffffffc0a728ac>] lbug_with_loc+0x4c/0xa0 [libcfs]
       [<ffffffffc176f3ca>] ldlm_lock_put+0x67a/0x7b0 [ptlrpc]
       [<ffffffffc1773058>] ldlm_lock_match_with_skip+0x3b8/0x860 [ptlrpc]
       [<ffffffffc0d982d2>] osc_match_base+0x102/0x290 [osc]
       [<ffffffffc0da3dfc>] osc_obj_dlmlock_at_pgoff+0x14c/0x2c0 [osc]
       [<ffffffffc0d9c358>] osc_req_attr_set+0x128/0x610 [osc]
       [<ffffffffc1549b13>] cl_req_attr_set+0x63/0x160 [obdclass]
       [<ffffffffc0d969f3>] osc_build_rpc+0x483/0x1080 [osc]
       [<ffffffffc0db1cbd>] osc_io_unplug0+0xecd/0x19c0 [osc]
       [<ffffffffc0db6620>] osc_cache_writeback_range+0x9d0/0x1000 [osc]
       [<ffffffffc0da2195>] osc_lock_flush+0x195/0x290 [osc]
       [<ffffffffc0da2653>] osc_ldlm_blocking_ast+0x2e3/0x3a0 [osc]
       [<ffffffffc1776dea>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
       [<ffffffffc178e620>] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc]
       [<ffffffffc17943f7>] ldlm_cli_cancel+0x157/0x620 [ptlrpc]
       [<ffffffffc0da24ea>] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc]
       [<ffffffffc17a0618>] ldlm_handle_bl_callback+0xf8/0x4f0 [ptlrpc]
       [<ffffffffc17a1230>] ldlm_bl_thread_main+0x820/0xa60 [ptlrpc]
       [<ffffffffab4cca01>] kthread+0xd1/0xe0
       [<ffffffffabbbff5d>] ret_from_fork_nospec_begin+0x7/0x21
       [<ffffffffffffffff>] 0xffffffffffffffff
      Kernel panic - not syncing: LBUG
      CPU: 20 PID: 36887 Comm: ldlm_bl_62 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1

      From /tftpboot/dumps/192.168.64.82-2020-08-12-13:23:27/vmcore-dmesg.txt
      and /tftpboot/dumps/192.168.66.180-2020-08-12-16:39:36/vmcore-dmesg.txt

      Attachments

        Issue Links

          Activity

            [LU-13908] ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed

            Next week we will finally get a build with the revert installed on the machine where we've seen the issue.

            ofaaland Olaf Faaland added a comment - Next week we will finally get a build with the revert installed on the machine where we've seen the issue.
            pjones Peter Jones added a comment -

            This is believed to be a duplicate of LU-11719

            pjones Peter Jones added a comment - This is believed to be a duplicate of LU-11719

            Hello, 

            Do you have any feedback on whether the revert of the targeted commit has any effect on your production ?

            One of our customer has hit this 8 times in the past week. 

            This would not be a problem if it affected only the crash client, but in our case, the oss handling the lock never releases it until the crashed client remounts the filesystem (the oss keeps retrying to send requests to the crashed client every 600 seconds even hours after the crash), which sometimes will result in a almost hung filesystem.

            I am surprised the client is never evicted, would there be a reason to this ?

            spiechurski Sebastien Piechurski added a comment - Hello,  Do you have any feedback on whether the revert of the targeted commit has any effect on your production ? One of our customer has hit this 8 times in the past week.  This would not be a problem if it affected only the crash client, but in our case, the oss handling the lock never releases it until the crashed client remounts the filesystem (the oss keeps retrying to send requests to the crashed client every 600 seconds even hours after the crash), which sometimes will result in a almost hung filesystem. I am surprised the client is never evicted, would there be a reason to this ?
            ofaaland Olaf Faaland added a comment -

            For my tracking purposes: my internal ticket is TOSS4864

            ofaaland Olaf Faaland added a comment - For my tracking purposes: my internal ticket is TOSS4864
            pjones Peter Jones added a comment -

            I just noticed that Gerrit did not post an update to Jira for the revert on b2_12 - https://review.whamcloud.com/#/c/39819/ . Hopefully this is more convenient to test ofaaland

            pjones Peter Jones added a comment - I just noticed that Gerrit did not post an update to Jira for the revert on b2_12 - https://review.whamcloud.com/#/c/39819/  . Hopefully this is more convenient to test ofaaland
            ofaaland Olaf Faaland added a comment -

            Hi Zhenyu,
            Can you get the revert patch to pass Maloo, so I can more confidently test this on the production system where we see the error? We have not been able to reproduce this on our test systems.
            thanks

            ofaaland Olaf Faaland added a comment - Hi Zhenyu, Can you get the revert patch to pass Maloo, so I can more confidently test this on the production system where we see the error? We have not been able to reproduce this on our test systems. thanks

            Yes, I can try.  It may take a while.  Sorry for the delay answering your question.

            ofaaland Olaf Faaland added a comment - Yes, I can try.  It may take a while.  Sorry for the delay answering your question.
            bobijam Zhenyu Xu added a comment -

            The commit mentioned looks like the culprit. Olaf, is it possible trying the revert patch?

            bobijam Zhenyu Xu added a comment - The commit mentioned looks like the culprit. Olaf, is it possible trying the revert patch?

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/39693
            Subject: LU-13908 osc: revert "glimpse - search for active lock"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6caffe3423c184b9717c9a5307da503ae0fbba4e

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/39693 Subject: LU-13908 osc: revert "glimpse - search for active lock" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6caffe3423c184b9717c9a5307da503ae0fbba4e
            ofaaland Olaf Faaland added a comment -

            Peter,
            We have never seen this under an earlier version of Lustre 2.12.x, but this is the first 2.12 we deployed widely.
            We have never seen this under any 2.10.x version.
            thanks

            ofaaland Olaf Faaland added a comment - Peter, We have never seen this under an earlier version of Lustre 2.12.x, but this is the first 2.12 we deployed widely. We have never seen this under any 2.10.x version. thanks
            jhammond John Hammond added a comment -

            I agree that this is likely a duplicate of LU-13089. As Oleg notes there "except this time it's glimpse cb vs cancel cb race". Based on the functions changed and the time that this was first noticed I suspect that this was introduced by

            commit 2548cb9e32bfca897de577f88836629f72641369
            Author:     Patrick Farrell <pfarrell@whamcloud.com>
            AuthorDate: Mon Sep 9 11:56:07 2019 -0400
            Commit:     Oleg Drokin <green@whamcloud.com>
            CommitDate: Thu Dec 12 23:05:15 2019 +0000
            
                LU-11670 osc: glimpse - search for active lock
            

            Lustre-change: https://review.whamcloud.com/33660
            Reviewed-on: https://review.whamcloud.com/36406

            jhammond John Hammond added a comment - I agree that this is likely a duplicate of LU-13089 . As Oleg notes there "except this time it's glimpse cb vs cancel cb race". Based on the functions changed and the time that this was first noticed I suspect that this was introduced by commit 2548cb9e32bfca897de577f88836629f72641369 Author: Patrick Farrell <pfarrell@whamcloud.com> AuthorDate: Mon Sep 9 11:56:07 2019 -0400 Commit: Oleg Drokin <green@whamcloud.com> CommitDate: Thu Dec 12 23:05:15 2019 +0000 LU-11670 osc: glimpse - search for active lock Lustre-change: https://review.whamcloud.com/33660 Reviewed-on: https://review.whamcloud.com/36406

            People

              bobijam Zhenyu Xu
              ofaaland Olaf Faaland
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: