Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13908

ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.12.4
    • 2.12.4_5.chaos
      toss 3.6-2 (RHEL 7.8)
    • 3
    • 9223372036854775807

    Description

      On compute node, ASSERT fails and node crashes. One node reports two failed ASSERTs in the dumped log:

      LustreError: 13759:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed:
      LustreError: 10188:0:(ldlm_lock.c:205:ldlm_lock_put()) ASSERTION( atomic_read(&lock->l_refc) > 0 ) failed:
      LustreError: 10188:0:(ldlm_lock.c:205:ldlm_lock_put()) LBUG
      Pid: 10188, comm: ldlm_bl_16 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1 SMP Fri Apr 3 08:56:52 PDT 2020
      Call Trace:
       [<ffffffffc0a637ec>] libcfs_call_trace+0x8c/0xd0 [libcfs]
       [<ffffffffc0a638ac>] lbug_with_loc+0x4c/0xa0 [libcfs]
       [<ffffffffc16cb366>] ldlm_lock_put+0x616/0x7b0 [ptlrpc]
       [<ffffffffc0c5828b>] osc_extent_put+0x6b/0x320 [osc]
       [<ffffffffc0c645fb>] osc_cache_wait_range+0x30b/0x960 [osc]
       [<ffffffffc0c655ce>] osc_cache_writeback_range+0x97e/0x1000 [osc]
       [<ffffffffc0c51195>] osc_lock_flush+0x195/0x290 [osc]
       [<ffffffffc0c51653>] osc_ldlm_blocking_ast+0x2e3/0x3a0 [osc]
       [<ffffffffc16d2dea>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
       [<ffffffffc16ea620>] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc]
       [<ffffffffc16f03f7>] ldlm_cli_cancel+0x157/0x620 [ptlrpc]
       [<ffffffffc0c514ea>] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc]
       [<ffffffffc16fc618>] ldlm_handle_bl_callback+0xf8/0x4f0 [ptlrpc]
       [<ffffffffc16fd230>] ldlm_bl_thread_main+0x820/0xa60 [ptlrpc]
       [<ffffffffbaccca01>] kthread+0xd1/0xe0
       [<ffffffffbb3bff5d>] ret_from_fork_nospec_begin+0x7/0x21
       [<ffffffffffffffff>] 0xffffffffffffffff
      Kernel panic - not syncing: LBUG
      CPU: 53 PID: 10188 Comm: ldlm_bl_16 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1
      Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0027.071020182329 07/10/2018
      
      

      The other reports the same ASSERT twice:

      LustreError: 20571:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed:
      LustreError: 36887:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed:
      LustreError: 36887:0:(ldlm_lock.c:213:ldlm_lock_put()) LBUG
      Pid: 36887, comm: ldlm_bl_62 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1 SMP Fri Apr 3 08:56:52 PDT 2020
      Call Trace:
       [<ffffffffc0a727ec>] libcfs_call_trace+0x8c/0xd0 [libcfs]
       [<ffffffffc0a728ac>] lbug_with_loc+0x4c/0xa0 [libcfs]
       [<ffffffffc176f3ca>] ldlm_lock_put+0x67a/0x7b0 [ptlrpc]
       [<ffffffffc1773058>] ldlm_lock_match_with_skip+0x3b8/0x860 [ptlrpc]
       [<ffffffffc0d982d2>] osc_match_base+0x102/0x290 [osc]
       [<ffffffffc0da3dfc>] osc_obj_dlmlock_at_pgoff+0x14c/0x2c0 [osc]
       [<ffffffffc0d9c358>] osc_req_attr_set+0x128/0x610 [osc]
       [<ffffffffc1549b13>] cl_req_attr_set+0x63/0x160 [obdclass]
       [<ffffffffc0d969f3>] osc_build_rpc+0x483/0x1080 [osc]
       [<ffffffffc0db1cbd>] osc_io_unplug0+0xecd/0x19c0 [osc]
       [<ffffffffc0db6620>] osc_cache_writeback_range+0x9d0/0x1000 [osc]
       [<ffffffffc0da2195>] osc_lock_flush+0x195/0x290 [osc]
       [<ffffffffc0da2653>] osc_ldlm_blocking_ast+0x2e3/0x3a0 [osc]
       [<ffffffffc1776dea>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
       [<ffffffffc178e620>] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc]
       [<ffffffffc17943f7>] ldlm_cli_cancel+0x157/0x620 [ptlrpc]
       [<ffffffffc0da24ea>] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc]
       [<ffffffffc17a0618>] ldlm_handle_bl_callback+0xf8/0x4f0 [ptlrpc]
       [<ffffffffc17a1230>] ldlm_bl_thread_main+0x820/0xa60 [ptlrpc]
       [<ffffffffab4cca01>] kthread+0xd1/0xe0
       [<ffffffffabbbff5d>] ret_from_fork_nospec_begin+0x7/0x21
       [<ffffffffffffffff>] 0xffffffffffffffff
      Kernel panic - not syncing: LBUG
      CPU: 20 PID: 36887 Comm: ldlm_bl_62 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1

      From /tftpboot/dumps/192.168.64.82-2020-08-12-13:23:27/vmcore-dmesg.txt
      and /tftpboot/dumps/192.168.66.180-2020-08-12-16:39:36/vmcore-dmesg.txt

      Attachments

        Issue Links

          Activity

            [LU-13908] ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed
            ofaaland Olaf Faaland added a comment -

            In about 3 weeks we will have 2.12.6_3.llnl, which includes the fix from LU-11719, on the clusters where we've seen this.

            ofaaland Olaf Faaland added a comment - In about 3 weeks we will have 2.12.6_3.llnl, which includes the fix from LU-11719 , on the clusters where we've seen this.

            Yes, but it's sadly complicated.

            ofaaland Olaf Faaland added a comment - Yes, but it's sadly complicated.
            pjones Peter Jones added a comment -

            Wouldn't you rather test the actual fix?

            pjones Peter Jones added a comment - Wouldn't you rather test the actual fix?

            Next week we will finally get a build with the revert installed on the machine where we've seen the issue.

            ofaaland Olaf Faaland added a comment - Next week we will finally get a build with the revert installed on the machine where we've seen the issue.
            pjones Peter Jones added a comment -

            This is believed to be a duplicate of LU-11719

            pjones Peter Jones added a comment - This is believed to be a duplicate of LU-11719

            Hello, 

            Do you have any feedback on whether the revert of the targeted commit has any effect on your production ?

            One of our customer has hit this 8 times in the past week. 

            This would not be a problem if it affected only the crash client, but in our case, the oss handling the lock never releases it until the crashed client remounts the filesystem (the oss keeps retrying to send requests to the crashed client every 600 seconds even hours after the crash), which sometimes will result in a almost hung filesystem.

            I am surprised the client is never evicted, would there be a reason to this ?

            spiechurski Sebastien Piechurski added a comment - Hello,  Do you have any feedback on whether the revert of the targeted commit has any effect on your production ? One of our customer has hit this 8 times in the past week.  This would not be a problem if it affected only the crash client, but in our case, the oss handling the lock never releases it until the crashed client remounts the filesystem (the oss keeps retrying to send requests to the crashed client every 600 seconds even hours after the crash), which sometimes will result in a almost hung filesystem. I am surprised the client is never evicted, would there be a reason to this ?
            ofaaland Olaf Faaland added a comment -

            For my tracking purposes: my internal ticket is TOSS4864

            ofaaland Olaf Faaland added a comment - For my tracking purposes: my internal ticket is TOSS4864
            pjones Peter Jones added a comment -

            I just noticed that Gerrit did not post an update to Jira for the revert on b2_12 - https://review.whamcloud.com/#/c/39819/ . Hopefully this is more convenient to test ofaaland

            pjones Peter Jones added a comment - I just noticed that Gerrit did not post an update to Jira for the revert on b2_12 - https://review.whamcloud.com/#/c/39819/  . Hopefully this is more convenient to test ofaaland
            ofaaland Olaf Faaland added a comment -

            Hi Zhenyu,
            Can you get the revert patch to pass Maloo, so I can more confidently test this on the production system where we see the error? We have not been able to reproduce this on our test systems.
            thanks

            ofaaland Olaf Faaland added a comment - Hi Zhenyu, Can you get the revert patch to pass Maloo, so I can more confidently test this on the production system where we see the error? We have not been able to reproduce this on our test systems. thanks

            Yes, I can try.  It may take a while.  Sorry for the delay answering your question.

            ofaaland Olaf Faaland added a comment - Yes, I can try.  It may take a while.  Sorry for the delay answering your question.

            People

              bobijam Zhenyu Xu
              ofaaland Olaf Faaland
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: