[LU-13908] ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed Created: 17/Aug/20  Updated: 25/Jan/21  Resolved: 25/Jan/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Olaf Faaland Assignee: Zhenyu Xu
Resolution: Fixed Votes: 1
Labels: llnl
Environment:

2.12.4_5.chaos
toss 3.6-2 (RHEL 7.8)


Issue Links:
Related
is related to LU-13089 ASSERTION( (((( lock))->l_flags & (1U... Resolved
is related to LU-11719 Refactor search_itree Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On compute node, ASSERT fails and node crashes. One node reports two failed ASSERTs in the dumped log:

LustreError: 13759:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed:
LustreError: 10188:0:(ldlm_lock.c:205:ldlm_lock_put()) ASSERTION( atomic_read(&lock->l_refc) > 0 ) failed:
LustreError: 10188:0:(ldlm_lock.c:205:ldlm_lock_put()) LBUG
Pid: 10188, comm: ldlm_bl_16 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1 SMP Fri Apr 3 08:56:52 PDT 2020
Call Trace:
 [<ffffffffc0a637ec>] libcfs_call_trace+0x8c/0xd0 [libcfs]
 [<ffffffffc0a638ac>] lbug_with_loc+0x4c/0xa0 [libcfs]
 [<ffffffffc16cb366>] ldlm_lock_put+0x616/0x7b0 [ptlrpc]
 [<ffffffffc0c5828b>] osc_extent_put+0x6b/0x320 [osc]
 [<ffffffffc0c645fb>] osc_cache_wait_range+0x30b/0x960 [osc]
 [<ffffffffc0c655ce>] osc_cache_writeback_range+0x97e/0x1000 [osc]
 [<ffffffffc0c51195>] osc_lock_flush+0x195/0x290 [osc]
 [<ffffffffc0c51653>] osc_ldlm_blocking_ast+0x2e3/0x3a0 [osc]
 [<ffffffffc16d2dea>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
 [<ffffffffc16ea620>] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc]
 [<ffffffffc16f03f7>] ldlm_cli_cancel+0x157/0x620 [ptlrpc]
 [<ffffffffc0c514ea>] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc]
 [<ffffffffc16fc618>] ldlm_handle_bl_callback+0xf8/0x4f0 [ptlrpc]
 [<ffffffffc16fd230>] ldlm_bl_thread_main+0x820/0xa60 [ptlrpc]
 [<ffffffffbaccca01>] kthread+0xd1/0xe0
 [<ffffffffbb3bff5d>] ret_from_fork_nospec_begin+0x7/0x21
 [<ffffffffffffffff>] 0xffffffffffffffff
Kernel panic - not syncing: LBUG
CPU: 53 PID: 10188 Comm: ldlm_bl_16 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1
Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0027.071020182329 07/10/2018

The other reports the same ASSERT twice:

LustreError: 20571:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed:
LustreError: 36887:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed:
LustreError: 36887:0:(ldlm_lock.c:213:ldlm_lock_put()) LBUG
Pid: 36887, comm: ldlm_bl_62 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1 SMP Fri Apr 3 08:56:52 PDT 2020
Call Trace:
 [<ffffffffc0a727ec>] libcfs_call_trace+0x8c/0xd0 [libcfs]
 [<ffffffffc0a728ac>] lbug_with_loc+0x4c/0xa0 [libcfs]
 [<ffffffffc176f3ca>] ldlm_lock_put+0x67a/0x7b0 [ptlrpc]
 [<ffffffffc1773058>] ldlm_lock_match_with_skip+0x3b8/0x860 [ptlrpc]
 [<ffffffffc0d982d2>] osc_match_base+0x102/0x290 [osc]
 [<ffffffffc0da3dfc>] osc_obj_dlmlock_at_pgoff+0x14c/0x2c0 [osc]
 [<ffffffffc0d9c358>] osc_req_attr_set+0x128/0x610 [osc]
 [<ffffffffc1549b13>] cl_req_attr_set+0x63/0x160 [obdclass]
 [<ffffffffc0d969f3>] osc_build_rpc+0x483/0x1080 [osc]
 [<ffffffffc0db1cbd>] osc_io_unplug0+0xecd/0x19c0 [osc]
 [<ffffffffc0db6620>] osc_cache_writeback_range+0x9d0/0x1000 [osc]
 [<ffffffffc0da2195>] osc_lock_flush+0x195/0x290 [osc]
 [<ffffffffc0da2653>] osc_ldlm_blocking_ast+0x2e3/0x3a0 [osc]
 [<ffffffffc1776dea>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
 [<ffffffffc178e620>] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc]
 [<ffffffffc17943f7>] ldlm_cli_cancel+0x157/0x620 [ptlrpc]
 [<ffffffffc0da24ea>] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc]
 [<ffffffffc17a0618>] ldlm_handle_bl_callback+0xf8/0x4f0 [ptlrpc]
 [<ffffffffc17a1230>] ldlm_bl_thread_main+0x820/0xa60 [ptlrpc]
 [<ffffffffab4cca01>] kthread+0xd1/0xe0
 [<ffffffffabbbff5d>] ret_from_fork_nospec_begin+0x7/0x21
 [<ffffffffffffffff>] 0xffffffffffffffff
Kernel panic - not syncing: LBUG
CPU: 20 PID: 36887 Comm: ldlm_bl_62 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1

From /tftpboot/dumps/192.168.64.82-2020-08-12-13:23:27/vmcore-dmesg.txt
and /tftpboot/dumps/192.168.66.180-2020-08-12-16:39:36/vmcore-dmesg.txt



 Comments   
Comment by Olaf Faaland [ 17/Aug/20 ]

Looks like dupe of LU-13089

Comment by Olaf Faaland [ 17/Aug/20 ]

For my reference, my local ticket is TOSS4864

Comment by Olaf Faaland [ 17/Aug/20 ]

Average about 10 crashes per week, although it varies widely. I do not know whether a particular workload triggers it.

Comment by Peter Jones [ 18/Aug/20 ]

ofaaland has this issue ever been seen on earlier version of 2.12.x or or 2.10.x?

Comment by Peter Jones [ 18/Aug/20 ]

Bobijam

Could you please investigate?

Thanks

Peter

Comment by John Hammond [ 18/Aug/20 ]

I agree that this is likely a duplicate of LU-13089. As Oleg notes there "except this time it's glimpse cb vs cancel cb race". Based on the functions changed and the time that this was first noticed I suspect that this was introduced by

commit 2548cb9e32bfca897de577f88836629f72641369
Author:     Patrick Farrell <pfarrell@whamcloud.com>
AuthorDate: Mon Sep 9 11:56:07 2019 -0400
Commit:     Oleg Drokin <green@whamcloud.com>
CommitDate: Thu Dec 12 23:05:15 2019 +0000

    LU-11670 osc: glimpse - search for active lock

Lustre-change: https://review.whamcloud.com/33660
Reviewed-on: https://review.whamcloud.com/36406

Comment by Olaf Faaland [ 18/Aug/20 ]

Peter,
We have never seen this under an earlier version of Lustre 2.12.x, but this is the first 2.12 we deployed widely.
We have never seen this under any 2.10.x version.
thanks

Comment by Gerrit Updater [ 19/Aug/20 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/39693
Subject: LU-13908 osc: revert "glimpse - search for active lock"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6caffe3423c184b9717c9a5307da503ae0fbba4e

Comment by Zhenyu Xu [ 19/Aug/20 ]

The commit mentioned looks like the culprit. Olaf, is it possible trying the revert patch?

Comment by Olaf Faaland [ 25/Aug/20 ]

Yes, I can try.  It may take a while.  Sorry for the delay answering your question.

Comment by Olaf Faaland [ 31/Aug/20 ]

Hi Zhenyu,
Can you get the revert patch to pass Maloo, so I can more confidently test this on the production system where we see the error? We have not been able to reproduce this on our test systems.
thanks

Comment by Peter Jones [ 05/Sep/20 ]

I just noticed that Gerrit did not post an update to Jira for the revert on b2_12 - https://review.whamcloud.com/#/c/39819/ . Hopefully this is more convenient to test ofaaland

Comment by Olaf Faaland [ 08/Sep/20 ]

For my tracking purposes: my internal ticket is TOSS4864

Comment by Sebastien Piechurski [ 29/Sep/20 ]

Hello, 

Do you have any feedback on whether the revert of the targeted commit has any effect on your production ?

One of our customer has hit this 8 times in the past week. 

This would not be a problem if it affected only the crash client, but in our case, the oss handling the lock never releases it until the crashed client remounts the filesystem (the oss keeps retrying to send requests to the crashed client every 600 seconds even hours after the crash), which sometimes will result in a almost hung filesystem.

I am surprised the client is never evicted, would there be a reason to this ?

Comment by Peter Jones [ 30/Oct/20 ]

This is believed to be a duplicate of LU-11719

Comment by Olaf Faaland [ 05/Nov/20 ]

Next week we will finally get a build with the revert installed on the machine where we've seen the issue.

Comment by Peter Jones [ 05/Nov/20 ]

Wouldn't you rather test the actual fix?

Comment by Olaf Faaland [ 05/Nov/20 ]

Yes, but it's sadly complicated.

Comment by Olaf Faaland [ 25/Jan/21 ]

In about 3 weeks we will have 2.12.6_3.llnl, which includes the fix from LU-11719, on the clusters where we've seen this.

Generated at Sat Feb 10 03:05:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.