[LU-13908] ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed Created: 17/Aug/20 Updated: 25/Jan/21 Resolved: 25/Jan/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Olaf Faaland | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | llnl | ||
| Environment: |
2.12.4_5.chaos |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
On compute node, ASSERT fails and node crashes. One node reports two failed ASSERTs in the dumped log: LustreError: 13759:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: LustreError: 10188:0:(ldlm_lock.c:205:ldlm_lock_put()) ASSERTION( atomic_read(&lock->l_refc) > 0 ) failed: LustreError: 10188:0:(ldlm_lock.c:205:ldlm_lock_put()) LBUG Pid: 10188, comm: ldlm_bl_16 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1 SMP Fri Apr 3 08:56:52 PDT 2020 Call Trace: [<ffffffffc0a637ec>] libcfs_call_trace+0x8c/0xd0 [libcfs] [<ffffffffc0a638ac>] lbug_with_loc+0x4c/0xa0 [libcfs] [<ffffffffc16cb366>] ldlm_lock_put+0x616/0x7b0 [ptlrpc] [<ffffffffc0c5828b>] osc_extent_put+0x6b/0x320 [osc] [<ffffffffc0c645fb>] osc_cache_wait_range+0x30b/0x960 [osc] [<ffffffffc0c655ce>] osc_cache_writeback_range+0x97e/0x1000 [osc] [<ffffffffc0c51195>] osc_lock_flush+0x195/0x290 [osc] [<ffffffffc0c51653>] osc_ldlm_blocking_ast+0x2e3/0x3a0 [osc] [<ffffffffc16d2dea>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc] [<ffffffffc16ea620>] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc] [<ffffffffc16f03f7>] ldlm_cli_cancel+0x157/0x620 [ptlrpc] [<ffffffffc0c514ea>] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc] [<ffffffffc16fc618>] ldlm_handle_bl_callback+0xf8/0x4f0 [ptlrpc] [<ffffffffc16fd230>] ldlm_bl_thread_main+0x820/0xa60 [ptlrpc] [<ffffffffbaccca01>] kthread+0xd1/0xe0 [<ffffffffbb3bff5d>] ret_from_fork_nospec_begin+0x7/0x21 [<ffffffffffffffff>] 0xffffffffffffffff Kernel panic - not syncing: LBUG CPU: 53 PID: 10188 Comm: ldlm_bl_16 Kdump: loaded Tainted: G OE ------------ T 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1 Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0027.071020182329 07/10/2018 The other reports the same ASSERT twice: LustreError: 20571:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: LustreError: 36887:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: LustreError: 36887:0:(ldlm_lock.c:213:ldlm_lock_put()) LBUG Pid: 36887, comm: ldlm_bl_62 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1 SMP Fri Apr 3 08:56:52 PDT 2020 Call Trace: [<ffffffffc0a727ec>] libcfs_call_trace+0x8c/0xd0 [libcfs] [<ffffffffc0a728ac>] lbug_with_loc+0x4c/0xa0 [libcfs] [<ffffffffc176f3ca>] ldlm_lock_put+0x67a/0x7b0 [ptlrpc] [<ffffffffc1773058>] ldlm_lock_match_with_skip+0x3b8/0x860 [ptlrpc] [<ffffffffc0d982d2>] osc_match_base+0x102/0x290 [osc] [<ffffffffc0da3dfc>] osc_obj_dlmlock_at_pgoff+0x14c/0x2c0 [osc] [<ffffffffc0d9c358>] osc_req_attr_set+0x128/0x610 [osc] [<ffffffffc1549b13>] cl_req_attr_set+0x63/0x160 [obdclass] [<ffffffffc0d969f3>] osc_build_rpc+0x483/0x1080 [osc] [<ffffffffc0db1cbd>] osc_io_unplug0+0xecd/0x19c0 [osc] [<ffffffffc0db6620>] osc_cache_writeback_range+0x9d0/0x1000 [osc] [<ffffffffc0da2195>] osc_lock_flush+0x195/0x290 [osc] [<ffffffffc0da2653>] osc_ldlm_blocking_ast+0x2e3/0x3a0 [osc] [<ffffffffc1776dea>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc] [<ffffffffc178e620>] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc] [<ffffffffc17943f7>] ldlm_cli_cancel+0x157/0x620 [ptlrpc] [<ffffffffc0da24ea>] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc] [<ffffffffc17a0618>] ldlm_handle_bl_callback+0xf8/0x4f0 [ptlrpc] [<ffffffffc17a1230>] ldlm_bl_thread_main+0x820/0xa60 [ptlrpc] [<ffffffffab4cca01>] kthread+0xd1/0xe0 [<ffffffffabbbff5d>] ret_from_fork_nospec_begin+0x7/0x21 [<ffffffffffffffff>] 0xffffffffffffffff Kernel panic - not syncing: LBUG CPU: 20 PID: 36887 Comm: ldlm_bl_62 Kdump: loaded Tainted: G OE ------------ T 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1 From /tftpboot/dumps/192.168.64.82-2020-08-12-13:23:27/vmcore-dmesg.txt |
| Comments |
| Comment by Olaf Faaland [ 17/Aug/20 ] |
|
Looks like dupe of |
| Comment by Olaf Faaland [ 17/Aug/20 ] |
|
For my reference, my local ticket is TOSS4864 |
| Comment by Olaf Faaland [ 17/Aug/20 ] |
|
Average about 10 crashes per week, although it varies widely. I do not know whether a particular workload triggers it. |
| Comment by Peter Jones [ 18/Aug/20 ] |
|
ofaaland has this issue ever been seen on earlier version of 2.12.x or or 2.10.x? |
| Comment by Peter Jones [ 18/Aug/20 ] |
|
Bobijam Could you please investigate? Thanks Peter |
| Comment by John Hammond [ 18/Aug/20 ] |
|
I agree that this is likely a duplicate of commit 2548cb9e32bfca897de577f88836629f72641369
Author: Patrick Farrell <pfarrell@whamcloud.com>
AuthorDate: Mon Sep 9 11:56:07 2019 -0400
Commit: Oleg Drokin <green@whamcloud.com>
CommitDate: Thu Dec 12 23:05:15 2019 +0000
LU-11670 osc: glimpse - search for active lock
Lustre-change: https://review.whamcloud.com/33660 |
| Comment by Olaf Faaland [ 18/Aug/20 ] |
|
Peter, |
| Comment by Gerrit Updater [ 19/Aug/20 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/39693 |
| Comment by Zhenyu Xu [ 19/Aug/20 ] |
|
The commit mentioned looks like the culprit. Olaf, is it possible trying the revert patch? |
| Comment by Olaf Faaland [ 25/Aug/20 ] |
|
Yes, I can try. It may take a while. Sorry for the delay answering your question. |
| Comment by Olaf Faaland [ 31/Aug/20 ] |
|
Hi Zhenyu, |
| Comment by Peter Jones [ 05/Sep/20 ] |
|
I just noticed that Gerrit did not post an update to Jira for the revert on b2_12 - https://review.whamcloud.com/#/c/39819/ . Hopefully this is more convenient to test ofaaland |
| Comment by Olaf Faaland [ 08/Sep/20 ] |
|
For my tracking purposes: my internal ticket is TOSS4864 |
| Comment by Sebastien Piechurski [ 29/Sep/20 ] |
|
Hello, Do you have any feedback on whether the revert of the targeted commit has any effect on your production ? One of our customer has hit this 8 times in the past week. This would not be a problem if it affected only the crash client, but in our case, the oss handling the lock never releases it until the crashed client remounts the filesystem (the oss keeps retrying to send requests to the crashed client every 600 seconds even hours after the crash), which sometimes will result in a almost hung filesystem. I am surprised the client is never evicted, would there be a reason to this ? |
| Comment by Peter Jones [ 30/Oct/20 ] |
|
This is believed to be a duplicate of |
| Comment by Olaf Faaland [ 05/Nov/20 ] |
|
Next week we will finally get a build with the revert installed on the machine where we've seen the issue. |
| Comment by Peter Jones [ 05/Nov/20 ] |
|
Wouldn't you rather test the actual fix? |
| Comment by Olaf Faaland [ 05/Nov/20 ] |
|
Yes, but it's sadly complicated. |
| Comment by Olaf Faaland [ 25/Jan/21 ] |
|
In about 3 weeks we will have 2.12.6_3.llnl, which includes the fix from |