[LU-13089] ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed Created: 19/Dec/19 Updated: 21/Oct/22 Resolved: 26/Oct/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 2 |
| Labels: | llnl | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
I hit this in b2_12-next testing, but there's nothing I see in there that appears related so I think this is just super rare race that's pretty similar in nature to except this time it's glimpse cb vs cancel cb race console gives us: [19625.122592] LustreError: 8347:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: [19625.124370] LustreError: 8347:0:(ldlm_lock.c:213:ldlm_lock_put()) LBUG [19625.124704] LustreError: 8367:0:(ldlm_lock.c:205:ldlm_lock_put()) ASSERTION( atomic_read(&lock->l_refc) > 0 ) failed: [19625.124705] LustreError: 8367:0:(ldlm_lock.c:205:ldlm_lock_put()) LBUG [19625.124706] Pid: 8367, comm: ldlm_bl_02 3.10.0-7.7-debug #1 SMP Wed Oct 30 09:43:00 EDT 2019 [19625.124707] Call Trace: [19625.124739] [<ffffffffa01f97dc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [19625.124744] [<ffffffffa01f988c>] lbug_with_loc+0x4c/0xa0 [libcfs] [19625.124796] [<ffffffffa0638197>] ldlm_lock_put+0x557/0x730 [ptlrpc] [19625.124821] [<ffffffffa0639980>] ldlm_lock_destroy_nolock+0x50/0x110 [ptlrpc] [19625.124839] [<ffffffffa063febf>] ldlm_lock_cancel+0x6f/0x1f0 [ptlrpc] [19625.124866] [<ffffffffa0656a3a>] ldlm_cli_cancel_local+0xca/0x3f0 [ptlrpc] [19625.124887] [<ffffffffa065c767>] ldlm_cli_cancel+0x157/0x620 [ptlrpc] [19625.124917] [<ffffffffa08a8b1a>] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc] [19625.124949] [<ffffffffa0668835>] ldlm_handle_bl_callback+0xf5/0x4f0 [ptlrpc] [19625.124977] [<ffffffffa06693e8>] ldlm_bl_thread_main+0x7b8/0x980 [ptlrpc] [19625.125040] [<ffffffff810b8254>] kthread+0xe4/0xf0 [19625.125053] [<ffffffff817e0df7>] ret_from_fork_nospec_end+0x0/0x39 [19625.125063] [<ffffffffffffffff>] 0xffffffffffffffff [19625.125067] Kernel panic - not syncing: LBUG and the oud 8347 backtrace is #4 [ffff88003e313820] vt_console_print at ffffffff814dbe54 #5 [ffff88003e313880] call_console_drivers.constprop.17 at ffffffff8108cf00 #6 [ffff88003e3138a8] console_unlock at ffffffff8108e559 #7 [ffff88003e3138e8] vprintk_emit at ffffffff8108e984 #8 [ffff88003e313958] vprintk_default at ffffffff8108ed79 #9 [ffff88003e313968] printk at ffffffff817c67d1 #10 [ffff88003e3139c8] cfs_print_to_console at ffffffffa01f958a [libcfs] #11 [ffff88003e3139f8] libcfs_debug_vmsg2 at ffffffffa01ff9e3 [libcfs] #12 [ffff88003e313b48] libcfs_debug_msg at ffffffffa0200077 [libcfs] #13 [ffff88003e313ba8] lbug_with_loc at ffffffffa01f9866 [libcfs] #14 [ffff88003e313bc8] ldlm_lock_put at ffffffffa0638266 [ptlrpc] #15 [ffff88003e313be8] osc_ldlm_glimpse_ast at ffffffffa08a8f28 [osc] #16 [ffff88003e313ca0] ldlm_callback_handler at ffffffffa066a0a8 [ptlrpc] #17 [ffff88003e313d18] ldlm_callback_handler at ffffffffa066b3c7 [ptlrpc] #18 [ffff88003e313d30] ptlrpc_server_handle_request at ffffffffa0699266 [ptlrpc] #19 [ffff88003e313dd0] ptlrpc_main at ffffffffa069d261 [ptlrpc] #20 [ffff88003e313ea8] kthread at ffffffff810b8254 |
| Comments |
| Comment by Oleg Drokin [ 19/Dec/19 ] |
|
attached is lustre.txt that's lustre debug log extracted from the core file where you clearly can see the race unfolding. |
| Comment by Zhenyu Xu [ 20/Dec/19 ] |
|
This is more like lock's reference issue other than the race about setting lock's flags. From the log, the lock was requested to be cancelled, and process 8367 handled it and was trying to flush data covered by it. The lock's reference account looks ok. 428786 00010000:00010000:3.0:1576584703.467448:0:8367:0:(ldlm_lockd.c:1730:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: ?? lock: ffff880012bd5d80/0x232daeb61003014b lrc: 3/0,0 mode: PW/PW res: ?? rrc=?? type: ??? flags: 0x420000020000 nid: local remote: 0x232daeb610030278 expref: -99 pid: 22687 timeout: 0 lvb_type: 1 428787 00010000:00010000:3.0:1576584703.467450:0:8367:0:(ldlm_lockd.c:1761:ldlm_handle_bl_callback()) Lock ffff880012bd5d80 already unused, calling callback (ffffffffa08a89a0) 428788 00010000:00010000:3.0:1576584703.467452:0:8367:0:(ldlm_request.c:1268:ldlm_cli_cancel_local()) ### client-side cancel ns: ?? lock: ffff880012bd5d80/0x232daeb61003014b lrc: 4/0,0 mode: PW/PW res:?? rrc=?? type: ??? flags: 0x428400020000 nid: local remote: 0x232daeb610030278 expref: -99 pid: 22687 timeout: 0 lvb_type: 1 428790 00000008:00000020:3.0:1576584703.467457:0:8367:0:(osc_cache.c:3008:osc_cache_wait_range()) obj ffff88002b179e60 ready 0|-|- wr 0|-|- rd 0|- sync file range. 428791 00000008:00000020:3.0:1576584703.467459:0:8367:0:(osc_cache.c:3140:osc_cache_writeback_range()) obj ffff88002b179e60 ready 0|-|- wr 0|-|- rd 0|- pageout [224, 479], 0. 428792 00000008:00000020:3.0:1576584703.467461:0:8367:0:(osc_lock.c:375:osc_lock_flush()) object ffff88002b179e60: [224 -> 479] 0 pages were written back. 428794 00000080:00200000:3.0:1576584703.467465:0:8367:0:(vvp_io.c:1469:vvp_io_init()) [0x200000402:0x70c0:0x0] ignore/verify layout 1/0, layout version 0 restore needed 0 While process 8347 handling gl callback tripped over the lock before flush finished, but it's ref count strangely became 0 then, and for sure the lock hadn't been set destroy flag yet. 428898 00010000:00010000:1.0:1576584703.468909:0:8347:0:(ldlm_lock.c:210:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-OST0001-osc-ffff88003f2f8800 lock: ffff880012bd5d80/0x232daeb61003014b lrc: 0/0,0 mode: PW/PW res: [0x12cb:0x0:0x0].0x0 rrc: 5 type: EXT [917504->1966079] (req 917504->983039) flags: 0x429400020000 nid: local remote: 0x232daeb610030278 expref: -99 pid: 22687 timeout: 0 lvb_type: 1 428899 00010000:00040000:1.0:1576584703.468914:0:8347:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: process 8367 picked up the flush IO and found out the lock's erroneous ref count. 428913 00000080:00200000:3.0:1576584703.471016:0:8367:0:(vvp_io.c:312:vvp_io_fini()) [0x200000402:0x70c0:0x0] ignore/verify layout 1/0, layout version 0 need write layout 0, restore needed 0 428914 00000008:00000020:3.0:1576584703.471021:0:8367:0:(osc_object.c:183:osc_attr_update()) set kms from 1966080to 0 428915 00010000:00040000:3.0:1576584703.471027:0:8367:0:(ldlm_lock.c:205:ldlm_lock_put()) ASSERTION( atomic_read(&lock->l_refc) > 0 ) failed: 428916 00010000:00040000:3.0:1576584703.471030:0:8367:0:(ldlm_lock.c:205:ldlm_lock_put()) LBUG |
| Comment by Lukasz Flis [ 20/Jul/20 ] |
|
CYFRONET here, Since migration to 2.12 we noted that some types of hpc jobs are crashing the nodes(few crashes a month / 2200 compute nodes). Until today we were not able to reproduce the issue - now we belive to have workload which does it every time it's run.
[2195888.180647] LustreError: 3966:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: Client version is 2.12.4 If you need a debug log for any susbystem or other debug data please let us know. We'd be happy to provide more info which could help with diagnosis
Best Regards – Lukasz Flis |
| Comment by Olaf Faaland [ 13/Aug/20 ] |
|
We are seeing this regularly at LLNL with Lustre 2.12.4_5.chaos. |
| Comment by Gerrit Updater [ 03/Sep/20 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/39819 |
| Comment by Lukasz Flis [ 11/Sep/20 ] |
|
Applying the https://review.whamcloud.com/39819 fixed the problem in CYFRONET
|
| Comment by Kim Sebo [ 13/Sep/20 ] |
|
We at ANU/NCI are also seeing this assert, ~10 times per month on 3000 clients. Clients are Centos 8 / 2.12.5 + some backports. |
| Comment by Gerrit Updater [ 17/Oct/20 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a patch port: https://review.whamcloud.com/40399
The search_itree was glitched and messed up data->lmd_lock. |
| Comment by Gerrit Updater [ 18/Oct/20 ] |
|
Bobi Jam (bobijam@hotmail.com) abandoned the patch: https://review.whamcloud.com/40286 |
| Comment by Peter Jones [ 26/Oct/20 ] |
|
Believed to be a duplicate of |