[LU-13089] ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed Created: 19/Dec/19  Updated: 21/Oct/22  Resolved: 26/Oct/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Duplicate Votes: 2
Labels: llnl

Attachments: Text File lustre.txt    
Issue Links:
Related
is related to LU-11670 Incorrect size when using lockahead Resolved
is related to LU-13908 ldlm_lock_put()) ASSERTION( (((( lock... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I hit this in b2_12-next testing, but there's nothing I see in there that appears related so I think this is just super rare race that's pretty similar in nature to LU-4269

except this time it's glimpse cb vs cancel cb race

console gives us:

[19625.122592] LustreError: 8347:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: 
[19625.124370] LustreError: 8347:0:(ldlm_lock.c:213:ldlm_lock_put()) LBUG
[19625.124704] LustreError: 8367:0:(ldlm_lock.c:205:ldlm_lock_put()) ASSERTION( atomic_read(&lock->l_refc) > 0 ) failed: 
[19625.124705] LustreError: 8367:0:(ldlm_lock.c:205:ldlm_lock_put()) LBUG
[19625.124706] Pid: 8367, comm: ldlm_bl_02 3.10.0-7.7-debug #1 SMP Wed Oct 30 09:43:00 EDT 2019
[19625.124707] Call Trace:
[19625.124739]  [<ffffffffa01f97dc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[19625.124744]  [<ffffffffa01f988c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[19625.124796]  [<ffffffffa0638197>] ldlm_lock_put+0x557/0x730 [ptlrpc]
[19625.124821]  [<ffffffffa0639980>] ldlm_lock_destroy_nolock+0x50/0x110 [ptlrpc]
[19625.124839]  [<ffffffffa063febf>] ldlm_lock_cancel+0x6f/0x1f0 [ptlrpc]
[19625.124866]  [<ffffffffa0656a3a>] ldlm_cli_cancel_local+0xca/0x3f0 [ptlrpc]
[19625.124887]  [<ffffffffa065c767>] ldlm_cli_cancel+0x157/0x620 [ptlrpc]
[19625.124917]  [<ffffffffa08a8b1a>] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc]
[19625.124949]  [<ffffffffa0668835>] ldlm_handle_bl_callback+0xf5/0x4f0 [ptlrpc]
[19625.124977]  [<ffffffffa06693e8>] ldlm_bl_thread_main+0x7b8/0x980 [ptlrpc]
[19625.125040]  [<ffffffff810b8254>] kthread+0xe4/0xf0
[19625.125053]  [<ffffffff817e0df7>] ret_from_fork_nospec_end+0x0/0x39
[19625.125063]  [<ffffffffffffffff>] 0xffffffffffffffff
[19625.125067] Kernel panic - not syncing: LBUG

and the oud 8347 backtrace is

 #4 [ffff88003e313820] vt_console_print at ffffffff814dbe54
 #5 [ffff88003e313880] call_console_drivers.constprop.17 at ffffffff8108cf00
 #6 [ffff88003e3138a8] console_unlock at ffffffff8108e559
 #7 [ffff88003e3138e8] vprintk_emit at ffffffff8108e984
 #8 [ffff88003e313958] vprintk_default at ffffffff8108ed79
 #9 [ffff88003e313968] printk at ffffffff817c67d1
#10 [ffff88003e3139c8] cfs_print_to_console at ffffffffa01f958a [libcfs]
#11 [ffff88003e3139f8] libcfs_debug_vmsg2 at ffffffffa01ff9e3 [libcfs]
#12 [ffff88003e313b48] libcfs_debug_msg at ffffffffa0200077 [libcfs]
#13 [ffff88003e313ba8] lbug_with_loc at ffffffffa01f9866 [libcfs]
#14 [ffff88003e313bc8] ldlm_lock_put at ffffffffa0638266 [ptlrpc]
#15 [ffff88003e313be8] osc_ldlm_glimpse_ast at ffffffffa08a8f28 [osc]
#16 [ffff88003e313ca0] ldlm_callback_handler at ffffffffa066a0a8 [ptlrpc]
#17 [ffff88003e313d18] ldlm_callback_handler at ffffffffa066b3c7 [ptlrpc]
#18 [ffff88003e313d30] ptlrpc_server_handle_request at ffffffffa0699266 [ptlrpc]
#19 [ffff88003e313dd0] ptlrpc_main at ffffffffa069d261 [ptlrpc]
#20 [ffff88003e313ea8] kthread at ffffffff810b8254


 Comments   
Comment by Oleg Drokin [ 19/Dec/19 ]

attached is lustre.txt that's lustre debug log extracted from the core file where you clearly can see the race unfolding.

Comment by Zhenyu Xu [ 20/Dec/19 ]

This is more like lock's reference issue other than the race about setting lock's flags.

From the log, the lock was requested to be cancelled, and process 8367 handled it and was trying to flush data covered by it. The lock's reference account looks ok.

428786 00010000:00010000:3.0:1576584703.467448:0:8367:0:(ldlm_lockd.c:1730:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: ?? lock: ffff880012bd5d80/0x232daeb61003014b lrc: 3/0,0 mode: PW/PW res: ?? rrc=?? type: ??? flags: 0x420000020000 nid: local remote: 0x232daeb610030278 expref: -99 pid: 22687 timeout: 0 lvb_type: 1
428787 00010000:00010000:3.0:1576584703.467450:0:8367:0:(ldlm_lockd.c:1761:ldlm_handle_bl_callback()) Lock ffff880012bd5d80 already unused, calling callback (ffffffffa08a89a0)
428788 00010000:00010000:3.0:1576584703.467452:0:8367:0:(ldlm_request.c:1268:ldlm_cli_cancel_local()) ### client-side cancel ns: ?? lock: ffff880012bd5d80/0x232daeb61003014b lrc: 4/0,0 mode: PW/PW res:?? rrc=?? type: ??? flags: 0x428400020000 nid: local remote: 0x232daeb610030278 expref: -99 pid: 22687 timeout: 0 lvb_type: 1
428790 00000008:00000020:3.0:1576584703.467457:0:8367:0:(osc_cache.c:3008:osc_cache_wait_range()) obj ffff88002b179e60 ready 0|-|- wr 0|-|- rd 0|- sync file range.
428791 00000008:00000020:3.0:1576584703.467459:0:8367:0:(osc_cache.c:3140:osc_cache_writeback_range()) obj ffff88002b179e60 ready 0|-|- wr 0|-|- rd 0|- pageout [224, 479], 0.                            
428792 00000008:00000020:3.0:1576584703.467461:0:8367:0:(osc_lock.c:375:osc_lock_flush()) object ffff88002b179e60: [224 -> 479] 0 pages were written back.
428794 00000080:00200000:3.0:1576584703.467465:0:8367:0:(vvp_io.c:1469:vvp_io_init()) [0x200000402:0x70c0:0x0] ignore/verify layout 1/0, layout version 0 restore needed 0

While process 8347 handling gl callback tripped over the lock before flush finished, but it's ref count strangely became 0 then, and for sure the lock hadn't been set destroy flag yet.

428898 00010000:00010000:1.0:1576584703.468909:0:8347:0:(ldlm_lock.c:210:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-OST0001-osc-ffff88003f2f8800 lock: ffff880012bd5d80/0x232daeb61003014b lrc: 0/0,0 mode: PW/PW res: [0x12cb:0x0:0x0].0x0 rrc: 5 type: EXT [917504->1966079] (req 917504->983039) flags: 0x429400020000 nid: local remote: 0x232daeb610030278 expref: -99 pid: 22687 timeout: 0 lvb_type: 1
428899 00010000:00040000:1.0:1576584703.468914:0:8347:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: 

process 8367 picked up the flush IO and found out the lock's erroneous ref count.

428913 00000080:00200000:3.0:1576584703.471016:0:8367:0:(vvp_io.c:312:vvp_io_fini()) [0x200000402:0x70c0:0x0] ignore/verify layout 1/0, layout version 0 need write layout 0, restore needed 0
428914 00000008:00000020:3.0:1576584703.471021:0:8367:0:(osc_object.c:183:osc_attr_update()) set kms from 1966080to 0 
428915 00010000:00040000:3.0:1576584703.471027:0:8367:0:(ldlm_lock.c:205:ldlm_lock_put()) ASSERTION( atomic_read(&lock->l_refc) > 0 ) failed: 
428916 00010000:00040000:3.0:1576584703.471030:0:8367:0:(ldlm_lock.c:205:ldlm_lock_put()) LBUG
Comment by Lukasz Flis [ 20/Jul/20 ]

CYFRONET here,

Since migration to 2.12 we noted that some types of hpc jobs  are crashing the nodes(few crashes a month / 2200 compute nodes). Until today we were not able to reproduce the issue - now we belive to have workload which does it every time it's run.

 

[2195888.180647] LustreError: 3966:0:(ldlm_lock.c:213:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed:

Client version is 2.12.4

If you need a debug log for any susbystem or other debug data please let us know. We'd be happy to provide more info which could help with diagnosis

 

Best Regards

Lukasz Flis

Comment by Olaf Faaland [ 13/Aug/20 ]

We are seeing this regularly at LLNL with Lustre 2.12.4_5.chaos.

Comment by Gerrit Updater [ 03/Sep/20 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/39819
Subject: LU-13089 osc: revert "glimpse - search for active lock"
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 37205c60bb2d99363a0c9dbf29d8f4fd684b6fab

Comment by Lukasz Flis [ 11/Sep/20 ]

Applying the https://review.whamcloud.com/39819 fixed the problem in CYFRONET
Before going to production with this fix we need to test if LU-11670 is a problem for us.

 

Comment by Kim Sebo [ 13/Sep/20 ]

We at ANU/NCI are also seeing this assert, ~10 times per month on 3000 clients. Clients are Centos 8 / 2.12.5 + some backports.

Comment by Gerrit Updater [ 17/Oct/20 ]

Bobi Jam (bobijam@hotmail.com) uploaded a patch port: https://review.whamcloud.com/40399

 

The search_itree was glitched and messed up data->lmd_lock.

Comment by Gerrit Updater [ 18/Oct/20 ]

Bobi Jam (bobijam@hotmail.com) abandoned the patch: https://review.whamcloud.com/40286

Comment by Peter Jones [ 26/Oct/20 ]

Believed to be a duplicate of LU-11719

Generated at Sat Feb 10 02:58:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.