[LU-13263] "(lu_ref.c:257:lu_ref_del()) ASSERTION( 0 ) failed" triggered by lu_ref_del() in osc_ldlm_glimpse_ast(), because no corresponding lu_ref_link posted, with recent master configured with USE_LU_REF defined Created: 19/Feb/20  Updated: 25/Feb/20  Resolved: 25/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Bruno Faccini (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This problem has occurred when running "auster sanity", with latest master configured with USE_LU_REF defined ("configure --enable-lu_ref").
Here are the log and the backtrace for this crash/LBUG :

crash> dmesg | less
………………………………….
[73238.236340] Lustre: DEBUG MARKER: == sanity test 56w: check lfs_migrate -c stripe_count works ========================================== 11:58:03 (1582027083)
[73240.207988] LustreError: 132905:0:(lu_ref.c:96:lu_ref_print()) lu_ref: ffff9757ff247ca0 4 0 ldlm_lock_new:495
[73240.208120] LustreError: 146472:0:(lu_ref.c:98:lu_ref_print())      link: hash ffff9757ff242d00
[73240.235163] LustreError: 132905:0:(lu_ref.c:96:lu_ref_print()) Skipped 11 previous similar messages
[73240.294633] LustreError: 146472:0:(lu_ref.c:257:lu_ref_del()) ASSERTION( 0 ) failed:
[73240.294636] LustreError: 146472:0:(lu_ref.c:257:lu_ref_del()) LBUG
[73240.294639] Pid: 146472, comm: ldlm_cb01_009 3.10.0-862.14.4.el7_lustre_ClientSymlink_279c264.x86_64 #1 SMP Thu Oct 17 10:54:24 UTC 2019
[73240.294639] Call Trace:
[73240.294674]  [<ffffffffc0d670ec>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[73240.294685]  [<ffffffffc0d6719c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[73240.294764]  [<ffffffffc0aceab0>] lu_ref_set_at+0x0/0x160 [obdclass]
[73240.294787]  [<ffffffffc104bae8>] osc_ldlm_glimpse_ast+0x128/0x510 [osc]
[73240.294872]  [<ffffffffc0e36dcb>] ldlm_callback_handler.part.27+0xb0b/0x1e30 [ptlrpc]
[73240.294934]  [<ffffffffc0e38127>] ldlm_callback_handler+0x37/0xd0 [ptlrpc]
[73240.294994]  [<ffffffffc0e67d96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[73240.295056]  [<ffffffffc0e6be24>] ptlrpc_main+0xbb4/0x1550 [ptlrpc]
[73240.295062]  [<ffffffffbdcbdf21>] kthread+0xd1/0xe0
[73240.295068]  [<ffffffffbe3255f7>] ret_from_fork_nospec_end+0x0/0x39
[73240.295095]  [<ffffffffffffffff>] 0xffffffffffffffff
[73240.295096] Kernel panic - not syncing: LBUG
[73240.295101] CPU: 26 PID: 146472 Comm: ldlm_cb01_009 Kdump: loaded Tainted: G        W IOE  ------------   3.10.0-862.14.4.el7_lustre_ClientSymlink_279c264.x86_64 #1
[73240.295102] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[73240.295103] Call Trace:
[73240.295109]  [<ffffffffbe313754>] dump_stack+0x19/0x1b
[73240.295114]  [<ffffffffbe30d29f>] panic+0xe8/0x21f
[73240.295127]  [<ffffffffc0d671eb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[73240.295159]  [<ffffffffc0aceab0>] lu_ref_del+0x230/0x230 [obdclass]
[73240.295172]  [<ffffffffc104bae8>] osc_ldlm_glimpse_ast+0x128/0x510 [osc]
[73240.295216]  [<ffffffffc0e36dcb>] ldlm_callback_handler.part.27+0xb0b/0x1e30 [ptlrpc]
[73240.295284]  [<ffffffffc0e38127>] ldlm_callback_handler+0x37/0xd0 [ptlrpc]
[73240.295338]  [<ffffffffc0e67d96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[73240.295388]  [<ffffffffc0e6be24>] ptlrpc_main+0xbb4/0x1550 [ptlrpc]
[73240.295440]  [<ffffffffc0e6b270>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc]
[73240.295443]  [<ffffffffbdcbdf21>] kthread+0xd1/0xe0
[73240.295446]  [<ffffffffbdcbde50>] ? insert_kthread_work+0x40/0x40
[73240.295465]  [<ffffffffbe3255f7>] ret_from_fork_nospec_begin+0x21/0x21
[73240.295467]  [<ffffffffbdcbde50>] ? insert_kthread_work+0x40/0x40
(END)
crash> bt
PID: 146472  TASK: ffff9758562c4f10  CPU: 26  COMMAND: "ldlm_cb01_009"
 #0 [ffff9757d6a279e8] machine_kexec at ffffffffbdc62a0a
 #1 [ffff9757d6a27a48] __crash_kexec at ffffffffbdd166c2
 #2 [ffff9757d6a27b18] panic at ffffffffbe30d2aa
 #3 [ffff9757d6a27b98] lbug_with_loc at ffffffffc0d671eb [libcfs]
 #4 [ffff9757d6a27bf0] osc_ldlm_glimpse_ast at ffffffffc104bae8 [osc]
 #5 [ffff9757d6a27ca8] ldlm_callback_handler at ffffffffc0e36dcb [ptlrpc]
 #6 [ffff9757d6a27d20] ldlm_callback_handler at ffffffffc0e38127 [ptlrpc]
 #7 [ffff9757d6a27d38] ptlrpc_server_handle_request at ffffffffc0e67d96 [ptlrpc]
 #8 [ffff9757d6a27df0] ptlrpc_main at ffffffffc0e6be24 [ptlrpc]
 #9 [ffff9757d6a27ec8] kthread at ffffffffbdcbdf21
crash> 

Crash-dump analysis, along with concerned source code browsing, points to the fact that this problem could have been introduced by commit b3461d11dcb from LU-11670, where, in osc_ldlm_glimpse_ast() (in lustre/osc/osc_lock.c) LDLM_LOCK_PUT() is called after LDLM_LOCK_GET(), instead of LDLM_LOCK_RELEASE() or without a preceding lu_ref_add() !!….
I will push a patch soon as a fix tentative for this problem.



 Comments   
Comment by Gerrit Updater [ 19/Feb/20 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/37625
Subject: LU-13263 osc: use LDLM_LOCK_RELEASE() if no lu_ref added
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 482707b3d9fc3cbef0f2ee4cd2547aec962e9f03

Comment by Gerrit Updater [ 25/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37625/
Subject: LU-13263 osc: use LDLM_LOCK_RELEASE() if no lu_ref added
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7bde4a104485662d70a578c056cc39ef46b22a10

Comment by Peter Jones [ 25/Feb/20 ]

Landed for 2.14

Generated at Sat Feb 10 02:59:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.