Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
This problem has occurred when running "auster sanity", with latest master configured with USE_LU_REF defined ("configure --enable-lu_ref").
Here are the log and the backtrace for this crash/LBUG :
crash> dmesg | less …………………………………. [73238.236340] Lustre: DEBUG MARKER: == sanity test 56w: check lfs_migrate -c stripe_count works ========================================== 11:58:03 (1582027083) [73240.207988] LustreError: 132905:0:(lu_ref.c:96:lu_ref_print()) lu_ref: ffff9757ff247ca0 4 0 ldlm_lock_new:495 [73240.208120] LustreError: 146472:0:(lu_ref.c:98:lu_ref_print()) link: hash ffff9757ff242d00 [73240.235163] LustreError: 132905:0:(lu_ref.c:96:lu_ref_print()) Skipped 11 previous similar messages [73240.294633] LustreError: 146472:0:(lu_ref.c:257:lu_ref_del()) ASSERTION( 0 ) failed: [73240.294636] LustreError: 146472:0:(lu_ref.c:257:lu_ref_del()) LBUG [73240.294639] Pid: 146472, comm: ldlm_cb01_009 3.10.0-862.14.4.el7_lustre_ClientSymlink_279c264.x86_64 #1 SMP Thu Oct 17 10:54:24 UTC 2019 [73240.294639] Call Trace: [73240.294674] [<ffffffffc0d670ec>] libcfs_call_trace+0x8c/0xc0 [libcfs] [73240.294685] [<ffffffffc0d6719c>] lbug_with_loc+0x4c/0xa0 [libcfs] [73240.294764] [<ffffffffc0aceab0>] lu_ref_set_at+0x0/0x160 [obdclass] [73240.294787] [<ffffffffc104bae8>] osc_ldlm_glimpse_ast+0x128/0x510 [osc] [73240.294872] [<ffffffffc0e36dcb>] ldlm_callback_handler.part.27+0xb0b/0x1e30 [ptlrpc] [73240.294934] [<ffffffffc0e38127>] ldlm_callback_handler+0x37/0xd0 [ptlrpc] [73240.294994] [<ffffffffc0e67d96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [73240.295056] [<ffffffffc0e6be24>] ptlrpc_main+0xbb4/0x1550 [ptlrpc] [73240.295062] [<ffffffffbdcbdf21>] kthread+0xd1/0xe0 [73240.295068] [<ffffffffbe3255f7>] ret_from_fork_nospec_end+0x0/0x39 [73240.295095] [<ffffffffffffffff>] 0xffffffffffffffff [73240.295096] Kernel panic - not syncing: LBUG [73240.295101] CPU: 26 PID: 146472 Comm: ldlm_cb01_009 Kdump: loaded Tainted: G W IOE ------------ 3.10.0-862.14.4.el7_lustre_ClientSymlink_279c264.x86_64 #1 [73240.295102] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [73240.295103] Call Trace: [73240.295109] [<ffffffffbe313754>] dump_stack+0x19/0x1b [73240.295114] [<ffffffffbe30d29f>] panic+0xe8/0x21f [73240.295127] [<ffffffffc0d671eb>] lbug_with_loc+0x9b/0xa0 [libcfs] [73240.295159] [<ffffffffc0aceab0>] lu_ref_del+0x230/0x230 [obdclass] [73240.295172] [<ffffffffc104bae8>] osc_ldlm_glimpse_ast+0x128/0x510 [osc] [73240.295216] [<ffffffffc0e36dcb>] ldlm_callback_handler.part.27+0xb0b/0x1e30 [ptlrpc] [73240.295284] [<ffffffffc0e38127>] ldlm_callback_handler+0x37/0xd0 [ptlrpc] [73240.295338] [<ffffffffc0e67d96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [73240.295388] [<ffffffffc0e6be24>] ptlrpc_main+0xbb4/0x1550 [ptlrpc] [73240.295440] [<ffffffffc0e6b270>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc] [73240.295443] [<ffffffffbdcbdf21>] kthread+0xd1/0xe0 [73240.295446] [<ffffffffbdcbde50>] ? insert_kthread_work+0x40/0x40 [73240.295465] [<ffffffffbe3255f7>] ret_from_fork_nospec_begin+0x21/0x21 [73240.295467] [<ffffffffbdcbde50>] ? insert_kthread_work+0x40/0x40 (END) crash> bt PID: 146472 TASK: ffff9758562c4f10 CPU: 26 COMMAND: "ldlm_cb01_009" #0 [ffff9757d6a279e8] machine_kexec at ffffffffbdc62a0a #1 [ffff9757d6a27a48] __crash_kexec at ffffffffbdd166c2 #2 [ffff9757d6a27b18] panic at ffffffffbe30d2aa #3 [ffff9757d6a27b98] lbug_with_loc at ffffffffc0d671eb [libcfs] #4 [ffff9757d6a27bf0] osc_ldlm_glimpse_ast at ffffffffc104bae8 [osc] #5 [ffff9757d6a27ca8] ldlm_callback_handler at ffffffffc0e36dcb [ptlrpc] #6 [ffff9757d6a27d20] ldlm_callback_handler at ffffffffc0e38127 [ptlrpc] #7 [ffff9757d6a27d38] ptlrpc_server_handle_request at ffffffffc0e67d96 [ptlrpc] #8 [ffff9757d6a27df0] ptlrpc_main at ffffffffc0e6be24 [ptlrpc] #9 [ffff9757d6a27ec8] kthread at ffffffffbdcbdf21 crash>
Crash-dump analysis, along with concerned source code browsing, points to the fact that this problem could have been introduced by commit b3461d11dcb from LU-11670, where, in osc_ldlm_glimpse_ast() (in lustre/osc/osc_lock.c) LDLM_LOCK_PUT() is called after LDLM_LOCK_GET(), instead of LDLM_LOCK_RELEASE() or without a preceding lu_ref_add() !!….
I will push a patch soon as a fix tentative for this problem.