Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
These problems have occurred when running "auster sanity", with latest master configured with USE_LU_REF defined ("configure --enable-lu_ref").
Here are the logs and the backtraces for these slightly different cases/backtraces with the same crash/LBUG.
FIRST ONE :
crash> dmesg | less ………………………………… [16713.715520] Lustre: DEBUG MARKER: == sanity test 27Cd: test maximum stripe count ======================================================= 16:39:08 (1582043948) [16714.015089] Lustre: 14985:0:(osd_internal.h:1247:osd_trans_exec_op()) lustre-MDT0000: opcode 7: before 512 < left 2010, rollback = 7 [16714.032739] Lustre: 14985:0:(osd_handler.c:1807:osd_trans_dump_creds()) create: 0/0/0, destroy: 0/0/0 [16714.045410] Lustre: 14985:0:(osd_handler.c:1807:osd_trans_dump_creds()) Skipped 5 previous similar messages [16714.058317] Lustre: 14985:0:(osd_handler.c:1814:osd_trans_dump_creds()) attr_set: 0/0/0, xattr_set: 1/89/0 [16714.071167] Lustre: 14985:0:(osd_handler.c:1814:osd_trans_dump_creds()) Skipped 5 previous similar messages [16714.083794] Lustre: 14985:0:(osd_handler.c:1824:osd_trans_dump_creds()) write: 2001/2010/0, punch: 0/0/0, quota 3/3/0 [16714.097491] Lustre: 14985:0:(osd_handler.c:1824:osd_trans_dump_creds()) Skipped 5 previous similar messages [16714.109917] Lustre: 14985:0:(osd_handler.c:1831:osd_trans_dump_creds()) insert: 0/0/0, delete: 0/0/0 [16714.121753] Lustre: 14985:0:(osd_handler.c:1831:osd_trans_dump_creds()) Skipped 5 previous similar messages [16714.133993] Lustre: 14985:0:(osd_handler.c:1838:osd_trans_dump_creds()) ref_add: 0/0/0, ref_del: 0/0/0 [16714.145902] Lustre: 14985:0:(osd_handler.c:1838:osd_trans_dump_creds()) Skipped 5 previous similar messages [16714.691028] LustreError: 146480:0:(cl_object.c:629:cl_env_init0()) ASSERTION( cle->ce_ref == 0 ) failed: [16714.694227] LustreError: 146479:0:(cl_object.c:962:cl_env_percpu_put()) ASSERTION( cle == &cl_env_percpu[cpu] ) failed: [16714.694230] LustreError: 146479:0:(cl_object.c:962:cl_env_percpu_put()) LBUG [16714.694234] Pid: 146479, comm: ptlrpcd_01_29 3.10.0-862.14.4.el7_lustre_ClientSymlink_279c264.x86_64 #1 SMP Thu Oct 17 10:54:24 UTC 2019 [16714.694234] Call Trace: [16714.694276] [<ffffffffc09a20ec>] libcfs_call_trace+0x8c/0xc0 [libcfs] [16714.694289] [<ffffffffc09a219c>] lbug_with_loc+0x4c/0xa0 [libcfs] [16714.694350] [<ffffffffc0ad1ade>] cl_env_percpu_put+0x6e/0xa0 [obdclass] [16714.694371] [<ffffffffc0fa816c>] osc_lock_upcall+0x9c/0x5f0 [osc] [16714.694384] [<ffffffffc0f9c432>] osc_enqueue_fini+0xb2/0x210 [osc] [16714.694396] [<ffffffffc0f9c6a0>] osc_enqueue_interpret+0x110/0x340 [osc] [16714.694477] [<ffffffffc0da6e70>] ptlrpc_check_set.part.23+0x480/0x1de0 [ptlrpc] [16714.694537] [<ffffffffc0da882b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [16714.694606] [<ffffffffc0dd55eb>] ptlrpcd_check+0x4ab/0x590 [ptlrpc] [16714.694669] [<ffffffffc0dd596b>] ptlrpcd+0x29b/0x550 [ptlrpc] [16714.694676] [<ffffffff844bdf21>] kthread+0xd1/0xe0 [16714.694683] [<ffffffff84b255f7>] ret_from_fork_nospec_end+0x0/0x39 [16714.694696] [<ffffffffffffffff>] 0xffffffffffffffff [16714.694698] Kernel panic - not syncing: LBUG [16714.694703] CPU: 29 PID: 146479 Comm: ptlrpcd_01_29 Kdump: loaded Tainted: G IOE ------------ 3.10.0-862.14.4.el7_lustre_ClientSymlink_279c264.x86_64 #1 [16714.694705] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [16714.694706] Call Trace: [16714.694712] [<ffffffff84b13754>] dump_stack+0x19/0x1b [16714.694718] [<ffffffff84b0d29f>] panic+0xe8/0x21f [16714.694734] [<ffffffffc09a21eb>] lbug_with_loc+0x9b/0xa0 [libcfs] [16714.694776] [<ffffffffc0ad1ade>] cl_env_percpu_put+0x6e/0xa0 [obdclass] [16714.694791] [<ffffffffc0fa816c>] osc_lock_upcall+0x9c/0x5f0 [osc] [16714.694806] [<ffffffffc0fa80d0>] ? osc_lock_upcall_speculative+0x2e0/0x2e0 [osc] [16714.694818] [<ffffffffc0f9c432>] osc_enqueue_fini+0xb2/0x210 [osc] [16714.694831] [<ffffffffc0f9c6a0>] osc_enqueue_interpret+0x110/0x340 [osc] [16714.694888] [<ffffffffc0da6e70>] ptlrpc_check_set.part.23+0x480/0x1de0 [ptlrpc] [16714.694946] [<ffffffffc0da882b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [16714.695010] [<ffffffffc0dd55eb>] ptlrpcd_check+0x4ab/0x590 [ptlrpc] [16714.695072] [<ffffffffc0dd596b>] ptlrpcd+0x29b/0x550 [ptlrpc] [16714.695077] [<ffffffff844d2010>] ? wake_up_state+0x20/0x20 [16714.695138] [<ffffffffc0dd56d0>] ? ptlrpcd_check+0x590/0x590 [ptlrpc] [16714.695141] [<ffffffff844bdf21>] kthread+0xd1/0xe0 [16714.695145] [<ffffffff844bde50>] ? insert_kthread_work+0x40/0x40 [16714.695150] [<ffffffff84b255f7>] ret_from_fork_nospec_begin+0x21/0x21 [16714.695153] [<ffffffff844bde50>] ? insert_kthread_work+0x40/0x40 (END) crash> bt PID: 146479 TASK: ffff8a320a248fd0 CPU: 29 COMMAND: "ptlrpcd_01_29" #0 [ffff8a325c5039a8] machine_kexec at ffffffff84462a0a #1 [ffff8a325c503a08] __crash_kexec at ffffffff845166c2 #2 [ffff8a325c503ad8] panic at ffffffff84b0d2aa #3 [ffff8a325c503b58] lbug_with_loc at ffffffffc09a21eb [libcfs] #4 [ffff8a325c503b78] cl_env_percpu_put at ffffffffc0ad1ade [obdclass] #5 [ffff8a325c503b88] osc_lock_upcall at ffffffffc0fa816c [osc] #6 [ffff8a325c503bd0] osc_enqueue_fini at ffffffffc0f9c432 [osc] #7 [ffff8a325c503c20] osc_enqueue_interpret at ffffffffc0f9c6a0 [osc] #8 [ffff8a325c503c98] ptlrpc_check_set at ffffffffc0da6e70 [ptlrpc] #9 [ffff8a325c503d48] ptlrpc_check_set at ffffffffc0da882b [ptlrpc] #10 [ffff8a325c503d68] ptlrpcd_check at ffffffffc0dd55eb [ptlrpc] #11 [ffff8a325c503db8] ptlrpcd at ffffffffc0dd596b [ptlrpc] #12 [ffff8a325c503ec8] kthread at ffffffff844bdf21 crash> crash> bt 146480 PID: 146480 TASK: ffff8a320a249fa0 CPU: 65 COMMAND: "ptlrpcd_01_30" #0 [ffff8a3a5eb48e48] crash_nmi_callback at ffffffff844551b7 #1 [ffff8a3a5eb48e58] nmi_handle at ffffffff84b1d8dc #2 [ffff8a3a5eb48eb0] do_nmi at ffffffff84b1dafd #3 [ffff8a3a5eb48ef0] end_repeat_nmi at ffffffff84b1cd79 [exception RIP: io_serial_in+21] RIP: ffffffff8484aab5 RSP: ffff8a31c2a1f7d8 RFLAGS: 00000002 RAX: ffffffff8484aa00 RBX: ffffffff85614400 RCX: 0000000000000000 RDX: 00000000000003fd RSI: 00000000000003fd RDI: ffffffff85614400 RBP: ffff8a31c2a1f7d8 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: ffff8a31c2a1f6ee R12: 0000000000002700 R13: 0000000000000020 R14: ffffffff853c1c4d R15: 000000000000006d ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #4 [ffff8a31c2a1f7d8] io_serial_in at ffffffff8484aab5 #5 [ffff8a31c2a1f7e0] wait_for_xmitr at ffffffff8484b746 #6 [ffff8a31c2a1f808] serial8250_console_putchar at ffffffff8484b92c #7 [ffff8a31c2a1f828] uart_console_write at ffffffff8484632c #8 [ffff8a31c2a1f858] serial8250_console_write at ffffffff8484b86f #9 [ffff8a31c2a1f8a0] call_console_drivers.constprop.17 at ffffffff84495843 #10 [ffff8a31c2a1f8c8] console_unlock at ffffffff84496fe8 #11 [ffff8a31c2a1f908] vprintk_emit at ffffffff844973f4 #12 [ffff8a31c2a1f978] vprintk_default at ffffffff84497769 #13 [ffff8a31c2a1f988] printk at ffffffff84b0d436 #14 [ffff8a31c2a1f9e8] cfs_print_to_console at ffffffffc09a2bed [libcfs] #15 [ffff8a31c2a1fa30] libcfs_debug_msg at ffffffffc098da3c [libcfs] #16 [ffff8a31c2a1fb60] cl_env_init0 at ffffffffc0ad19b3 [obdclass] #17 [ffff8a31c2a1fb70] cl_env_percpu_get at ffffffffc0ad1a5e [obdclass] #18 [ffff8a31c2a1fb88] osc_lock_upcall at ffffffffc0fa810a [osc] #19 [ffff8a31c2a1fbd0] osc_enqueue_fini at ffffffffc0f9c432 [osc] #20 [ffff8a31c2a1fc20] osc_enqueue_interpret at ffffffffc0f9c6a0 [osc] #21 [ffff8a31c2a1fc98] ptlrpc_check_set at ffffffffc0da6e70 [ptlrpc] #22 [ffff8a31c2a1fd48] ptlrpc_check_set at ffffffffc0da882b [ptlrpc] #23 [ffff8a31c2a1fd68] ptlrpcd_check at ffffffffc0dd55eb [ptlrpc] #24 [ffff8a31c2a1fdb8] ptlrpcd at ffffffffc0dd596b [ptlrpc] #25 [ffff8a31c2a1fec8] kthread at ffffffff844bdf21 crash>
And last Lustre debug traces for both PIDs 146479 and 146480 clearly show that 146479 it has changed from CPU #65 to #29 during its execution within osc_lock_upcall() context, and particularly in lu_ref_add() because of might_sleep(), hence the “cl_env_percpu_put()) ASSERTION( cle == &cl_env_percpu[cpu] ) failed” in cl_env_percpu_put(). When 146480 triggers “(cl_object.c:629:cl_env_init0()) ASSERTION( cle->ce_ref == 0 ) failed“ on CPU #65 due to previous stuff left by 146479 before being switched to CPU #29 !! :
00000008:00000001:65.0:1582043949.184487:0:146479:0:(osc_lock.c:274:osc_lock_upcall()) Process entered 00010000:00000001:65.0:1582043949.184489:0:146479:0:(ldlm_lock.c:599:__ldlm_handle2lock()) Process entered 00000020:00000001:65.0:1582043949.184490:0:146479:0:(lustre_handles.c:141:class_handle2object()) Process entered 00000020:00000040:65.0:1582043949.184492:0:146479:0:(lustre_handles.c:159:class_handle2object()) GET ldlm ffff8a30e1ad8500 refcount=5 00000020:00000001:65.0:1582043949.184494:0:146479:0:(lustre_handles.c:166:class_handle2object()) Process leaving (rc=18446614541282149632 : -129532427401984 : ffff8a30e1ad8500) …………………………. 00000008:00000001:65.0:1582043949.184758:0:146480:0:(osc_request.c:2395:osc_enqueue_fini()) Process entered 00000008:00000001:65.0:1582043949.184759:0:146480:0:(osc_lock.c:274:osc_lock_upcall()) Process entered 00000020:00040000:65.0:1582043949.184760:0:146480:0:(cl_object.c:629:cl_env_init0()) ASSERTION( cle->ce_ref == 0 ) failed: 00000020:00000010:29.0:1582043949.187924:0:146479:0:(lu_ref.c:157:lu_ref_add_context()) slab-alloced 'link': 40 at ffff8a31c26a8f50. 00010000:00000001:29.0:1582043949.187926:0:146479:0:(ldlm_lock.c:621:__ldlm_handle2lock()) Process leaving (rc=18446614541282149632 : -129532427401984 : ffff8a30e1ad8500) 00000020:00000010:29.0:1582043949.187929:0:146479:0:(lu_ref.c:255:lu_ref_del()) slab-freed 'link': 40 at ffff8a3a5bff5528. 00000020:00000010:29.0:1582043949.187931:0:146479:0:(lu_ref.c:157:lu_ref_add_context()) slab-alloced 'link': 40 at ffff8a31c26a82f8. 00000008:00000001:29.0:1582043949.187933:0:146479:0:(osc_lock.c:159:osc_lock_lvb_update()) Process entered 00000020:00000001:29.0:1582043949.187935:0:146479:0:(cl_object.c:149:cl_object_top()) ffff8a30e8a22fd0 -> ffff8a31cdbe7c18 00000008:00010000:29.0:1582043949.187937:0:146479:0:(osc_lock.c:197:osc_lock_lvb_update()) ### acquired size 0, setting rss=0; kms=0, end=18446744073709551615 ns: ?? lock: ffff8a30e1ad8500/0x55623c61b9c3f430 lrc: 5/2,0 mode: PR/PR res: ?? rrc=?? type: ??? flags: 0x10020000000000 nid: local remote: 0x55623c61b9c3f4a0 expref: -99 pid: 22620 timeout: 0 lvb_type: 1 00000020:00000001:29.0:1582043949.187944:0:146479:0:(cl_object.c:149:cl_object_top()) ffff8a30e8a22fd0 -> ffff8a31cdbe7c18 00000020:00000001:29.0:1582043949.187946:0:146479:0:(cl_object.c:238:cl_object_attr_update()) Process entered 00000008:00000020:29.0:1582043949.187947:0:146479:0:(osc_object.c:183:osc_attr_update()) set kms from 0to 0 00020000:00000001:29.0:1582043949.187949:0:146479:0:(lovsub_object.c:116:lovsub_attr_update()) Process entered 00020000:00000001:29.0:1582043949.187950:0:146479:0:(lovsub_object.c:118:lovsub_attr_update()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:29.0:1582043949.187952:0:146479:0:(cl_object.c:253:cl_object_attr_update()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:29.0:1582043949.187954:0:146479:0:(cl_object.c:149:cl_object_top()) ffff8a30e8a22fd0 -> ffff8a31cdbe7c18 00000008:00000001:29.0:1582043949.187956:0:146479:0:(osc_lock.c:202:osc_lock_lvb_update()) Process leaving 00000020:00000001:29.0:1582043949.187957:0:146479:0:(cl_io.c:1180:cl_sync_io_note()) Process entered 00000020:00000001:29.0:1582043949.187958:0:146479:0:(cl_io.c:1208:cl_sync_io_note()) Process leaving 00000020:00040000:29.0:1582043949.187959:0:146479:0:(cl_object.c:962:cl_env_percpu_put()) ASSERTION( cle == &cl_env_percpu[cpu] ) failed: 00000020:00040000:29.0:1582043949.187965:0:146479:0:(cl_object.c:962:cl_env_percpu_put()) LBUG
SECOND ONE :
crash> sys KERNEL: /usr/lib/debug/lib/modules/3.10.0-862.14.4.el7_lustre_ClientSymlink_279c264.x86_64/vmlinux DUMPFILE: /scratch/dumps/wolf-5.wolf.hpdd.intel.com/10.8.1.5-2020-02-19-09:31:14/vmcore [PARTIAL DUMP] CPUS: 72 DATE: Wed Feb 19 09:30:56 2020 UPTIME: 16:50:03 LOAD AVERAGE: 3.98, 2.34, 1.38 TASKS: 1117 NODENAME: wolf-5.wolf.hpdd.intel.com RELEASE: 3.10.0-862.14.4.el7_lustre_ClientSymlink_279c264.x86_64 VERSION: #1 SMP Thu Oct 17 10:54:24 UTC 2019 MACHINE: x86_64 (2294 Mhz) MEMORY: 63.9 GB PANIC: "Kernel panic - not syncing: LBUG" crash> crash> dmesg | less …………………………………….. [60598.486899] Lustre: DEBUG MARKER: == sanity test 49: Change max_pages_per_rpc won't break osc extent =================================== 09:30:54 (1582104654) [60599.708080] LustreError: 100699:0:(cl_object.c:962:cl_env_percpu_put()) ASSERTION( cle == &cl_env_percpu[cpu] ) failed: [60599.723309] LustreError: 100699:0:(cl_object.c:962:cl_env_percpu_put()) LBUG [60599.734409] Pid: 100699, comm: dd 3.10.0-862.14.4.el7_lustre_ClientSymlink_279c264.x86_64 #1 SMP Thu Oct 17 10:54:24 UTC 2019 [60599.750031] Call Trace: [60599.755487] [<ffffffffc0b820ec>] libcfs_call_trace+0x8c/0xc0 [libcfs] [60599.765526] [<ffffffffc0b8219c>] lbug_with_loc+0x4c/0xa0 [libcfs] [60599.774953] [<ffffffffc0da6ade>] cl_env_percpu_put+0x6e/0xa0 [obdclass] [60599.784833] [<ffffffffc0cba16c>] osc_lock_upcall+0x9c/0x5f0 [osc] [60599.794046] [<ffffffffc0caee93>] osc_enqueue_base+0x5c3/0x6e0 [osc] [60599.803288] [<ffffffffc0cb956c>] osc_lock_enqueue+0x37c/0x830 [osc] [60599.812424] [<ffffffffc0dad745>] cl_lock_enqueue+0x65/0x120 [obdclass] [60599.821884] [<ffffffffc125a9d5>] lov_lock_enqueue+0x95/0x150 [lov] [60599.830916] [<ffffffffc0dad745>] cl_lock_enqueue+0x65/0x120 [obdclass] [60599.840327] [<ffffffffc0dadcd7>] cl_lock_request+0x67/0x1f0 [obdclass] [60599.849696] [<ffffffffc0db194b>] cl_io_lock+0xeb/0x270 [obdclass] [60599.858545] [<ffffffffc0db1cea>] cl_io_loop+0xba/0x1c0 [obdclass] [60599.867366] [<ffffffffc12df49a>] ll_file_io_generic+0x5da/0xb00 [lustre] [60599.876880] [<ffffffffc12e00e1>] ll_file_aio_write+0x491/0x780 [lustre] [60599.886271] [<ffffffffc12e04d0>] ll_file_write+0x100/0x1c0 [lustre] [60599.895246] [<ffffffffa381f240>] vfs_write+0xc0/0x1f0 [60599.902819] [<ffffffffa382006f>] SyS_write+0x7f/0xf0 [60599.910247] [<ffffffffa3d2579b>] system_call_fastpath+0x22/0x27 [60599.918717] [<ffffffffffffffff>] 0xffffffffffffffff [60599.925994] Kernel panic - not syncing: LBUG [60599.932421] CPU: 18 PID: 100699 Comm: dd Kdump: loaded Tainted: G W IOE ------------ 3.10.0-862.14.4.el7_lustre_ClientSymlink_279c264.x86_64 #1 [60599.951075] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [60599.964398] Call Trace: [60599.968765] [<ffffffffa3d13754>] dump_stack+0x19/0x1b [60599.976118] [<ffffffffa3d0d29f>] panic+0xe8/0x21f [60599.983051] [<ffffffffc0b821eb>] lbug_with_loc+0x9b/0xa0 [libcfs] [60599.991533] [<ffffffffc0da6ade>] cl_env_percpu_put+0x6e/0xa0 [obdclass] [60600.000551] [<ffffffffc0cba16c>] osc_lock_upcall+0x9c/0x5f0 [osc] [60600.008973] [<ffffffffc0caee93>] osc_enqueue_base+0x5c3/0x6e0 [osc] [60600.017566] [<ffffffffc0cba0d0>] ? osc_lock_upcall_speculative+0x2e0/0x2e0 [osc] [60600.027395] [<ffffffffc0cb956c>] osc_lock_enqueue+0x37c/0x830 [osc] [60600.035953] [<ffffffffc0dad745>] cl_lock_enqueue+0x65/0x120 [obdclass] [60600.044755] [<ffffffffc125a9d5>] lov_lock_enqueue+0x95/0x150 [lov] [60600.053156] [<ffffffffc0dad745>] cl_lock_enqueue+0x65/0x120 [obdclass] [60600.061920] [<ffffffffc0dadcd7>] cl_lock_request+0x67/0x1f0 [obdclass] [60600.070642] [<ffffffffc0db194b>] cl_io_lock+0xeb/0x270 [obdclass] [60600.078846] [<ffffffffc0db1cea>] cl_io_loop+0xba/0x1c0 [obdclass] [60600.086993] [<ffffffffc12df49a>] ll_file_io_generic+0x5da/0xb00 [lustre] [60600.095791] [<ffffffffc12e00e1>] ll_file_aio_write+0x491/0x780 [lustre] [60600.104459] [<ffffffffc12e04d0>] ll_file_write+0x100/0x1c0 [lustre] [60600.112693] [<ffffffffa381f240>] vfs_write+0xc0/0x1f0 [60600.119532] [<ffffffffa382006f>] SyS_write+0x7f/0xf0 [60600.126232] [<ffffffffa3d2579b>] system_call_fastpath+0x22/0x27 (END) crash> crash> bt 100699 PID: 100699 TASK: ffff99f600b63f40 CPU: 18 COMMAND: "dd" #0 [ffff99f5e3897788] machine_kexec at ffffffffa3662a0a #1 [ffff99f5e38977e8] __crash_kexec at ffffffffa37166c2 #2 [ffff99f5e38978b8] panic at ffffffffa3d0d2aa #3 [ffff99f5e3897938] lbug_with_loc at ffffffffc0b821eb [libcfs] #4 [ffff99f5e3897958] cl_env_percpu_put at ffffffffc0da6ade [obdclass] #5 [ffff99f5e3897968] osc_lock_upcall at ffffffffc0cba16c [osc] #6 [ffff99f5e38979b0] osc_enqueue_base at ffffffffc0caee93 [osc] #7 [ffff99f5e3897a68] osc_lock_enqueue at ffffffffc0cb956c [osc] #8 [ffff99f5e3897af0] cl_lock_enqueue at ffffffffc0dad745 [obdclass] #9 [ffff99f5e3897b28] lov_lock_enqueue at ffffffffc125a9d5 [lov] #10 [ffff99f5e3897b68] cl_lock_enqueue at ffffffffc0dad745 [obdclass] #11 [ffff99f5e3897ba0] cl_lock_request at ffffffffc0dadcd7 [obdclass] #12 [ffff99f5e3897bd8] cl_io_lock at ffffffffc0db194b [obdclass] #13 [ffff99f5e3897c20] cl_io_loop at ffffffffc0db1cea [obdclass] #14 [ffff99f5e3897c50] ll_file_io_generic at ffffffffc12df49a [lustre] #15 [ffff99f5e3897d58] ll_file_aio_write at ffffffffc12e00e1 [lustre] #16 [ffff99f5e3897de0] ll_file_write at ffffffffc12e04d0 [lustre] #17 [ffff99f5e3897ec8] vfs_write at ffffffffa381f240 #18 [ffff99f5e3897f08] sys_write at ffffffffa382006f #19 [ffff99f5e3897f50] system_call_fastpath at ffffffffa3d2579b RIP: 00007fd19b2ce9b0 RSP: 00007ffcafbeda78 RFLAGS: 00000246 RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffffffffffffffff RDX: 0000000000001000 RSI: 0000000000912000 RDI: 0000000000000001 RBP: 0000000000001000 R8: 0000000000000001 R9: 0000000000003003 R10: 00007ffcafbed4e0 R11: 0000000000000246 R12: 0000000000001000 R13: 0000000000912000 R14: 0000000000913000 R15: 0000000000000000 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b crash>
And last Lustre debug traces for PID 100699 clearly show that it has changed from CPU #0 to #18 during its execution within osc_lock_upcall() context, and particularly in __ldlm_handle2lock()->lu_ref_add() sequence because of might_sleep() !! :
00000008:00000001:0.0:1582104656.226704:0:100699:0:(osc_lock.c:274:osc_lock_upcall()) Process entered 00010000:00000001:0.0:1582104656.226704:0:100699:0:(ldlm_lock.c:599:__ldlm_handle2lock()) Process entered 00000020:00000001:0.0:1582104656.226705:0:100699:0:(lustre_handles.c:141:class_handle2object()) Process entered 00000020:00000040:0.0:1582104656.226705:0:100699:0:(lustre_handles.c:159:class_handle2object()) GET ldlm ffff99fd2cf91680 refcount=8 00000020:00000001:0.0:1582104656.226706:0:100699:0:(lustre_handles.c:166:class_handle2object()) Process leaving (rc=18446631911393138304 : -112162316413312 : ffff99fd2cf91680) 00000020:00000010:0.0:1582104656.226707:0:100699:0:(lu_ref.c:157:lu_ref_add_context()) slab-alloced 'link': 40 at ffff99fd5e6c2e88. 00010000:00000001:0.0:1582104656.226708:0:100699:0:(ldlm_lock.c:621:__ldlm_handle2lock()) Process leaving (rc=18446631911393138304 : -112162316413312 : ffff99fd2cf91680) 00000020:00000010:0.0:1582104656.226709:0:100699:0:(lu_ref.c:255:lu_ref_del()) slab-freed 'link': 40 at ffff99fd5e6c2b90. 00000020:00000010:0.0:1582104656.226710:0:100699:0:(lu_ref.c:157:lu_ref_add_context()) slab-alloced 'link': 40 at ffff99fd5e6c2b90. 00010000:00000001:0.0:1582104656.226711:0:100699:0:(ldlm_lock.c:599:__ldlm_handle2lock()) Process entered 00000020:00000001:0.0:1582104656.226711:0:100699:0:(lustre_handles.c:141:class_handle2object()) Process entered 00000020:00000040:0.0:1582104656.226712:0:100699:0:(lustre_handles.c:159:class_handle2object()) GET ldlm ffff99fd2cf91680 refcount=9 00000020:00000001:0.0:1582104656.226712:0:100699:0:(lustre_handles.c:166:class_handle2object()) Process leaving (rc=18446631911393138304 : -112162316413312 : ffff99fd2cf91680) 00000020:00000010:18.0:1582104656.226768:0:100699:0:(lu_ref.c:157:lu_ref_add_context()) slab-alloced 'link': 40 at ffff99fda6ecf320. 00010000:00000001:18.0:1582104656.226773:0:100699:0:(ldlm_lock.c:621:__ldlm_handle2lock()) Process leaving (rc=18446631911393138304 : -112162316413312 : ffff99fd2cf91680) 00010000:00000001:18.0:1582104656.226777:0:100699:0:(ldlm_lock.c:289:ldlm_lock_remove_from_lru_check()) Process entered 00010000:00000001:18.0:1582104656.226779:0:100699:0:(ldlm_lock.c:301:ldlm_lock_remove_from_lru_check()) Process leaving (rc=0 : 0 : 0) 00000020:00000010:18.0:1582104656.226781:0:100699:0:(lu_ref.c:157:lu_ref_add_context()) slab-alloced 'link': 40 at ffff99fda6ecf230. 00000020:00000010:18.0:1582104656.226783:0:100699:0:(lu_ref.c:157:lu_ref_add_context()) slab-alloced 'link': 40 at ffff99fda6ecf618. 00010000:00010000:18.0:1582104656.226785:0:100699:0:(ldlm_lock.c:767:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PW) ns: ?? lock: ffff99fd2cf91680/0x11f7c8e35cf3bf6 lrc: 10/0,2 mode: PW/PW res: ?? rrc=?? type: ??? flags: 0x810020080000000 nid: local remote: 0x11f7c8e35cf3bfd expref: -99 pid: 100699 timeout: 0 lvb_type: 1 00000020:00000010:18.0:1582104656.226794:0:100699:0:(lu_ref.c:255:lu_ref_del()) slab-freed 'link': 40 at ffff99fd5e6c2e88. 00010000:00000001:18.0:1582104656.226796:0:100699:0:(ldlm_lock.c:209:ldlm_lock_put()) Process entered 00010000:00000001:18.0:1582104656.226797:0:100699:0:(ldlm_lock.c:249:ldlm_lock_put()) Process leaving 00000020:00040000:18.0:1582104656.226799:0:100699:0:(cl_object.c:962:cl_env_percpu_put()) ASSERTION( cle == &cl_env_percpu[cpu] ) failed: 00000020:00040000:18.0:1582104656.242030:0:100699:0:(cl_object.c:962:cl_env_percpu_put()) LBUG
So all of this clearly indicate that all lu_ref related work within osc_lock_upcall() context needs to be done atomically and without scheduling/sleeping, and particularly in _ldlm_handle2lock() and osc_lock_granted() where lu_add_ref_atomic() should be used instead of lu_add_ref() (calling might_sleep() and likely to cause a scheduling/cpu-switch...).
Will push a patch soon to address and tentatively fix these cases.
Landed for 2.14