[LU-4591] Related cl_lock failures on master/2.5 Created: 05/Feb/14  Updated: 12/Nov/14  Resolved: 04/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.6.0, Lustre 2.4.2
Fix Version/s: Lustre 2.6.0, Lustre 2.5.2

Type: Bug Priority: Blocker
Reporter: Patrick Farrell (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: MB, mn4
Environment:

Master clients on SLES11SP3, server version irrelevant (tested against 2.1,2.4,2.4.1,2.5).


Attachments: File 0002-LELUS-203-clio-no-recursive-closures.patch     File cl_lock_debug_patch.diff     File lelus-203-lock-hold-v1.patch     Text File locks.log     File mmstress.tar.gz     Text File osc.ko    
Issue Links:
Related
is related to LU-4692 (osc_lock.c:1204:osc_lock_enqueue()) ... Resolved
is related to LU-4797 ASSERTION( cl_lock_is_mutexed(slice->... Resolved
is related to LU-5910 LustreError: osc_lock_enqueue() Impo... Resolved
Severity: 3
Rank (Obsolete): 12543

 Description   

We've seen a number of different cl_clock bugs in master/2.4.1/2.5 that we believe are related.

We have seen these in master as recently as two weeks ago (current master cannot run mmstress at all due to a problem in the paging code).

These bugs are not present in Intel released 2.4.0, but we've seen them in Cray 2.4.1 and 2.5 (which do not track precisely with the Intel versions of 2.4.1 and 2.5).

We've seen all of these bugs during our general purpose testing, but we believe they're related, because all of them are reproduced easily by running multiple copies (on multiple nodes) of mmstress from the Linux Test Project (mtest05 - I will attach the source), and none of them seem to be present in 2.4.0. (At least, none of them are reproduced in that context.)

Not all of the stack traces below are from runs on master (sorry - it's not what I've got handy), but all of the listed bugs have been reproduced on master:

General protection fault in osc_lock_detach (this one seems to be the most common):

> 20:00:29 Pid: 3120, comm: ldlm_bl_04 Tainted: P 3.0.82-0.7.9_1.0000.7690-cray_gem_c #1
> 20:00:30 RIP: 0010:[<ffffffffa0635586>] [<ffffffffa0635586>] osc_lock_detach+0x46/0x180 [osc]
> 20:00:30 RSP: 0000:ffff880204065cc0 EFLAGS: 00010206
> 20:00:30 RAX: 0000000000005050 RBX: 5a5a5a5a5a5a5a5a RCX: ffff880204065cc0
> 20:00:30 RDX: ffff880204065cc0 RSI: ffff880203c38b50 RDI: ffffffffa066ba00
> 20:00:30 RBP: ffff880204065cf0 R08: 0000000000000020 R09: ffffffff8136fd38
> 20:00:30 R10: 0000000000000400 R11: 0000000000000009 R12: ffff880203c38b50
> 20:00:30 R13: 0000000000000000 R14: ffff88020e33ab40 R15: ffff880201fd23e0
> 20:00:30 FS: 00002aaaaec54700(0000) GS:ffff88021fcc0000(0000) knlGS:0000000000000000
> 20:00:30 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> 20:00:30 CR2: 00002aaaad101000 CR3: 000000020439d000 CR4: 00000000000407e0
> 20:00:30 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> 20:00:30 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> 20:00:30 Process ldlm_bl_04 (pid: 3120, threadinfo ffff880204064000, task ffff88020e73d7e0)
> 20:00:30 Stack:
> 20:00:30 ffff880204065cc0 0000000000000000 ffff880203c38b50 0000000000000000
> 20:00:30 ffff88020e33ab40 ffff880201fd23e0 ffff880204065d60 ffffffffa0635a7a
> 20:00:30 ffff880201fd23e0 ffff880150f57ed0 0000000001fd23e0 ffff880150f57ed0
> 20:00:30 Call Trace:
> 20:00:30 [<ffffffffa0635a7a>] osc_lock_cancel+0xca/0x410 [osc]
> 20:00:30 [<ffffffffa02b874d>] cl_lock_cancel0+0x6d/0x150 [obdclass]
> 20:00:30 [<ffffffffa02b948b>] cl_lock_cancel+0x13b/0x140 [obdclass]
> 20:00:30 [<ffffffffa0636e6c>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
> 20:00:30 [<ffffffffa03c5334>] ldlm_handle_bl_callback+0xd4/0x430 [ptlrpc]
> 20:00:30 [<ffffffffa03c588c>] ldlm_bl_thread_main+0x1fc/0x420 [ptlrpc]
> 20:00:30 [<ffffffff8106610e>] kthread+0x9e/0xb0
> 20:00:30 [<ffffffff81363ff4>] kernel_thread_helper+0x4/0x10
> 20:00:30 Code: f8 66 66 66 66 90 49 89 f4 49 89 ff 48 c7 c7 00 ba 66 a0 e8 0d cb d2 e0 49 8b 5c 24 28 48 85 db 74 7b 49 c7 44 24 28 00 00 00 00
> 20:00:30 c7 83 60 01 00 00 00 00 00 00 49 c7 44 24 70 00 00 00 00 fe
> 20:00:30 RIP [<ffffffffa0635586>] osc_lock_detach+0x46/0x180 [osc]
> 20:00:30 RSP <ffff880204065cc0>
> 20:00:30 --[ end trace 317fa078a5344509 ]--

(osc_lock.c:1134:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6

> 23:39:52 LustreError: 6456:0:(osc_lock.c:1134:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6
> 23:39:52 LustreError: 6456:0:(osc_lock.c:1134:osc_lock_enqueue()) LBUG
> 23:39:52 Pid: 6456, comm: mmstress
> 23:39:52 Call Trace:
> 23:39:52 [<ffffffff810065b1>] try_stack_unwind+0x161/0x1a0
> 23:39:52 [<ffffffff81004dd9>] dump_trace+0x89/0x440
> 23:39:52 [<ffffffffa016d897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
> 23:39:52 [<ffffffffa016dde7>] lbug_with_loc+0x47/0xc0 [libcfs]
> 23:39:52 [<ffffffffa063b8ef>] osc_lock_enqueue+0x74f/0x8d0 [osc]
> 23:39:52 [<ffffffffa02c0dab>] cl_enqueue_try+0xfb/0x320 [obdclass]
> 23:39:52 [<ffffffffa06d13dd>] lov_lock_enqueue+0x1fd/0x880 [lov]
> 23:39:52 [<ffffffffa02c0dab>] cl_enqueue_try+0xfb/0x320 [obdclass]
> 23:39:52 [<ffffffffa02c1c8f>] cl_enqueue_locked+0x7f/0x1f0 [obdclass]
> 23:39:52 [<ffffffffa02c287e>] cl_lock_request+0x7e/0x270 [obdclass]
> 23:39:52 [<ffffffffa02c7db4>] cl_io_lock+0x394/0x5c0 [obdclass]
> 23:39:52 [<ffffffffa02c807a>] cl_io_loop+0x9a/0x1a0 [obdclass]
> 23:39:52 [<ffffffffa078bf78>] ll_fault+0x308/0x4e0 [lustre]
> 23:39:52 [<ffffffff81111ce6>] __do_fault+0x76/0x570
> 23:39:52 [<ffffffff81112284>] handle_pte_fault+0xa4/0xcc0
> 23:39:52 [<ffffffff8111304e>] handle_mm_fault+0x1ae/0x240
> 23:39:52 [<ffffffff81026c6f>] do_page_fault+0x18f/0x420
> 23:39:52 [<ffffffff813628cf>] page_fault+0x1f/0x30
> 23:39:52 [<00000000200007ea>] 0x200007ea
> 23:39:52 Kernel panic - not syncing: LBUG

lov_lock_link_find()) ASSERTION( cl_lock_is_mutexed(sub->lss_cl.cls_lock) ) failed:

2013-12-06T21:36:44.086298-06:00 c0-0c0s11n2 LustreError: 9051:0:(lov_lock.c:1092:lov_lock_link_find()) ASSERTION( cl_lock_is_mutexed(sub->lss_cl.cls_lock) ) failed:
2013-12-06T21:36:44.111468-06:00 c0-0c0s11n2 LustreError: 9051:0:(lov_lock.c:1092:lov_lock_link_find()) LBUG
2013-12-06T21:36:44.111497-06:00 c0-0c0s11n2 Pid: 9051, comm: mmstress
2013-12-06T21:36:44.111536-06:00 c0-0c0s11n2 Call Trace:
2013-12-06T21:36:44.136717-06:00 c0-0c0s11n2 [<ffffffff81005db9>] try_stack_unwind+0x169/0x1b0
2013-12-06T21:36:44.136757-06:00 c0-0c0s11n2 [<ffffffff81004849>] dump_trace+0x89/0x450
2013-12-06T21:36:44.136787-06:00 c0-0c0s11n2 [<ffffffffa02128d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2013-12-06T21:36:44.169446-06:00 c0-0c0s11n2 [<ffffffffa0212e37>] lbug_with_loc+0x47/0xc0 [libcfs]
2013-12-06T21:36:44.169474-06:00 c0-0c0s11n2 [<ffffffffa0778f7a>] lov_lock_link_find+0x16a/0x170 [lov]
2013-12-06T21:36:44.169502-06:00 c0-0c0s11n2 [<ffffffffa07799cf>] lov_sublock_adopt+0x8f/0x370 [lov]
2013-12-06T21:36:44.194648-06:00 c0-0c0s11n2 [<ffffffffa077cba7>] lov_lock_init_raid0+0x637/0xd50 [lov]
2013-12-06T21:36:44.194687-06:00 c0-0c0s11n2 [<ffffffffa0773c0e>] lov_lock_init+0x1e/0x60 [lov]
2013-12-06T21:36:44.194704-06:00 c0-0c0s11n2 [<ffffffffa036961a>] cl_lock_hold_mutex+0x32a/0x640 [obdclass]
2013-12-06T21:36:44.219871-06:00 c0-0c0s11n2 [<ffffffffa0369ab2>] cl_lock_request+0x62/0x270 [obdclass]
2013-12-06T21:36:44.219901-06:00 c0-0c0s11n2 [<ffffffffa036f0ce>] cl_io_lock+0x39e/0x5d0 [obdclass]
2013-12-06T21:36:44.245175-06:00 c0-0c0s11n2 [<ffffffffa036f3a2>] cl_io_loop+0xa2/0x1b0 [obdclass]
2013-12-06T21:36:44.245216-06:00 c0-0c0s11n2 [<ffffffffa0838760>] ll_page_mkwrite+0x280/0x680 [lustre]
2013-12-06T21:36:44.245228-06:00 c0-0c0s11n2 [<ffffffff81118019>] __do_fault+0xf9/0x5b0
2013-12-06T21:36:44.270380-06:00 c0-0c0s11n2 [<ffffffff81118574>] handle_pte_fault+0xa4/0xcd0
2013-12-06T21:36:44.270420-06:00 c0-0c0s11n2 [<ffffffff8111934e>] handle_mm_fault+0x1ae/0x240
2013-12-06T21:36:44.270431-06:00 c0-0c0s11n2 [<ffffffff8138a915>] do_page_fault+0x1e5/0x4a0
2013-12-06T21:36:44.295606-06:00 c0-0c0s11n2 [<ffffffff8138778f>] page_fault+0x1f/0x30
2013-12-06T21:36:44.295648-06:00 c0-0c0s11n2 [<00000000200007fa>] 0x200007fa

General protection fault in cl_lock_put (possibly the same issue as LU-4558?):

> 19:55:11 general protection fault: 0000 1 SMP
> 19:55:11 CPU 19
> 19:55:11 Modules linked in: mic xpmem dvspn(P) dvsof(P) dvsutil(P) dvsipc(P) dvsipc_lnet(P) dvsproc(P) bpmcdmod nic_compat cmsr lmv mgc lustre lov osc mdc fid fld kgnilnd ptlrpc obdclass lnet lvfs sha1_generic md5 libcfs ib_core pcie_link_bw_monitor kdreg gpcd_ari ipogif_ari kgni_ari hwerr(P) rca hss_os(P) heartbeat simplex(P) ghal_ari craytrace
> 19:55:11 Pid: 8720, comm: ldlm_bl_02 Tainted: P 3.0.80-0.5.1_1.0501.7664-cray_ari_c #1 Cray Inc. Cascade/Cascade
> 19:55:11 RIP: 0010:[<ffffffffa036b553>] [<ffffffffa036b553>] cl_lock_put+0x103/0x410 [obdclass]
> 19:55:11 RSP: 0018:ffff8807ef0fdc70 EFLAGS: 00010246
> 19:55:11 RAX: 0000000000000001 RBX: 5a5a5a5a5a5a5a5a RCX: ffff8807f30a7af8
> 19:55:11 RDX: ffffffffa038da5b RSI: 5a5a5a5a5a5a5a5a RDI: ffff8807f1d169c8
> 19:55:11 RBP: ffff8807ef0fdc90 R08: ffffffffa037ebc0 R09: 00000000000002f8
> 19:55:11 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> 19:55:11 R13: ffff8807f1d169c8 R14: ffff8807f30a7af8 R15: 0000000000000001
> 19:55:11 FS: 00007ffff7ff6700(0000) GS:ffff88087f6c0000(0000) knlGS:0000000000000000
> 19:55:11 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> 19:55:11 CR2: 00005555557af328 CR3: 0000000001661000 CR4: 00000000001406e0
> 19:55:11 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> 19:55:11 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> 19:55:11 Process ldlm_bl_02 (pid: 8720, threadinfo ffff8807ef0fc000, task ffff88083c2ae0c0)
> 19:55:11 Stack:
> 19:55:11 ffff88083bf7e840 0000000000000000 ffff8807f1d169c8 ffff8807f30a7af8
> 19:55:11 ffff8807ef0fdcf0 ffffffffa07194ca ffff88083bf7e840 ffff88050caec000
> 19:55:11 0000000000000001 0000000000000000 ffff8807ef0fdcf0 ffff88050caec000
> 19:55:11 Call Trace:
> 19:55:11 [<ffffffffa07194ca>] osc_ldlm_blocking_ast+0xaa/0x330 [osc]
> 19:55:11 [<ffffffffa0460aeb>] ldlm_cancel_callback+0x6b/0x190 [ptlrpc]
> 19:55:11 [<ffffffffa046f41a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
> 19:55:11 [<ffffffffa0473ec0>] ldlm_cli_cancel+0x60/0x370 [ptlrpc]
> 19:55:11 [<ffffffffa071819e>] osc_lock_cancel+0xfe/0x1c0 [osc]
> 19:55:11 [<ffffffffa0368145>] cl_lock_cancel0+0x75/0x160 [obdclass]
> 19:55:11 [<ffffffffa0368e8b>] cl_lock_cancel+0x13b/0x140 [obdclass]
> 19:55:11 [<ffffffffa071962c>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
> 19:55:11 [<ffffffffa04777d4>] ldlm_handle_bl_callback+0xd4/0x430 [ptlrpc]
> 19:55:11 [<ffffffffa0477d5c>] ldlm_bl_thread_main+0x22c/0x450 [ptlrpc]
> 19:55:11 [<ffffffff8138e894>] kernel_thread_helper+0x4/0x10
> 19:55:11 Code: 00 00 00 c7 05 23 e3 06 00 01 00 00 00 4c 8b 45 08 8b 13 48 89 d9 48 c7 c6 60 82 38 a0 48 c7 c7 40 98 3d a0 31 c0 e8 5d 6a eb ff <f0> ff 0b 0f 94 c0 84 c0 74 06 83 7b 50 06 74 7d f6 05 5a b4 ed
> 19:55:11 RIP [<ffffffffa036b553>] cl_lock_put+0x103/0x410 [obdclass]
> 19:55:11 RSP <ffff8807ef0fdc70>

lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:

> 04:59:03 LustreError: 19083:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:
> 04:59:03 LustreError: 19083:0:(lovsub_lock.c:103:lovsub_lock_state()) LBUG
> 04:59:03 Pid: 19083, comm: mmstress
> 04:59:03 Call Trace:
> 04:59:03 [<ffffffff810065b1>] try_stack_unwind+0x161/0x1a0
> 04:59:03 [<ffffffff81004dd9>] dump_trace+0x89/0x440
> 04:59:03 [<ffffffffa0168897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
> 04:59:03 [<ffffffffa0168de7>] lbug_with_loc+0x47/0xc0 [libcfs]
> 04:59:03 [<ffffffffa06d458e>] lovsub_lock_state+0x19e/0x1a0 [lov]
> 04:59:03 [<ffffffffa02b7b80>] cl_lock_state_signal+0x60/0x160 [obdclass]
> 04:59:03 [<ffffffffa02b7d4d>] cl_lock_state_set+0xcd/0x1a0 [obdclass]
> 04:59:03 [<ffffffffa02bbe0b>] cl_enqueue_try+0x14b/0x320 [obdclass]
> 04:59:03 [<ffffffffa06cc3dd>] lov_lock_enqueue+0x1fd/0x880 [lov]
> 04:59:03 [<ffffffffa02bbdbb>] cl_enqueue_try+0xfb/0x320 [obdclass]
> 04:59:03 [<ffffffffa02bcc9f>] cl_enqueue_locked+0x7f/0x1f0 [obdclass]
> 04:59:03 [<ffffffffa02bd88e>] cl_lock_request+0x7e/0x270 [obdclass]
> 04:59:03 [<ffffffffa02c2dc4>] cl_io_lock+0x394/0x5c0 [obdclass]
> 04:59:03 [<ffffffffa02c308a>] cl_io_loop+0x9a/0x1a0 [obdclass]
> 04:59:03 [<ffffffffa0786ee8>] ll_fault+0x308/0x4e0 [lustre]
> 04:59:03 [<ffffffff81111d16>] __do_fault+0x76/0x570
> 04:59:03 [<ffffffff811122b4>] handle_pte_fault+0xa4/0xcc0
> 04:59:03 [<ffffffff8136290f>] page_fault+0x1f/0x30
> 04:59:03 [<00000000200007ea>] 0x200007ea

General protection fault in cl_lock_delete:

> 22:37:45 Pid: 6333, comm: ldlm_bl_13 Tainted: P 3.0.82-0.7.9_1.0502.7742-cray_gem_c #1
> 22:37:45 RIP: 0010:[<ffffffffa02b8eb0>] [<ffffffffa02b8eb0>] cl_lock_delete0+0x190/0x1f0 [obdclass]
> 22:37:45 RSP: 0018:ffff8803d0403bc0 EFLAGS: 00010296
> 22:37:45 RAX: 5a5a5a5a5a5a5a5a RBX: 5a5a5a5a5a5a5a42 RCX: ffff8803d7509ca8
> 22:37:45 RDX: ffff8803d54b6800 RSI: ffff8803fbfdceb8 RDI: ffff8803d54b67c0
> 22:37:45 RBP: ffff8803d0403be0 R08: ffff8803d55a8858 R09: 0000000000000000
> 22:37:45 R10: 0000000000000023 R11: 0000000000000000 R12: ffff8803d55a8818
> 22:37:45 R13: ffff8803d55a8810 R14: ffff880803ddcdb8 R15: 0000000000000001
> 22:37:45 FS: 000000004013d880(0000) GS:ffff88081fcc0000(0000) knlGS:0000000000000000
> 22:37:46 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> 22:37:46 CR2: 00002aaaaab38000 CR3: 0000000803db3000 CR4: 00000000000407e0
> 22:37:46 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> 22:37:46 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> 22:37:46 Process ldlm_bl_13 (pid: 6333, threadinfo ffff8803d0402000, task ffff880403064040)
> 22:37:46 Stack:
> 22:37:46 ffff8803d55a8810 ffff880803ddcdb8 ffff880803ddcdb8 ffff8803d55a8810
> 22:37:46 ffff8803d0403c00 ffffffffa02b9063 ffff8801e05fbb50 0000000000000000
> 22:37:46 ffff8803d0403c60 ffffffffa0636e7a ffff8801e05fbb50 ffff8803d54b67c0
> 22:37:46 Call Trace:
> 22:37:46 [<ffffffffa02b9063>] cl_lock_delete+0x153/0x1a0 [obdclass]
> 22:37:46 [<ffffffffa0636e7a>] osc_ldlm_blocking_ast+0x21a/0x330 [osc]
> 22:37:46 [<ffffffffa03adabb>] ldlm_cancel_callback+0x6b/0x190 [ptlrpc]
> 22:37:46 [<ffffffffa03bc7ca>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
> 22:37:46 [<ffffffffa03c11bb>] ldlm_cli_cancel+0x6b/0x380 [ptlrpc]
> 22:37:46 [<ffffffffa0635c47>] osc_lock_cancel+0x297/0x410 [osc]
> 22:37:46 [<ffffffffa02b770d>] cl_lock_cancel0+0x6d/0x150 [obdclass]
> 22:37:46 [<ffffffffa02b844b>] cl_lock_cancel+0x13b/0x140 [obdclass]
> 22:37:46 [<ffffffffa0636e6c>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
> 22:37:46 [<ffffffffa03c4944>] ldlm_handle_bl_callback+0xd4/0x430 [ptlrpc]
> 22:37:46 [<ffffffffa03c4e9c>] ldlm_bl_thread_main+0x1fc/0x420 [ptlrpc]
> 22:37:46 [<ffffffff8106610e>] kthread+0x9e/0xb0
> 22:37:46 [<ffffffff81364034>] kernel_thread_helper+0x4/0x10
> 22:37:46 Code: 08 48 89 10 49 89 4d 18 49 89 4d 20 41 fe 44 24 5c 49 8b 45 10 4d 8d 65 08 49 39 c4 48 8d 58 e8 0f 84 be fe ff ff 0f 1f 44 00 00
> 22:37:46 8b 43 10 48 8b 40 50 48 85 c0 74 08 48 89 de 4c 89 f7 ff d0
> 22:37:46 RIP [<ffffffffa02b8eb0>] cl_lock_delete0+0x190/0x1f0 [obdclass]
> 22:37:46 RSP <ffff8803d0403bc0>
> 22:37:46 --[ end trace cfc07b184a378ec7 ]--

One more, which is slightly different, but still caused by the same tests, not found in 2.4.0, etc.

Lustre is getting stuck looping in cl_locks_prune. We have many cases of applications failing to exit with processes stuck somewhere under cl_locks_prune - Here's two examples:
> 05:09:21 [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0
> 05:09:21 [<ffffffff81004919>] dump_trace+0x89/0x450
> 05:09:21 [<ffffffff8100591c>] show_trace_log_lvl+0x5c/0x80
> 05:09:21 [<ffffffff81005955>] show_trace+0x15/0x20
> 05:09:21 [<ffffffff813bd54b>] dump_stack+0x79/0x84
> 05:09:21 [<ffffffff810c9bf9>] __rcu_pending+0x199/0x410
> 05:09:21 [<ffffffff810c9eda>] rcu_check_callbacks+0x6a/0x120
> 05:09:21 [<ffffffff8105cb96>] update_process_times+0x46/0x90
> 05:09:21 [<ffffffff81080a16>] tick_sched_timer+0x66/0xc0
> 05:09:21 [<ffffffff8107386f>] __run_hrtimer+0xcf/0x1d0
> 05:09:21 [<ffffffff81073bb7>] hrtimer_interrupt+0xe7/0x220
> 05:09:21 [<ffffffff813c9dd9>] smp_apic_timer_interrupt+0x69/0x99
> 05:09:21 [<ffffffff813c8d13>] apic_timer_interrupt+0x13/0x20
> 05:09:21 [<ffffffffa0373a79>] cl_env_info+0x9/0x20 [obdclass]
> 05:09:21 [<ffffffffa037dfc6>] cl_lock_mutex_get+0x36/0xd0 [obdclass]
> 05:09:21 [<ffffffffa0380319>] cl_locks_prune+0xd9/0x330 [obdclass]
> 05:09:21 [<ffffffffa078dc91>] lov_delete_raid0+0xe1/0x3f0 [lov]
> 05:09:21 [<ffffffffa078d059>] lov_object_delete+0x69/0x180 [lov]
> 05:09:21 [<ffffffffa036bda5>] lu_object_free+0x85/0x1a0 [obdclass]
> 05:09:21 [<ffffffffa036c2de>] lu_object_put+0xbe/0x370 [obdclass]
> 05:09:21 [<ffffffffa0374d1e>] cl_object_put+0xe/0x10 [obdclass]
> 05:09:21 [<ffffffffa086803d>] cl_inode_fini+0xbd/0x2a0 [lustre]
> 05:09:21 [<ffffffffa082f160>] ll_clear_inode+0x2d0/0x930 [lustre]
> 05:09:21 [<ffffffffa082f84d>] ll_delete_inode+0x8d/0x1f0 [lustre]
> 05:09:21 [<ffffffff811662f1>] evict+0x91/0x170
> 05:09:21 [<ffffffff81166762>] iput+0xc2/0x180
> 05:09:21 [<ffffffffa07ffaf6>] ll_d_iput+0x2e6/0x830 [lustre]
> 05:09:21 [<ffffffff8116106b>] d_kill+0xcb/0x130
> 05:09:21 [<ffffffff81164401>] dput+0xc1/0x190
> 05:09:21 [<ffffffff8114ec27>] fput+0x167/0x200
> 05:09:21 [<ffffffff8115475f>] do_execve_common+0x16f/0x300
> 05:09:21 [<ffffffff8115497f>] do_execve+0x3f/0x50
> 05:09:21 [<ffffffff8100a89e>] sys_execve+0x4e/0x80
> 05:09:21 [<ffffffff813c87bc>] stub_execve+0x6c/0xc0
> 05:09:21 [<00002aaaabe582c7>] 0x2aaaabe582c6

> 05:18:21 [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0
> 05:18:21 [<ffffffff81004919>] dump_trace+0x89/0x450
> 05:18:21 [<ffffffff8100591c>] show_trace_log_lvl+0x5c/0x80
> 05:18:22 [<ffffffff81005955>] show_trace+0x15/0x20
> 05:18:22 [<ffffffff813bd54b>] dump_stack+0x79/0x84
> 05:18:22 [<ffffffff810c9bf9>] __rcu_pending+0x199/0x410
> 05:18:22 [<ffffffff810c9eda>] rcu_check_callbacks+0x6a/0x120
> 05:18:22 [<ffffffff8105cb96>] update_process_times+0x46/0x90
> 05:18:22 [<ffffffff81080a16>] tick_sched_timer+0x66/0xc0
> 05:18:22 [<ffffffff8107386f>] __run_hrtimer+0xcf/0x1d0
> 05:18:22 [<ffffffff81073bb7>] hrtimer_interrupt+0xe7/0x220
> 05:18:22 [<ffffffff813c9dd9>] smp_apic_timer_interrupt+0x69/0x99
> 05:18:22 [<ffffffff813c8d13>] apic_timer_interrupt+0x13/0x20
> 05:18:22 [<ffffffffa037e419>] cl_lock_get_trust+0x89/0x90 [obdclass]
> 05:18:22 [<ffffffffa0380307>] cl_locks_prune+0xc7/0x330 [obdclass]
> 05:18:22 [<ffffffffa078dc91>] lov_delete_raid0+0xe1/0x3f0 [lov]
> 05:18:22 [<ffffffffa078d059>] lov_object_delete+0x69/0x180 [lov]
> 05:18:22 [<ffffffffa036bda5>] lu_object_free+0x85/0x1a0 [obdclass]
> 05:18:22 [<ffffffffa036c2de>] lu_object_put+0xbe/0x370 [obdclass]
> 05:18:22 [<ffffffffa0374d1e>] cl_object_put+0xe/0x10 [obdclass]
> 05:18:22 [<ffffffffa086803d>] cl_inode_fini+0xbd/0x2a0 [lustre]
> 05:18:22 [<ffffffffa082f160>] ll_clear_inode+0x2d0/0x930 [lustre]
> 05:18:22 [<ffffffffa082f84d>] ll_delete_inode+0x8d/0x1f0 [lustre]
> 05:18:22 [<ffffffff811662f1>] evict+0x91/0x170
> 05:18:22 [<ffffffff81166762>] iput+0xc2/0x180
> 05:18:22 [<ffffffffa07ffaf6>] ll_d_iput+0x2e6/0x830 [lustre]
> 05:18:22 [<ffffffff8116106b>] d_kill+0xcb/0x130
> 05:18:22 [<ffffffff81164401>] dput+0xc1/0x190
> 05:18:22 [<ffffffff8114ec27>] fput+0x167/0x200
> 05:18:22 [<ffffffff8115475f>] do_execve_common+0x16f/0x300
> 05:18:22 [<ffffffff8115497f>] do_execve+0x3f/0x50
> 05:18:22 [<ffffffff8100a89e>] sys_execve+0x4e/0x80
> 05:18:22 [<ffffffff813c87bc>] stub_execve+0x6c/0xc0
> 05:18:22 [<00002aaaabe582c7>] 0x2aaaabe582c6

Sorry for the massive dump of information in one bug, but we strongly suspect these bugs have a single cause or several tightly related causes.

With assistance from Xyratex, we've singled these patches out as possible patches of interest that have come in between 2.4.0 and master:

most suspicious:
18834a5 LU-3321 clio: remove stackable cl_page completely
0a259bd LU-3321 clio: collapse layer of cl_page
less suspicious:
13079de LU-3889 osc: Allow lock to be canceled at ENQ time
7168ea8 LU-3027 clio: Do not shrink sublock at cancel
521335c LU-3433 clio: wrong cl_lock usage

On my to-do list is testing master with some of these patches removed to see what, if any, affect this has on the bugs listed above.



 Comments   
Comment by Patrick Farrell (Inactive) [ 05/Feb/14 ]

mmstress

Comment by Jinshan Xiong (Inactive) [ 05/Feb/14 ]

Thanks for the work. I'd suggest you to try to remove those patches:

13079de LU-3889 osc: Allow lock to be canceled at ENQ time
7168ea8 LU-3027 clio: Do not shrink sublock at cancel
521335c LU-3433 clio: wrong cl_lock usage

and see what will happen.

Yes, I agree with you that this looks really like a single cause. Will you please get some logs so that our guys can take a look?

Comment by Jinshan Xiong (Inactive) [ 05/Feb/14 ]

Can you please tell me how you ran the test program?

Comment by Patrick Farrell (Inactive) [ 05/Feb/14 ]

Jinshan,

Thanks for the quick response.

About logs: Unfortunately, these problems don't happen with dlmtrace (or any of the other large debug flags - such as trace or rpctrace) enabled. I created a special debug patch with all calls to cl_lock_trace under a special debug flag and was able to hit it with only that enabled.

I should be able to get those logs for you tomorrow morning. (Sorry I don't have them on hand, I had to clean out my old dumps/logs.)

Just a heads up, Vitaly Fertman of Xyratex has been looking in to this with us.

About mmstress - It's executed with no arguments, but we started multiple copies with our workload manager:
e.g.
aprun -n 100 ./mmstress

Would run it on 100 cores, which is enough that we see the problems pretty quickly (with debug at default). That core count is allocated by putting NUM_CPUs jobs on each node. So if nodes had 8 cores, with 100 jobs, we'd get 12 nodes with 8 jobs each, and one with 4 jobs.

Comment by Patrick Farrell (Inactive) [ 05/Feb/14 ]

Also, I'll (hopefully, system problems may interfere) be testing removing those patches tomorrow as well.

Comment by Alexander Boyko [ 06/Feb/14 ]

I have added info from crash for one case.

Comment by Patrick Farrell (Inactive) [ 06/Feb/14 ]

[Edit] Sorry, I forgot a bit of background info.

There are actually two LU-3027 patches, rather than one (and there is also the earlier patch which was removed). I decided to look at them both.

Here's the list of patches I explored today:
13079de LU-3889 osc: Allow lock to be canceled at ENQ time (http://review.whamcloud.com/#/c/8405/)
7168ea8 LU-3027 clio: Do not shrink sublock at cancel (http://review.whamcloud.com/#/c/7569/)
521335c LU-3433 clio: wrong cl_lock usage (http://review.whamcloud.com/#/c/6709/)
I1ea629 LU-3027 lov: to not modify lov lock when sublock is canceled (http://review.whamcloud.com/#/c/7841/)

I think the problem is a bad interaction between:
I1ea629 LU-3027 lov: to not modify lov lock when sublock is canceled (http://review.whamcloud.com/#/c/7841/)
and LU-3433/LU-3889.

I started by removing all four of those patches I noted above from Cray 2.5, and confirmed there's no problem.

I tested the two LU-3027 patches together, no problem.

I tested LU-3433 and LU-3899 together, no problem.

I tested LU-3433 and both LU-3027 patches, I saw LELUS-203 and several of the related bugs.

I tested LU-3889 and both LU-3027 patches, and again I saw LELUS-203 and several of the related bugs.

I tested LU-3889, LU-3433, and 7168ea8 LU-3027 clio: Do not shrink sublock at cancel (http://review.whamcloud.com/#/c/7569/), no problem.

I tested LU-3889, LU-3433, and I1ea629 LU-3027 lov: to not modify lov lock when sublock is canceled (http://review.whamcloud.com/#/c/7841/), and I hit (in 10 minutes or so - my other runs here were 30 minutes each):
GPF in osc_lock_detach
GPF in cl_lock_put
GPF in cl_lock_delete
osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6
cl_locks_prune (no exit)

I'm going to test the two LU-3027 patches together again, as the lack of failures with the lov patch in place surprises me.
I'm also going to do some general stress testing with the LU-3027 lov patch removed and the other three patches in place.

Comment by Patrick Farrell (Inactive) [ 06/Feb/14 ]

With further testing of the two LU-3027 patches together, I hit the GPF in osc_lock_detach.

Comment by Jinshan Xiong (Inactive) [ 07/Feb/14 ]

Hi Patrick,

Thank you for the information, I will take a look. Just to confirm, you still can't reproduce this problem with dlmtrace enabled, is that right?

Comment by Patrick Farrell (Inactive) [ 07/Feb/14 ]

Jinshan - Correct.

The logs Alex Boyko provided are from a special debug patch with the calls to cl_lock_trace moved to their own debug flags (actually, two different ones). Both of those flags were on, but nothing else. So all calls to cl_lock_trace should be logged there.

Also, in further stress testing (notice it is mmstress again) with the LU-3027 lov patch removed, I did hit one of the assertions we've got as tentatively related - but so far, only this one, and after much longer testing than I did above (when I hit several bugs with that patch installed):
2014-02-06T20:33:06.582316-06:00 c0-0c2s0n0 LustreError: 4302:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:
2014-02-06T20:33:06.582369-06:00 c0-0c2s0n0 LustreError: 4302:0:(lovsub_lock.c:103:lovsub_lock_state()) LBUG
2014-02-06T20:33:06.582377-06:00 c0-0c2s0n0 Pid: 4302, comm: mmstress
2014-02-06T20:33:06.582383-06:00 c0-0c2s0n0 Call Trace:
2014-02-06T20:33:06.582398-06:00 c0-0c2s0n0 [<ffffffff810065b1>] try_stack_unwind+0x161/0x1a0
2014-02-06T20:33:06.582414-06:00 c0-0c2s0n0 [<ffffffff81004dd9>] dump_trace+0x89/0x440
2014-02-06T20:33:06.611611-06:00 c0-0c2s0n0 [<ffffffffa016b897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2014-02-06T20:33:06.611627-06:00 c0-0c2s0n0 [<ffffffffa016bde7>] lbug_with_loc+0x47/0xc0 [libcfs]
2014-02-06T20:33:06.611636-06:00 c0-0c2s0n0 [<ffffffffa06dc26e>] lovsub_lock_state+0x19e/0x1a0 [lov]
2014-02-06T20:33:06.611649-06:00 c0-0c2s0n0 [<ffffffffa02bbb80>] cl_lock_state_signal+0x60/0x160 [obdclass]
2014-02-06T20:33:06.611674-06:00 c0-0c2s0n0 [<ffffffffa02bbd4d>] cl_lock_state_set+0xcd/0x1a0 [obdclass]
2014-02-06T20:33:06.611685-06:00 c0-0c2s0n0 [<ffffffffa02bfe0b>] cl_enqueue_try+0x14b/0x320 [obdclass]
2014-02-06T20:33:06.641543-06:00 c0-0c2s0n0 [<ffffffffa06d46cb>] lov_lock_enqueue+0x1fb/0xf80 [lov]
2014-02-06T20:33:06.641565-06:00 c0-0c2s0n0 [<ffffffffa02bfdbb>] cl_enqueue_try+0xfb/0x320 [obdclass]
2014-02-06T20:33:06.641580-06:00 c0-0c2s0n0 [<ffffffffa02c0c9f>] cl_enqueue_locked+0x7f/0x1f0 [obdclass]
2014-02-06T20:33:06.641587-06:00 c0-0c2s0n0 [<ffffffffa02c188e>] cl_lock_request+0x7e/0x270 [obdclass]
2014-02-06T20:33:06.641598-06:00 c0-0c2s0n0 [<ffffffffa02c6dc4>] cl_io_lock+0x394/0x5c0 [obdclass]
2014-02-06T20:33:06.641605-06:00 c0-0c2s0n0 [<ffffffffa02c708a>] cl_io_loop+0x9a/0x1a0 [obdclass]
2014-02-06T20:33:06.670989-06:00 c0-0c2s0n0 [<ffffffffa078f350>] ll_page_mkwrite+0x280/0x680 [lustre]
2014-02-06T20:33:06.671014-06:00 c0-0c2s0n0 [<ffffffff81112267>] __do_fault+0xe7/0x570
2014-02-06T20:33:06.671030-06:00 c0-0c2s0n0 [<ffffffff81112794>] handle_pte_fault+0xa4/0xcc0
2014-02-06T20:33:06.671049-06:00 c0-0c2s0n0 [<ffffffff8111355e>] handle_mm_fault+0x1ae/0x240
2014-02-06T20:33:06.671069-06:00 c0-0c2s0n0 [<ffffffff81026caf>] do_page_fault+0x18f/0x420
2014-02-06T20:33:06.671080-06:00 c0-0c2s0n0 [<ffffffff81367acf>] page_fault+0x1f/0x30
2014-02-06T20:33:06.671087-06:00 c0-0c2s0n0 [<00000000200007ea>] 0x200007ea
2014-02-06T20:33:06.701413-06:00 c0-0c2s0n0 Kernel panic - not syncing: LBUG
2014-02-06T20:33:06.701431-06:00 c0-0c2s0n0 Pid: 4302, comm: mmstress Tainted: P 3.0.93-0.8.2_1.0000.7755-cray_gem_c #1
2014-02-06T20:33:06.701445-06:00 c0-0c2s0n0 Call Trace:
2014-02-06T20:33:06.701453-06:00 c0-0c2s0n0 [<ffffffff810065b1>] try_stack_unwind+0x161/0x1a0
2014-02-06T20:33:06.701463-06:00 c0-0c2s0n0 [<ffffffff81004dd9>] dump_trace+0x89/0x440
2014-02-06T20:33:06.701481-06:00 c0-0c2s0n0 [<ffffffff8100601c>] show_trace_log_lvl+0x5c/0x80
2014-02-06T20:33:06.701494-06:00 c0-0c2s0n0 [<ffffffff81006055>] show_trace+0x15/0x20
2014-02-06T20:33:06.733566-06:00 c0-0c2s0n0 [<ffffffff81364432>] dump_stack+0x79/0x84
2014-02-06T20:33:06.733584-06:00 c0-0c2s0n0 [<ffffffff813644d1>] panic+0x94/0x1da
2014-02-06T20:33:06.733592-06:00 c0-0c2s0n0 [<ffffffffa016be4b>] lbug_with_loc+0xab/0xc0 [libcfs]
2014-02-06T20:33:06.733605-06:00 c0-0c2s0n0 [<ffffffffa06dc26e>] lovsub_lock_state+0x19e/0x1a0 [lov]
2014-02-06T20:33:06.733642-06:00 c0-0c2s0n0 [<ffffffffa02bbb80>] cl_lock_state_signal+0x60/0x160 [obdclass]
2014-02-06T20:33:06.733655-06:00 c0-0c2s0n0 [<ffffffffa02bbd4d>] cl_lock_state_set+0xcd/0x1a0 [obdclass]
2014-02-06T20:33:06.733663-06:00 c0-0c2s0n0 [<ffffffffa02bfe0b>] cl_enqueue_try+0x14b/0x320 [obdclass]
2014-02-06T20:33:06.761443-06:00 c0-0c2s0n0 [<ffffffffa06d46cb>] lov_lock_enqueue+0x1fb/0xf80 [lov]
2014-02-06T20:33:06.787133-06:00 c0-0c2s0n0 [<ffffffffa02bfdbb>] cl_enqueue_try+0xfb/0x320 [obdclass]
2014-02-06T20:33:06.787151-06:00 c0-0c2s0n0 [<ffffffffa02c0c9f>] cl_enqueue_locked+0x7f/0x1f0 [obdclass]
2014-02-06T20:33:06.787181-06:00 c0-0c2s0n0 [<ffffffffa02c188e>] cl_lock_request+0x7e/0x270 [obdclass]
2014-02-06T20:33:06.787197-06:00 c0-0c2s0n0 [<ffffffffa02c6dc4>] cl_io_lock+0x394/0x5c0 [obdclass]
2014-02-06T20:33:06.812740-06:00 c0-0c2s0n0 [<ffffffffa02c708a>] cl_io_loop+0x9a/0x1a0 [obdclass]
2014-02-06T20:33:06.812757-06:00 c0-0c2s0n0 [<ffffffffa078f350>] ll_page_mkwrite+0x280/0x680 [lustre]
2014-02-06T20:33:06.812786-06:00 c0-0c2s0n0 [<ffffffff81112267>] __do_fault+0xe7/0x570

Comment by Patrick Farrell (Inactive) [ 07/Feb/14 ]

Jinshan - Something that came up in a Cray discussion of the history of LU-3889 and LU-3027.

When we opened LU-3889, Oleg pointed us at the first of the two current patches for LU-3027 (ignoring the one which was removed for causing a regression). That fixed our original reproducer that I opened LU-3889 with.

Then Oleg reported Intel was still seeing the assertion from LU-3889, and noted there was a second patch for LU-3027. We (and you) picked that one up, but then both Cray and Intel continued to see the LU-3889 assertion.

Then we found racer.sh could reproduce the problem, and you and several of the Xyratex guys worked out a patch for it, which was labeled with LU-3889. After getting that patch, we have not seen the assertion from LU-3889 again.

So I don't think we have any hard evidence that the second LU-3027 patch fixed any of the bugs we were looking at at the time.

Comment by Patrick Farrell (Inactive) [ 07/Feb/14 ]

One further thought...
As these bugs don't usually happen with any of the tracing options turned on, they're clearly timing sensitive. That brings up the possibility that reverting the second LU-3207 patch is just changing timing, rather than the problem being in the patch itself. Just something else to chew on.

Comment by Jinshan Xiong (Inactive) [ 10/Feb/14 ]

working on this ...

Comment by Patrick Farrell (Inactive) [ 10/Feb/14 ]

Jinshan - I know from experience these bugs can be hard to replicate without a larger system. If you've got something you'd like tested (including a debug patch), I can make time to test it on one of our in house systems here where we can replicate the problems.

Comment by Jinshan Xiong (Inactive) [ 10/Feb/14 ]

Patrick - will you please attach osc.ko here?

Comment by Patrick Farrell (Inactive) [ 10/Feb/14 ]

Jinshan - Can you be more specific? osc.ko that goes with the logs that Alex Boyko provided? If that one, he'll have to provide that, unfortunately.

Comment by Jinshan Xiong (Inactive) [ 10/Feb/14 ]

I just want to know the source of osc_lock_detach+0x46 so that I will know which freed data it was trying to access.

Comment by Jinshan Xiong (Inactive) [ 10/Feb/14 ]

Mostly it implies dlmlock was already freed, just want to make sure.

Comment by Patrick Farrell (Inactive) [ 10/Feb/14 ]

Ah, OK.

I'll attach another osc.ko in a moment where we get the crash at the same line...

In case it's enough, here's a disassemble of osc_lock_detach and a line number from the ko I'm going to attach:
crash> gdb list *(osc_lock_detach+46)
0xffffffffa063e56e is in osc_lock_detach (/usr/src/linux-3.0.93-0.8.2_1.0000.7747/include/linux/spi
nlock.h:285).
280 /usr/src/linux-3.0.93-0.8.2_1.0000.7747/include/linux/spinlock.h: No such file or directory
.
in /usr/src/linux-3.0.93-0.8.2_1.0000.7747/include/linux/spinlock.h

crash> disassemble osc_lock_detach
Dump of assembler code for function osc_lock_detach:
0xffffffffa063e540 <+0>: push %rbp
0xffffffffa063e541 <+1>: mov %rsp,%rbp
0xffffffffa063e544 <+4>: sub $0x30,%rsp
0xffffffffa063e548 <+8>: mov %rbx,-0x28(%rbp)
0xffffffffa063e54c <+12>: mov %r12,-0x20(%rbp)
0xffffffffa063e550 <+16>: mov %r13,-0x18(%rbp)
0xffffffffa063e554 <+20>: mov %r14,-0x10(%rbp)
0xffffffffa063e558 <+24>: mov %r15,-0x8(%rbp)
0xffffffffa063e55c <+28>: data32 data32 data32 xchg %ax,%ax
0xffffffffa063e561 <+33>: mov %rsi,%r12
0xffffffffa063e564 <+36>: mov %rdi,%r15
0xffffffffa063e567 <+39>: mov $0xffffffffa0674e00,%rdi
0xffffffffa063e56e <+46>: callq 0xffffffff81367280 <_raw_spin_lock>
0xffffffffa063e573 <+51>: mov 0x28(%r12),%rbx
0xffffffffa063e578 <+56>: test %rbx,%rbx
0xffffffffa063e57b <+59>: je 0xffffffffa063e5f8 <osc_lock_detach+184>
0xffffffffa063e57d <+61>: movq $0x0,0x28(%r12)
0xffffffffa063e586 <+70>: movq $0x0,0x160(%rbx)
0xffffffffa063e591 <+81>: movq $0x0,0x70(%r12)
0xffffffffa063e59a <+90>: incb 0x36860(%rip) # 0xffffffffa0674e00 <osc_ast_guard>
0xffffffffa063e5a0 <+96>: mov %rbx,%rdi
0xffffffffa063e5a3 <+99>: callq 0xffffffffa03ae030 <lock_res_and_lock>
0xffffffffa063e5a8 <+104>: mov 0x9c(%rbx),%eax
0xffffffffa063e5ae <+110>: cmp 0x98(%rbx),%eax
0xffffffffa063e5b4 <+116>: je 0xffffffffa063e600 <osc_lock_detach+192>
0xffffffffa063e5b6 <+118>: mov %rbx,%rdi
0xffffffffa063e5b9 <+121>: callq 0xffffffffa03ae000 <unlock_res_and_lock>
0xffffffffa063e5be <+126>: testb $0x2,0xa8(%r12)
0xffffffffa063e5c7 <+135>: je 0xffffffffa063e65c <osc_lock_detach+284>
0xffffffffa063e5cd <+141>: mov %rbx,%rdi
0xffffffffa063e5d0 <+144>: callq 0xffffffffa03b0450 <ldlm_lock_put>
0xffffffffa063e5d5 <+149>: andb $0xfd,0xa8(%r12)
0xffffffffa063e5de <+158>: mov -0x28(%rbp),%rbx
0xffffffffa063e5e2 <+162>: mov -0x20(%rbp),%r12
0xffffffffa063e5e6 <+166>: mov -0x18(%rbp),%r13
0xffffffffa063e5ea <+170>: mov -0x10(%rbp),%r14
0xffffffffa063e5ee <+174>: mov -0x8(%rbp),%r15
0xffffffffa063e5f2 <+178>: leaveq
0xffffffffa063e5f3 <+179>: retq
0xffffffffa063e5f4 <+180>: nopl 0x0(%rax)
0xffffffffa063e5f8 <+184>: incb 0x36802(%rip) # 0xffffffffa0674e00 <osc_ast_guard>
0xffffffffa063e5fe <+190>: jmp 0xffffffffa063e5de <osc_lock_detach+158>
0xffffffffa063e600 <+192>: mov $0xffffffffa066b440,%rsi
0xffffffffa063e607 <+199>: mov %r15,%rdi
0xffffffffa063e60a <+202>: mov 0x8(%r12),%r14
0xffffffffa063e60f <+207>: callq 0xffffffffa02a8a50 <lu_context_key_get>
0xffffffffa063e614 <+212>: test %rax,%rax
0xffffffffa063e617 <+215>: mov %rax,%r13
0xffffffffa063e61a <+218>: je 0xffffffffa063e68e <osc_lock_detach+334>
0xffffffffa063e61c <+220>: mov %r14,%rdi
0xffffffffa063e61f <+223>: callq 0xffffffffa02b3f50 <cl_object_attr_lock>
0xffffffffa063e624 <+228>: mov 0x38(%r14),%rax
0xffffffffa063e628 <+232>: mov %rbx,%rdi
0xffffffffa063e62b <+235>: mov 0x20(%rax),%rsi
0xffffffffa063e62f <+239>: callq 0xffffffffa03c0370 <ldlm_extent_shift_kms>
0xffffffffa063e634 <+244>: lea 0x78(%r13),%rdx
0xffffffffa063e638 <+248>: mov %r15,%rdi
0xffffffffa063e63b <+251>: mov %rax,0x80(%r13)
0xffffffffa063e642 <+258>: mov $0x2,%ecx
0xffffffffa063e647 <+263>: mov %r14,%rsi
0xffffffffa063e64a <+266>: callq 0xffffffffa02b3f70 <cl_object_attr_set>
0xffffffffa063e64f <+271>: mov %r14,%rdi
0xffffffffa063e652 <+274>: callq 0xffffffffa02b3f30 <cl_object_attr_unlock>
0xffffffffa063e657 <+279>: jmpq 0xffffffffa063e5b6 <osc_lock_detach+118>
0xffffffffa063e65c <+284>: mov $0xffffffffa066c660,%rdi
0xffffffffa063e663 <+291>: mov $0xffffffffa0661d4d,%rdx
0xffffffffa063e66a <+298>: mov $0xffffffffa0661ce4,%rsi
0xffffffffa063e671 <+305>: xor %eax,%eax
0xffffffffa063e673 <+307>: movl $0x40000,0x2dffb(%rip) # 0xffffffffa066c678 <__msg_da
ta.72090+24>
0xffffffffa063e67d <+317>: callq 0xffffffffa017a280 <libcfs_debug_msg>
0xffffffffa063e682 <+322>: mov $0xffffffffa066c660,%rdi
0xffffffffa063e689 <+329>: callq 0xffffffffa016ada0 <lbug_with_loc>
0xffffffffa063e68e <+334>: mov $0xffffffffa066c6a0,%rdi
0xffffffffa063e695 <+341>: mov $0xffffffffa0661d39,%rdx
0xffffffffa063e69c <+348>: mov $0xffffffffa0661ce4,%rsi
0xffffffffa063e6a3 <+355>: xor %eax,%eax
0xffffffffa063e6a5 <+357>: movl $0x40000,0x2e009(%rip) # 0xffffffffa066c6b8 <__msg_da
ta.71857+24>
0xffffffffa063e6af <+367>: callq 0xffffffffa017a280 <libcfs_debug_msg>
0xffffffffa063e6b4 <+372>: mov $0xffffffffa066c6a0,%rdi
0xffffffffa063e6bb <+379>: callq 0xffffffffa016ada0 <lbug_with_loc>
End of assembler dump.

Comment by Patrick Farrell (Inactive) [ 10/Feb/14 ]

KO that goes with the dissassembly in Paf's comment https://jira.hpdd.intel.com/browse/LU-4591?focusedCommentId=76645&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-76645

Comment by Jinshan Xiong (Inactive) [ 10/Feb/14 ]

Hi Patrick - do you have a crash dump in hand? if yes, can you please show me the state of corresponding cl_lock and osc_lock?

Comment by Patrick Farrell (Inactive) [ 10/Feb/14 ]

Jinshan - Not right this second, but I'll try to get one uploaded for you so you can take a look. Sorry about not having one in hand.

Comment by Patrick Farrell (Inactive) [ 10/Feb/14 ]

Dump is here, with KOs and console and messages log:
ftp.whamcloud.com

uploads/LU-4591/LU-4591.1402101702.tar.gz

The node which went down with osc_lock_detach is named c2-0c0s7n3.

This system was running Cray 2.5.

Comment by Jinshan Xiong (Inactive) [ 11/Feb/14 ]

Hi Patrick - what's the tip of your branch?

Comment by Patrick Farrell (Inactive) [ 11/Feb/14 ]

Jinshan - Sadly, we don't use git, so there's no answer to that question.

Our 2.5 is 2.5 as released by Intel plus a number of patches we've pulled in, but I can replicate the same problems in the same way on master or Intel's released 2.5. If it would help, I could do it with one of those code bases - That's just the dump I had handy.

Comment by Jinshan Xiong (Inactive) [ 11/Feb/14 ]

No worry Patrick. I will take a look at the dump, sorry was interrupted by something else yesterday.

Comment by Jinshan Xiong (Inactive) [ 12/Feb/14 ]

I've taken a look at the dump. I suspect this issue is related to LU-3889, the patch # is 8405.

Patrick - can you please revert that patch and see what'll happen?

Comment by Patrick Farrell (Inactive) [ 12/Feb/14 ]

Sure. I just started testing master with LU-3889 removerted and hit the GPF in cl_lock_delete I noted in my original message. I'm going to keep testing, but I suspect we'll see some of the other bugs.

Would you like the dump from that cl_lock_delete GPF? (No debugging enabled.)

Further update:

I've hit these three other bugs from the list above:
2014-02-12T12:32:04.309770-06:00 c0-0c0s5n2 LustreError: 6228:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:
2014-02-12T12:32:04.309813-06:00 c0-0c0s5n2 LustreError: 6228:0:(lovsub_lock.c:103:lovsub_lock_state()) LBUG

GPF at osc_lock_detach+0x46

2014-02-12T12:19:01.577674-06:00 c0-0c2s2n3 LustreError: 3031:0:(osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6
2014-02-12T12:19:01.577718-06:00 c0-0c2s2n3 LustreError: 3031:0:(osc_lock.c:1204:osc_lock_enqueue()) LBUG

I also hit LU-4269:
2014-02-12T11:46:50.233799-06:00 c0-0c2s6n0 LustreError: 16995:0:(ldlm_lock.c:222:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed:

Comment by Li Xi (Inactive) [ 19/Feb/14 ]

Hi Patrick,

Would you please share the way of reproducing these bugs? I've tried to run multiple processes of LTP mmstress on Lustre to reproduce them, but failed. I ran with "./mmstress -t 1" commands. Is there anything I am missing?

Thanks!

Comment by Patrick Farrell (Inactive) [ 19/Feb/14 ]

Li - Sure. I have never successfully reproduced these on a small system. My usual system has 70 nodes on it, though I expect something smaller could do it as well. But when I tried with two and three nodes, I wasn't able to reproduce the problem either.

I run - with no command line options - 4 copies of mmstress per node, on ~70 nodes. All copies of mmstress are executed in the same directory on the Lustre file system. Within a half hour, on master or 2.5 or 2.4.1, I've hit about 15-20 of these problems.

Comment by Zhenyu Xu [ 28/Feb/14 ]

Hi Patrick,

Would you mind giving http://review.whamcloud.com/9433 a try, it's a rewrite of "LU-3889 osc: Allow lock to be canceled at ENQ time" patch.

Comment by Patrick Farrell (Inactive) [ 28/Feb/14 ]

Zhenyu - I tried this patch on master just now (master from today + your patch) with the mmstress reproducer.

I hit essentially all of the bugs from above, and I suspect if I kept running, I would see the others.

Here's the list of those I hit for sure:
LustreError: 4971:0:(lov_lock.c:216:lov_sublock_lock()) ASSERTION( cl_lock_is_mutexed(child) ) failed:

GPF in osc_lock_detach

No exit stuck in cl_locks_prune

LustreError: 12688:0:(osc_lock.c:1208:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6

GPF in cl_lock_delete

Comment by Patrick Farrell (Inactive) [ 28/Feb/14 ]

cl_lock debugging patch

Comment by Patrick Farrell (Inactive) [ 28/Feb/14 ]

Zhenyu -

I just attached a debug patch which breaks out the cl_lock_trace calls under other debug flags. In the past, I've been able to hit some of these bugs with the cllock and clfree debug flags that patch adds enabled. (I can't hit them with any of the heavier debug flags, like dlmtrace or rpctrace, enabled.)

Would you be interested in a dump and logs of one of these crashes with your patch and that debug patch?

Comment by Zhenyu Xu [ 01/Mar/14 ]

yes please, do you want me to combine these two patches in a single patches for you to get a built image easily?

Comment by Zhenyu Xu [ 01/Mar/14 ]

FYI, I've pushed http://review.whamcloud.com/9441 , the debug patch which you provided which is based on my patch.

Comment by Zhenyu Xu [ 03/Mar/14 ]

Patrick,

FYI, the error you reported in Mar 1st are exactly issues reported in LU-4692 and LU-4693.

Comment by Patrick Farrell (Inactive) [ 03/Mar/14 ]

Thanks for pointing those out - Good that someone else has seen them.

I should be able to get you a dump with debugging (I hope) later today.

Comment by Patrick Farrell (Inactive) [ 03/Mar/14 ]

Unfortunately, I've been unable to hit the bugs with debugging enabled. I'm trying with only the clfree debugging option on.

I did see something that may not be related... I have a number of threads not exiting stuck here:
[<ffffffff811f8ad4>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff810468db>] exit_mm+0x3b/0x160
[<ffffffff8104853e>] do_exit+0x18e/0x980
[<ffffffff81048d78>] do_group_exit+0x48/0xc0
[<ffffffff81059d93>] get_signal_to_deliver+0x243/0x480
[<ffffffff81002330>] do_notify_resume+0xe0/0x7f0
[<ffffffff81367d53>] retint_signal+0x46/0x83
[<00000000200007d1>] 0x200007d1
[<ffffffffffffffff>] 0xffffffffffffffff

I've seen this before, but only, I think on master doing these tests. It seems to happen when the tests are run for a long time. (Normally they aren't run very long because nodes are dropping.)

Comment by Patrick Farrell (Inactive) [ 03/Mar/14 ]

With debugging reduced to just the clfree flag (cl_lock_tracing only in cl_free), I started hitting the various bugs.

I grabbed three dumps.
Node c0-0c1s2n0: GPF in osc_lock_detach
Node c0-0c1s1n0: No exit stuck in cl_locks_prune
Node c0-0c0s5n2: ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6

Dumps are uploading, will be here in about 5 minutes:
ftp.whamcloud.com:/uploads/LU-4591/LU-4591.1403031327.tar.gz

I'll go back to testing with the cllock and clfree flags on to see if I can hit the bug.

Comment by Patrick Farrell (Inactive) [ 03/Mar/14 ]

I suspect if you need better logs, we'll have to adjust the debug further.

Before we discovered removing one of the LU-3027 patches hides the problem for us, I did some analysis of which calls to cl_lock_trace were most common when running this test, hoping to identify which ones we could remove or make lighter.

Here's that data. The first number is the # of calls to that particular cl_lock_trace in the sample I gathered:
[Note: Data was gathered on Cray 2.5, rather than master. Line numbers are probably a bit different.]
162054 (cl_lock.c:150:cl_lock_trace0()) releasing ref: at cl_lock_put():308
160954 (cl_lock.c:150:cl_lock_trace0()) put mutex: at cl_lock_mutex_put():754
160913 (cl_lock.c:150:cl_lock_trace0()) got mutex: at cl_lock_mutex_tail():661
67146 (cl_lock.c:150:cl_lock_trace0()) acquiring trusted at cl_lock_get_trust():348
65822 (cl_lock.c:150:cl_lock_trace0()) acquiring ref: at cl_lock_get():332
37243 (cl_lock.c:150:cl_lock_trace0()) changing state: ate_set():1059
37243 (cl_lock.c:1058:cl_lock_state_set()
34207 (cl_lock.c:150:cl_lock_trace0()) delete lock: at cl_lock_delete():1795
34195 (cl_lock.c:150:cl_lock_trace0()) cancel lock: at cl_lock_cancel():1853
29408 (cl_lock.c:150:cl_lock_trace0()) changing holds: at cl_lock_hold_mod():868
29407 (cl_lock.c:867:cl_lock_hold_mod()
21250 (cl_lock.c:826:cl_lock_delete0()
17038 (cl_lock.c:802:cl_lock_cancel0()
12549 (cl_lock.c:150:cl_lock_trace0()) enclosure lock: at cl_lock_enclosure():1701
7672 (cl_lock.c:150:cl_lock_trace0()) free lock: at cl_lock_free():269
7491 (cl_lock.c:150:cl_lock_trace0()) enqueue lock: at cl_enqueue_try():1201
7272 (cl_lock.c:150:cl_lock_trace0()) disclosure lock: at cl_lock_disclosure():1744
6172 (cl_lock.c:888:cl_lock_used_mod()
6172 (cl_lock.c:150:cl_lock_trace0()) changing users: at cl_lock_used_mod():889
3022 (cl_lock.c:150:cl_lock_trace0()) unuse lock: at cl_unuse_try():1372
1882 (cl_lock.c:150:cl_lock_trace0()) alloc lock: at cl_lock_alloc():410
589 (cl_lock.c:150:cl_lock_trace0()) use lock: at cl_use_try():1108
386 (cl_lock.c:150:cl_lock_trace0()) enqueue failed: at cl_lock_request():2180
4 (cl_lock.c:1800:cl_lock_delete()
3 (cl_lock.c:1858:cl_lock_cancel()

Looking at this list, are there any of the top 10 or so we could do without, or could reduce significantly? I'm concerned that reducing the amount of data printed by cl_lock_trace won't really change how heavy it is - I would think most of the cost is in printing the message (though I could be wrong).


I was also considering trying a modified version of cl_lock_trace which prints less information:


New version of cl_lock_trace I'm going to try:

static void cl_lock_trace_reduced0(int level, const struct lu_env *env,
const char *prefix, const struct cl_lock *lock,
const char *func, const int line)
{
CDEBUG(level, "%s: %p@(%d %d %d %d %lx)"
" at %s():%d\n",
prefix, lock, cfs_atomic_read(&lock->cll_ref),
lock->cll_state, lock->cll_holds,
lock->cll_users, lock->cll_flags,
func, line);
}
#define cl_lock_trace_reduced(level, env, prefix, lock) \
cl_lock_trace_reduced0(level, env, prefix, lock, FUNCTION, __LINE)

So, are there any of the most common cl_lock_trace calls you don't think we need?
Does that modified version of cl_lock_trace have enough info in it? Is there more I could take out?

Once we've figured out how best to reduce the debug levels, I can test accordingly.

Comment by Jinshan Xiong (Inactive) [ 06/Mar/14 ]

Hi Patrick,

Will you please try this patch: http://review.whamcloud.com/9524

and see if it will help?

Jinshan

Comment by Patrick Farrell (Inactive) [ 06/Mar/14 ]

Sure, I'll test as soon as I can. Due to some poor planning on my part, that may not be until next week. I'll get results sooner if I can.

Comment by James A Simmons [ 12/Mar/14 ]

Please cherry pick this to b2_5

Comment by Peter Jones [ 12/Mar/14 ]

James

This will certainly be a candidate to back port once we have confirmation that the fix works

Peter

Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ]

Jinshan - I'm sorry for the delay here (vacation, then system problems), but this patch doesn't fix the problem.

I ran with this patch + master from last week.

I wasn't able to hit the bugs with my cl_lock debugging enabled, unfortunately. I can provide node dumps from one of these nodes if desired.

The bug set observed has changed somewhat...

Old bugs we're still seeing:
2014-03-12T11:45:47.984843-05:00 c0-0c1s1n1 LustreError: 8494:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed

GPF in osc_lock_detach:
2014-03-12T11:41:56.661772-05:00 c0-0c0s5n2 RIP: 0010:[<ffffffffa0860866>] [<ffffffffa0860866>] osc_lock_detach+0x46/0x180 [osc]

2014-03-12T11:37:21.627399-05:00 c0-0c2s1n2 LustreError: 20950:0:(osc_lock.c:1208:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6

GPF in cl_lock_delete:
2014-03-12T11:13:41.270193-05:00 c0-0c2s2n1 RIP: 0010:[<ffffffffa02c1550>] [<ffffffffa02c1550>] cl_lock_delete0+0x190/0x1f0 [obdclass]

No exit stuck in cl_locks_prune.

----------------------

Now for new things:

I'm seeing this error and related messages fairly often in the logs:
2014-03-12T11:44:04.990085-05:00 c0-0c1s2n0 LustreError: 8529:0:(osc_lock.c:830:osc_ldlm_completion_ast()) } lock@ffff8801e5189078
2014-03-12T11:44:04.990101-05:00 c0-0c1s2n0 LustreError: 8529:0:(osc_lock.c:830:osc_ldlm_completion_ast()) dlmlock returned -5

This one is new, and observed several times:
2014-03-12T11:47:12.039431-05:00 c0-0c2s3n1 BUG: unable to handle kernel paging request at fffffffffffffff8
2014-03-12T11:47:12.039481-05:00 c0-0c2s3n1 IP: [<ffffffffa02bfda8>] cl_lock_cancel0+0x58/0x150 [obdclass]
2014-03-12T11:47:12.039499-05:00 c0-0c2s3n1 PGD 15ff067 PUD 1600067 PMD 0
2014-03-12T11:47:12.039507-05:00 c0-0c2s3n1 Oops: 0000 1 SMP
2014-03-12T11:47:12.039514-05:00 c0-0c2s3n1 CPU 5
2014-03-12T11:47:12.079400-05:00 c0-0c2s3n1 Modules linked in: xpmem dvspn(P) dvsof(P) dvsutil(P) dvsipc(P) dvsipc_lnet(P) dvsproc(P) nic_compat cmsr osc mgc lustre lov mdc fid lmv fld kgnilnd ptlrpc obdclass lnet sha1_generic md5 libcfs ib_core ipip krsip kdreg gpcd_gem ipogif_gem kgni_gem hwerr(P) rca hss_os(P) heartbeat simplex(P) ghal_gem cgm craytrace
2014-03-12T11:47:12.079425-05:00 c0-0c2s3n1 Pid: 3701, comm: ldlm_bl_08 Tainted: P 3.0.93-0.8.2_1.0000.7848-cray_gem_c #1
2014-03-12T11:47:12.079439-05:00 c0-0c2s3n1 RIP: 0010:[<ffffffffa02bfda8>] [<ffffffffa02bfda8>] cl_lock_cancel0+0x58/0x150 [obdclass]
2014-03-12T11:47:12.079453-05:00 c0-0c2s3n1 RSP: 0018:ffff88040f137d70 EFLAGS: 00010286
2014-03-12T11:47:12.079459-05:00 c0-0c2s3n1 RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffff88040f137cf0
2014-03-12T11:47:12.112232-05:00 c0-0c2s3n1 RDX: ffff88040f137cf0 RSI: ffff88040fde9c08 RDI: ffffffffa0898f20
2014-03-12T11:47:12.112256-05:00 c0-0c2s3n1 RBP: ffff88040f137d90 R08: 0000000000000020 R09: ffffffff81375ed8
2014-03-12T11:47:12.112268-05:00 c0-0c2s3n1 R10: 0000000000000000 R11: 0000000000000009 R12: ffff8804016c5590
2014-03-12T11:47:12.112290-05:00 c0-0c2s3n1 R13: ffff88040db84b18 R14: ffff8804016c5588 R15: 0000000000000000
2014-03-12T11:47:12.112299-05:00 c0-0c2s3n1 FS: 0000000040176880(0000) GS:ffff88021fd40000(0000) knlGS:0000000000000000
2014-03-12T11:47:12.138736-05:00 c0-0c2s3n1 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
2014-03-12T11:47:12.138763-05:00 c0-0c2s3n1 CR2: fffffffffffffff8 CR3: 00000000015fd000 CR4: 00000000000407e0
2014-03-12T11:47:12.138887-05:00 c0-0c2s3n1 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2014-03-12T11:47:12.138911-05:00 c0-0c2s3n1 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2014-03-12T11:47:12.138930-05:00 c0-0c2s3n1 Process ldlm_bl_08 (pid: 3701, threadinfo ffff88040f136000, task ffff88040db8d040)
2014-03-12T11:47:12.138962-05:00 c0-0c2s3n1 Stack:
2014-03-12T11:47:12.168889-05:00 c0-0c2s3n1 ffff88040db84b18 ffff8804016c5588 ffff88040db84b18 ffff88040db84b18
2014-03-12T11:47:12.168907-05:00 c0-0c2s3n1 ffff88040f137db0 ffffffffa02c0afb ffff88040fde9c08 ffff8804016c5588
2014-03-12T11:47:12.168921-05:00 c0-0c2s3n1 ffff88040f137e10 ffffffffa0862cec ffff88040fde9c08 ffff88040f099b40
2014-03-12T11:47:12.168930-05:00 c0-0c2s3n1 Call Trace:
2014-03-12T11:47:12.168976-05:00 c0-0c2s3n1 [<ffffffffa02c0afb>] cl_lock_cancel+0x13b/0x140 [obdclass]
2014-03-12T11:47:12.168988-05:00 c0-0c2s3n1 [<ffffffffa0862cec>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
2014-03-12T11:47:12.198730-05:00 c0-0c2s3n1 [<ffffffffa03cd354>] ldlm_handle_bl_callback+0xd4/0x430 [ptlrpc]
2014-03-12T11:47:12.198752-05:00 c0-0c2s3n1 [<ffffffffa03cd8ac>] ldlm_bl_thread_main+0x1fc/0x420 [ptlrpc]
2014-03-12T11:47:12.198779-05:00 c0-0c2s3n1 [<ffffffff8106637e>] kthread+0x9e/0xb0
2014-03-12T11:47:12.198791-05:00 c0-0c2s3n1 [<ffffffff81369ff4>] kernel_thread_helper+0x4/0x10
2014-03-12T11:47:12.198811-05:00 c0-0c2s3n1 Code: 00 49 8b 84 24 b8 00 00 00 a8 01 75 40 48 83 c8 01 49 89 84 24 b8 00 00 00 49 8b 44 24 10 49 83 c4 08 49 39 c4 48 8d 58 e8 74 22
2014-03-12T11:47:12.228894-05:00 c0-0c2s3n1 8b 43 10 48 8b 40 30 48 85 c0 74 08 48 89 de 4c 89 ef ff d0
2014-03-12T11:47:12.254426-05:00 c0-0c2s3n1 RIP [<ffffffffa02bfda8>] cl_lock_cancel0+0x58/0x150 [obdclass]
2014-03-12T11:47:12.254442-05:00 c0-0c2s3n1 RSP <ffff88040f137d70>
2014-03-12T11:47:12.254450-05:00 c0-0c2s3n1 CR2: fffffffffffffff8
2014-03-12T11:47:12.254457-05:00 c0-0c2s3n1 --[ end trace ac1164d8e2c40df9 ]--
2014-03-12T11:47:12.254509-05:00 c0-0c2s3n1 Kernel panic - not syncing: Fatal exception
2014-03-12T11:47:12.280053-05:00 c0-0c2s3n1 Pid: 3701, comm: ldlm_bl_08 Tainted: P D 3.0.93-0.8.2_1.0000.7848-cray_gem_c #1
2014-03-12T11:47:12.280088-05:00 c0-0c2s3n1 Call Trace:
2014-03-12T11:47:12.280099-05:00 c0-0c2s3n1 [<ffffffff810065b1>] try_stack_unwind+0x161/0x1a0
2014-03-12T11:47:12.280109-05:00 c0-0c2s3n1 [<ffffffff81004dd9>] dump_trace+0x89/0x440
2014-03-12T11:47:12.280144-05:00 c0-0c2s3n1 [<ffffffff8100601c>] show_trace_log_lvl+0x5c/0x80
2014-03-12T11:47:12.280151-05:00 c0-0c2s3n1 [<ffffffff81006055>] show_trace+0x15/0x20
2014-03-12T11:47:12.280170-05:00 c0-0c2s3n1 [<ffffffff81365182>] dump_stack+0x79/0x84
2014-03-12T11:47:12.280186-05:00 c0-0c2s3n1 [<ffffffff81365221>] panic+0x94/0x1da
2014-03-12T11:47:12.280206-05:00 c0-0c2s3n1 [<ffffffff81005e38>] oops_end+0xa8/0xe0
2014-03-12T11:47:12.305555-05:00 c0-0c2s3n1 [<ffffffff81026509>] no_context+0xf9/0x260
2014-03-12T11:47:12.305579-05:00 c0-0c2s3n1 [<ffffffff810267d5>] __bad_area_nosemaphore+0x165/0x1f0
2014-03-12T11:47:12.305613-05:00 c0-0c2s3n1 [<ffffffff81026873>] bad_area_nosemaphore+0x13/0x20
2014-03-12T11:47:12.305626-05:00 c0-0c2s3n1 [<ffffffff81026e1e>] do_page_fault+0x2fe/0x420
2014-03-12T11:47:12.331149-05:00 c0-0c2s3n1 [<ffffffff8136884f>] page_fault+0x1f/0x30
2014-03-12T11:47:12.331174-05:00 c0-0c2s3n1 [<ffffffffa02bfda8>] cl_lock_cancel0+0x58/0x150 [obdclass]
2014-03-12T11:47:12.331244-05:00 c0-0c2s3n1 [<ffffffffa02c0afb>] cl_lock_cancel+0x13b/0x140 [obdclass]
2014-03-12T11:47:12.331262-05:00 c0-0c2s3n1 [<ffffffffa0862cec>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
2014-03-12T11:47:12.331290-05:00 c0-0c2s3n1 [<ffffffffa03cd354>] ldlm_handle_bl_callback+0xd4/0x430 [ptlrpc]
2014-03-12T11:47:12.331301-05:00 c0-0c2s3n1 [<ffffffffa03cd8ac>] ldlm_bl_thread_main+0x1fc/0x420 [ptlrpc]
2014-03-12T11:47:12.356710-05:00 c0-0c2s3n1 [<ffffffff8106637e>] kthread+0x9e/0xb0
2014-03-12T11:47:12.356726-05:00 c0-0c2s3n1 [<ffffffff81369ff4>] kernel_thread_helper+0x4/0x10

This is also (sort of) new - I also saw it with one of Bobijam's patches.
It's a failure to exit, stuck here:
2014-03-12T13:10:59.311520-05:00 c0-0c1s5n3 <node_health:5.1> APID:4805011 (Application_Exited_Check) STACK: call_rwsem_down_read_failed+0x14/0x30; exit_mm+0x3b/0x160; do_exit+0x18e/0x980; do_group_exit+0x48/0xc0; get_signal_to_deliver+0x243/0x480; do_notify_resume+0xe0/0x7f0; retint_signal+0x46/0x83; 0x200007d1; 0xffffffffffffffff;

It's possible this isn't related to the patches, but I haven't seen it except in testing with fairly recent master and one of these patches. (I haven't carefully tested recent master by itself.)

I also saw several dropped connections to some of our OSTs:
2014-03-12T11:02:57.205561-05:00 c0-0c1s4n0 Lustre: snxb1-OST0004-osc-ffff88020e481c00: Connection to snxb1-OST0004 (at 10.10.100.2@o2ib) was lost; in progress operations using this service will wait for recovery to complete

That seems likely to be a problem with our system rather than the Lustre client, but I haven't seen it before on this system, so I thought I'd mention it.

Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ]

A quick note - the version of master I used does have the patch for LU-4300, so that stuck in cl_locks_prune issue above is NOT LU-4300.

Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ]

One further note: I haven't examined all of the patches offered by Jinshan and Bobijam well enough to be sure, but are they all conflicting? These bugs haven't been fixed by a number of different patches, and I'm starting to wonder if there isn't more than one fix needed - I know that's much less likely in general, but I thought I'd suggest it as something to consider.

There's also a pair of patches that were suggested at one point by Vitaly F. @ Xyratex. They were not successful, but I'll attach them for reference.

One is a patch to avoid recursive disclosures, the other is a tweak to usage of hold/get. Again, these did not resolve the issue, I'm just attaching them for reference.

Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ]

Attempted patch from Vitaly

Comment by Jinshan Xiong (Inactive) [ 12/Mar/14 ]

2014-03-12T11:44:04.990085-05:00 c0-0c1s2n0 LustreError: 8529:0:(osc_lock.c:830:osc_ldlm_completion_ast()) } lock@ffff8801e5189078
2014-03-12T11:44:04.990101-05:00 c0-0c1s2n0 LustreError: 8529:0:(osc_lock.c:830:osc_ldlm_completion_ast()) dlmlock returned -5

I saw this, is the client being evicted at that time?

Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ]

Other attempted patch from Xyratex

Comment by Jinshan Xiong (Inactive) [ 12/Mar/14 ]

Please share us the the core dump. Thanks.

Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ]

Ah, yes Jinshan - It was. Sorry, the messages were a bit garbled and I missed that.

So that and the lost connection to the OST were a client eviction by the OST:
2014-03-12T11:44:04.989967-05:00 c0-0c1s2n0 LustreError: 167-0: snxb1-OST0004-osc-ffff8801ec7ae400: This client was evicted by snxb1-OST0004; in progress operations using this service will fail.

Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ]

Jinshan - Is there a particular failure you'd like a dump for?

Comment by Jinshan Xiong (Inactive) [ 12/Mar/14 ]

Just provide me the latest failure with my patch applied, please. I will take a look.

BTW, have you ever seen the issue on your production system?

Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ]

When you say the latest failure, do you just mean this one?
2014-03-12T11:47:12.039431-05:00 c0-0c2s3n1 BUG: unable to handle kernel paging request at fffffffffffffff8

All of the failures I listed were encountered during testing with your patch applied.

And yes, we've seen several of these on our production systems. I can't/shouldn't share the details, but until we found the workaround of removing the LU-3027 lov patch, this was a serious problem for us.

Note that LU-4692 and LU-4693 are non-Cray reports of two of these same bugs, and then there's also the older LU-4558.

Comment by Jinshan Xiong (Inactive) [ 12/Mar/14 ]

2014-03-12T11:47:12.039431-05:00 c0-0c2s3n1 BUG: unable to handle kernel paging request at fffffffffffffff8

Yes, let's start with this one for now

Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ]

Jinshan,

Unfortunately, I had to give up the test system before I could get those dumps...
I've got it again and testing hasn't turned up that page requesting panic again. (I'm sure it's still there, but I haven't seen it on this run...)

So, instead, I've got these six dumps for you, five of them are previously seen bugs and the sixth is a new one:
c0-0c2s5n3: GPF in cl_lock_delete (Also had a thread stuck in cl_locks_prune)
c0-0c2s7n1: lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed
c0-0c2s3n0: GPF in cl_lock_put
c0-0c1s6n2: GPF in osc_lock_detach

The first four are kernel panics, this is a dump of a node with a thread stuck in cl_locks_prune (Which was NMI'ed while running, rather than having a kernel panic):
c0-0c1s3n2
(Pid: 14569, mmstress)

And this is the new bug:
c0-0c0s4n2 LustreError: 19165:0:(cl_lock.c:313:cl_lock_put()) ASSERTION( list_empty(&lock->cll_linkage) ) failed:

Dumps will be in LU-4591-140312.tar.gz at ftp.whamcloud.com/uploads/LU-4591

The console log is also in there. There are many other nodes (10+) which went down that I didn't give dumps for because they were duplicates of the ones I picked. (You'll see their stack traces in the console log.)

Upload of the dumps should be done in ~10-15 minutes.

  • Patrick
Comment by Jinshan Xiong (Inactive) [ 13/Mar/14 ]

Hi Patrick,

I will take a look at the dump, please don't forget to copy lustre modules

Anyway you have provided an important clue about the error in completion AST. If this is this error happened for every failing case, definitely we can get something from there.

Jinshan

Comment by Patrick Farrell (Inactive) [ 13/Mar/14 ]

Jinshan, Bobijam,

Just a general question. Do you think breaking out the cl_lock_trace calls in to their own debug flag (rather than being part of dlmtrace) is a good thing in general? My patch to do it is just a quick hack, but I'm wondering if a cleaned up version of it - without special treatment for cl_free - is something we might want to land to master?

It's been useful for me having it separated, but only because enabling full dlmtrace always prevents me from seeing these cl_lock bugs (and some of the earlier ones as well).

If you think so, I'll submit a patch for it.

Comment by Jinshan Xiong (Inactive) [ 13/Mar/14 ]

Hi Patrick,

cl_lock is dying so please don't waste any time on it. A simplified version of cl_lock will be introduced in CLIO simplification project.

Jinshah

Comment by Patrick Farrell (Inactive) [ 14/Mar/14 ]

OK. I like that answer.

Comment by Shuichi Ihara (Inactive) [ 29/Mar/14 ]

we need this fix with b2_5, so backported. http://review.whamcloud.com/9851

Comment by Bruno Faccini (Inactive) [ 30/Mar/14 ]

After doing some investigations in correlating/crosscheck with the current problems encountered at CEA/Tera-100 site since they upgraded to Lustre 2.4.2 it appears that they also encounter almost all of the [L]BUGs described in this ticket, and here is the list :
_ "ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6", also reported in LU-4692.
_ "ASSERTION( cl_lock_is_mutexed(slice->cls_lock)" also reported in LU-4693/LU-4797.
_ GPF in cl_lock_delete0().
_ GPF in cl_lock_put().
_ GPF in osc_lock_detach(), also reported in LU-3614.

I have encouraged them to give a try to the debug-trace setting (rpctrace+dlmtrace) and see if it helps to avoid/reduce the frequency of the crashes, and they have enabled this on laste Friday night. Will see on Monday if their very bad stats (about 8 of the different crashes listed per day) have been lowered.

I have added Lustre 2.4.2 to the list of affected versions for this ticket.

What is unclear for me (and CEA people) with this ticket is :
_ does the patch http://review.whamcloud.com/9524 really fix ?
_ if not, can we consider that reverting patches of LU-3889 and LU-3027 (and more ?) is the fix ?

Comment by Patrick Farrell (Inactive) [ 31/Mar/14 ]

Bruno - Here's a breakdown from the Cray perspective, where we/I've been looking at these for a while.

9524 does not fix any single specific assertion/GPF. In my own testing, with mmstress on a 70-ish node system, it did not significantly reduce incidence of the bugs listed in this ticket. (I wouldn't have noticed anything less than probably a 50% reduction, however. So it may improve things a bit.)

According to review comments on 9524, it improves the success rate with racer. I haven't checked that, as racer isn't part of our usual test suite. (We don't pass it often enough.)

Cray has found a set of patches that seems to avoid the problems, though I don't believe it really fixes them. As I described in this comment:
https://jira.hpdd.intel.com/browse/LU-4591?focusedCommentId=76411&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-76411

We identified these patches as relevant:

13079de LU-3889 osc: Allow lock to be canceled at ENQ time (http://review.whamcloud.com/#/c/8405/)
7168ea8 LU-3027 clio: Do not shrink sublock at cancel (http://review.whamcloud.com/#/c/7569/)
521335c LU-3433 clio: wrong cl_lock usage (http://review.whamcloud.com/#/c/6709/)
I1ea629 LU-3027 lov: to not modify lov lock when sublock is canceled (http://review.whamcloud.com/#/c/7841/)

I give the (lengthy) details in my original comment, but in essence, we removed:
I1ea629 LU-3027 lov: to not modify lov lock when sublock is canceled (http://review.whamcloud.com/#/c/7841/)

From our 2.4 and 2.5, and have not seen any of the assertions/GPFs you listed in general testing since. In specific, focused testing with debug disabled, I've been able to hit one or two of them. So I don't believe pulling the LU-3027 lov patch has fixed any specific code flaws (also, no one has identified any problems with it); rather, I think it's probably changed timing.

Still, the LU-3027 lov patch was landed because it's believed to fix a flaw in the code, not because there was a specific crash it fixed. The LU-3027 clio patch fixed the original reproducer for the assertion reported in LU-3027.

So, we decided it was safe to pull the LU-3027 lov patch. We've been running that way for some time, with excellent results. (And as far as we can tell, no new bugs introduced.)

Comment by Bruno Faccini (Inactive) [ 02/Apr/14 ]

Patrick, thanks for all these clarifications that are very helpful!!

Comment by Jinshan Xiong (Inactive) [ 02/Apr/14 ]

I'm working on this issue.

Comment by Bruno Faccini (Inactive) [ 02/Apr/14 ]

Just a small comment to indicate that CEA did not get any crash since they only enabled dlmtrace last Friday !!
Also, they have a full system down for maintenance planned next Wednesday April 9th, so this will be a good time for them to either run a Lustre version where LU-3027 lov patch has been pulled and as suggested by Patrick, or to apply any other fix we may have identified or developed to avoid the crashes ...

Comment by Jinshan Xiong (Inactive) [ 02/Apr/14 ]

Yes, it's recommended to revert the 2nd patch of LU-3027. Though I believe that is not the root cause of this problem but it can reduce the chance significantly.

Comment by Jinshan Xiong (Inactive) [ 03/Apr/14 ]

I think all occurrences of the problems point to the same root cause - the sub lock has already been freed. I think this is a race of LU-4558, I will create a patch.

Comment by Jinshan Xiong (Inactive) [ 03/Apr/14 ]

Patch is at http://review.whamcloud.com/9876.

Patrick, can you please give it a try since you can consistently reproduce it?

Jinshan

Comment by Patrick Farrell (Inactive) [ 03/Apr/14 ]

Jinshan - With this patch applied on top of master, I'm getting a GPF in cl_lock_delete0, on line 841, which is this:

                /*
                 * From now on, no new references to this lock can be acquired
                 * by cl_lock_lookup().
                 */
                cfs_list_for_each_entry_reverse(slice, &lock->cll_layers,
                                                cls_linkage) {
                        if (slice->cls_ops->clo_delete != NULL) <---- This line here.
                                slice->cls_ops->clo_delete(env, slice);
                }


[Edit: I had previously given this is happening in cl_lock_delete at the call to cl_lock_delete0, I was misreading the stack trace - Sorry about that.]

This happens swiftly when running mmstress, even with full debug enabled.

I'm going to take a quick look to see if I can understand why, but I'll probably upload a dump (with full dk logs enabled) shortly...

Comment by Jinshan Xiong (Inactive) [ 03/Apr/14 ]

Can you please give me stack trace?

Comment by Patrick Farrell (Inactive) [ 03/Apr/14 ]

Oh, duh. Sorry Jinshan - I forgot:

[<ffffffff81006591>] try_stack_unwind+0x161/0x1a0
[<ffffffff81004de9>] dump_trace+0x89/0x440
[<ffffffff81005ffc>] show_trace_log_lvl+0x5c/0x80
[<ffffffff81006035>] show_trace+0x15/0x20
[<ffffffff81367a12>] dump_stack+0x79/0x84
[<ffffffff81367ab1>] panic+0x94/0x1da
[<ffffffff81005e18>] oops_end+0xa8/0xe0
[<ffffffff81005f4b>] die+0x5b/0x90
[<ffffffff8100362a>] do_general_protection+0x15a/0x160
[<ffffffff8136b11f>] general_protection+0x1f/0x30
[<ffffffffa02c2588>] cl_lock_delete0+0x198/0x200 [obdclass]
[<ffffffffa02c273b>] cl_lock_delete+0x14b/0x190 [obdclass]
[<ffffffffa02c2bd7>] cl_lock_finish+0x37/0x60 [obdclass]
[<ffffffffa02c508a>] cl_lock_hold_mutex+0x3ba/0x620 [obdclass]
[<ffffffffa02c5346>] cl_lock_hold+0x56/0x120 [obdclass]
[<ffffffffa069f133>] lov_lock_enqueue+0x8e3/0xf80 [lov]
[<ffffffffa02c3e9b>] cl_enqueue_try+0xfb/0x320 [obdclass]
[<ffffffffa02c489f>] cl_enqueue_locked+0x7f/0x1f0 [obdclass]
[<ffffffffa02c548e>] cl_lock_request+0x7e/0x270 [obdclass]
[<ffffffffa02ca434>] cl_io_lock+0x394/0x5c0 [obdclass]
[<ffffffffa02ca6fa>] cl_io_loop+0x9a/0x1a0 [obdclass]
[<ffffffffa07634f8>] ll_fault+0x308/0x4e0 [lustre]
[<ffffffff811135e6>] __do_fault+0x76/0x570
[<ffffffff81113b84>] handle_pte_fault+0xa4/0xcc0
[<ffffffff8111494e>] handle_mm_fault+0x1ae/0x240
[<ffffffff81026caf>] do_page_fault+0x18f/0x420
[<ffffffff8136b14f>] page_fault+0x1f/0x30

Dump is here:

ftp.whamcloud.com
uploads/LU-4591/LU-4591_cl_lock_delete0_gpf_140403.tar.gz

Comment by Jinshan Xiong (Inactive) [ 03/Apr/14 ]

Please use patch version 2 of http://review.whamcloud.com/9881

Comment by Ann Koehler (Inactive) [ 03/Apr/14 ]

Jinshan, you might want to take a look at LU-4861. Cray is also seeing a deadlock in cl_lock_mutex_get along the cl_glimpse_lock path. This is independent of Patrick's testing. Seems like this different behavior could shed light on or confirm your suspicions about the root cause of LU-4591. I didn't want to add the deadlock info here in case it's an unrelated problem.

Comment by Patrick Farrell (Inactive) [ 03/Apr/14 ]

Jinshan - Wow. Finally some good news on these bugs - early testing results on master are perfect. I would expect to have seen 10-20 instances of these various bugs by now in my testing, and I have not seen any yet.

I'm adding this patch and LU-3027 lov to Cray 2.5, and will be doing some lengthy testing tonight, both with the mmstress reproducer and general Cray IO stress testing. I may also try racer.sh for a while as well.

Comment by Jinshan Xiong (Inactive) [ 03/Apr/14 ]

That's really good, Patrick. Thank you for your effort on this bug.

Hi Ann, I will take a look at it soon.

Jinshan

Comment by Patrick Farrell (Inactive) [ 04/Apr/14 ]

Jinshan - Testing last night completed without any problems. Thank you very much for this - It looks like we've probably finally fixed a bug we've been working on for a long time. (Much of the work on this bug on our side happened before we opened the bug with you.)

I've given a positive review to the mod as well. I think this ticket could probably be closed as a duplicate of LU-4558.

Again, thank you very much. I've been working on this in various forms since about October of last year.

Comment by Jinshan Xiong (Inactive) [ 04/Apr/14 ]

duplicate of LU-4558.

Comment by Patrick Valentin (Inactive) [ 04/Apr/14 ]

Patrick,

in your comment on 03/Apr/14 8:53 PM, you wrote:
> I'm adding this patch and LU-3027 lov to Cray 2.5, and ...
Does it mean you apply LU-4558 patch AND you continue to revert "LU-3027 lov", or does it mean that you reapply "LU-3027 lov" you were previously reverting ?

We built today a lustre 2.4.2 with LU-4558 patch set 2 without reverting "LU-3027 lov" for tests at CEA, and it seems they still have the issue. I should have the confirmation of test results tonight or on next monday.

Thanks in advance
Patrick

Comment by Patrick Farrell (Inactive) [ 04/Apr/14 ]

Patrick -

I ran Cray's 2.5 (which is very similar to Intel's 2.5.1) with LU-4558 and LU-3027 lov. So I reapplied that patch, which I was previously reverting.

It worries me (obviously) that CEA continued to see problems in 2.4.2. Can you share which specific assertions/GPFs they continued to hit? And if possible, what codes were causing the issue?

Thanks

  • Patrick
Comment by Jinshan Xiong (Inactive) [ 04/Apr/14 ]

Hi Patrick Valentin,

Please apply patch http://review.whamcloud.com/9881 to your branch and don't revert anything.

I was wondering why you talked to yourself and it took me a while to figure out that you guys have the same first name :-D

Jinshan

Comment by Patrick Valentin (Inactive) [ 09/Apr/14 ]

Hi Jinshan and Patrick,
in my comment on last friday, I said that tests were run at CEA and they still had the issue. In fact, they only ran a subset of the tests, and they did not have crashes. They had client evictions, but this is probably another problem (perhaps LU-4495).
Monday and tuesday was the cluster maintenance window. Tests will restart today with luster 2.4.3 plus http://review.whamcloud.com/9881 and without reverting LU-3027. Hope to have news soon.

Patrick

Generated at Sat Feb 10 06:44:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.