Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4591

Related cl_lock failures on master/2.5

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.2
    • Lustre 2.5.0, Lustre 2.6.0, Lustre 2.4.2
    • Master clients on SLES11SP3, server version irrelevant (tested against 2.1,2.4,2.4.1,2.5).
    • 3
    • 12543

    Description

      We've seen a number of different cl_clock bugs in master/2.4.1/2.5 that we believe are related.

      We have seen these in master as recently as two weeks ago (current master cannot run mmstress at all due to a problem in the paging code).

      These bugs are not present in Intel released 2.4.0, but we've seen them in Cray 2.4.1 and 2.5 (which do not track precisely with the Intel versions of 2.4.1 and 2.5).

      We've seen all of these bugs during our general purpose testing, but we believe they're related, because all of them are reproduced easily by running multiple copies (on multiple nodes) of mmstress from the Linux Test Project (mtest05 - I will attach the source), and none of them seem to be present in 2.4.0. (At least, none of them are reproduced in that context.)

      Not all of the stack traces below are from runs on master (sorry - it's not what I've got handy), but all of the listed bugs have been reproduced on master:

      General protection fault in osc_lock_detach (this one seems to be the most common):
      —
      > 20:00:29 Pid: 3120, comm: ldlm_bl_04 Tainted: P 3.0.82-0.7.9_1.0000.7690-cray_gem_c #1
      > 20:00:30 RIP: 0010:[<ffffffffa0635586>] [<ffffffffa0635586>] osc_lock_detach+0x46/0x180 [osc]
      > 20:00:30 RSP: 0000:ffff880204065cc0 EFLAGS: 00010206
      > 20:00:30 RAX: 0000000000005050 RBX: 5a5a5a5a5a5a5a5a RCX: ffff880204065cc0
      > 20:00:30 RDX: ffff880204065cc0 RSI: ffff880203c38b50 RDI: ffffffffa066ba00
      > 20:00:30 RBP: ffff880204065cf0 R08: 0000000000000020 R09: ffffffff8136fd38
      > 20:00:30 R10: 0000000000000400 R11: 0000000000000009 R12: ffff880203c38b50
      > 20:00:30 R13: 0000000000000000 R14: ffff88020e33ab40 R15: ffff880201fd23e0
      > 20:00:30 FS: 00002aaaaec54700(0000) GS:ffff88021fcc0000(0000) knlGS:0000000000000000
      > 20:00:30 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      > 20:00:30 CR2: 00002aaaad101000 CR3: 000000020439d000 CR4: 00000000000407e0
      > 20:00:30 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      > 20:00:30 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      > 20:00:30 Process ldlm_bl_04 (pid: 3120, threadinfo ffff880204064000, task ffff88020e73d7e0)
      > 20:00:30 Stack:
      > 20:00:30 ffff880204065cc0 0000000000000000 ffff880203c38b50 0000000000000000
      > 20:00:30 ffff88020e33ab40 ffff880201fd23e0 ffff880204065d60 ffffffffa0635a7a
      > 20:00:30 ffff880201fd23e0 ffff880150f57ed0 0000000001fd23e0 ffff880150f57ed0
      > 20:00:30 Call Trace:
      > 20:00:30 [<ffffffffa0635a7a>] osc_lock_cancel+0xca/0x410 [osc]
      > 20:00:30 [<ffffffffa02b874d>] cl_lock_cancel0+0x6d/0x150 [obdclass]
      > 20:00:30 [<ffffffffa02b948b>] cl_lock_cancel+0x13b/0x140 [obdclass]
      > 20:00:30 [<ffffffffa0636e6c>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
      > 20:00:30 [<ffffffffa03c5334>] ldlm_handle_bl_callback+0xd4/0x430 [ptlrpc]
      > 20:00:30 [<ffffffffa03c588c>] ldlm_bl_thread_main+0x1fc/0x420 [ptlrpc]
      > 20:00:30 [<ffffffff8106610e>] kthread+0x9e/0xb0
      > 20:00:30 [<ffffffff81363ff4>] kernel_thread_helper+0x4/0x10
      > 20:00:30 Code: f8 66 66 66 66 90 49 89 f4 49 89 ff 48 c7 c7 00 ba 66 a0 e8 0d cb d2 e0 49 8b 5c 24 28 48 85 db 74 7b 49 c7 44 24 28 00 00 00 00
      > 20:00:30 c7 83 60 01 00 00 00 00 00 00 49 c7 44 24 70 00 00 00 00 fe
      > 20:00:30 RIP [<ffffffffa0635586>] osc_lock_detach+0x46/0x180 [osc]
      > 20:00:30 RSP <ffff880204065cc0>
      > 20:00:30 --[ end trace 317fa078a5344509 ]--

      —

      (osc_lock.c:1134:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6
      —
      > 23:39:52 LustreError: 6456:0:(osc_lock.c:1134:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6
      > 23:39:52 LustreError: 6456:0:(osc_lock.c:1134:osc_lock_enqueue()) LBUG
      > 23:39:52 Pid: 6456, comm: mmstress
      > 23:39:52 Call Trace:
      > 23:39:52 [<ffffffff810065b1>] try_stack_unwind+0x161/0x1a0
      > 23:39:52 [<ffffffff81004dd9>] dump_trace+0x89/0x440
      > 23:39:52 [<ffffffffa016d897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      > 23:39:52 [<ffffffffa016dde7>] lbug_with_loc+0x47/0xc0 [libcfs]
      > 23:39:52 [<ffffffffa063b8ef>] osc_lock_enqueue+0x74f/0x8d0 [osc]
      > 23:39:52 [<ffffffffa02c0dab>] cl_enqueue_try+0xfb/0x320 [obdclass]
      > 23:39:52 [<ffffffffa06d13dd>] lov_lock_enqueue+0x1fd/0x880 [lov]
      > 23:39:52 [<ffffffffa02c0dab>] cl_enqueue_try+0xfb/0x320 [obdclass]
      > 23:39:52 [<ffffffffa02c1c8f>] cl_enqueue_locked+0x7f/0x1f0 [obdclass]
      > 23:39:52 [<ffffffffa02c287e>] cl_lock_request+0x7e/0x270 [obdclass]
      > 23:39:52 [<ffffffffa02c7db4>] cl_io_lock+0x394/0x5c0 [obdclass]
      > 23:39:52 [<ffffffffa02c807a>] cl_io_loop+0x9a/0x1a0 [obdclass]
      > 23:39:52 [<ffffffffa078bf78>] ll_fault+0x308/0x4e0 [lustre]
      > 23:39:52 [<ffffffff81111ce6>] __do_fault+0x76/0x570
      > 23:39:52 [<ffffffff81112284>] handle_pte_fault+0xa4/0xcc0
      > 23:39:52 [<ffffffff8111304e>] handle_mm_fault+0x1ae/0x240
      > 23:39:52 [<ffffffff81026c6f>] do_page_fault+0x18f/0x420
      > 23:39:52 [<ffffffff813628cf>] page_fault+0x1f/0x30
      > 23:39:52 [<00000000200007ea>] 0x200007ea
      > 23:39:52 Kernel panic - not syncing: LBUG
      —

      lov_lock_link_find()) ASSERTION( cl_lock_is_mutexed(sub->lss_cl.cls_lock) ) failed:
      –
      2013-12-06T21:36:44.086298-06:00 c0-0c0s11n2 LustreError: 9051:0:(lov_lock.c:1092:lov_lock_link_find()) ASSERTION( cl_lock_is_mutexed(sub->lss_cl.cls_lock) ) failed:
      2013-12-06T21:36:44.111468-06:00 c0-0c0s11n2 LustreError: 9051:0:(lov_lock.c:1092:lov_lock_link_find()) LBUG
      2013-12-06T21:36:44.111497-06:00 c0-0c0s11n2 Pid: 9051, comm: mmstress
      2013-12-06T21:36:44.111536-06:00 c0-0c0s11n2 Call Trace:
      2013-12-06T21:36:44.136717-06:00 c0-0c0s11n2 [<ffffffff81005db9>] try_stack_unwind+0x169/0x1b0
      2013-12-06T21:36:44.136757-06:00 c0-0c0s11n2 [<ffffffff81004849>] dump_trace+0x89/0x450
      2013-12-06T21:36:44.136787-06:00 c0-0c0s11n2 [<ffffffffa02128d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      2013-12-06T21:36:44.169446-06:00 c0-0c0s11n2 [<ffffffffa0212e37>] lbug_with_loc+0x47/0xc0 [libcfs]
      2013-12-06T21:36:44.169474-06:00 c0-0c0s11n2 [<ffffffffa0778f7a>] lov_lock_link_find+0x16a/0x170 [lov]
      2013-12-06T21:36:44.169502-06:00 c0-0c0s11n2 [<ffffffffa07799cf>] lov_sublock_adopt+0x8f/0x370 [lov]
      2013-12-06T21:36:44.194648-06:00 c0-0c0s11n2 [<ffffffffa077cba7>] lov_lock_init_raid0+0x637/0xd50 [lov]
      2013-12-06T21:36:44.194687-06:00 c0-0c0s11n2 [<ffffffffa0773c0e>] lov_lock_init+0x1e/0x60 [lov]
      2013-12-06T21:36:44.194704-06:00 c0-0c0s11n2 [<ffffffffa036961a>] cl_lock_hold_mutex+0x32a/0x640 [obdclass]
      2013-12-06T21:36:44.219871-06:00 c0-0c0s11n2 [<ffffffffa0369ab2>] cl_lock_request+0x62/0x270 [obdclass]
      2013-12-06T21:36:44.219901-06:00 c0-0c0s11n2 [<ffffffffa036f0ce>] cl_io_lock+0x39e/0x5d0 [obdclass]
      2013-12-06T21:36:44.245175-06:00 c0-0c0s11n2 [<ffffffffa036f3a2>] cl_io_loop+0xa2/0x1b0 [obdclass]
      2013-12-06T21:36:44.245216-06:00 c0-0c0s11n2 [<ffffffffa0838760>] ll_page_mkwrite+0x280/0x680 [lustre]
      2013-12-06T21:36:44.245228-06:00 c0-0c0s11n2 [<ffffffff81118019>] __do_fault+0xf9/0x5b0
      2013-12-06T21:36:44.270380-06:00 c0-0c0s11n2 [<ffffffff81118574>] handle_pte_fault+0xa4/0xcd0
      2013-12-06T21:36:44.270420-06:00 c0-0c0s11n2 [<ffffffff8111934e>] handle_mm_fault+0x1ae/0x240
      2013-12-06T21:36:44.270431-06:00 c0-0c0s11n2 [<ffffffff8138a915>] do_page_fault+0x1e5/0x4a0
      2013-12-06T21:36:44.295606-06:00 c0-0c0s11n2 [<ffffffff8138778f>] page_fault+0x1f/0x30
      2013-12-06T21:36:44.295648-06:00 c0-0c0s11n2 [<00000000200007fa>] 0x200007fa
      —

      General protection fault in cl_lock_put (possibly the same issue as LU-4558?):
      —
      > 19:55:11 general protection fault: 0000 1 SMP
      > 19:55:11 CPU 19
      > 19:55:11 Modules linked in: mic xpmem dvspn(P) dvsof(P) dvsutil(P) dvsipc(P) dvsipc_lnet(P) dvsproc(P) bpmcdmod nic_compat cmsr lmv mgc lustre lov osc mdc fid fld kgnilnd ptlrpc obdclass lnet lvfs sha1_generic md5 libcfs ib_core pcie_link_bw_monitor kdreg gpcd_ari ipogif_ari kgni_ari hwerr(P) rca hss_os(P) heartbeat simplex(P) ghal_ari craytrace
      > 19:55:11 Pid: 8720, comm: ldlm_bl_02 Tainted: P 3.0.80-0.5.1_1.0501.7664-cray_ari_c #1 Cray Inc. Cascade/Cascade
      > 19:55:11 RIP: 0010:[<ffffffffa036b553>] [<ffffffffa036b553>] cl_lock_put+0x103/0x410 [obdclass]
      > 19:55:11 RSP: 0018:ffff8807ef0fdc70 EFLAGS: 00010246
      > 19:55:11 RAX: 0000000000000001 RBX: 5a5a5a5a5a5a5a5a RCX: ffff8807f30a7af8
      > 19:55:11 RDX: ffffffffa038da5b RSI: 5a5a5a5a5a5a5a5a RDI: ffff8807f1d169c8
      > 19:55:11 RBP: ffff8807ef0fdc90 R08: ffffffffa037ebc0 R09: 00000000000002f8
      > 19:55:11 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      > 19:55:11 R13: ffff8807f1d169c8 R14: ffff8807f30a7af8 R15: 0000000000000001
      > 19:55:11 FS: 00007ffff7ff6700(0000) GS:ffff88087f6c0000(0000) knlGS:0000000000000000
      > 19:55:11 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      > 19:55:11 CR2: 00005555557af328 CR3: 0000000001661000 CR4: 00000000001406e0
      > 19:55:11 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      > 19:55:11 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      > 19:55:11 Process ldlm_bl_02 (pid: 8720, threadinfo ffff8807ef0fc000, task ffff88083c2ae0c0)
      > 19:55:11 Stack:
      > 19:55:11 ffff88083bf7e840 0000000000000000 ffff8807f1d169c8 ffff8807f30a7af8
      > 19:55:11 ffff8807ef0fdcf0 ffffffffa07194ca ffff88083bf7e840 ffff88050caec000
      > 19:55:11 0000000000000001 0000000000000000 ffff8807ef0fdcf0 ffff88050caec000
      > 19:55:11 Call Trace:
      > 19:55:11 [<ffffffffa07194ca>] osc_ldlm_blocking_ast+0xaa/0x330 [osc]
      > 19:55:11 [<ffffffffa0460aeb>] ldlm_cancel_callback+0x6b/0x190 [ptlrpc]
      > 19:55:11 [<ffffffffa046f41a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
      > 19:55:11 [<ffffffffa0473ec0>] ldlm_cli_cancel+0x60/0x370 [ptlrpc]
      > 19:55:11 [<ffffffffa071819e>] osc_lock_cancel+0xfe/0x1c0 [osc]
      > 19:55:11 [<ffffffffa0368145>] cl_lock_cancel0+0x75/0x160 [obdclass]
      > 19:55:11 [<ffffffffa0368e8b>] cl_lock_cancel+0x13b/0x140 [obdclass]
      > 19:55:11 [<ffffffffa071962c>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
      > 19:55:11 [<ffffffffa04777d4>] ldlm_handle_bl_callback+0xd4/0x430 [ptlrpc]
      > 19:55:11 [<ffffffffa0477d5c>] ldlm_bl_thread_main+0x22c/0x450 [ptlrpc]
      > 19:55:11 [<ffffffff8138e894>] kernel_thread_helper+0x4/0x10
      > 19:55:11 Code: 00 00 00 c7 05 23 e3 06 00 01 00 00 00 4c 8b 45 08 8b 13 48 89 d9 48 c7 c6 60 82 38 a0 48 c7 c7 40 98 3d a0 31 c0 e8 5d 6a eb ff <f0> ff 0b 0f 94 c0 84 c0 74 06 83 7b 50 06 74 7d f6 05 5a b4 ed
      > 19:55:11 RIP [<ffffffffa036b553>] cl_lock_put+0x103/0x410 [obdclass]
      > 19:55:11 RSP <ffff8807ef0fdc70>
      —

      lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:
      —
      > 04:59:03 LustreError: 19083:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:
      > 04:59:03 LustreError: 19083:0:(lovsub_lock.c:103:lovsub_lock_state()) LBUG
      > 04:59:03 Pid: 19083, comm: mmstress
      > 04:59:03 Call Trace:
      > 04:59:03 [<ffffffff810065b1>] try_stack_unwind+0x161/0x1a0
      > 04:59:03 [<ffffffff81004dd9>] dump_trace+0x89/0x440
      > 04:59:03 [<ffffffffa0168897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      > 04:59:03 [<ffffffffa0168de7>] lbug_with_loc+0x47/0xc0 [libcfs]
      > 04:59:03 [<ffffffffa06d458e>] lovsub_lock_state+0x19e/0x1a0 [lov]
      > 04:59:03 [<ffffffffa02b7b80>] cl_lock_state_signal+0x60/0x160 [obdclass]
      > 04:59:03 [<ffffffffa02b7d4d>] cl_lock_state_set+0xcd/0x1a0 [obdclass]
      > 04:59:03 [<ffffffffa02bbe0b>] cl_enqueue_try+0x14b/0x320 [obdclass]
      > 04:59:03 [<ffffffffa06cc3dd>] lov_lock_enqueue+0x1fd/0x880 [lov]
      > 04:59:03 [<ffffffffa02bbdbb>] cl_enqueue_try+0xfb/0x320 [obdclass]
      > 04:59:03 [<ffffffffa02bcc9f>] cl_enqueue_locked+0x7f/0x1f0 [obdclass]
      > 04:59:03 [<ffffffffa02bd88e>] cl_lock_request+0x7e/0x270 [obdclass]
      > 04:59:03 [<ffffffffa02c2dc4>] cl_io_lock+0x394/0x5c0 [obdclass]
      > 04:59:03 [<ffffffffa02c308a>] cl_io_loop+0x9a/0x1a0 [obdclass]
      > 04:59:03 [<ffffffffa0786ee8>] ll_fault+0x308/0x4e0 [lustre]
      > 04:59:03 [<ffffffff81111d16>] __do_fault+0x76/0x570
      > 04:59:03 [<ffffffff811122b4>] handle_pte_fault+0xa4/0xcc0
      > 04:59:03 [<ffffffff8136290f>] page_fault+0x1f/0x30
      > 04:59:03 [<00000000200007ea>] 0x200007ea
      —

      General protection fault in cl_lock_delete:
      —
      > 22:37:45 Pid: 6333, comm: ldlm_bl_13 Tainted: P 3.0.82-0.7.9_1.0502.7742-cray_gem_c #1
      > 22:37:45 RIP: 0010:[<ffffffffa02b8eb0>] [<ffffffffa02b8eb0>] cl_lock_delete0+0x190/0x1f0 [obdclass]
      > 22:37:45 RSP: 0018:ffff8803d0403bc0 EFLAGS: 00010296
      > 22:37:45 RAX: 5a5a5a5a5a5a5a5a RBX: 5a5a5a5a5a5a5a42 RCX: ffff8803d7509ca8
      > 22:37:45 RDX: ffff8803d54b6800 RSI: ffff8803fbfdceb8 RDI: ffff8803d54b67c0
      > 22:37:45 RBP: ffff8803d0403be0 R08: ffff8803d55a8858 R09: 0000000000000000
      > 22:37:45 R10: 0000000000000023 R11: 0000000000000000 R12: ffff8803d55a8818
      > 22:37:45 R13: ffff8803d55a8810 R14: ffff880803ddcdb8 R15: 0000000000000001
      > 22:37:45 FS: 000000004013d880(0000) GS:ffff88081fcc0000(0000) knlGS:0000000000000000
      > 22:37:46 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      > 22:37:46 CR2: 00002aaaaab38000 CR3: 0000000803db3000 CR4: 00000000000407e0
      > 22:37:46 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      > 22:37:46 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      > 22:37:46 Process ldlm_bl_13 (pid: 6333, threadinfo ffff8803d0402000, task ffff880403064040)
      > 22:37:46 Stack:
      > 22:37:46 ffff8803d55a8810 ffff880803ddcdb8 ffff880803ddcdb8 ffff8803d55a8810
      > 22:37:46 ffff8803d0403c00 ffffffffa02b9063 ffff8801e05fbb50 0000000000000000
      > 22:37:46 ffff8803d0403c60 ffffffffa0636e7a ffff8801e05fbb50 ffff8803d54b67c0
      > 22:37:46 Call Trace:
      > 22:37:46 [<ffffffffa02b9063>] cl_lock_delete+0x153/0x1a0 [obdclass]
      > 22:37:46 [<ffffffffa0636e7a>] osc_ldlm_blocking_ast+0x21a/0x330 [osc]
      > 22:37:46 [<ffffffffa03adabb>] ldlm_cancel_callback+0x6b/0x190 [ptlrpc]
      > 22:37:46 [<ffffffffa03bc7ca>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
      > 22:37:46 [<ffffffffa03c11bb>] ldlm_cli_cancel+0x6b/0x380 [ptlrpc]
      > 22:37:46 [<ffffffffa0635c47>] osc_lock_cancel+0x297/0x410 [osc]
      > 22:37:46 [<ffffffffa02b770d>] cl_lock_cancel0+0x6d/0x150 [obdclass]
      > 22:37:46 [<ffffffffa02b844b>] cl_lock_cancel+0x13b/0x140 [obdclass]
      > 22:37:46 [<ffffffffa0636e6c>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
      > 22:37:46 [<ffffffffa03c4944>] ldlm_handle_bl_callback+0xd4/0x430 [ptlrpc]
      > 22:37:46 [<ffffffffa03c4e9c>] ldlm_bl_thread_main+0x1fc/0x420 [ptlrpc]
      > 22:37:46 [<ffffffff8106610e>] kthread+0x9e/0xb0
      > 22:37:46 [<ffffffff81364034>] kernel_thread_helper+0x4/0x10
      > 22:37:46 Code: 08 48 89 10 49 89 4d 18 49 89 4d 20 41 fe 44 24 5c 49 8b 45 10 4d 8d 65 08 49 39 c4 48 8d 58 e8 0f 84 be fe ff ff 0f 1f 44 00 00
      > 22:37:46 8b 43 10 48 8b 40 50 48 85 c0 74 08 48 89 de 4c 89 f7 ff d0
      > 22:37:46 RIP [<ffffffffa02b8eb0>] cl_lock_delete0+0x190/0x1f0 [obdclass]
      > 22:37:46 RSP <ffff8803d0403bc0>
      > 22:37:46 --[ end trace cfc07b184a378ec7 ]--
      —

      One more, which is slightly different, but still caused by the same tests, not found in 2.4.0, etc.

      Lustre is getting stuck looping in cl_locks_prune. We have many cases of applications failing to exit with processes stuck somewhere under cl_locks_prune - Here's two examples:
      > 05:09:21 [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0
      > 05:09:21 [<ffffffff81004919>] dump_trace+0x89/0x450
      > 05:09:21 [<ffffffff8100591c>] show_trace_log_lvl+0x5c/0x80
      > 05:09:21 [<ffffffff81005955>] show_trace+0x15/0x20
      > 05:09:21 [<ffffffff813bd54b>] dump_stack+0x79/0x84
      > 05:09:21 [<ffffffff810c9bf9>] __rcu_pending+0x199/0x410
      > 05:09:21 [<ffffffff810c9eda>] rcu_check_callbacks+0x6a/0x120
      > 05:09:21 [<ffffffff8105cb96>] update_process_times+0x46/0x90
      > 05:09:21 [<ffffffff81080a16>] tick_sched_timer+0x66/0xc0
      > 05:09:21 [<ffffffff8107386f>] __run_hrtimer+0xcf/0x1d0
      > 05:09:21 [<ffffffff81073bb7>] hrtimer_interrupt+0xe7/0x220
      > 05:09:21 [<ffffffff813c9dd9>] smp_apic_timer_interrupt+0x69/0x99
      > 05:09:21 [<ffffffff813c8d13>] apic_timer_interrupt+0x13/0x20
      > 05:09:21 [<ffffffffa0373a79>] cl_env_info+0x9/0x20 [obdclass]
      > 05:09:21 [<ffffffffa037dfc6>] cl_lock_mutex_get+0x36/0xd0 [obdclass]
      > 05:09:21 [<ffffffffa0380319>] cl_locks_prune+0xd9/0x330 [obdclass]
      > 05:09:21 [<ffffffffa078dc91>] lov_delete_raid0+0xe1/0x3f0 [lov]
      > 05:09:21 [<ffffffffa078d059>] lov_object_delete+0x69/0x180 [lov]
      > 05:09:21 [<ffffffffa036bda5>] lu_object_free+0x85/0x1a0 [obdclass]
      > 05:09:21 [<ffffffffa036c2de>] lu_object_put+0xbe/0x370 [obdclass]
      > 05:09:21 [<ffffffffa0374d1e>] cl_object_put+0xe/0x10 [obdclass]
      > 05:09:21 [<ffffffffa086803d>] cl_inode_fini+0xbd/0x2a0 [lustre]
      > 05:09:21 [<ffffffffa082f160>] ll_clear_inode+0x2d0/0x930 [lustre]
      > 05:09:21 [<ffffffffa082f84d>] ll_delete_inode+0x8d/0x1f0 [lustre]
      > 05:09:21 [<ffffffff811662f1>] evict+0x91/0x170
      > 05:09:21 [<ffffffff81166762>] iput+0xc2/0x180
      > 05:09:21 [<ffffffffa07ffaf6>] ll_d_iput+0x2e6/0x830 [lustre]
      > 05:09:21 [<ffffffff8116106b>] d_kill+0xcb/0x130
      > 05:09:21 [<ffffffff81164401>] dput+0xc1/0x190
      > 05:09:21 [<ffffffff8114ec27>] fput+0x167/0x200
      > 05:09:21 [<ffffffff8115475f>] do_execve_common+0x16f/0x300
      > 05:09:21 [<ffffffff8115497f>] do_execve+0x3f/0x50
      > 05:09:21 [<ffffffff8100a89e>] sys_execve+0x4e/0x80
      > 05:09:21 [<ffffffff813c87bc>] stub_execve+0x6c/0xc0
      > 05:09:21 [<00002aaaabe582c7>] 0x2aaaabe582c6

      > 05:18:21 [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0
      > 05:18:21 [<ffffffff81004919>] dump_trace+0x89/0x450
      > 05:18:21 [<ffffffff8100591c>] show_trace_log_lvl+0x5c/0x80
      > 05:18:22 [<ffffffff81005955>] show_trace+0x15/0x20
      > 05:18:22 [<ffffffff813bd54b>] dump_stack+0x79/0x84
      > 05:18:22 [<ffffffff810c9bf9>] __rcu_pending+0x199/0x410
      > 05:18:22 [<ffffffff810c9eda>] rcu_check_callbacks+0x6a/0x120
      > 05:18:22 [<ffffffff8105cb96>] update_process_times+0x46/0x90
      > 05:18:22 [<ffffffff81080a16>] tick_sched_timer+0x66/0xc0
      > 05:18:22 [<ffffffff8107386f>] __run_hrtimer+0xcf/0x1d0
      > 05:18:22 [<ffffffff81073bb7>] hrtimer_interrupt+0xe7/0x220
      > 05:18:22 [<ffffffff813c9dd9>] smp_apic_timer_interrupt+0x69/0x99
      > 05:18:22 [<ffffffff813c8d13>] apic_timer_interrupt+0x13/0x20
      > 05:18:22 [<ffffffffa037e419>] cl_lock_get_trust+0x89/0x90 [obdclass]
      > 05:18:22 [<ffffffffa0380307>] cl_locks_prune+0xc7/0x330 [obdclass]
      > 05:18:22 [<ffffffffa078dc91>] lov_delete_raid0+0xe1/0x3f0 [lov]
      > 05:18:22 [<ffffffffa078d059>] lov_object_delete+0x69/0x180 [lov]
      > 05:18:22 [<ffffffffa036bda5>] lu_object_free+0x85/0x1a0 [obdclass]
      > 05:18:22 [<ffffffffa036c2de>] lu_object_put+0xbe/0x370 [obdclass]
      > 05:18:22 [<ffffffffa0374d1e>] cl_object_put+0xe/0x10 [obdclass]
      > 05:18:22 [<ffffffffa086803d>] cl_inode_fini+0xbd/0x2a0 [lustre]
      > 05:18:22 [<ffffffffa082f160>] ll_clear_inode+0x2d0/0x930 [lustre]
      > 05:18:22 [<ffffffffa082f84d>] ll_delete_inode+0x8d/0x1f0 [lustre]
      > 05:18:22 [<ffffffff811662f1>] evict+0x91/0x170
      > 05:18:22 [<ffffffff81166762>] iput+0xc2/0x180
      > 05:18:22 [<ffffffffa07ffaf6>] ll_d_iput+0x2e6/0x830 [lustre]
      > 05:18:22 [<ffffffff8116106b>] d_kill+0xcb/0x130
      > 05:18:22 [<ffffffff81164401>] dput+0xc1/0x190
      > 05:18:22 [<ffffffff8114ec27>] fput+0x167/0x200
      > 05:18:22 [<ffffffff8115475f>] do_execve_common+0x16f/0x300
      > 05:18:22 [<ffffffff8115497f>] do_execve+0x3f/0x50
      > 05:18:22 [<ffffffff8100a89e>] sys_execve+0x4e/0x80
      > 05:18:22 [<ffffffff813c87bc>] stub_execve+0x6c/0xc0
      > 05:18:22 [<00002aaaabe582c7>] 0x2aaaabe582c6
      —

      Sorry for the massive dump of information in one bug, but we strongly suspect these bugs have a single cause or several tightly related causes.

      With assistance from Xyratex, we've singled these patches out as possible patches of interest that have come in between 2.4.0 and master:

      most suspicious:
      18834a5 LU-3321 clio: remove stackable cl_page completely
      0a259bd LU-3321 clio: collapse layer of cl_page
      less suspicious:
      13079de LU-3889 osc: Allow lock to be canceled at ENQ time
      7168ea8 LU-3027 clio: Do not shrink sublock at cancel
      521335c LU-3433 clio: wrong cl_lock usage

      On my to-do list is testing master with some of these patches removed to see what, if any, affect this has on the bugs listed above.

      Attachments

        1. 0002-LELUS-203-clio-no-recursive-closures.patch
          7 kB
        2. cl_lock_debug_patch.diff
          9 kB
        3. lelus-203-lock-hold-v1.patch
          0.5 kB
        4. locks.log
          10 kB
        5. mmstress.tar.gz
          39 kB
        6. osc.ko
          0.2 kB

        Issue Links

          Activity

            [LU-4591] Related cl_lock failures on master/2.5

            Hi Jinshan and Patrick,
            in my comment on last friday, I said that tests were run at CEA and they still had the issue. In fact, they only ran a subset of the tests, and they did not have crashes. They had client evictions, but this is probably another problem (perhaps LU-4495).
            Monday and tuesday was the cluster maintenance window. Tests will restart today with luster 2.4.3 plus http://review.whamcloud.com/9881 and without reverting LU-3027. Hope to have news soon.

            Patrick

            patrick.valentin Patrick Valentin (Inactive) added a comment - Hi Jinshan and Patrick, in my comment on last friday, I said that tests were run at CEA and they still had the issue. In fact, they only ran a subset of the tests, and they did not have crashes. They had client evictions, but this is probably another problem (perhaps LU-4495 ). Monday and tuesday was the cluster maintenance window. Tests will restart today with luster 2.4.3 plus http://review.whamcloud.com/9881 and without reverting LU-3027 . Hope to have news soon. Patrick

            Hi Patrick Valentin,

            Please apply patch http://review.whamcloud.com/9881 to your branch and don't revert anything.

            I was wondering why you talked to yourself and it took me a while to figure out that you guys have the same first name :-D

            Jinshan

            jay Jinshan Xiong (Inactive) added a comment - Hi Patrick Valentin, Please apply patch http://review.whamcloud.com/9881 to your branch and don't revert anything. I was wondering why you talked to yourself and it took me a while to figure out that you guys have the same first name :-D Jinshan

            Patrick -

            I ran Cray's 2.5 (which is very similar to Intel's 2.5.1) with LU-4558 and LU-3027 lov. So I reapplied that patch, which I was previously reverting.

            It worries me (obviously) that CEA continued to see problems in 2.4.2. Can you share which specific assertions/GPFs they continued to hit? And if possible, what codes were causing the issue?

            Thanks

            • Patrick
            paf Patrick Farrell (Inactive) added a comment - Patrick - I ran Cray's 2.5 (which is very similar to Intel's 2.5.1) with LU-4558 and LU-3027 lov. So I reapplied that patch, which I was previously reverting. It worries me (obviously) that CEA continued to see problems in 2.4.2. Can you share which specific assertions/GPFs they continued to hit? And if possible, what codes were causing the issue? Thanks Patrick

            Patrick,

            in your comment on 03/Apr/14 8:53 PM, you wrote:
            > I'm adding this patch and LU-3027 lov to Cray 2.5, and ...
            Does it mean you apply LU-4558 patch AND you continue to revert "LU-3027 lov", or does it mean that you reapply "LU-3027 lov" you were previously reverting ?

            We built today a lustre 2.4.2 with LU-4558 patch set 2 without reverting "LU-3027 lov" for tests at CEA, and it seems they still have the issue. I should have the confirmation of test results tonight or on next monday.

            Thanks in advance
            Patrick

            patrick.valentin Patrick Valentin (Inactive) added a comment - Patrick, in your comment on 03/Apr/14 8:53 PM, you wrote: > I'm adding this patch and LU-3027 lov to Cray 2.5, and ... Does it mean you apply LU-4558 patch AND you continue to revert " LU-3027 lov", or does it mean that you reapply " LU-3027 lov" you were previously reverting ? We built today a lustre 2.4.2 with LU-4558 patch set 2 without reverting " LU-3027 lov" for tests at CEA, and it seems they still have the issue. I should have the confirmation of test results tonight or on next monday. Thanks in advance Patrick

            duplicate of LU-4558.

            jay Jinshan Xiong (Inactive) added a comment - duplicate of LU-4558 .

            Jinshan - Testing last night completed without any problems. Thank you very much for this - It looks like we've probably finally fixed a bug we've been working on for a long time. (Much of the work on this bug on our side happened before we opened the bug with you.)

            I've given a positive review to the mod as well. I think this ticket could probably be closed as a duplicate of LU-4558.

            Again, thank you very much. I've been working on this in various forms since about October of last year.

            paf Patrick Farrell (Inactive) added a comment - Jinshan - Testing last night completed without any problems. Thank you very much for this - It looks like we've probably finally fixed a bug we've been working on for a long time. (Much of the work on this bug on our side happened before we opened the bug with you.) I've given a positive review to the mod as well. I think this ticket could probably be closed as a duplicate of LU-4558 . Again, thank you very much. I've been working on this in various forms since about October of last year.

            That's really good, Patrick. Thank you for your effort on this bug.

            Hi Ann, I will take a look at it soon.

            Jinshan

            jay Jinshan Xiong (Inactive) added a comment - That's really good, Patrick. Thank you for your effort on this bug. Hi Ann, I will take a look at it soon. Jinshan

            Jinshan - Wow. Finally some good news on these bugs - early testing results on master are perfect. I would expect to have seen 10-20 instances of these various bugs by now in my testing, and I have not seen any yet.

            I'm adding this patch and LU-3027 lov to Cray 2.5, and will be doing some lengthy testing tonight, both with the mmstress reproducer and general Cray IO stress testing. I may also try racer.sh for a while as well.

            paf Patrick Farrell (Inactive) added a comment - Jinshan - Wow. Finally some good news on these bugs - early testing results on master are perfect. I would expect to have seen 10-20 instances of these various bugs by now in my testing, and I have not seen any yet. I'm adding this patch and LU-3027 lov to Cray 2.5, and will be doing some lengthy testing tonight, both with the mmstress reproducer and general Cray IO stress testing. I may also try racer.sh for a while as well.

            Jinshan, you might want to take a look at LU-4861. Cray is also seeing a deadlock in cl_lock_mutex_get along the cl_glimpse_lock path. This is independent of Patrick's testing. Seems like this different behavior could shed light on or confirm your suspicions about the root cause of LU-4591. I didn't want to add the deadlock info here in case it's an unrelated problem.

            amk Ann Koehler (Inactive) added a comment - Jinshan, you might want to take a look at LU-4861 . Cray is also seeing a deadlock in cl_lock_mutex_get along the cl_glimpse_lock path. This is independent of Patrick's testing. Seems like this different behavior could shed light on or confirm your suspicions about the root cause of LU-4591 . I didn't want to add the deadlock info here in case it's an unrelated problem.

            Please use patch version 2 of http://review.whamcloud.com/9881

            jay Jinshan Xiong (Inactive) added a comment - Please use patch version 2 of http://review.whamcloud.com/9881

            People

              jay Jinshan Xiong (Inactive)
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: