Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4591

Related cl_lock failures on master/2.5

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.2
    • Lustre 2.5.0, Lustre 2.6.0, Lustre 2.4.2
    • Master clients on SLES11SP3, server version irrelevant (tested against 2.1,2.4,2.4.1,2.5).
    • 3
    • 12543

    Description

      We've seen a number of different cl_clock bugs in master/2.4.1/2.5 that we believe are related.

      We have seen these in master as recently as two weeks ago (current master cannot run mmstress at all due to a problem in the paging code).

      These bugs are not present in Intel released 2.4.0, but we've seen them in Cray 2.4.1 and 2.5 (which do not track precisely with the Intel versions of 2.4.1 and 2.5).

      We've seen all of these bugs during our general purpose testing, but we believe they're related, because all of them are reproduced easily by running multiple copies (on multiple nodes) of mmstress from the Linux Test Project (mtest05 - I will attach the source), and none of them seem to be present in 2.4.0. (At least, none of them are reproduced in that context.)

      Not all of the stack traces below are from runs on master (sorry - it's not what I've got handy), but all of the listed bugs have been reproduced on master:

      General protection fault in osc_lock_detach (this one seems to be the most common):

      > 20:00:29 Pid: 3120, comm: ldlm_bl_04 Tainted: P 3.0.82-0.7.9_1.0000.7690-cray_gem_c #1
      > 20:00:30 RIP: 0010:[<ffffffffa0635586>] [<ffffffffa0635586>] osc_lock_detach+0x46/0x180 [osc]
      > 20:00:30 RSP: 0000:ffff880204065cc0 EFLAGS: 00010206
      > 20:00:30 RAX: 0000000000005050 RBX: 5a5a5a5a5a5a5a5a RCX: ffff880204065cc0
      > 20:00:30 RDX: ffff880204065cc0 RSI: ffff880203c38b50 RDI: ffffffffa066ba00
      > 20:00:30 RBP: ffff880204065cf0 R08: 0000000000000020 R09: ffffffff8136fd38
      > 20:00:30 R10: 0000000000000400 R11: 0000000000000009 R12: ffff880203c38b50
      > 20:00:30 R13: 0000000000000000 R14: ffff88020e33ab40 R15: ffff880201fd23e0
      > 20:00:30 FS: 00002aaaaec54700(0000) GS:ffff88021fcc0000(0000) knlGS:0000000000000000
      > 20:00:30 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      > 20:00:30 CR2: 00002aaaad101000 CR3: 000000020439d000 CR4: 00000000000407e0
      > 20:00:30 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      > 20:00:30 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      > 20:00:30 Process ldlm_bl_04 (pid: 3120, threadinfo ffff880204064000, task ffff88020e73d7e0)
      > 20:00:30 Stack:
      > 20:00:30 ffff880204065cc0 0000000000000000 ffff880203c38b50 0000000000000000
      > 20:00:30 ffff88020e33ab40 ffff880201fd23e0 ffff880204065d60 ffffffffa0635a7a
      > 20:00:30 ffff880201fd23e0 ffff880150f57ed0 0000000001fd23e0 ffff880150f57ed0
      > 20:00:30 Call Trace:
      > 20:00:30 [<ffffffffa0635a7a>] osc_lock_cancel+0xca/0x410 [osc]
      > 20:00:30 [<ffffffffa02b874d>] cl_lock_cancel0+0x6d/0x150 [obdclass]
      > 20:00:30 [<ffffffffa02b948b>] cl_lock_cancel+0x13b/0x140 [obdclass]
      > 20:00:30 [<ffffffffa0636e6c>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
      > 20:00:30 [<ffffffffa03c5334>] ldlm_handle_bl_callback+0xd4/0x430 [ptlrpc]
      > 20:00:30 [<ffffffffa03c588c>] ldlm_bl_thread_main+0x1fc/0x420 [ptlrpc]
      > 20:00:30 [<ffffffff8106610e>] kthread+0x9e/0xb0
      > 20:00:30 [<ffffffff81363ff4>] kernel_thread_helper+0x4/0x10
      > 20:00:30 Code: f8 66 66 66 66 90 49 89 f4 49 89 ff 48 c7 c7 00 ba 66 a0 e8 0d cb d2 e0 49 8b 5c 24 28 48 85 db 74 7b 49 c7 44 24 28 00 00 00 00
      > 20:00:30 c7 83 60 01 00 00 00 00 00 00 49 c7 44 24 70 00 00 00 00 fe
      > 20:00:30 RIP [<ffffffffa0635586>] osc_lock_detach+0x46/0x180 [osc]
      > 20:00:30 RSP <ffff880204065cc0>
      > 20:00:30 --[ end trace 317fa078a5344509 ]--

      (osc_lock.c:1134:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6

      > 23:39:52 LustreError: 6456:0:(osc_lock.c:1134:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6
      > 23:39:52 LustreError: 6456:0:(osc_lock.c:1134:osc_lock_enqueue()) LBUG
      > 23:39:52 Pid: 6456, comm: mmstress
      > 23:39:52 Call Trace:
      > 23:39:52 [<ffffffff810065b1>] try_stack_unwind+0x161/0x1a0
      > 23:39:52 [<ffffffff81004dd9>] dump_trace+0x89/0x440
      > 23:39:52 [<ffffffffa016d897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      > 23:39:52 [<ffffffffa016dde7>] lbug_with_loc+0x47/0xc0 [libcfs]
      > 23:39:52 [<ffffffffa063b8ef>] osc_lock_enqueue+0x74f/0x8d0 [osc]
      > 23:39:52 [<ffffffffa02c0dab>] cl_enqueue_try+0xfb/0x320 [obdclass]
      > 23:39:52 [<ffffffffa06d13dd>] lov_lock_enqueue+0x1fd/0x880 [lov]
      > 23:39:52 [<ffffffffa02c0dab>] cl_enqueue_try+0xfb/0x320 [obdclass]
      > 23:39:52 [<ffffffffa02c1c8f>] cl_enqueue_locked+0x7f/0x1f0 [obdclass]
      > 23:39:52 [<ffffffffa02c287e>] cl_lock_request+0x7e/0x270 [obdclass]
      > 23:39:52 [<ffffffffa02c7db4>] cl_io_lock+0x394/0x5c0 [obdclass]
      > 23:39:52 [<ffffffffa02c807a>] cl_io_loop+0x9a/0x1a0 [obdclass]
      > 23:39:52 [<ffffffffa078bf78>] ll_fault+0x308/0x4e0 [lustre]
      > 23:39:52 [<ffffffff81111ce6>] __do_fault+0x76/0x570
      > 23:39:52 [<ffffffff81112284>] handle_pte_fault+0xa4/0xcc0
      > 23:39:52 [<ffffffff8111304e>] handle_mm_fault+0x1ae/0x240
      > 23:39:52 [<ffffffff81026c6f>] do_page_fault+0x18f/0x420
      > 23:39:52 [<ffffffff813628cf>] page_fault+0x1f/0x30
      > 23:39:52 [<00000000200007ea>] 0x200007ea
      > 23:39:52 Kernel panic - not syncing: LBUG

      lov_lock_link_find()) ASSERTION( cl_lock_is_mutexed(sub->lss_cl.cls_lock) ) failed:

      2013-12-06T21:36:44.086298-06:00 c0-0c0s11n2 LustreError: 9051:0:(lov_lock.c:1092:lov_lock_link_find()) ASSERTION( cl_lock_is_mutexed(sub->lss_cl.cls_lock) ) failed:
      2013-12-06T21:36:44.111468-06:00 c0-0c0s11n2 LustreError: 9051:0:(lov_lock.c:1092:lov_lock_link_find()) LBUG
      2013-12-06T21:36:44.111497-06:00 c0-0c0s11n2 Pid: 9051, comm: mmstress
      2013-12-06T21:36:44.111536-06:00 c0-0c0s11n2 Call Trace:
      2013-12-06T21:36:44.136717-06:00 c0-0c0s11n2 [<ffffffff81005db9>] try_stack_unwind+0x169/0x1b0
      2013-12-06T21:36:44.136757-06:00 c0-0c0s11n2 [<ffffffff81004849>] dump_trace+0x89/0x450
      2013-12-06T21:36:44.136787-06:00 c0-0c0s11n2 [<ffffffffa02128d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      2013-12-06T21:36:44.169446-06:00 c0-0c0s11n2 [<ffffffffa0212e37>] lbug_with_loc+0x47/0xc0 [libcfs]
      2013-12-06T21:36:44.169474-06:00 c0-0c0s11n2 [<ffffffffa0778f7a>] lov_lock_link_find+0x16a/0x170 [lov]
      2013-12-06T21:36:44.169502-06:00 c0-0c0s11n2 [<ffffffffa07799cf>] lov_sublock_adopt+0x8f/0x370 [lov]
      2013-12-06T21:36:44.194648-06:00 c0-0c0s11n2 [<ffffffffa077cba7>] lov_lock_init_raid0+0x637/0xd50 [lov]
      2013-12-06T21:36:44.194687-06:00 c0-0c0s11n2 [<ffffffffa0773c0e>] lov_lock_init+0x1e/0x60 [lov]
      2013-12-06T21:36:44.194704-06:00 c0-0c0s11n2 [<ffffffffa036961a>] cl_lock_hold_mutex+0x32a/0x640 [obdclass]
      2013-12-06T21:36:44.219871-06:00 c0-0c0s11n2 [<ffffffffa0369ab2>] cl_lock_request+0x62/0x270 [obdclass]
      2013-12-06T21:36:44.219901-06:00 c0-0c0s11n2 [<ffffffffa036f0ce>] cl_io_lock+0x39e/0x5d0 [obdclass]
      2013-12-06T21:36:44.245175-06:00 c0-0c0s11n2 [<ffffffffa036f3a2>] cl_io_loop+0xa2/0x1b0 [obdclass]
      2013-12-06T21:36:44.245216-06:00 c0-0c0s11n2 [<ffffffffa0838760>] ll_page_mkwrite+0x280/0x680 [lustre]
      2013-12-06T21:36:44.245228-06:00 c0-0c0s11n2 [<ffffffff81118019>] __do_fault+0xf9/0x5b0
      2013-12-06T21:36:44.270380-06:00 c0-0c0s11n2 [<ffffffff81118574>] handle_pte_fault+0xa4/0xcd0
      2013-12-06T21:36:44.270420-06:00 c0-0c0s11n2 [<ffffffff8111934e>] handle_mm_fault+0x1ae/0x240
      2013-12-06T21:36:44.270431-06:00 c0-0c0s11n2 [<ffffffff8138a915>] do_page_fault+0x1e5/0x4a0
      2013-12-06T21:36:44.295606-06:00 c0-0c0s11n2 [<ffffffff8138778f>] page_fault+0x1f/0x30
      2013-12-06T21:36:44.295648-06:00 c0-0c0s11n2 [<00000000200007fa>] 0x200007fa

      General protection fault in cl_lock_put (possibly the same issue as LU-4558?):

      > 19:55:11 general protection fault: 0000 1 SMP
      > 19:55:11 CPU 19
      > 19:55:11 Modules linked in: mic xpmem dvspn(P) dvsof(P) dvsutil(P) dvsipc(P) dvsipc_lnet(P) dvsproc(P) bpmcdmod nic_compat cmsr lmv mgc lustre lov osc mdc fid fld kgnilnd ptlrpc obdclass lnet lvfs sha1_generic md5 libcfs ib_core pcie_link_bw_monitor kdreg gpcd_ari ipogif_ari kgni_ari hwerr(P) rca hss_os(P) heartbeat simplex(P) ghal_ari craytrace
      > 19:55:11 Pid: 8720, comm: ldlm_bl_02 Tainted: P 3.0.80-0.5.1_1.0501.7664-cray_ari_c #1 Cray Inc. Cascade/Cascade
      > 19:55:11 RIP: 0010:[<ffffffffa036b553>] [<ffffffffa036b553>] cl_lock_put+0x103/0x410 [obdclass]
      > 19:55:11 RSP: 0018:ffff8807ef0fdc70 EFLAGS: 00010246
      > 19:55:11 RAX: 0000000000000001 RBX: 5a5a5a5a5a5a5a5a RCX: ffff8807f30a7af8
      > 19:55:11 RDX: ffffffffa038da5b RSI: 5a5a5a5a5a5a5a5a RDI: ffff8807f1d169c8
      > 19:55:11 RBP: ffff8807ef0fdc90 R08: ffffffffa037ebc0 R09: 00000000000002f8
      > 19:55:11 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      > 19:55:11 R13: ffff8807f1d169c8 R14: ffff8807f30a7af8 R15: 0000000000000001
      > 19:55:11 FS: 00007ffff7ff6700(0000) GS:ffff88087f6c0000(0000) knlGS:0000000000000000
      > 19:55:11 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      > 19:55:11 CR2: 00005555557af328 CR3: 0000000001661000 CR4: 00000000001406e0
      > 19:55:11 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      > 19:55:11 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      > 19:55:11 Process ldlm_bl_02 (pid: 8720, threadinfo ffff8807ef0fc000, task ffff88083c2ae0c0)
      > 19:55:11 Stack:
      > 19:55:11 ffff88083bf7e840 0000000000000000 ffff8807f1d169c8 ffff8807f30a7af8
      > 19:55:11 ffff8807ef0fdcf0 ffffffffa07194ca ffff88083bf7e840 ffff88050caec000
      > 19:55:11 0000000000000001 0000000000000000 ffff8807ef0fdcf0 ffff88050caec000
      > 19:55:11 Call Trace:
      > 19:55:11 [<ffffffffa07194ca>] osc_ldlm_blocking_ast+0xaa/0x330 [osc]
      > 19:55:11 [<ffffffffa0460aeb>] ldlm_cancel_callback+0x6b/0x190 [ptlrpc]
      > 19:55:11 [<ffffffffa046f41a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
      > 19:55:11 [<ffffffffa0473ec0>] ldlm_cli_cancel+0x60/0x370 [ptlrpc]
      > 19:55:11 [<ffffffffa071819e>] osc_lock_cancel+0xfe/0x1c0 [osc]
      > 19:55:11 [<ffffffffa0368145>] cl_lock_cancel0+0x75/0x160 [obdclass]
      > 19:55:11 [<ffffffffa0368e8b>] cl_lock_cancel+0x13b/0x140 [obdclass]
      > 19:55:11 [<ffffffffa071962c>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
      > 19:55:11 [<ffffffffa04777d4>] ldlm_handle_bl_callback+0xd4/0x430 [ptlrpc]
      > 19:55:11 [<ffffffffa0477d5c>] ldlm_bl_thread_main+0x22c/0x450 [ptlrpc]
      > 19:55:11 [<ffffffff8138e894>] kernel_thread_helper+0x4/0x10
      > 19:55:11 Code: 00 00 00 c7 05 23 e3 06 00 01 00 00 00 4c 8b 45 08 8b 13 48 89 d9 48 c7 c6 60 82 38 a0 48 c7 c7 40 98 3d a0 31 c0 e8 5d 6a eb ff <f0> ff 0b 0f 94 c0 84 c0 74 06 83 7b 50 06 74 7d f6 05 5a b4 ed
      > 19:55:11 RIP [<ffffffffa036b553>] cl_lock_put+0x103/0x410 [obdclass]
      > 19:55:11 RSP <ffff8807ef0fdc70>

      lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:

      > 04:59:03 LustreError: 19083:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:
      > 04:59:03 LustreError: 19083:0:(lovsub_lock.c:103:lovsub_lock_state()) LBUG
      > 04:59:03 Pid: 19083, comm: mmstress
      > 04:59:03 Call Trace:
      > 04:59:03 [<ffffffff810065b1>] try_stack_unwind+0x161/0x1a0
      > 04:59:03 [<ffffffff81004dd9>] dump_trace+0x89/0x440
      > 04:59:03 [<ffffffffa0168897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      > 04:59:03 [<ffffffffa0168de7>] lbug_with_loc+0x47/0xc0 [libcfs]
      > 04:59:03 [<ffffffffa06d458e>] lovsub_lock_state+0x19e/0x1a0 [lov]
      > 04:59:03 [<ffffffffa02b7b80>] cl_lock_state_signal+0x60/0x160 [obdclass]
      > 04:59:03 [<ffffffffa02b7d4d>] cl_lock_state_set+0xcd/0x1a0 [obdclass]
      > 04:59:03 [<ffffffffa02bbe0b>] cl_enqueue_try+0x14b/0x320 [obdclass]
      > 04:59:03 [<ffffffffa06cc3dd>] lov_lock_enqueue+0x1fd/0x880 [lov]
      > 04:59:03 [<ffffffffa02bbdbb>] cl_enqueue_try+0xfb/0x320 [obdclass]
      > 04:59:03 [<ffffffffa02bcc9f>] cl_enqueue_locked+0x7f/0x1f0 [obdclass]
      > 04:59:03 [<ffffffffa02bd88e>] cl_lock_request+0x7e/0x270 [obdclass]
      > 04:59:03 [<ffffffffa02c2dc4>] cl_io_lock+0x394/0x5c0 [obdclass]
      > 04:59:03 [<ffffffffa02c308a>] cl_io_loop+0x9a/0x1a0 [obdclass]
      > 04:59:03 [<ffffffffa0786ee8>] ll_fault+0x308/0x4e0 [lustre]
      > 04:59:03 [<ffffffff81111d16>] __do_fault+0x76/0x570
      > 04:59:03 [<ffffffff811122b4>] handle_pte_fault+0xa4/0xcc0
      > 04:59:03 [<ffffffff8136290f>] page_fault+0x1f/0x30
      > 04:59:03 [<00000000200007ea>] 0x200007ea

      General protection fault in cl_lock_delete:

      > 22:37:45 Pid: 6333, comm: ldlm_bl_13 Tainted: P 3.0.82-0.7.9_1.0502.7742-cray_gem_c #1
      > 22:37:45 RIP: 0010:[<ffffffffa02b8eb0>] [<ffffffffa02b8eb0>] cl_lock_delete0+0x190/0x1f0 [obdclass]
      > 22:37:45 RSP: 0018:ffff8803d0403bc0 EFLAGS: 00010296
      > 22:37:45 RAX: 5a5a5a5a5a5a5a5a RBX: 5a5a5a5a5a5a5a42 RCX: ffff8803d7509ca8
      > 22:37:45 RDX: ffff8803d54b6800 RSI: ffff8803fbfdceb8 RDI: ffff8803d54b67c0
      > 22:37:45 RBP: ffff8803d0403be0 R08: ffff8803d55a8858 R09: 0000000000000000
      > 22:37:45 R10: 0000000000000023 R11: 0000000000000000 R12: ffff8803d55a8818
      > 22:37:45 R13: ffff8803d55a8810 R14: ffff880803ddcdb8 R15: 0000000000000001
      > 22:37:45 FS: 000000004013d880(0000) GS:ffff88081fcc0000(0000) knlGS:0000000000000000
      > 22:37:46 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      > 22:37:46 CR2: 00002aaaaab38000 CR3: 0000000803db3000 CR4: 00000000000407e0
      > 22:37:46 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      > 22:37:46 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      > 22:37:46 Process ldlm_bl_13 (pid: 6333, threadinfo ffff8803d0402000, task ffff880403064040)
      > 22:37:46 Stack:
      > 22:37:46 ffff8803d55a8810 ffff880803ddcdb8 ffff880803ddcdb8 ffff8803d55a8810
      > 22:37:46 ffff8803d0403c00 ffffffffa02b9063 ffff8801e05fbb50 0000000000000000
      > 22:37:46 ffff8803d0403c60 ffffffffa0636e7a ffff8801e05fbb50 ffff8803d54b67c0
      > 22:37:46 Call Trace:
      > 22:37:46 [<ffffffffa02b9063>] cl_lock_delete+0x153/0x1a0 [obdclass]
      > 22:37:46 [<ffffffffa0636e7a>] osc_ldlm_blocking_ast+0x21a/0x330 [osc]
      > 22:37:46 [<ffffffffa03adabb>] ldlm_cancel_callback+0x6b/0x190 [ptlrpc]
      > 22:37:46 [<ffffffffa03bc7ca>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
      > 22:37:46 [<ffffffffa03c11bb>] ldlm_cli_cancel+0x6b/0x380 [ptlrpc]
      > 22:37:46 [<ffffffffa0635c47>] osc_lock_cancel+0x297/0x410 [osc]
      > 22:37:46 [<ffffffffa02b770d>] cl_lock_cancel0+0x6d/0x150 [obdclass]
      > 22:37:46 [<ffffffffa02b844b>] cl_lock_cancel+0x13b/0x140 [obdclass]
      > 22:37:46 [<ffffffffa0636e6c>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
      > 22:37:46 [<ffffffffa03c4944>] ldlm_handle_bl_callback+0xd4/0x430 [ptlrpc]
      > 22:37:46 [<ffffffffa03c4e9c>] ldlm_bl_thread_main+0x1fc/0x420 [ptlrpc]
      > 22:37:46 [<ffffffff8106610e>] kthread+0x9e/0xb0
      > 22:37:46 [<ffffffff81364034>] kernel_thread_helper+0x4/0x10
      > 22:37:46 Code: 08 48 89 10 49 89 4d 18 49 89 4d 20 41 fe 44 24 5c 49 8b 45 10 4d 8d 65 08 49 39 c4 48 8d 58 e8 0f 84 be fe ff ff 0f 1f 44 00 00
      > 22:37:46 8b 43 10 48 8b 40 50 48 85 c0 74 08 48 89 de 4c 89 f7 ff d0
      > 22:37:46 RIP [<ffffffffa02b8eb0>] cl_lock_delete0+0x190/0x1f0 [obdclass]
      > 22:37:46 RSP <ffff8803d0403bc0>
      > 22:37:46 --[ end trace cfc07b184a378ec7 ]--

      One more, which is slightly different, but still caused by the same tests, not found in 2.4.0, etc.

      Lustre is getting stuck looping in cl_locks_prune. We have many cases of applications failing to exit with processes stuck somewhere under cl_locks_prune - Here's two examples:
      > 05:09:21 [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0
      > 05:09:21 [<ffffffff81004919>] dump_trace+0x89/0x450
      > 05:09:21 [<ffffffff8100591c>] show_trace_log_lvl+0x5c/0x80
      > 05:09:21 [<ffffffff81005955>] show_trace+0x15/0x20
      > 05:09:21 [<ffffffff813bd54b>] dump_stack+0x79/0x84
      > 05:09:21 [<ffffffff810c9bf9>] __rcu_pending+0x199/0x410
      > 05:09:21 [<ffffffff810c9eda>] rcu_check_callbacks+0x6a/0x120
      > 05:09:21 [<ffffffff8105cb96>] update_process_times+0x46/0x90
      > 05:09:21 [<ffffffff81080a16>] tick_sched_timer+0x66/0xc0
      > 05:09:21 [<ffffffff8107386f>] __run_hrtimer+0xcf/0x1d0
      > 05:09:21 [<ffffffff81073bb7>] hrtimer_interrupt+0xe7/0x220
      > 05:09:21 [<ffffffff813c9dd9>] smp_apic_timer_interrupt+0x69/0x99
      > 05:09:21 [<ffffffff813c8d13>] apic_timer_interrupt+0x13/0x20
      > 05:09:21 [<ffffffffa0373a79>] cl_env_info+0x9/0x20 [obdclass]
      > 05:09:21 [<ffffffffa037dfc6>] cl_lock_mutex_get+0x36/0xd0 [obdclass]
      > 05:09:21 [<ffffffffa0380319>] cl_locks_prune+0xd9/0x330 [obdclass]
      > 05:09:21 [<ffffffffa078dc91>] lov_delete_raid0+0xe1/0x3f0 [lov]
      > 05:09:21 [<ffffffffa078d059>] lov_object_delete+0x69/0x180 [lov]
      > 05:09:21 [<ffffffffa036bda5>] lu_object_free+0x85/0x1a0 [obdclass]
      > 05:09:21 [<ffffffffa036c2de>] lu_object_put+0xbe/0x370 [obdclass]
      > 05:09:21 [<ffffffffa0374d1e>] cl_object_put+0xe/0x10 [obdclass]
      > 05:09:21 [<ffffffffa086803d>] cl_inode_fini+0xbd/0x2a0 [lustre]
      > 05:09:21 [<ffffffffa082f160>] ll_clear_inode+0x2d0/0x930 [lustre]
      > 05:09:21 [<ffffffffa082f84d>] ll_delete_inode+0x8d/0x1f0 [lustre]
      > 05:09:21 [<ffffffff811662f1>] evict+0x91/0x170
      > 05:09:21 [<ffffffff81166762>] iput+0xc2/0x180
      > 05:09:21 [<ffffffffa07ffaf6>] ll_d_iput+0x2e6/0x830 [lustre]
      > 05:09:21 [<ffffffff8116106b>] d_kill+0xcb/0x130
      > 05:09:21 [<ffffffff81164401>] dput+0xc1/0x190
      > 05:09:21 [<ffffffff8114ec27>] fput+0x167/0x200
      > 05:09:21 [<ffffffff8115475f>] do_execve_common+0x16f/0x300
      > 05:09:21 [<ffffffff8115497f>] do_execve+0x3f/0x50
      > 05:09:21 [<ffffffff8100a89e>] sys_execve+0x4e/0x80
      > 05:09:21 [<ffffffff813c87bc>] stub_execve+0x6c/0xc0
      > 05:09:21 [<00002aaaabe582c7>] 0x2aaaabe582c6

      > 05:18:21 [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0
      > 05:18:21 [<ffffffff81004919>] dump_trace+0x89/0x450
      > 05:18:21 [<ffffffff8100591c>] show_trace_log_lvl+0x5c/0x80
      > 05:18:22 [<ffffffff81005955>] show_trace+0x15/0x20
      > 05:18:22 [<ffffffff813bd54b>] dump_stack+0x79/0x84
      > 05:18:22 [<ffffffff810c9bf9>] __rcu_pending+0x199/0x410
      > 05:18:22 [<ffffffff810c9eda>] rcu_check_callbacks+0x6a/0x120
      > 05:18:22 [<ffffffff8105cb96>] update_process_times+0x46/0x90
      > 05:18:22 [<ffffffff81080a16>] tick_sched_timer+0x66/0xc0
      > 05:18:22 [<ffffffff8107386f>] __run_hrtimer+0xcf/0x1d0
      > 05:18:22 [<ffffffff81073bb7>] hrtimer_interrupt+0xe7/0x220
      > 05:18:22 [<ffffffff813c9dd9>] smp_apic_timer_interrupt+0x69/0x99
      > 05:18:22 [<ffffffff813c8d13>] apic_timer_interrupt+0x13/0x20
      > 05:18:22 [<ffffffffa037e419>] cl_lock_get_trust+0x89/0x90 [obdclass]
      > 05:18:22 [<ffffffffa0380307>] cl_locks_prune+0xc7/0x330 [obdclass]
      > 05:18:22 [<ffffffffa078dc91>] lov_delete_raid0+0xe1/0x3f0 [lov]
      > 05:18:22 [<ffffffffa078d059>] lov_object_delete+0x69/0x180 [lov]
      > 05:18:22 [<ffffffffa036bda5>] lu_object_free+0x85/0x1a0 [obdclass]
      > 05:18:22 [<ffffffffa036c2de>] lu_object_put+0xbe/0x370 [obdclass]
      > 05:18:22 [<ffffffffa0374d1e>] cl_object_put+0xe/0x10 [obdclass]
      > 05:18:22 [<ffffffffa086803d>] cl_inode_fini+0xbd/0x2a0 [lustre]
      > 05:18:22 [<ffffffffa082f160>] ll_clear_inode+0x2d0/0x930 [lustre]
      > 05:18:22 [<ffffffffa082f84d>] ll_delete_inode+0x8d/0x1f0 [lustre]
      > 05:18:22 [<ffffffff811662f1>] evict+0x91/0x170
      > 05:18:22 [<ffffffff81166762>] iput+0xc2/0x180
      > 05:18:22 [<ffffffffa07ffaf6>] ll_d_iput+0x2e6/0x830 [lustre]
      > 05:18:22 [<ffffffff8116106b>] d_kill+0xcb/0x130
      > 05:18:22 [<ffffffff81164401>] dput+0xc1/0x190
      > 05:18:22 [<ffffffff8114ec27>] fput+0x167/0x200
      > 05:18:22 [<ffffffff8115475f>] do_execve_common+0x16f/0x300
      > 05:18:22 [<ffffffff8115497f>] do_execve+0x3f/0x50
      > 05:18:22 [<ffffffff8100a89e>] sys_execve+0x4e/0x80
      > 05:18:22 [<ffffffff813c87bc>] stub_execve+0x6c/0xc0
      > 05:18:22 [<00002aaaabe582c7>] 0x2aaaabe582c6

      Sorry for the massive dump of information in one bug, but we strongly suspect these bugs have a single cause or several tightly related causes.

      With assistance from Xyratex, we've singled these patches out as possible patches of interest that have come in between 2.4.0 and master:

      most suspicious:
      18834a5 LU-3321 clio: remove stackable cl_page completely
      0a259bd LU-3321 clio: collapse layer of cl_page
      less suspicious:
      13079de LU-3889 osc: Allow lock to be canceled at ENQ time
      7168ea8 LU-3027 clio: Do not shrink sublock at cancel
      521335c LU-3433 clio: wrong cl_lock usage

      On my to-do list is testing master with some of these patches removed to see what, if any, affect this has on the bugs listed above.

      Attachments

        1. mmstress.tar.gz
          39 kB
        2. locks.log
          10 kB
        3. osc.ko
          0.2 kB
        4. cl_lock_debug_patch.diff
          9 kB
        5. 0002-LELUS-203-clio-no-recursive-closures.patch
          7 kB
        6. lelus-203-lock-hold-v1.patch
          0.5 kB

        Issue Links

          Activity

            People

              jay Jinshan Xiong (Inactive)
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: