Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.3.0, Lustre 2.4.0
    • SWL hyperion
    • 3
    • 4389

    Description

      Had a client crash

      2012-10-07 08:26:22 general protection fault: 0000 [#1] SMP
      2012-10-07 08:26:22 last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/irq
      [10/7/12 8:37:37 AM] Cliff White: 2012-10-07 08:26:22 Pid: 4119, comm: ptlrpcd_2 Tainted: G        W  ---------------    2.6.32-279.5.1.el6.x86_64 #1 Dell        XS23-TY     /XS23-TY
      2012-10-07 08:26:22 RIP: 0010:[<ffffffff8104e2e1>]  [<ffffffff8104e2e1>] __wake_up_common+0x31/0x90
      2012-10-07 08:26:22 RSP: 0018:ffff880181f05970  EFLAGS: 00010096
      2012-10-07 08:26:22 RAX: 5a5a5a5a5a5a5a42 RBX: ffff880101760a18 RCX: 0000000000000000
      2012-10-07 08:26:22 RDX: 5a5a5a5a5a5a5a5a RSI: 0000000000000003 RDI: ffff880101760a18
      2012-10-07 08:26:22 RBP: ffff880181f059b0 R08: 0000000000000000 R09: 0000000000000000
      2012-10-07 08:26:22 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000286
      2012-10-07 08:26:22 R13: ffff880101760a20 R14: 0000000000000000 R15: 0000000000000000
      2012-10-07 08:26:22 FS:  00007ffff7fdd700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
      2012-10-07 08:26:22 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      2012-10-07 08:26:22 CR2: 00002aaab800b078 CR3: 0000000151898000 CR4: 00000000000006f0
      2012-10-07 08:26:22 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      2012-10-07 08:26:22 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      2012-10-07 08:26:22 Process ptlrpcd_2 (pid: 4119, threadinfo ffff880181f04000, task ffff8801ba3f7500)
      2012-10-07 08:26:22 Stack:
      2012-10-07 08:26:22  ffff880181f05a00 0000000300000000 0000000000000001 ffff880101760a18
      2012-10-07 08:26:22 <d> 0000000000000286 0000000000000003 0000000000000000 0000000000000000
      2012-10-07 08:26:22 <d> ffff880181f059f0 ffffffff810533e8 ffffffffffffff8a ffff880101760a10
      [10/7/12 8:38:01 AM] Cliff White: 2012-10-07 08:26:22 Call Trace:
      2012-10-07 08:26:22  [<ffffffff810533e8>] __wake_up+0x48/0x70
      2012-10-07 08:26:22  [<ffffffffa0372797>] cfs_waitq_broadcast+0x17/0x20 [libcfs]
      2012-10-07 08:26:22  [<ffffffffa0542f39>] cl_sync_io_note+0x139/0x180 [obdclass]
      2012-10-07 08:26:22  [<ffffffffa0538c45>] cl_page_completion+0x155/0x680 [obdclass]
      2012-10-07 08:26:22  [<ffffffff8105368d>] ? task_rq_lock+0x5d/0xa0
      2012-10-07 08:26:22  [<ffffffff810600bc>] ? try_to_wake_up+0x24c/0x3e0
      2012-10-07 08:26:22  [<ffffffffa08671d2>] osc_ap_completion+0x222/0x980 [osc]
      2012-10-07 08:26:22  [<ffffffff81060262>] ? default_wake_function+0x12/0x20
      2012-10-07 08:26:22  [<ffffffff8104e309>] ? __wake_up_common+0x59/0x90
      2012-10-07 08:26:22  [<ffffffffa068942d>] ? lustre_msg_buf+0x5d/0x60 [ptlrpc]
      2012-10-07 08:26:22  [<ffffffffa0867cbd>] osc_extent_finish+0x38d/0xa30 [osc]
      2012-10-07 08:26:22  [<ffffffffa0688fbc>] ? lustre_msg_get_opc+0x9c/0x110 [ptlrpc]
      2012-10-07 08:26:22  [<ffffffffa0847d0b>] ? osc_brw_fini_request+0x10b/0x1360 [osc]
      2012-10-07 08:26:22  [<ffffffffa0372bae>] ? cfs_free+0xe/0x10 [libcfs]
      2012-10-07 08:26:22  [<ffffffffa06a9b38>] ? at_measured+0x108/0x390 [ptlrpc]
      2012-10-07 08:26:22  [<ffffffffa06a9b38>] ? at_measured+0x108/0x390 [ptlrpc]
      2012-10-07 08:26:22  [<ffffffffa084a0ef>] brw_interpret+0x30f/0x11b0 [osc]
      2012-10-07 08:26:22  [<ffffffffa06850e9>] ? ptlrpc_unregister_bulk+0x99/0xad0 [ptlrpc]
      2012-10-07 08:26:22  [<ffffffffa067ec2f>] ptlrpc_check_set+0x29f/0x1ae0 [ptlrpc]
      2012-10-07 08:26:22  [<ffffffffa06b17eb>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
      2012-10-07 08:26:22  [<ffffffffa06b1d1b>] ptlrpcd+0x22b/0x3a0 [ptlrpc]
      2012-10-07 08:26:22  [<ffffffff81060250>] ? default_wake_function+0x0/0x20
      2012-10-07 08:26:22  [<ffffffffa06b1af0>] ? ptlrpcd+0x0/0x3a0 [ptlrpc]
      2012-10-07 08:26:22  [<ffffffff8100c14a>] child_rip+0xa/0x20
      2012-10-07 08:26:22  [<ffffffffa06b1af0>] ? ptlrpcd+0x0/0x3a0 [ptlrpc]
      2012-10-07 08:26:22  [<ffffffffa06b1af0>] ? ptlrpcd+0x0/0x3a0 [ptlrpc]
      2012-10-07 08:26:22  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      2012-10-07 08:26:22 Code: 41 56 41 55 41 54 53 48 83 ec 18 0f 1f 44 00 00 89 75 cc 89 55 c8 4c 8d 6f 08 48 8b 57 08 41 89 cf 4d 89 c6 48 8d 42 e8 49 39 d5 <48> 8b 58 18 74 3f 48 83 eb 18 eb 0a 0f 1f 00 48 89 d8 48 8d 5a
      2012-10-07 08:26:22 RIP  [<ffffffff8104e2e1>] __wake_up_common+0x31/0x90
      2012-10-07 08:26:22  RSP <ffff880181f05970>
      2012-10-07 08:26:22 Initializing cgroup subsys cpuset
      

      Attachments

        Activity

          [LU-2101] Oops under cl_sync_io_note()

          Patch landed to master

          jlevi Jodi Levi (Inactive) added a comment - Patch landed to master
          liang Liang Zhen (Inactive) added a comment - rebase the patch to master: http://review.whamcloud.com/#change,5199

          We had a Sequoia client crash in what appears to be the same place.

          2013-01-25 12:53:21.119279 {DefaultControlEventListener} [mmcs]{499}.4.0: 2013-01-25 12:53:11.399 (INFO ) [0xfff80c07680] ibm.cios.jobctld.HwJobController: Job 54185 added with 128 compute nodes
          2013-01-25 16:18:31.874727 {DefaultControlEventListener} [mmcs]{499}.3.0: BUG: spinlock bad magic on CPU#12, ptlrpcd_8/3374 (Not tainted)
          2013-01-25 16:18:31.875471 {DefaultControlEventListener} [mmcs]{499}.3.0: Unable to handle kernel paging request for data at address 0x5a5a5a5a5a5a5be2
          2013-01-25 16:18:31.875750 {DefaultControlEventListener} [mmcs]{499}.3.0: Faulting instruction address: 0xc0000000002393f8
          2013-01-25 16:18:31.876087 {DefaultControlEventListener} [mmcs]{499}.3.0: Oops: Kernel access of bad area, sig: 11 [#1]
          2013-01-25 16:18:31.876366 {DefaultControlEventListener} [mmcs]{499}.3.0: SMP NR_CPUS=68 Blue Gene/Q
          2013-01-25 16:18:31.876736 {DefaultControlEventListener} [mmcs]{499}.3.0: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
          2013-01-25 16:18:31.877158 {DefaultControlEventListener} [mmcs]{499}.3.0: NIP: c0000000002393f8 LR: c0000000002393dc CTR: 0000000000000000
          2013-01-25 16:18:31.877615 {DefaultControlEventListener} [mmcs]{499}.3.0: REGS: c0000003ca61ef30 TRAP: 0300   Not tainted  (2.6.32-220.23.3.bgq.18llnl.V1R1M2.bgq62_16.ppc64)
          2013-01-25 16:18:31.878008 {DefaultControlEventListener} [mmcs]{499}.3.0: MSR: 0000000080029000 <EE,ME,CE>  CR: 22282484  XER: 20000000
          2013-01-25 16:18:31.878414 {DefaultControlEventListener} [mmcs]{499}.3.0: DEAR: 5a5a5a5a5a5a5be2, ESR: 0000000000000000
          2013-01-25 16:18:31.878875 {DefaultControlEventListener} [mmcs]{499}.3.0: TASK = c0000003ec7a4a60[3374] 'ptlrpcd_8' THREAD: c0000003ca61c000 CPU: 12
          2013-01-25 16:18:31.879357 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR00: c0000000002393dc c0000003ca61f1b0 c0000000006e3420 0000000000000046
          2013-01-25 16:18:31.879840 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR04: 0000000000000000 000000005a5a5a5a 61696e746564290a 3420284e6f742074
          2013-01-25 16:18:31.880323 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR08: 63645f382f333337 c0000000006214f8 c00000000079a7b0 0000000001da0000
          2013-01-25 16:18:31.880815 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR12: 0000000022282482 c00000000076c100 ffffffffebc0de04 0000000000001000
          2013-01-25 16:18:31.881317 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR16: c00000036822bc90 0000000000000000 00000000000037e0 0000000000000070
          2013-01-25 16:18:31.881827 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR20: 0000000000000000 c00000034a4fb370 8000000000af424c 0000000000003550
          2013-01-25 16:18:31.882333 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR24: c00000036eeeae20 c000000000574c80 000000000000000c c0000003ec7a4a60
          2013-01-25 16:18:31.882847 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR28: 0000000000000d2e 5a5a5a5a5a5a5a5a c000000000697da8 c00000036eeeae28
          2013-01-25 16:18:31.883351 {DefaultControlEventListener} [mmcs]{499}.3.0: NIP [c0000000002393f8] .spin_bug+0xac/0xfc
          2013-01-25 16:18:31.883868 {DefaultControlEventListener} [mmcs]{499}.3.0: LR [c0000000002393dc] .spin_bug+0x90/0xfc
          2013-01-25 16:18:31.884364 {DefaultControlEventListener} [mmcs]{499}.3.0: Call Trace:
          2013-01-25 16:18:31.884887 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f1b0] [c0000000002393dc] .spin_bug+0x90/0xfc (unreliable)
          2013-01-25 16:18:31.885438 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f260] [c000000000239564] ._raw_spin_lock+0x50/0x1a8
          2013-01-25 16:18:31.885966 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f310] [c00000000042d564] ._spin_lock_irqsave+0x20/0x3c
          2013-01-25 16:18:31.886515 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f390] [c0000000000273ac] .__wake_up+0x2c/0x78
          2013-01-25 16:18:31.887012 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f430] [8000000000ab20cc] .cfs_waitq_broadcast+0x1c/0x30 [libcfs]
          2013-01-25 16:18:31.887540 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f4a0] [80000000024d93e8] .cl_sync_io_note+0x1b8/0x2d0 [obdclass]
          2013-01-25 16:18:31.888074 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f550] [80000000024c8ecc] .cl_page_completion+0x2ac/0x9a0 [obdclass]
          2013-01-25 16:18:31.888551 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f690] [800000000468ad00] .osc_ap_completion+0x590/0xd80 [osc]
          2013-01-25 16:18:31.889100 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f7e0] [800000000468ce64] .osc_extent_finish+0x4d4/0xdc0 [osc]
          2013-01-25 16:18:31.889611 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fa00] [8000000004669b64] .brw_interpret+0x304/0x1800 [osc]
          2013-01-25 16:18:31.890136 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fb40] [8000000003b4a308] .ptlrpc_check_set+0x3c8/0x4e50 [ptlrpc]
          2013-01-25 16:18:31.890702 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fd20] [8000000003b9ffec] .ptlrpcd_check+0x66c/0x870 [ptlrpc]
          2013-01-25 16:18:31.891163 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fe40] [8000000003ba054c] .ptlrpcd+0x35c/0x510 [ptlrpc]
          2013-01-25 16:18:31.891666 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61ff90] [c00000000001b9a0] .kernel_thread+0x54/0x70
          2013-01-25 16:18:31.892174 {DefaultControlEventListener} [mmcs]{499}.3.0: Instruction dump:
          2013-01-25 16:18:31.892707 {DefaultControlEventListener} [mmcs]{499}.3.0: 7c681b78 e87e8030 38db0378 7f87e378 481f8c49 60000000 2fbd0000 80bf0004
          2013-01-25 16:18:31.893241 {DefaultControlEventListener} [mmcs]{499}.3.0: 409e00
          10 e8de8038 38e0ffff 4800000c <e8fd018a> 38dd0378 811f0008 e87e8040
          2013-01-25 16:18:31.893642 {DefaultControlEventListener} [mmcs]{499}.3.0: Kernel panic - not syncing: Fatal exception
          2013-01-25 16:18:31.894048 {DefaultControlEventListener} [mmcs]{499}.3.0: Call Trace:
          2013-01-25 16:18:31.894528 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61ec60] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)
          2013-01-25 16:18:31.894940 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61ed10] [c000000000431ef4] .panic+0x80/0x1ac
          2013-01-25 16:18:31.895407 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61eda0] [c000000000019d40] .die+0x1a4/0x1bc
          2013-01-25 16:18:31.895946 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61ee40] [c00000000001f95c] .bad_page_fault+0xb8/0xd4
          2013-01-25 16:18:31.896436 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61eec0] [c000000000014e4c] storage_fault_common+0x48/0x4c
          2013-01-25 16:18:31.896897 {DefaultControlEventListener} [mmcs]{499}.3.0: --- Exception: 300 at .spin_bug+0xac/0xfc
          2013-01-25 16:18:31.897369 {DefaultControlEventListener} [mmcs]{499}.3.0:     LR = .spin_bug+0x90/0xfc
          2013-01-25 16:18:31.897891 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f260] [c000000000239564] ._raw_spin_lock+0x50/0x1a8
          2013-01-25 16:18:31.898409 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f310] [c00000000042d564] ._spin_lock_irqsave+0x20/0x3c
          2013-01-25 16:18:31.898915 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f390] [c0000000000273ac] .__wake_up+0x2c/0x78
          2013-01-25 16:18:31.899379 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f430] [8000000000ab20cc] .cfs_waitq_broadcast+0x1c/0x30 [libcfs]
          2013-01-25 16:18:31.899824 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f4a0] [80000000024d93e8] .cl_sync_io_note+0x1b8/0x2d0 [obdclass]
          2013-01-25 16:18:31.900251 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f550] [80000000024c8ecc] .cl_page_completion+0x2ac/0x9a0 [obdclass]
          2013-01-25 16:18:31.900701 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f690] [800000000468ad00] .osc_ap_completion+0x590/0xd80 [osc]
          2013-01-25 16:18:31.901133 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f7e0] [800000000468ce64] .osc_extent_finish+0x4d4/0xdc0 [osc]
          2013-01-25 16:18:31.901544 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fa00] [8000000004669b64] .brw_interpret+0x304/0x1800 [osc]
          2013-01-25 16:18:31.901960 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fb40] [8000000003b4a308] .ptlrpc_check_set+0x3c8/0x4e50 [ptlrpc]
          2013-01-25 16:18:31.902374 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fd20] [8000000003b9ffec] .ptlrpcd_check+0x66c/0x870 [ptlrpc]
          2013-01-25 16:18:31.902807 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fe40] [8000000003ba054c] .ptlrpcd+0x35c/0x510 [ptlrpc]
          2013-01-25 16:18:31.903197 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61ff90] [c00000000001b9a0] .kernel_thread+0x54/0x70
          
          morrone Christopher Morrone (Inactive) added a comment - We had a Sequoia client crash in what appears to be the same place. 2013-01-25 12:53:21.119279 {DefaultControlEventListener} [mmcs]{499}.4.0: 2013-01-25 12:53:11.399 (INFO ) [0xfff80c07680] ibm.cios.jobctld.HwJobController: Job 54185 added with 128 compute nodes 2013-01-25 16:18:31.874727 {DefaultControlEventListener} [mmcs]{499}.3.0: BUG: spinlock bad magic on CPU#12, ptlrpcd_8/3374 (Not tainted) 2013-01-25 16:18:31.875471 {DefaultControlEventListener} [mmcs]{499}.3.0: Unable to handle kernel paging request for data at address 0x5a5a5a5a5a5a5be2 2013-01-25 16:18:31.875750 {DefaultControlEventListener} [mmcs]{499}.3.0: Faulting instruction address: 0xc0000000002393f8 2013-01-25 16:18:31.876087 {DefaultControlEventListener} [mmcs]{499}.3.0: Oops: Kernel access of bad area, sig: 11 [#1] 2013-01-25 16:18:31.876366 {DefaultControlEventListener} [mmcs]{499}.3.0: SMP NR_CPUS=68 Blue Gene/Q 2013-01-25 16:18:31.876736 {DefaultControlEventListener} [mmcs]{499}.3.0: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm 2013-01-25 16:18:31.877158 {DefaultControlEventListener} [mmcs]{499}.3.0: NIP: c0000000002393f8 LR: c0000000002393dc CTR: 0000000000000000 2013-01-25 16:18:31.877615 {DefaultControlEventListener} [mmcs]{499}.3.0: REGS: c0000003ca61ef30 TRAP: 0300 Not tainted (2.6.32-220.23.3.bgq.18llnl.V1R1M2.bgq62_16.ppc64) 2013-01-25 16:18:31.878008 {DefaultControlEventListener} [mmcs]{499}.3.0: MSR: 0000000080029000 <EE,ME,CE> CR: 22282484 XER: 20000000 2013-01-25 16:18:31.878414 {DefaultControlEventListener} [mmcs]{499}.3.0: DEAR: 5a5a5a5a5a5a5be2, ESR: 0000000000000000 2013-01-25 16:18:31.878875 {DefaultControlEventListener} [mmcs]{499}.3.0: TASK = c0000003ec7a4a60[3374] 'ptlrpcd_8' THREAD: c0000003ca61c000 CPU: 12 2013-01-25 16:18:31.879357 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR00: c0000000002393dc c0000003ca61f1b0 c0000000006e3420 0000000000000046 2013-01-25 16:18:31.879840 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR04: 0000000000000000 000000005a5a5a5a 61696e746564290a 3420284e6f742074 2013-01-25 16:18:31.880323 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR08: 63645f382f333337 c0000000006214f8 c00000000079a7b0 0000000001da0000 2013-01-25 16:18:31.880815 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR12: 0000000022282482 c00000000076c100 ffffffffebc0de04 0000000000001000 2013-01-25 16:18:31.881317 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR16: c00000036822bc90 0000000000000000 00000000000037e0 0000000000000070 2013-01-25 16:18:31.881827 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR20: 0000000000000000 c00000034a4fb370 8000000000af424c 0000000000003550 2013-01-25 16:18:31.882333 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR24: c00000036eeeae20 c000000000574c80 000000000000000c c0000003ec7a4a60 2013-01-25 16:18:31.882847 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR28: 0000000000000d2e 5a5a5a5a5a5a5a5a c000000000697da8 c00000036eeeae28 2013-01-25 16:18:31.883351 {DefaultControlEventListener} [mmcs]{499}.3.0: NIP [c0000000002393f8] .spin_bug+0xac/0xfc 2013-01-25 16:18:31.883868 {DefaultControlEventListener} [mmcs]{499}.3.0: LR [c0000000002393dc] .spin_bug+0x90/0xfc 2013-01-25 16:18:31.884364 {DefaultControlEventListener} [mmcs]{499}.3.0: Call Trace: 2013-01-25 16:18:31.884887 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f1b0] [c0000000002393dc] .spin_bug+0x90/0xfc (unreliable) 2013-01-25 16:18:31.885438 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f260] [c000000000239564] ._raw_spin_lock+0x50/0x1a8 2013-01-25 16:18:31.885966 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f310] [c00000000042d564] ._spin_lock_irqsave+0x20/0x3c 2013-01-25 16:18:31.886515 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f390] [c0000000000273ac] .__wake_up+0x2c/0x78 2013-01-25 16:18:31.887012 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f430] [8000000000ab20cc] .cfs_waitq_broadcast+0x1c/0x30 [libcfs] 2013-01-25 16:18:31.887540 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f4a0] [80000000024d93e8] .cl_sync_io_note+0x1b8/0x2d0 [obdclass] 2013-01-25 16:18:31.888074 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f550] [80000000024c8ecc] .cl_page_completion+0x2ac/0x9a0 [obdclass] 2013-01-25 16:18:31.888551 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f690] [800000000468ad00] .osc_ap_completion+0x590/0xd80 [osc] 2013-01-25 16:18:31.889100 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f7e0] [800000000468ce64] .osc_extent_finish+0x4d4/0xdc0 [osc] 2013-01-25 16:18:31.889611 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fa00] [8000000004669b64] .brw_interpret+0x304/0x1800 [osc] 2013-01-25 16:18:31.890136 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fb40] [8000000003b4a308] .ptlrpc_check_set+0x3c8/0x4e50 [ptlrpc] 2013-01-25 16:18:31.890702 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fd20] [8000000003b9ffec] .ptlrpcd_check+0x66c/0x870 [ptlrpc] 2013-01-25 16:18:31.891163 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fe40] [8000000003ba054c] .ptlrpcd+0x35c/0x510 [ptlrpc] 2013-01-25 16:18:31.891666 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61ff90] [c00000000001b9a0] .kernel_thread+0x54/0x70 2013-01-25 16:18:31.892174 {DefaultControlEventListener} [mmcs]{499}.3.0: Instruction dump: 2013-01-25 16:18:31.892707 {DefaultControlEventListener} [mmcs]{499}.3.0: 7c681b78 e87e8030 38db0378 7f87e378 481f8c49 60000000 2fbd0000 80bf0004 2013-01-25 16:18:31.893241 {DefaultControlEventListener} [mmcs]{499}.3.0: 409e00 10 e8de8038 38e0ffff 4800000c <e8fd018a> 38dd0378 811f0008 e87e8040 2013-01-25 16:18:31.893642 {DefaultControlEventListener} [mmcs]{499}.3.0: Kernel panic - not syncing: Fatal exception 2013-01-25 16:18:31.894048 {DefaultControlEventListener} [mmcs]{499}.3.0: Call Trace: 2013-01-25 16:18:31.894528 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61ec60] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable) 2013-01-25 16:18:31.894940 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61ed10] [c000000000431ef4] .panic+0x80/0x1ac 2013-01-25 16:18:31.895407 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61eda0] [c000000000019d40] .die+0x1a4/0x1bc 2013-01-25 16:18:31.895946 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61ee40] [c00000000001f95c] .bad_page_fault+0xb8/0xd4 2013-01-25 16:18:31.896436 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61eec0] [c000000000014e4c] storage_fault_common+0x48/0x4c 2013-01-25 16:18:31.896897 {DefaultControlEventListener} [mmcs]{499}.3.0: --- Exception: 300 at .spin_bug+0xac/0xfc 2013-01-25 16:18:31.897369 {DefaultControlEventListener} [mmcs]{499}.3.0: LR = .spin_bug+0x90/0xfc 2013-01-25 16:18:31.897891 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f260] [c000000000239564] ._raw_spin_lock+0x50/0x1a8 2013-01-25 16:18:31.898409 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f310] [c00000000042d564] ._spin_lock_irqsave+0x20/0x3c 2013-01-25 16:18:31.898915 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f390] [c0000000000273ac] .__wake_up+0x2c/0x78 2013-01-25 16:18:31.899379 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f430] [8000000000ab20cc] .cfs_waitq_broadcast+0x1c/0x30 [libcfs] 2013-01-25 16:18:31.899824 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f4a0] [80000000024d93e8] .cl_sync_io_note+0x1b8/0x2d0 [obdclass] 2013-01-25 16:18:31.900251 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f550] [80000000024c8ecc] .cl_page_completion+0x2ac/0x9a0 [obdclass] 2013-01-25 16:18:31.900701 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f690] [800000000468ad00] .osc_ap_completion+0x590/0xd80 [osc] 2013-01-25 16:18:31.901133 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f7e0] [800000000468ce64] .osc_extent_finish+0x4d4/0xdc0 [osc] 2013-01-25 16:18:31.901544 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fa00] [8000000004669b64] .brw_interpret+0x304/0x1800 [osc] 2013-01-25 16:18:31.901960 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fb40] [8000000003b4a308] .ptlrpc_check_set+0x3c8/0x4e50 [ptlrpc] 2013-01-25 16:18:31.902374 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fd20] [8000000003b9ffec] .ptlrpcd_check+0x66c/0x870 [ptlrpc] 2013-01-25 16:18:31.902807 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fe40] [8000000003ba054c] .ptlrpcd+0x35c/0x510 [ptlrpc] 2013-01-25 16:18:31.903197 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61ff90] [c00000000001b9a0] .kernel_thread+0x54/0x70
          pjones Peter Jones added a comment -

          Dropping priority as this seems to be a rare pre-existing race. We may include the fix if it is ready but we will not necessarily hold the release for it.

          pjones Peter Jones added a comment - Dropping priority as this seems to be a rare pre-existing race. We may include the fix if it is ready but we will not necessarily hold the release for it.

          Assume this:

          • thread-A calls cl_sync_io_note() and set anchor::csi_sync_nr to zero
          • thread-B calls cl_sync_io_wait() and found anchor::csi_sync_nr is already zero, so it will not sleep
          • thread-B poisoned anchor
          • thread-A try to call wakup anchor::csi_waitq, which is poisoned, so crash

          patch is here: http://review.whamcloud.com/4214

          liang Liang Zhen (Inactive) added a comment - Assume this: thread-A calls cl_sync_io_note() and set anchor::csi_sync_nr to zero thread-B calls cl_sync_io_wait() and found anchor::csi_sync_nr is already zero, so it will not sleep thread-B poisoned anchor thread-A try to call wakup anchor::csi_waitq, which is poisoned, so crash patch is here: http://review.whamcloud.com/4214

          People

            liang Liang Zhen (Inactive)
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: