[LU-2101] Oops under cl_sync_io_note() Created: 07/Oct/12 Updated: 06/Feb/13 Resolved: 06/Feb/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | Liang Zhen (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sequoia | ||
| Environment: |
SWL hyperion |
||
| Severity: | 3 |
| Rank (Obsolete): | 4389 |
| Description |
|
Had a client crash 2012-10-07 08:26:22 general protection fault: 0000 [#1] SMP
2012-10-07 08:26:22 last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/irq
[10/7/12 8:37:37 AM] Cliff White: 2012-10-07 08:26:22 Pid: 4119, comm: ptlrpcd_2 Tainted: G W --------------- 2.6.32-279.5.1.el6.x86_64 #1 Dell XS23-TY /XS23-TY
2012-10-07 08:26:22 RIP: 0010:[<ffffffff8104e2e1>] [<ffffffff8104e2e1>] __wake_up_common+0x31/0x90
2012-10-07 08:26:22 RSP: 0018:ffff880181f05970 EFLAGS: 00010096
2012-10-07 08:26:22 RAX: 5a5a5a5a5a5a5a42 RBX: ffff880101760a18 RCX: 0000000000000000
2012-10-07 08:26:22 RDX: 5a5a5a5a5a5a5a5a RSI: 0000000000000003 RDI: ffff880101760a18
2012-10-07 08:26:22 RBP: ffff880181f059b0 R08: 0000000000000000 R09: 0000000000000000
2012-10-07 08:26:22 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000286
2012-10-07 08:26:22 R13: ffff880101760a20 R14: 0000000000000000 R15: 0000000000000000
2012-10-07 08:26:22 FS: 00007ffff7fdd700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
2012-10-07 08:26:22 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
2012-10-07 08:26:22 CR2: 00002aaab800b078 CR3: 0000000151898000 CR4: 00000000000006f0
2012-10-07 08:26:22 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2012-10-07 08:26:22 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2012-10-07 08:26:22 Process ptlrpcd_2 (pid: 4119, threadinfo ffff880181f04000, task ffff8801ba3f7500)
2012-10-07 08:26:22 Stack:
2012-10-07 08:26:22 ffff880181f05a00 0000000300000000 0000000000000001 ffff880101760a18
2012-10-07 08:26:22 <d> 0000000000000286 0000000000000003 0000000000000000 0000000000000000
2012-10-07 08:26:22 <d> ffff880181f059f0 ffffffff810533e8 ffffffffffffff8a ffff880101760a10
[10/7/12 8:38:01 AM] Cliff White: 2012-10-07 08:26:22 Call Trace:
2012-10-07 08:26:22 [<ffffffff810533e8>] __wake_up+0x48/0x70
2012-10-07 08:26:22 [<ffffffffa0372797>] cfs_waitq_broadcast+0x17/0x20 [libcfs]
2012-10-07 08:26:22 [<ffffffffa0542f39>] cl_sync_io_note+0x139/0x180 [obdclass]
2012-10-07 08:26:22 [<ffffffffa0538c45>] cl_page_completion+0x155/0x680 [obdclass]
2012-10-07 08:26:22 [<ffffffff8105368d>] ? task_rq_lock+0x5d/0xa0
2012-10-07 08:26:22 [<ffffffff810600bc>] ? try_to_wake_up+0x24c/0x3e0
2012-10-07 08:26:22 [<ffffffffa08671d2>] osc_ap_completion+0x222/0x980 [osc]
2012-10-07 08:26:22 [<ffffffff81060262>] ? default_wake_function+0x12/0x20
2012-10-07 08:26:22 [<ffffffff8104e309>] ? __wake_up_common+0x59/0x90
2012-10-07 08:26:22 [<ffffffffa068942d>] ? lustre_msg_buf+0x5d/0x60 [ptlrpc]
2012-10-07 08:26:22 [<ffffffffa0867cbd>] osc_extent_finish+0x38d/0xa30 [osc]
2012-10-07 08:26:22 [<ffffffffa0688fbc>] ? lustre_msg_get_opc+0x9c/0x110 [ptlrpc]
2012-10-07 08:26:22 [<ffffffffa0847d0b>] ? osc_brw_fini_request+0x10b/0x1360 [osc]
2012-10-07 08:26:22 [<ffffffffa0372bae>] ? cfs_free+0xe/0x10 [libcfs]
2012-10-07 08:26:22 [<ffffffffa06a9b38>] ? at_measured+0x108/0x390 [ptlrpc]
2012-10-07 08:26:22 [<ffffffffa06a9b38>] ? at_measured+0x108/0x390 [ptlrpc]
2012-10-07 08:26:22 [<ffffffffa084a0ef>] brw_interpret+0x30f/0x11b0 [osc]
2012-10-07 08:26:22 [<ffffffffa06850e9>] ? ptlrpc_unregister_bulk+0x99/0xad0 [ptlrpc]
2012-10-07 08:26:22 [<ffffffffa067ec2f>] ptlrpc_check_set+0x29f/0x1ae0 [ptlrpc]
2012-10-07 08:26:22 [<ffffffffa06b17eb>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
2012-10-07 08:26:22 [<ffffffffa06b1d1b>] ptlrpcd+0x22b/0x3a0 [ptlrpc]
2012-10-07 08:26:22 [<ffffffff81060250>] ? default_wake_function+0x0/0x20
2012-10-07 08:26:22 [<ffffffffa06b1af0>] ? ptlrpcd+0x0/0x3a0 [ptlrpc]
2012-10-07 08:26:22 [<ffffffff8100c14a>] child_rip+0xa/0x20
2012-10-07 08:26:22 [<ffffffffa06b1af0>] ? ptlrpcd+0x0/0x3a0 [ptlrpc]
2012-10-07 08:26:22 [<ffffffffa06b1af0>] ? ptlrpcd+0x0/0x3a0 [ptlrpc]
2012-10-07 08:26:22 [<ffffffff8100c140>] ? child_rip+0x0/0x20
2012-10-07 08:26:22 Code: 41 56 41 55 41 54 53 48 83 ec 18 0f 1f 44 00 00 89 75 cc 89 55 c8 4c 8d 6f 08 48 8b 57 08 41 89 cf 4d 89 c6 48 8d 42 e8 49 39 d5 <48> 8b 58 18 74 3f 48 83 eb 18 eb 0a 0f 1f 00 48 89 d8 48 8d 5a
2012-10-07 08:26:22 RIP [<ffffffff8104e2e1>] __wake_up_common+0x31/0x90
2012-10-07 08:26:22 RSP <ffff880181f05970>
2012-10-07 08:26:22 Initializing cgroup subsys cpuset
|
| Comments |
| Comment by Liang Zhen (Inactive) [ 07/Oct/12 ] |
|
Assume this:
patch is here: http://review.whamcloud.com/4214 |
| Comment by Peter Jones [ 08/Oct/12 ] |
|
Dropping priority as this seems to be a rare pre-existing race. We may include the fix if it is ready but we will not necessarily hold the release for it. |
| Comment by Christopher Morrone [ 28/Jan/13 ] |
|
We had a Sequoia client crash in what appears to be the same place. 2013-01-25 12:53:21.119279 {DefaultControlEventListener} [mmcs]{499}.4.0: 2013-01-25 12:53:11.399 (INFO ) [0xfff80c07680] ibm.cios.jobctld.HwJobController: Job 54185 added with 128 compute nodes
2013-01-25 16:18:31.874727 {DefaultControlEventListener} [mmcs]{499}.3.0: BUG: spinlock bad magic on CPU#12, ptlrpcd_8/3374 (Not tainted)
2013-01-25 16:18:31.875471 {DefaultControlEventListener} [mmcs]{499}.3.0: Unable to handle kernel paging request for data at address 0x5a5a5a5a5a5a5be2
2013-01-25 16:18:31.875750 {DefaultControlEventListener} [mmcs]{499}.3.0: Faulting instruction address: 0xc0000000002393f8
2013-01-25 16:18:31.876087 {DefaultControlEventListener} [mmcs]{499}.3.0: Oops: Kernel access of bad area, sig: 11 [#1]
2013-01-25 16:18:31.876366 {DefaultControlEventListener} [mmcs]{499}.3.0: SMP NR_CPUS=68 Blue Gene/Q
2013-01-25 16:18:31.876736 {DefaultControlEventListener} [mmcs]{499}.3.0: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
2013-01-25 16:18:31.877158 {DefaultControlEventListener} [mmcs]{499}.3.0: NIP: c0000000002393f8 LR: c0000000002393dc CTR: 0000000000000000
2013-01-25 16:18:31.877615 {DefaultControlEventListener} [mmcs]{499}.3.0: REGS: c0000003ca61ef30 TRAP: 0300 Not tainted (2.6.32-220.23.3.bgq.18llnl.V1R1M2.bgq62_16.ppc64)
2013-01-25 16:18:31.878008 {DefaultControlEventListener} [mmcs]{499}.3.0: MSR: 0000000080029000 <EE,ME,CE> CR: 22282484 XER: 20000000
2013-01-25 16:18:31.878414 {DefaultControlEventListener} [mmcs]{499}.3.0: DEAR: 5a5a5a5a5a5a5be2, ESR: 0000000000000000
2013-01-25 16:18:31.878875 {DefaultControlEventListener} [mmcs]{499}.3.0: TASK = c0000003ec7a4a60[3374] 'ptlrpcd_8' THREAD: c0000003ca61c000 CPU: 12
2013-01-25 16:18:31.879357 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR00: c0000000002393dc c0000003ca61f1b0 c0000000006e3420 0000000000000046
2013-01-25 16:18:31.879840 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR04: 0000000000000000 000000005a5a5a5a 61696e746564290a 3420284e6f742074
2013-01-25 16:18:31.880323 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR08: 63645f382f333337 c0000000006214f8 c00000000079a7b0 0000000001da0000
2013-01-25 16:18:31.880815 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR12: 0000000022282482 c00000000076c100 ffffffffebc0de04 0000000000001000
2013-01-25 16:18:31.881317 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR16: c00000036822bc90 0000000000000000 00000000000037e0 0000000000000070
2013-01-25 16:18:31.881827 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR20: 0000000000000000 c00000034a4fb370 8000000000af424c 0000000000003550
2013-01-25 16:18:31.882333 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR24: c00000036eeeae20 c000000000574c80 000000000000000c c0000003ec7a4a60
2013-01-25 16:18:31.882847 {DefaultControlEventListener} [mmcs]{499}.3.0: GPR28: 0000000000000d2e 5a5a5a5a5a5a5a5a c000000000697da8 c00000036eeeae28
2013-01-25 16:18:31.883351 {DefaultControlEventListener} [mmcs]{499}.3.0: NIP [c0000000002393f8] .spin_bug+0xac/0xfc
2013-01-25 16:18:31.883868 {DefaultControlEventListener} [mmcs]{499}.3.0: LR [c0000000002393dc] .spin_bug+0x90/0xfc
2013-01-25 16:18:31.884364 {DefaultControlEventListener} [mmcs]{499}.3.0: Call Trace:
2013-01-25 16:18:31.884887 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f1b0] [c0000000002393dc] .spin_bug+0x90/0xfc (unreliable)
2013-01-25 16:18:31.885438 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f260] [c000000000239564] ._raw_spin_lock+0x50/0x1a8
2013-01-25 16:18:31.885966 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f310] [c00000000042d564] ._spin_lock_irqsave+0x20/0x3c
2013-01-25 16:18:31.886515 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f390] [c0000000000273ac] .__wake_up+0x2c/0x78
2013-01-25 16:18:31.887012 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f430] [8000000000ab20cc] .cfs_waitq_broadcast+0x1c/0x30 [libcfs]
2013-01-25 16:18:31.887540 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f4a0] [80000000024d93e8] .cl_sync_io_note+0x1b8/0x2d0 [obdclass]
2013-01-25 16:18:31.888074 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f550] [80000000024c8ecc] .cl_page_completion+0x2ac/0x9a0 [obdclass]
2013-01-25 16:18:31.888551 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f690] [800000000468ad00] .osc_ap_completion+0x590/0xd80 [osc]
2013-01-25 16:18:31.889100 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f7e0] [800000000468ce64] .osc_extent_finish+0x4d4/0xdc0 [osc]
2013-01-25 16:18:31.889611 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fa00] [8000000004669b64] .brw_interpret+0x304/0x1800 [osc]
2013-01-25 16:18:31.890136 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fb40] [8000000003b4a308] .ptlrpc_check_set+0x3c8/0x4e50 [ptlrpc]
2013-01-25 16:18:31.890702 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fd20] [8000000003b9ffec] .ptlrpcd_check+0x66c/0x870 [ptlrpc]
2013-01-25 16:18:31.891163 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fe40] [8000000003ba054c] .ptlrpcd+0x35c/0x510 [ptlrpc]
2013-01-25 16:18:31.891666 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61ff90] [c00000000001b9a0] .kernel_thread+0x54/0x70
2013-01-25 16:18:31.892174 {DefaultControlEventListener} [mmcs]{499}.3.0: Instruction dump:
2013-01-25 16:18:31.892707 {DefaultControlEventListener} [mmcs]{499}.3.0: 7c681b78 e87e8030 38db0378 7f87e378 481f8c49 60000000 2fbd0000 80bf0004
2013-01-25 16:18:31.893241 {DefaultControlEventListener} [mmcs]{499}.3.0: 409e00
10 e8de8038 38e0ffff 4800000c <e8fd018a> 38dd0378 811f0008 e87e8040
2013-01-25 16:18:31.893642 {DefaultControlEventListener} [mmcs]{499}.3.0: Kernel panic - not syncing: Fatal exception
2013-01-25 16:18:31.894048 {DefaultControlEventListener} [mmcs]{499}.3.0: Call Trace:
2013-01-25 16:18:31.894528 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61ec60] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)
2013-01-25 16:18:31.894940 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61ed10] [c000000000431ef4] .panic+0x80/0x1ac
2013-01-25 16:18:31.895407 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61eda0] [c000000000019d40] .die+0x1a4/0x1bc
2013-01-25 16:18:31.895946 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61ee40] [c00000000001f95c] .bad_page_fault+0xb8/0xd4
2013-01-25 16:18:31.896436 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61eec0] [c000000000014e4c] storage_fault_common+0x48/0x4c
2013-01-25 16:18:31.896897 {DefaultControlEventListener} [mmcs]{499}.3.0: --- Exception: 300 at .spin_bug+0xac/0xfc
2013-01-25 16:18:31.897369 {DefaultControlEventListener} [mmcs]{499}.3.0: LR = .spin_bug+0x90/0xfc
2013-01-25 16:18:31.897891 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f260] [c000000000239564] ._raw_spin_lock+0x50/0x1a8
2013-01-25 16:18:31.898409 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f310] [c00000000042d564] ._spin_lock_irqsave+0x20/0x3c
2013-01-25 16:18:31.898915 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f390] [c0000000000273ac] .__wake_up+0x2c/0x78
2013-01-25 16:18:31.899379 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f430] [8000000000ab20cc] .cfs_waitq_broadcast+0x1c/0x30 [libcfs]
2013-01-25 16:18:31.899824 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f4a0] [80000000024d93e8] .cl_sync_io_note+0x1b8/0x2d0 [obdclass]
2013-01-25 16:18:31.900251 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f550] [80000000024c8ecc] .cl_page_completion+0x2ac/0x9a0 [obdclass]
2013-01-25 16:18:31.900701 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f690] [800000000468ad00] .osc_ap_completion+0x590/0xd80 [osc]
2013-01-25 16:18:31.901133 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61f7e0] [800000000468ce64] .osc_extent_finish+0x4d4/0xdc0 [osc]
2013-01-25 16:18:31.901544 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fa00] [8000000004669b64] .brw_interpret+0x304/0x1800 [osc]
2013-01-25 16:18:31.901960 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fb40] [8000000003b4a308] .ptlrpc_check_set+0x3c8/0x4e50 [ptlrpc]
2013-01-25 16:18:31.902374 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fd20] [8000000003b9ffec] .ptlrpcd_check+0x66c/0x870 [ptlrpc]
2013-01-25 16:18:31.902807 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61fe40] [8000000003ba054c] .ptlrpcd+0x35c/0x510 [ptlrpc]
2013-01-25 16:18:31.903197 {DefaultControlEventListener} [mmcs]{499}.3.0: [c0000003ca61ff90] [c00000000001b9a0] .kernel_thread+0x54/0x70
|
| Comment by Liang Zhen (Inactive) [ 29/Jan/13 ] |
|
rebase the patch to master: http://review.whamcloud.com/#change,5199 |
| Comment by Jodi Levi (Inactive) [ 06/Feb/13 ] |
|
Patch landed to master |