Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0
-
Sequoia, Lustre 2.3.54-2chaos on the clients, lustre 2.3.54-6chaos on the servers. github.com/chaos/lustre
-
3
-
5564
Description
We hit the following bad page request and Oops on a Lustre client (Sequoia I/O Node) while running ior. It happened during a read phase.
2012-11-14 16:25:51.438302 {DefaultControlEventListener} [mmcs]{753}.3.0: Unable to handle kernel paging request for data at address 0x00000188 2012-11-14 16:25:51.478050 {DefaultControlEventListener} [mmcs]{753}.3.0: Unable to handle kernel paging request for data at address 0x00000188 2012-11-14 16:25:51.518060 {DefaultControlEventListener} [mmcs]{753}.3.0: Faulting instruction address: 0x8000000004766018 2012-11-14 16:25:51.557887 {DefaultControlEventListener} [mmcs]{753}.3.0: Oops: Kernel access of bad area, sig: 11 [#1] 2012-11-14 16:25:51.598089 {DefaultControlEventListener} [mmcs]{753}.3.0: SMP NR_CPUS=68 Blue Gene/Q 2012-11-14 16:25:51.637946 {DefaultControlEventListener} [mmcs]{753}.3.0: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm 2012-11-14 16:25:51.678349 {DefaultControlEventListener} [mmcs]{753}.3.0: NIP: 8000000004766018 LR: 8000000004765f98 CTR: c00000000042dd78 2012-11-14 16:25:51.718547 {DefaultControlEventListener} [mmcs]{753}.3.0: REGS: c0000003e04bacd0 TRAP: 0300 Not tainted (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64) 2012-11-14 16:25:51.758561 {DefaultControlEventListener} [mmcs]{753}.3.0: MSR: 0000000080029000 <EE,ME,CE> CR: 24028488 XER: 20000000 2012-11-14 16:25:51.798747 {DefaultControlEventListener} [mmcs]{753}.3.0: DEAR: 0000000000000188, ESR: 0000000000000000 2012-11-14 16:25:51.838717 {DefaultControlEventListener} [mmcs]{753}.3.0: TASK = c0000003c0706f60[4778] 'sysiod' THREAD: c0000003e04b8000 CPU: 57 2012-11-14 16:25:51.878716 {DefaultControlEventListener} [mmcs]{753}.3.0: GPR00: 8000000004791308 c0000003e04baf50 8000000004795e00 0000000000000000 2012-11-14 16:25:51.918673 {DefaultControlEventListener} [mmcs]{753}.3.0: GPR04: c000000319261de0 c0000003c0706f60 0000000000000000 0000000000000000 2012-11-14 16:25:51.958294 {DefaultControlEventListener} [mmcs]{753}.3.0: GPR08: c0000002de5bf840 c0000003c07072c0 0000000100117cf6 c00000000042dd78 2012-11-14 16:25:51.998721 {DefaultControlEventListener} [mmcs]{753}.3.0: GPR12: 8000000004772710 c000000000770a00 0000000000000062 0000000000000060 2012-11-14 16:25:52.038665 {DefaultControlEventListener} [mmcs]{753}.3.0: GPR16: 0000000000000000 8000000000c2f384 80000000047770f0 800000000477d7d0 2012-11-14 16:25:52.078590 {DefaultControlEventListener} [mmcs]{753}.3.0: GPR20: 0000000002000400 00000000000010b0 0000000000000008 c0000003c37906c0 2012-11-14 16:25:52.118609 {DefaultControlEventListener} [mmcs]{753}.3.0: GPR24: c000000000710380 8000000004791088 c000000319261de0 0000000100117b01 2012-11-14 16:25:52.158704 {DefaultControlEventListener} [mmcs]{753}.3.0: GPR28: c000000319261ec0 c000000000710380 80000000047945f8 c0000003e04baf50 2012-11-14 16:25:52.199096 {DefaultControlEventListener} [mmcs]{753}.3.0: NIP [8000000004766018] .osc_io_unplug0+0x138/0x6f0 [osc] 2012-11-14 16:25:52.239389 {DefaultControlEventListener} [mmcs]{753}.3.0: LR [8000000004765f98] .osc_io_unplug0+0xb8/0x6f0 [osc] 2012-11-14 16:25:52.278920 {DefaultControlEventListener} [mmcs]{753}.3.0: Call Trace: 2012-11-14 16:25:52.318772 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04baf50] [c0000003c37906c0] 0xc0000003c37906c0 (unreliable) 2012-11-14 16:25:52.358541 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb080] [800000000476713c] .osc_queue_sync_pages+0x21c/0x460 [osc] 2012-11-14 16:25:52.398292 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb160] [8000000004754198] .osc_io_submit+0x228/0x6b0 [osc] 2012-11-14 16:25:52.438255 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb290] [80000000025fa7d8] .cl_io_submit_rw+0xd8/0x270 [obdclass] 2012-11-14 16:25:52.478260 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb350] [80000000052d7160] .lov_io_submit+0x3b0/0x10b0 [lov] 2012-11-14 16:25:52.518467 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb450] [80000000025fa7d8] .cl_io_submit_rw+0xd8/0x270 [obdclass] 2012-11-14 16:25:52.558405 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb510] [80000000025fea24] .cl_io_read_page+0x124/0x280 [obdclass] 2012-11-14 16:25:52.598323 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb5d0] [8000000006b1d3fc] .ll_readpage+0xdc/0x2c0 [lustre] 2012-11-14 16:25:52.638412 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb680] [c000000000096924] .generic_file_aio_read+0x4d8/0x6ec 2012-11-14 16:25:52.678479 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb7c0] [8000000006b610f4] .vvp_io_read_start+0x274/0x640 [lustre] 2012-11-14 16:25:52.718337 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb8e0] [80000000025faa3c] .cl_io_start+0xcc/0x220 [obdclass] 2012-11-14 16:25:52.758244 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb980] [8000000002602854] .cl_io_loop+0x194/0x2c0 [obdclass] 2012-11-14 16:25:52.798116 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bba30] [8000000006ada390] .ll_file_io_generic+0x410/0x670 [lustre] 2012-11-14 16:25:52.838735 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bbb30] [8000000006adb134] .ll_file_aio_read+0x1d4/0x3a0 [lustre] 2012-11-14 16:25:52.878194 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bbc00] [8000000006adb450] .ll_file_read+0x150/0x320 [lustre] 2012-11-14 16:25:52.918073 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bbce0] [c0000000000d21a0] .vfs_read+0xd0/0x1c4 2012-11-14 16:25:52.958634 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bbd80] [c0000000000d2390] .SyS_read+0x54/0x98 2012-11-14 16:25:52.998257 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bbe30] [c000000000000580] syscall_exit+0x0/0x2c 2012-11-14 16:25:53.038087 {DefaultControlEventListener} [mmcs]{753}.3.0: Instruction dump: 2012-11-14 16:25:53.078379 {DefaultControlEventListener} [mmcs]{753}.3.0: 393902a8 92d90290 91f90294 92990298 f93902a0 fa790280 fa590288 e95d0000 2012-11-14 16:25:53.118273 {DefaultControlEventListener} [mmcs]{753}.3.0: e8da00f0 e92d0c68 e8ad0c68 39290360 <e8e6018a> e89e8128 7c030378 399902e0 2012-11-14 16:25:53.158440 {DefaultControlEventListener} [mmcs]{753}.14.1: Kernel panic - not syncing: Fatal exception 2012-11-14 16:25:53.198523 {DefaultControlEventListener} [mmcs]{753}.14.1: Faulting instruction address: 0x800000000473aae8 2012-11-14 16:25:53.238587 {DefaultControlEventListener} [mmcs]{753}.14.1: Oops: Kernel access of bad area, sig: 11 [#2] 2012-11-14 16:25:53.278251 {DefaultControlEventListener} [mmcs]{753}.14.1: SMP NR_CPUS=68 Blue Gene/Q 2012-11-14 16:25:53.318109 {DefaultControlEventListener} [mmcs]{753}.14.1: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm 2012-11-14 16:25:53.358097 {DefaultControlEventListener} [mmcs]{753}.14.1: NIP: 800000000473aae8 LR: 800000000473aa58 CTR: c00000000042dd78 2012-11-14 16:25:53.398124 {DefaultControlEventListener} [mmcs]{753}.14.1: REGS: c000000313faf7c0 TRAP: 0300 Tainted: G D ---------------- (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64) 2012-11-14 16:25:53.438188 {DefaultControlEventListener} [mmcs]{753}.14.1: MSR: 0000000080029000 <EE,ME,CE> CR: 24282448 XER: 00000000 2012-11-14 16:25:53.478266 {DefaultControlEventListener} [mmcs]{753}.14.1: DEAR: 0000000000000188, ESR: 0000000000000000 2012-11-14 16:25:53.518102 {DefaultControlEventListener} [mmcs]{753}.14.1: TASK = c0000003e5052fc0[3692] 'ptlrpcd_28' THREAD: c000000313fac000 CPU: 12 2012-11-14 16:25:53.558305 {DefaultControlEventListener} [mmcs]{753}.14.1: GPR00: 800000000478e120 c000000313fafa40 8000000004795e00 800000000478e0c0 2012-11-14 16:25:53.598106 {DefaultControlEventListener} [mmcs]{753}.14.1: GPR04: 0000000000000000 c0000003e5053320 0000000000000000 c0000003c782e560 2012-11-14 16:25:53.638262 {DefaultControlEventListener} [mmcs]{753}.14.1: GPR08: 8000000004775f80 800000000478e0e8 0000000100117cf6 0000000100117b01 2012-11-14 16:25:53.677971 {DefaultControlEventListener} [mmcs]{753}.14.1: GPR12: c0000003e5052fc0 c00000000074c100 8000000004775a20 8000000004778168 2012-11-14 16:25:53.718276 {DefaultControlEventListener} [mmcs]{753}.14.1: GPR16: 0000000002000400 0000000000000008 00000000000005c0 c000000000710380 2012-11-14 16:25:53.757954 {DefaultControlEventListener} [mmcs]{753}.14.1: GPR20: 8000000000c2f380 8000000000c2f384 c0000002dc5cf800 800000000478e070 2012-11-14 16:25:53.798174 {DefaultControlEventListener} [mmcs]{753}.14.1: GPR24: c000000313fafed8 c000000319261de0 0000000100117b01 0000000000000000 2012-11-14 16:25:53.838260 {DefaultControlEventListener} [mmcs]{753}.14.1: GPR28: c000000319261ec0 c000000000710380 8000000004792e88 c000000313fafa40 2012-11-14 16:25:53.878250 {DefaultControlEventListener} [mmcs]{753}.14.1: NIP [800000000473aae8] .brw_interpret+0x5b8/0x1880 [osc] 2012-11-14 16:25:53.918267 {DefaultControlEventListener} [mmcs]{753}.14.1: LR [800000000473aa58] .brw_interpret+0x528/0x1880 [osc] 2012-11-14 16:25:53.958639 {DefaultControlEventListener} [mmcs]{753}.14.1: Call Trace: 2012-11-14 16:25:53.998166 {DefaultControlEventListener} [mmcs]{753}.14.1: [c000000313fafa40] [800000000473aa40] .brw_interpret+0x510/0x1880 [osc] (unreliable) 2012-11-14 16:25:54.038257 {DefaultControlEventListener} [mmcs]{753}.14.1: [c000000313fafb80] [8000000003bb6964] .ptlrpc_check_set+0x364/0x4e80 [ptlrpc] 2012-11-14 16:25:54.078395 {DefaultControlEventListener} [mmcs]{753}.14.1: [c000000313fafd20] [8000000003c0d1cc] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc] 2012-11-14 16:25:54.118285 {DefaultControlEventListener} [mmcs]{753}.14.1: [c000000313fafe40] [8000000003c0d708] .ptlrpcd+0x308/0x510 [ptlrpc] 2012-11-14 16:25:54.178067 {DefaultControlEventListener} [mmcs]{753}.14.1: [c000000313faff90] [c00000000001a9e0] .kernel_thread+0x54/0x70 2012-11-14 16:25:54.218226 {DefaultControlEventListener} [mmcs]{753}.14.1: Instruction dump: 2012-11-14 16:25:54.258239 {DefaultControlEventListener} [mmcs]{753}.3.0: f9f70050 91770064 f9d70058 e95d0000 e8d900f0 e8ad0c68 e98d0c68 Call Trace: 2012-11-14 16:25:54.298259 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04baa00] [c000000000008160] .show_stack+0x7c/0x184 (unreliable) 2012-11-14 16:25:54.338096 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04baab0] [c000000000432c0c] .panic+0x80/0x1a8 2012-11-14 16:25:54.378365 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bab40] [c000000000018d58] .die+0x1a4/0x1bc 2012-11-14 16:25:54.418056 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04babe0] [c00000000001e9e0] .bad_page_fault+0xb8/0xd4 2012-11-14 16:25:54.458132 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bac60] [c000000000013e4c] storage_fault_common+0x48/0x4c 2012-11-14 16:25:54.498250 {DefaultControlEventListener} [mmcs]{753}.3.0: --- Exception: 300 at .osc_io_unplug0+0x138/0x6f0 [osc] 2012-11-14 16:25:54.538454 {DefaultControlEventListener} [mmcs]{753}.3.0: LR = .osc_io_unplug0+0xb8/0x6f0 [osc] 2012-11-14 16:25:54.578285 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04baf50] [c0000003c37906c0] 0xc0000003c37906c0 (unreliable) 2012-11-14 16:25:54.618099 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb080] [800000000476713c] .osc_queue_sync_pages+0x21c/0x460 [osc] 2012-11-14 16:25:54.658237 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb160] [8000000004754198] .osc_io_submit+0x228/0x6b0 [osc] 2012-11-14 16:25:54.698239 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb290] [80000000025fa7d8] .cl_io_submit_rw+0xd8/0x270 [obdclass] 2012-11-14 16:25:54.738085 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb350] [80000000052d7160] .lov_io_submit+0x3b0/0x10b0 [lov] 2012-11-14 16:25:54.778232 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb450] [80000000025fa7d8] .cl_io_submit_rw+0xd8/0x270 [obdclass] 2012-11-14 16:25:54.818624 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb510] [80000000025fea24] .cl_io_read_page+0x124/0x280 [obdclass] 2012-11-14 16:25:54.858242 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb5d0] [8000000006b1d3fc] .ll_readpage+0xdc/0x2c0 [lustre] 2012-11-14 16:25:54.898121 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb680] [c000000000096924] .generic_file_aio_read+0x4d8/0x6ec 2012-11-14 16:25:54.938238 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb7c0] [8000000006b610f4] .vvp_io_read_start+0x274/0x640 [lustre] 2012-11-14 16:25:54.978092 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb8e0] [80000000025faa3c] .cl_io_start+0xcc/0x220 [obdclass] 2012-11-14 16:25:55.018114 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bb980] [8000000002602854] .cl_io_loop+0x194/0x2c0 [obdclass] 2012-11-14 16:25:55.058242 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bba30] [8000000006ada390] .ll_file_io_generic+0x410/0x670 [lustre] 2012-11-14 16:25:55.098188 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bbb30] [8000000006adb134] .ll_file_aio_read+0x1d4/0x3a0 [lustre]38a50360 2012-11-14 16:25:55.178442 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bbc00] [8000000006adb450] .ll_file_read+0x150/0x320 [lustre] 2012-11-14 16:25:55.218300 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bbce0] [c0000000000d21a0] .vfs_read+0xd0/0x1c4 2012-11-14 16:25:55.258127 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bbd80] [c0000000000d2390] .SyS_read+0x54/0x98 2012-11-14 16:25:55.298134 {DefaultControlEventListener} [mmcs]{753}.3.0: [c0000003e04bbe30] [c000000000000580] syscall_exit+0x0/0x2c 2012-11-14 16:25:55.338164 {DefaultControlEventListener} [mmcs]{753}.3.0: 7c030378 e97900e8 e91900f8
LU-1650 might be related, but it is not clear to me at first glance.
Jinshan's patch landed on master. I'm unaware of us seeing the issue with the patch applied, so this can be closed.