Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.3.0, Lustre 2.1.2, Lustre 1.8.8
-
None
-
3
-
4637
Description
Jeff Johnson reported on lustre-discuss that he can't boot with our patched RHEL5 & RHEL6 kernel, while the stock kernel works just fine. The stack trace is the following:
Starting system logger: ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at drivers/scsi/isci/request.h:348 invalid opcode: 0000 [1] SMP last sysfs file: /class/net/eth0/address CPU 0 Modules linked in: ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U) be2iscsi(U) ib_iser(U) rdma_cm(U) ib_cm(U) iw_cm(U) ib_sa(U) ib_addr(U) iscsi_tcp(U) bnx2i(U) cnic(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) uio(U) cxgb3i(U) libcxgbi(U) iw_cxgb3(U) cxgb3(U) libiscsi_tcp(U) libiscsi2(U) scsi_transport_iscsi2(U) scsi_transport_iscsi(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_en(U) joydev(U) sg(U) mlx4_core(U) igb(U) pcspkr(U) 8021q(U) i2c_i801(U) tpm_tis(U) i2c_core(U) tpm(U) dca(U) tpm_bios(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) ahci(U) shpchp(U) isci(U) libsas(U) libata(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 1337, comm: kjournald Tainted: G ---- 2.6.18.lustre18 #1 RIP: 0010:[<ffffffff88117b22>] [<ffffffff88117b22>] :isci:sci_request_build_sgl+0x110/0x1e4 RSP: 0018:ffff81023f8e19e0 EFLAGS: 00010012 RAX: 0000000000000a20 RBX: ffff810228cdb380 RCX: ffff81023eaf79e0 RDX: 0000000000001000 RSI: ffff810228cdb360 RDI: ffff81023eaf7a2c RBP: ffff81023eaf7000 R08: ffff81023ea0e100 R09: ffff81023eaf7000 R10: ffff81023ea0e100 R11: 0000000000000060 R12: 0000000000000003 R13: ffff81023eaf7a20 R14: 000000000000004d R15: ffff81023e9e0018 FS: 0000000000000000(0000) GS:ffffffff80431000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002b81aeba0380 CR3: 000000043b96a000 CR4: 00000000000006a0 Process kjournald (pid: 1337, threadinfo ffff81023f8e0000, task ffff81043fdcc7a0) Stack: 0000000000000001 ffff81023eaf7a00 000000000000004d ffff81023eaf7000 ffff81023ea0e100 000000000009f000 0000000000000001 ffff81023eaf7801 ffff8102472841f8 ffffffff88117c28 ffff81043f12f400 0000000000000004 Call Trace: [<ffffffff88117c28>] :isci:sci_stp_optimized_request_construct+0x32/0x6c [<ffffffff88119a04>] :isci:isci_request_execute+0x5b6/0x82e [<ffffffff88120d20>] :isci:isci_task_execute_task+0x10e/0x23e [<ffffffff88106f6f>] :libsas:sas_ata_qc_issue+0x1fb/0x286 [<ffffffff880776b7>] :scsi_mod:scsi_done+0x0/0x18 [<ffffffff880ceb55>] :libata:ata_qc_issue+0x4ef/0x567 [<ffffffff880d2ef0>] :libata:ata_scsi_rw_xlat+0x119/0x188 [<ffffffff880776b7>] :scsi_mod:scsi_done+0x0/0x18 [<ffffffff880d2dd7>] :libata:ata_scsi_rw_xlat+0x0/0x188 [<ffffffff880d309f>] :libata:ata_scsi_translate+0x140/0x16d [<ffffffff880776b7>] :scsi_mod:scsi_done+0x0/0x18 [<ffffffff88106119>] :libsas:sas_queuecommand+0x86/0x2a3 [<ffffffff8001cd72>] __mod_timer+0xff/0x10e [<ffffffff800155a2>] sync_buffer+0x0/0x3f [<ffffffff800155a2>] sync_buffer+0x0/0x3f [<ffffffff88077d9d>] :scsi_mod:scsi_dispatch_cmd+0x297/0x351 [<ffffffff8807d55a>] :scsi_mod:scsi_request_fn+0x2c3/0x392 [<ffffffff8005a43f>] generic_unplug_device+0x22/0x32 [<ffffffff800155d8>] sync_buffer+0x36/0x3f [<ffffffff80063a0a>] __wait_on_bit+0x40/0x6e [<ffffffff800155a2>] sync_buffer+0x0/0x3f [<ffffffff80063aa4>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff800a34ca>] wake_bit_function+0x0/0x23 [<ffffffff880342fc>] :jbd:journal_commit_transaction+0xa7f/0x132b [<ffffffff8003d80e>] lock_timer_base+0x1b/0x3c [<ffffffff88038469>] :jbd:kjournald+0xc1/0x213 [<ffffffff800a349c>] autoremove_wake_function+0x0/0x2e [<ffffffff880383a8>] :jbd:kjournald+0x0/0x213 [<ffffffff800a3284>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032679>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a3284>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003257b>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Code: 0f 0b 68 6b 4b 12 88 c2 5c 01 eb fe 48 89 c2 48 03 55 50 48 RIP [<ffffffff88117b22>] :isci:sci_request_build_sgl+0x110/0x1e4 RSP<ffff81023f8e19e0> <0>Kernel panic - not syncing: Fatal exception
The isci driver manages the new SAS onboard interface embedded in the Intel Sandy Bridge-EP (Xeon E5-2600) chipset. According to Jeff, he can reproduce the same issue with the 2.6.32-220.4.2.el6_lustre shipped with lustre 2.2, while the stock kernel from Red Hat works again just fine.
Looking at the stack trace, we hit the following assertion:
static inline dma_addr_t sci_io_request_get_dma_addr(struct isci_request *ireq, void *virt_addr) { char *requested_addr = (char *)virt_addr; char *base_addr = (char *)ireq; BUG_ON(requested_addr < base_addr); BUG_ON((requested_addr - base_addr) >= sizeof(*ireq)) -> BOOM!!!
The code path is sci_request_build_sgl->to_sgl_element_pair_dma->sci_io_request_get_dma_addr.
I suspect that blkdev_tunables-2.6-rhel5.patch is the culprit. Unfortunately, Jeff cannot retest w/o this patch soon.
Attachments
Issue Links
- Trackbacks
-
Changelog 1.8 Changes from version 1.8.7wc1 to version 1.8.8wc1 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.13.1.el6 (RHEL6) Recommended e2fsprogs version: 1.41.90....
-
Changelog 2.1 Changes from version 2.1.1 to version 2.1.2 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1....