Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.1.0, Lustre 1.8.6
-
None
-
Old Lustre Version: 1.8.6-wc1
Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/
New Lustre Version: 2.0.66.0
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/228/
Clean upgrading (Lustre servers and clients were upgraded all at once) from Lustre 1.8.6-wc1 to Lustre 2.0.66.0 under the following configuration:
OSS1: RHEL5/x86_64
OSS2: RHEL5/x86_64
MDS: RHEL5/x86_64
Client1: RHEL6/x86_64
Client2: RHEL5/x86_64
Old Lustre Version: 1.8.6-wc1 Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/ New Lustre Version: 2.0.66.0 Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/228/ Clean upgrading (Lustre servers and clients were upgraded all at once) from Lustre 1.8.6-wc1 to Lustre 2.0.66.0 under the following configuration: OSS1: RHEL5/x86_64 OSS2: RHEL5/x86_64 MDS: RHEL5/x86_64 Client1: RHEL6/x86_64 Client2: RHEL5/x86_64
-
3
-
4914
Description
After the upgrading, Lustre clients hit the following corruption while running racer test:
Aug 9 20:07:08 client-12 kernel: Lustre: DEBUG MARKER: == runracer test 1: racer on clients: client-12,client-13 DURATION=120 == 20:07:07 (1312945627) Aug 9 20:07:08 client-12 rshd[27104]: root@client-12.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "DURATION=120 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer ");echo XXRETCODE:$?' Aug 9 20:07:08 client-12 kernel: ------------[ cut here ]------------ Aug 9 20:07:08 client-12 kernel: WARNING: at lib/list_debug.c:26 __list_add+0x6d/0xa0() (Tainted: G ---------------- T) Aug 9 20:07:08 client-12 kernel: Hardware name: X8DTT Aug 9 20:07:08 client-12 kernel: list_add corruption. next->prev should be prev (ffff8803179f3e80), but was ffff8801edb6bcd8. (next=ffff8801edb6bbf0). Aug 9 20:07:08 client-12 kernel: Modules linked in: lustre(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs lockd fscache(T) nfs_acl auth_rpcgss autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror dm_region_hash dm_log mlx4_ib ib_mad ib_core mlx4_en mlx4_core igb serio_raw ghes hed i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mod [last unloaded: libcfs] Aug 9 20:07:08 client-12 kernel: Pid: 27276, comm: ln Tainted: G ---------------- T 2.6.32-131.2.1.el6.x86_64 #1 Aug 9 20:07:08 client-12 kernel: Call Trace: Aug 9 20:07:08 client-12 kernel: [<ffffffff810670f7>] ? warn_slowpath_common+0x87/0xc0 Aug 9 20:07:08 client-12 kernel: [<ffffffff810671e6>] ? warn_slowpath_fmt+0x46/0x50 Aug 9 20:07:08 client-12 kernel: [<ffffffff8126a18c>] ? put_dec+0x10c/0x110 Aug 9 20:07:08 client-12 kernel: [<ffffffff81272abd>] ? __list_add+0x6d/0xa0 Aug 9 20:07:08 client-12 kernel: [<ffffffff8108e45b>] ? add_wait_queue+0x3b/0x60 Aug 9 20:07:08 client-12 kernel: [<ffffffffa08937ee>] ? cfs_waitq_add+0xe/0x10 [libcfs] Aug 9 20:07:08 client-12 kernel: [<ffffffffa0276038>] ? seq_client_alloc_fid+0x98/0x8a0 [fid] Aug 9 20:07:08 client-12 kernel: [<ffffffff8126b6e0>] ? string+0x40/0x100 Aug 9 20:07:08 client-12 kernel: [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20 Aug 9 20:07:08 client-12 kernel: [<ffffffff8126c816>] ? vsnprintf+0x2b6/0x5f0 Aug 9 20:07:08 client-12 kernel: [<ffffffffa03644ee>] ? mdc_fid_alloc+0x3e/0x100 [mdc] Aug 9 20:07:08 client-12 kernel: [<ffffffffa08965de>] ? cfs_curproc_is_in_groups+0xe/0x10 [libcfs] Aug 9 20:07:08 client-12 kernel: [<ffffffffa04a5059>] ? ll_i2suppgid+0x19/0x30 [lustre] Aug 9 20:07:08 client-12 kernel: [<ffffffffa037192d>] ? mdc_create+0x1ed/0x620 [mdc] Aug 9 20:07:08 client-12 kernel: [<ffffffffa048a0d8>] ? ll_prep_md_op_data+0x148/0x410 [lustre] Aug 9 20:07:08 client-12 kernel: [<ffffffffa04a4298>] ? ll_new_node+0x198/0x7b0 [lustre] Aug 9 20:07:08 client-12 kernel: [<ffffffff8117e573>] ? generic_permission+0x23/0xb0 Aug 9 20:07:08 client-12 kernel: [<ffffffffa04a4d4e>] ? ll_symlink+0x6e/0x1c0 [lustre] Aug 9 20:07:08 client-12 kernel: [<ffffffff8117f9da>] ? vfs_symlink+0x9a/0xc0 Aug 9 20:07:08 client-12 kernel: [<ffffffff8118202e>] ? sys_symlinkat+0xee/0x110 Aug 9 20:07:08 client-12 kernel: [<ffffffff810d1ac2>] ? audit_syscall_entry+0x272/0x2a0 Aug 9 20:07:08 client-12 kernel: [<ffffffff814e017e>] ? do_page_fault+0x3e/0xa0 Aug 9 20:07:08 client-12 kernel: [<ffffffff81182066>] ? sys_symlink+0x16/0x20 Aug 9 20:07:08 client-12 kernel: [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b Aug 9 20:07:08 client-12 kernel: ---[ end trace 904817c3d0504e64 ]---
Console log on client-13:
Lustre: DEBUG MARKER: == runracer test 1: racer on clients: client-12,client-13 DURATION=120 == 20:07:07 (1312945627) list_add corruption. next->prev should be ffff81030d0df688, but was 00007fff85414ec0 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lib/list_debug.c:26 invalid opcode: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/irq CPU 0 Modules linked in: nfs fscache nfs_acl mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_tcp bnx2i cnic uio cxgb3i iw_cxgb3 cxgb3 libiscsi_tcp ib_iser libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi ib_srp rds ib_sdp ib_ipoib ipoib_helper ipv6 xfrm_nalgo crypto_api rdma_ucm rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ib_sa loop dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport mlx4_ib ib_mad ib_core mlx4_en sg joydev igb i2c_i801 shpchp tpm_tis tpm 8021q i7core_edac mlx4_core tpm_bios i2c_core edac_mc serio_raw dca pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 9598, comm: file_concat.sh Tainted: G 2.6.18-238.12.1.el5 #1 RIP: 0010:[<ffffffff80157c90>] [<ffffffff80157c90>] __list_add+0x24/0x68 RSP: 0018:ffff8103173f3a28 EFLAGS: 00010082 RAX: 0000000000000058 RBX: ffff8103155a7ae0 RCX: ffffffff8031af28 RDX: ffffffff8031af28 RSI: 0000000000000000 RDI: ffffffff8031af20 RBP: ffff81030d0df688 R08: ffffffff8031af28 R09: 000000000000003e R10: ffff8103173f36c8 R11: 0000000000000080 R12: ffff8103173f3ae0 R13: 0000000000000246 R14: ffff8103173f3ac8 R15: ffff81030d0df5e8 FS: 00002ae601cc1f80(0000) GS:ffffffff80426000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000006bb424 CR3: 000000030fcbd000 CR4: 00000000000006e0 Process file_concat.sh (pid: 9598, threadinfo ffff8103173f2000, task ffff810318a25860) Stack: ffff8103173f3ac8 ffff81030d0df5c8 ffff81030d0df680 ffffffff80049c8e ffff81030d0df5c0 ffff81030d0df608 ffff81030d0df680 ffffffff88a7d89b 0000000000000001 ffffffff8001a870 0000003000000030 ffff8103173f3bf0 Call Trace: [<ffffffff80049c8e>] add_wait_queue+0x24/0x34 [<ffffffff88a7d89b>] :fid:seq_client_alloc_fid+0x12b/0x860 [<ffffffff8001a870>] vsnprintf+0x3f8/0x627 [<ffffffff8008e430>] default_wake_function+0x0/0xe [<ffffffff887bca7b>] :libcfs:libcfs_debug_vmsg2+0x75b/0x9f0 [<ffffffff88a93c6d>] :mdc:mdc_fid_alloc+0x7d/0x100 [<ffffffff88aa5605>] :mdc:mdc_intent_lock+0x385/0x610 [<ffffffff88bf4c7c>] :lustre:ll_i2gids+0x6c/0xa0 [<ffffffff88bf4fd0>] :lustre:ll_md_blocking_ast+0x0/0x580 [<ffffffff88946cd0>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0 [<ffffffff88bf4828>] :lustre:ll_lookup_it+0x8c8/0xc90 [<ffffffff88bf4fd0>] :lustre:ll_md_blocking_ast+0x0/0x580 [<ffffffff88bf5956>] :lustre:ll_lookup_nd+0x1e6/0x400 [<ffffffff80022a76>] d_alloc+0x174/0x1a9 [<ffffffff800370b8>] __lookup_hash+0x10b/0x12f [<ffffffff8001b227>] open_namei+0xf2/0x712 [<ffffffff8006723e>] do_page_fault+0x4fe/0x874 [<ffffffff8002768a>] do_filp_open+0x1c/0x38 [<ffffffff8001a061>] do_sys_open+0x44/0xbe [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 0f 0b 68 cb 3f 2c 80 c2 1a 00 48 8b 55 00 48 39 da 74 1b 48 RIP [<ffffffff80157c90>] __list_add+0x24/0x68 RSP <ffff8103173f3a28> <0>Kernel panic - not syncing: Fatal exception
Maloo report: https://maloo.whamcloud.com/test_sets/4c7fa66c-c308-11e0-9a20-52540025f9af