Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-586

1.8<->2.1 interop: list_add corruption

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.1.0, Lustre 1.8.6
    • Fix Version/s: Lustre 2.1.0, Lustre 1.8.7
    • Labels:
      None
    • Environment:
    • Severity:
      3
    • Rank (Obsolete):
      4914

      Description

      After the upgrading, Lustre clients hit the following corruption while running racer test:

      Aug  9 20:07:08 client-12 kernel: Lustre: DEBUG MARKER: == runracer test 1: racer on clients: client-12,client-13 DURATION=120 == 20:07:07 (1312945627)
      Aug  9 20:07:08 client-12 rshd[27104]: root@client-12.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre"  sh -c "DURATION=120 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer ");echo XXRETCODE:$?'
      Aug  9 20:07:08 client-12 kernel: ------------[ cut here ]------------
      Aug  9 20:07:08 client-12 kernel: WARNING: at lib/list_debug.c:26 __list_add+0x6d/0xa0() (Tainted: G           ---------------- T)
      Aug  9 20:07:08 client-12 kernel: Hardware name: X8DTT
      Aug  9 20:07:08 client-12 kernel: list_add corruption. next->prev should be prev (ffff8803179f3e80), but was ffff8801edb6bcd8. (next=ffff8801edb6bbf0).
      Aug  9 20:07:08 client-12 kernel: Modules linked in: lustre(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs lockd fscache(T) nfs_acl auth_rpcgss autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror dm_region_hash dm_log mlx4_ib ib_mad ib_core mlx4_en mlx4_core igb serio_raw ghes hed i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mod [last unloaded: libcfs]
      Aug  9 20:07:08 client-12 kernel: Pid: 27276, comm: ln Tainted: G           ---------------- T 2.6.32-131.2.1.el6.x86_64 #1
      Aug  9 20:07:08 client-12 kernel: Call Trace:
      Aug  9 20:07:08 client-12 kernel: [<ffffffff810670f7>] ? warn_slowpath_common+0x87/0xc0
      Aug  9 20:07:08 client-12 kernel: [<ffffffff810671e6>] ? warn_slowpath_fmt+0x46/0x50
      Aug  9 20:07:08 client-12 kernel: [<ffffffff8126a18c>] ? put_dec+0x10c/0x110
      Aug  9 20:07:08 client-12 kernel: [<ffffffff81272abd>] ? __list_add+0x6d/0xa0
      Aug  9 20:07:08 client-12 kernel: [<ffffffff8108e45b>] ? add_wait_queue+0x3b/0x60
      Aug  9 20:07:08 client-12 kernel: [<ffffffffa08937ee>] ? cfs_waitq_add+0xe/0x10 [libcfs]
      Aug  9 20:07:08 client-12 kernel: [<ffffffffa0276038>] ? seq_client_alloc_fid+0x98/0x8a0 [fid]
      Aug  9 20:07:08 client-12 kernel: [<ffffffff8126b6e0>] ? string+0x40/0x100
      Aug  9 20:07:08 client-12 kernel: [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
      Aug  9 20:07:08 client-12 kernel: [<ffffffff8126c816>] ? vsnprintf+0x2b6/0x5f0
      Aug  9 20:07:08 client-12 kernel: [<ffffffffa03644ee>] ? mdc_fid_alloc+0x3e/0x100 [mdc]
      Aug  9 20:07:08 client-12 kernel: [<ffffffffa08965de>] ? cfs_curproc_is_in_groups+0xe/0x10 [libcfs]
      Aug  9 20:07:08 client-12 kernel: [<ffffffffa04a5059>] ? ll_i2suppgid+0x19/0x30 [lustre]
      Aug  9 20:07:08 client-12 kernel: [<ffffffffa037192d>] ? mdc_create+0x1ed/0x620 [mdc]
      Aug  9 20:07:08 client-12 kernel: [<ffffffffa048a0d8>] ? ll_prep_md_op_data+0x148/0x410 [lustre]
      Aug  9 20:07:08 client-12 kernel: [<ffffffffa04a4298>] ? ll_new_node+0x198/0x7b0 [lustre]
      Aug  9 20:07:08 client-12 kernel: [<ffffffff8117e573>] ? generic_permission+0x23/0xb0
      Aug  9 20:07:08 client-12 kernel: [<ffffffffa04a4d4e>] ? ll_symlink+0x6e/0x1c0 [lustre]
      Aug  9 20:07:08 client-12 kernel: [<ffffffff8117f9da>] ? vfs_symlink+0x9a/0xc0
      Aug  9 20:07:08 client-12 kernel: [<ffffffff8118202e>] ? sys_symlinkat+0xee/0x110
      Aug  9 20:07:08 client-12 kernel: [<ffffffff810d1ac2>] ? audit_syscall_entry+0x272/0x2a0
      Aug  9 20:07:08 client-12 kernel: [<ffffffff814e017e>] ? do_page_fault+0x3e/0xa0
      Aug  9 20:07:08 client-12 kernel: [<ffffffff81182066>] ? sys_symlink+0x16/0x20
      Aug  9 20:07:08 client-12 kernel: [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b
      Aug  9 20:07:08 client-12 kernel: ---[ end trace 904817c3d0504e64 ]---
      

      Console log on client-13:

      Lustre: DEBUG MARKER: == runracer test 1: racer on clients: client-12,client-13 DURATION=120 == 20:07:07 (1312945627)
      list_add corruption. next->prev should be ffff81030d0df688, but was 00007fff85414ec0
      ----------- [cut here ] --------- [please bite here ] ---------
      Kernel BUG at lib/list_debug.c:26
      invalid opcode: 0000 [1] SMP 
      last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
      CPU 0 
      Modules linked in: nfs fscache nfs_acl mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_tcp bnx2i cnic uio cxgb3i iw_cxgb3 cxgb3 libiscsi_tcp ib_iser libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi ib_srp rds ib_sdp ib_ipoib ipoib_helper ipv6 xfrm_nalgo crypto_api rdma_ucm rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ib_sa loop dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport mlx4_ib ib_mad ib_core mlx4_en sg joydev igb i2c_i801 shpchp tpm_tis tpm 8021q i7core_edac mlx4_core tpm_bios i2c_core edac_mc serio_raw dca pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
      Pid: 9598, comm: file_concat.sh Tainted: G      2.6.18-238.12.1.el5 #1
      RIP: 0010:[<ffffffff80157c90>]  [<ffffffff80157c90>] __list_add+0x24/0x68
      RSP: 0018:ffff8103173f3a28  EFLAGS: 00010082
      RAX: 0000000000000058 RBX: ffff8103155a7ae0 RCX: ffffffff8031af28
      RDX: ffffffff8031af28 RSI: 0000000000000000 RDI: ffffffff8031af20
      RBP: ffff81030d0df688 R08: ffffffff8031af28 R09: 000000000000003e
      R10: ffff8103173f36c8 R11: 0000000000000080 R12: ffff8103173f3ae0
      R13: 0000000000000246 R14: ffff8103173f3ac8 R15: ffff81030d0df5e8
      FS:  00002ae601cc1f80(0000) GS:ffffffff80426000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00000000006bb424 CR3: 000000030fcbd000 CR4: 00000000000006e0
      Process file_concat.sh (pid: 9598, threadinfo ffff8103173f2000, task ffff810318a25860)
      Stack:  ffff8103173f3ac8 ffff81030d0df5c8 ffff81030d0df680 ffffffff80049c8e
       ffff81030d0df5c0 ffff81030d0df608 ffff81030d0df680 ffffffff88a7d89b
       0000000000000001 ffffffff8001a870 0000003000000030 ffff8103173f3bf0
      Call Trace:
       [<ffffffff80049c8e>] add_wait_queue+0x24/0x34
       [<ffffffff88a7d89b>] :fid:seq_client_alloc_fid+0x12b/0x860
       [<ffffffff8001a870>] vsnprintf+0x3f8/0x627
       [<ffffffff8008e430>] default_wake_function+0x0/0xe
       [<ffffffff887bca7b>] :libcfs:libcfs_debug_vmsg2+0x75b/0x9f0
       [<ffffffff88a93c6d>] :mdc:mdc_fid_alloc+0x7d/0x100
       [<ffffffff88aa5605>] :mdc:mdc_intent_lock+0x385/0x610
       [<ffffffff88bf4c7c>] :lustre:ll_i2gids+0x6c/0xa0
       [<ffffffff88bf4fd0>] :lustre:ll_md_blocking_ast+0x0/0x580
       [<ffffffff88946cd0>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0
       [<ffffffff88bf4828>] :lustre:ll_lookup_it+0x8c8/0xc90
       [<ffffffff88bf4fd0>] :lustre:ll_md_blocking_ast+0x0/0x580
       [<ffffffff88bf5956>] :lustre:ll_lookup_nd+0x1e6/0x400
       [<ffffffff80022a76>] d_alloc+0x174/0x1a9
       [<ffffffff800370b8>] __lookup_hash+0x10b/0x12f
       [<ffffffff8001b227>] open_namei+0xf2/0x712
       [<ffffffff8006723e>] do_page_fault+0x4fe/0x874
       [<ffffffff8002768a>] do_filp_open+0x1c/0x38
       [<ffffffff8001a061>] do_sys_open+0x44/0xbe
       [<ffffffff8005d28d>] tracesys+0xd5/0xe0
      
      
      Code: 0f 0b 68 cb 3f 2c 80 c2 1a 00 48 8b 55 00 48 39 da 74 1b 48 
      RIP  [<ffffffff80157c90>] __list_add+0x24/0x68
       RSP <ffff8103173f3a28>
       <0>Kernel panic - not syncing: Fatal exception
      

      Maloo report: https://maloo.whamcloud.com/test_sets/4c7fa66c-c308-11e0-9a20-52540025f9af

        Attachments

          Activity

            People

            • Assignee:
              laisiyao Lai Siyao
              Reporter:
              yujian Jian Yu
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: