Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5378

recovery-small test 57 crash on invalid spinlock.

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.6.0, Lustre 2.7.0, Lustre 2.11.0
    • None
    • 3
    • 14985

    Description

      This seems to be reproducing from time to time on my systems.
      A crash like below that's likely due to lctl still trying to read the proc file while a parallel unmount frees stats.

      <4>[ 7018.594828] Lustre: DEBUG MARKER: == recovery-small test 57: read procfs entries causes kernel crash == 17:20:23 (1405891223)
      <0>[ 7021.841908] BUG: spinlock bad magic on CPU#0, lctl/27044 (Not tainted)
      <4>[ 7021.842559] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
      <4>[ 7021.843161] last sysfs file: /sys/devices/system/cpu/possible
      <4>[ 7021.844006] CPU 0 
      <4>[ 7021.844006] Modules linked in: lustre ofd osp lod ost mdt mdd mgs nodemap osd_ldiskfs ldiskfs lquota lfsck obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 jbd2 mbcache virtio_balloon virtio_console i2c_piix4 i2c_core virtio_net virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
      <4>[ 7021.844006] 
      <4>[ 7021.844006] Pid: 27044, comm: lctl Not tainted 2.6.32-rhe6.5-debug #2 Red Hat KVM
      <4>[ 7021.844006] RIP: 0010:[<ffffffff81292771>]  [<ffffffff81292771>] spin_bug+0x81/0x100
      <4>[ 7021.844006] RSP: 0018:ffff880089aa1cd8  EFLAGS: 00010002
      <4>[ 7021.844006] RAX: 0000000000000050 RBX: ffff88007aeec348 RCX: 00000000ffffffff
      <4>[ 7021.844006] RDX: 0000000000000000 RSI: 0000000000000096 RDI: 0000000000000046
      <4>[ 7021.844006] RBP: ffff880089aa1cf8 R08: 0000000000000000 R09: 000000006b6b6b6b
      <4>[ 7021.844006] R10: 0736072e07340735 R11: 073907360720075b R12: 6b6b6b6b6b6b6b6b
      <4>[ 7021.844006] R13: ffffffff817e1c57 R14: 0000000000000000 R15: 6b6b6b6b6b6b6b6b
      <4>[ 7021.844006] FS:  00007f92cc6d4700(0000) GS:ffff880006200000(0000) knlGS:0000000000000000
      <4>[ 7021.844006] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      <4>[ 7021.844006] CR2: 00007f92cbf57800 CR3: 00000000748fa000 CR4: 00000000000006f0
      <4>[ 7021.844006] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>[ 7021.844006] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>[ 7021.844006] Process lctl (pid: 27044, threadinfo ffff880089aa0000, task ffff8800502183c0)
      <4>[ 7021.844006] Stack:
      <4>[ 7021.844006]  0000006cffffffff ffff88007aeec348 0000000000000000 ffff880089aa1da8
      <4>[ 7021.844006] <d> ffff880089aa1d48 ffffffff81292935 0000000000000000 ffff880089aa1e60
      <4>[ 7021.844006] <d> ffff880089aa1d38 0000000000000292 0000000000000000 ffff880089aa1da8
      <4>[ 7021.844006] Call Trace:
      <4>[ 7021.844006]  [<ffffffff81292935>] _raw_spin_lock+0xa5/0x180
      <4>[ 7021.844006]  [<ffffffff81516894>] _spin_lock_irqsave+0x24/0x30
      <4>[ 7021.844006]  [<ffffffffa09f0a51>] lprocfs_stats_collect+0x161/0x180 [obdclass]
      <4>[ 7021.844006]  [<ffffffffa09f0ae6>] lprocfs_stats_seq_show+0x76/0x150 [obdclass]
      <4>[ 7021.844006]  [<ffffffff81170393>] ? kmem_cache_alloc_trace+0x143/0x250
      <4>[ 7021.844006]  [<ffffffff811ae778>] seq_read+0xf8/0x420
      <4>[ 7021.844006]  [<ffffffff811ae680>] ? seq_read+0x0/0x420
      <4>[ 7021.844006]  [<ffffffff811f4ae5>] proc_reg_read+0x85/0xc0
      <4>[ 7021.844006]  [<ffffffff81189c95>] vfs_read+0xb5/0x1a0
      <4>[ 7021.844006]  [<ffffffff81189dd1>] sys_read+0x51/0x90
      <4>[ 7021.844006]  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
      <4>[ 7021.844006] Code: 8d 8e a0 06 00 00 49 89 c1 4c 89 ee 31 c0 48 c7 c7 f8 1f 7e 81 65 8b 14 25 d8 e0 00 00 e8 72 06 28 00 4d 85 e4 44 8b 4b 08 74 6b <45> 8b 84 24 a8 04 00 00 49 8d 8c 24 a0 06 00 00 8b 53 04 48 89 
      <1>[ 7021.844006] RIP  [<ffffffff81292771>] spin_bug+0x81/0x100
      

      There's also unmount that does the freeing:

      PID: 27018  TASK: ffff8800836f80c0  CPU: 5   COMMAND: "umount"
       #0 [ffff880049b216f8] schedule at ffffffff815133ca
       #1 [ffff880049b217c0] schedule_timeout at ffffffff815142b5
       #2 [ffff880049b21870] wait_for_common at ffffffff81513f2b
       #3 [ffff880049b21900] wait_for_completion at ffffffff8151403d
       #4 [ffff880049b21910] remove_proc_entry at ffffffff811fb7a7
       #5 [ffff880049b219b0] lprocfs_remove_nolock at ffffffffa09efa20 [obdclass]
       #6 [ffff880049b219f0] lprocfs_remove at ffffffffa09efc15 [obdclass]
       #7 [ffff880049b21a10] lprocfs_obd_cleanup at ffffffffa09efc84 [obdclass]
       #8 [ffff880049b21a30] osc_precleanup at ffffffffa0ef94ec [osc]
       #9 [ffff880049b21a60] class_cleanup at ffffffffa0a085c3 [obdclass]
      #10 [ffff880049b21ae0] class_process_config at ffffffffa0a0a67a [obdclass]
      #11 [ffff880049b21b70] class_manual_cleanup at ffffffffa0a0ad59 [obdclass]
      #12 [ffff880049b21c30] lov_putref at ffffffffa0f83276 [lov]
      #13 [ffff880049b21cb0] lov_disconnect at ffffffffa0f8a8a2 [lov]
      #14 [ffff880049b21ce0] ll_put_super at ffffffffa07f51ce [lustre]
      #15 [ffff880049b21e30] generic_shutdown_super at ffffffff8118bb0b
      #16 [ffff880049b21e50] kill_anon_super at ffffffff8118bbf6
      #17 [ffff880049b21e70] lustre_kill_super at ffffffffa0a0cbda [obdclass]
      #18 [ffff880049b21e90] deactivate_super at ffffffff8118c397
      #19 [ffff880049b21eb0] mntput_no_expire at ffffffff811ab40f
      #20 [ffff880049b21ee0] sys_umount at ffffffff811abf7b
      #21 [ffff880049b21f80] system_call_fastpath at ffffffff8100b0b2
      

      This is probably somewhat similar to LU-106

      Sample crashdump in /exports/crashdumps/192.168.10.223-2014-07-20-17:20:28
      tag in my tree: master-20140720

      Attachments

        Issue Links

          Activity

            [LU-5378] recovery-small test 57 crash on invalid spinlock.

            Oleg, now that my patch for LU-10224 has landed, do you agree if we close this ticket too?

            bfaccini Bruno Faccini (Inactive) added a comment - Oleg, now that my patch for LU-10224 has landed, do you agree if we close this ticket too?

            I believe this is the same problem than the one I am working on with LU-10224.

            bfaccini Bruno Faccini (Inactive) added a comment - I believe this is the same problem than the one I am working on with LU-10224 .
            yujian Jian Yu added a comment - More failure instances on master branch: https://testing.hpdd.intel.com/sub_tests/3610f9a8-ceb4-11e7-a066-52540065bddc https://testing.hpdd.intel.com/sub_tests/06772f9e-cc1d-11e7-9c63-52540065bddc https://testing.hpdd.intel.com/sub_tests/d8004f24-cbe0-11e7-8027-52540065bddc https://testing.hpdd.intel.com/sub_tests/444a2980-cb45-11e7-9840-52540065bddc https://testing.hpdd.intel.com/sub_tests/76aef78e-ca55-11e7-9c63-52540065bddc
            yujian Jian Yu added a comment -

            While testing patch https://review.whamcloud.com/29088 based on master branch, recovery-small test 57 hit the following LBUG on client:

            Lustre: DEBUG MARKER: == recovery-small test 57: read procfs entries causes kernel crash =================================== 10:10:34 (1510654234)
            Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts 
            Lustre: DEBUG MARKER: lsof -t /mnt/lustre 
            Lustre: DEBUG MARKER: umount /mnt/lustre 2>&1
            BUG: unable to handle kernel paging request at 000000000000a078
            IP: [<ffffffffc0806cd1>] lprocfs_stats_collect+0xc1/0x140 [obdclass]
            PGD 6b37a067 PUD 5df6f067 PMD 0
            Oops: 0000 [#1] SMP
            Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul ppdev glue_helper ablk_helper cryptd pcspkr joydev virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl auth_rpcgss lockd grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix virtio_blk drm libata 8139too crct10dif_pclmul crct10dif_common crc32c_intel virtio_pci serio_raw virtio_ring virtio 8139cp i2c_core mii floppy [last unloaded: libcfs]
            CPU: 1 PID: 1905 Comm: lctl Tainted: G        W  OE  ------------   3.10.0-693.5.2.el7.x86_64 #1
            Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
            task: ffff88007c370fd0 ti: ffff880068c98000 task.ti: ffff880068c98000
            RIP: 0010:[<ffffffffc0806cd1>]  [<ffffffffc0806cd1>] lprocfs_stats_collect+0xc1/0x140 [obdclass]
            RSP: 0018:ffff880068c9bdd8  EFLAGS: 00010246
            RAX: 00000000000068c9 RBX: ffff880068c9be20 RCX: 0000000000000007
            RDX: 000000000000a078 RSI: 000000000000a090 RDI: 0000000000000007
            RBP: ffff880068c9be00 R08: 0000000000000078 R09: 0000000000000018
            R10: 0000000000000000 R11: ffff880068c9bca6 R12: ffff880068c95300
            R13: 0000000000000003 R14: ffff88007a3cc000 R15: ffff880065354380
            FS:  00007f8896e2f740(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
            CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            CR2: 000000000000a078 CR3: 00000000685ff000 CR4: 00000000000406e0
            DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            Stack:
             0000000000000000 0000000068eb0a32 0000000000000003 ffff880065354380
             ffff880068c95300 ffff880068c9be70 ffffffffc08078b7 000000005a0ac11b
             0000000035b69261 0000000000000000 0000000000000000 0000000000000000[17354.102007] Call Trace:
             [<ffffffffc08078b7>] lprocfs_stats_seq_show+0x47/0x140 [obdclass]
             [<ffffffff81226550>] seq_read+0x250/0x3b0
             [<ffffffff812702cd>] proc_reg_read+0x3d/0x80
             [<ffffffff81200b1c>] vfs_read+0x9c/0x170
             [<ffffffff812019df>] SyS_read+0x7f/0xe0
             [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
            Code: c1 e0 03 0f 1f 80 00 00 00 00 48 63 d1 49 83 7c d4 18 00 74 48 4c 89 c2 49 03 54 fc 18 41 f6 44 24 04 02 4a 8d 34 0a 48 0f 45 d6 <48> 8b 32 48 01 33 48 8b 72 20 48 01 73 20 48 8b 72 08 48 3b 73
            RIP  [<ffffffffc0806cd1>] lprocfs_stats_collect+0xc1/0x140 [obdclass]
             RSP <ffff880068c9bdd8>
            CR2: 000000000000a078
            

            Maloo report: https://testing.hpdd.intel.com/test_sets/dcbc4ef6-c937-11e7-9c63-52540065bddc

            yujian Jian Yu added a comment - While testing patch https://review.whamcloud.com/29088 based on master branch, recovery-small test 57 hit the following LBUG on client: Lustre: DEBUG MARKER: == recovery-small test 57: read procfs entries causes kernel crash =================================== 10:10:34 (1510654234) Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts Lustre: DEBUG MARKER: lsof -t /mnt/lustre Lustre: DEBUG MARKER: umount /mnt/lustre 2>&1 BUG: unable to handle kernel paging request at 000000000000a078 IP: [<ffffffffc0806cd1>] lprocfs_stats_collect+0xc1/0x140 [obdclass] PGD 6b37a067 PUD 5df6f067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul ppdev glue_helper ablk_helper cryptd pcspkr joydev virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl auth_rpcgss lockd grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix virtio_blk drm libata 8139too crct10dif_pclmul crct10dif_common crc32c_intel virtio_pci serio_raw virtio_ring virtio 8139cp i2c_core mii floppy [last unloaded: libcfs] CPU: 1 PID: 1905 Comm: lctl Tainted: G W OE ------------ 3.10.0-693.5.2.el7.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 task: ffff88007c370fd0 ti: ffff880068c98000 task.ti: ffff880068c98000 RIP: 0010:[<ffffffffc0806cd1>] [<ffffffffc0806cd1>] lprocfs_stats_collect+0xc1/0x140 [obdclass] RSP: 0018:ffff880068c9bdd8 EFLAGS: 00010246 RAX: 00000000000068c9 RBX: ffff880068c9be20 RCX: 0000000000000007 RDX: 000000000000a078 RSI: 000000000000a090 RDI: 0000000000000007 RBP: ffff880068c9be00 R08: 0000000000000078 R09: 0000000000000018 R10: 0000000000000000 R11: ffff880068c9bca6 R12: ffff880068c95300 R13: 0000000000000003 R14: ffff88007a3cc000 R15: ffff880065354380 FS: 00007f8896e2f740(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000a078 CR3: 00000000685ff000 CR4: 00000000000406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: 0000000000000000 0000000068eb0a32 0000000000000003 ffff880065354380 ffff880068c95300 ffff880068c9be70 ffffffffc08078b7 000000005a0ac11b 0000000035b69261 0000000000000000 0000000000000000 0000000000000000[17354.102007] Call Trace: [<ffffffffc08078b7>] lprocfs_stats_seq_show+0x47/0x140 [obdclass] [<ffffffff81226550>] seq_read+0x250/0x3b0 [<ffffffff812702cd>] proc_reg_read+0x3d/0x80 [<ffffffff81200b1c>] vfs_read+0x9c/0x170 [<ffffffff812019df>] SyS_read+0x7f/0xe0 [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b Code: c1 e0 03 0f 1f 80 00 00 00 00 48 63 d1 49 83 7c d4 18 00 74 48 4c 89 c2 49 03 54 fc 18 41 f6 44 24 04 02 4a 8d 34 0a 48 0f 45 d6 <48> 8b 32 48 01 33 48 8b 72 20 48 01 73 20 48 8b 72 08 48 3b 73 RIP [<ffffffffc0806cd1>] lprocfs_stats_collect+0xc1/0x140 [obdclass] RSP <ffff880068c9bdd8> CR2: 000000000000a078 Maloo report: https://testing.hpdd.intel.com/test_sets/dcbc4ef6-c937-11e7-9c63-52540065bddc

            We also have hit with this issue in recovery-small test 57 at lprocfs_stats_collect+177. Have any body found solution for this issue ?

            vinayakh Vinayak (Inactive) added a comment - We also have hit with this issue in recovery-small test 57 at lprocfs_stats_collect+177. Have any body found solution for this issue ?
            green Oleg Drokin added a comment -

            I just had another hit of this with current master

            green Oleg Drokin added a comment - I just had another hit of this with current master

            People

              wc-triage WC Triage
              green Oleg Drokin
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: