Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2938

MDS unmount deadlock/softlockup in ldiskfs/jbd2

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.4.0
    • None
    • rhel6, kernel 2.6.32-279.2.1.el6
    • 3
    • 7056

    Description

      Running 2.3.62 tag with replay-single in a loop I got test 73c locked up.
      dmesg is full of soft lockup messages like this:

      [267156.100008] BUG: soft lockup - CPU#4 stuck for 67s! [umount:7615]
      [267156.100273] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 i2c_piix4 i2c_core virtio_balloon virtio_console virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
      [267156.101054] CPU 4 
      [267156.101054] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 i2c_piix4 i2c_core virtio_balloon virtio_console virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
      [267156.101054] 
      [267156.101054] Pid: 7615, comm: umount Not tainted 2.6.32-debug #6 Bochs Bochs
      [267156.101054] RIP: 0010:[<ffffffff810a78fa>]  [<ffffffff810a78fa>] smp_call_function_many+0x1ea/0x260
      [267156.101054] RSP: 0018:ffff88007cc23968  EFLAGS: 00000202
      [267156.101054] RAX: 0000000000000011 RBX: ffff88007cc239a8 RCX: ffff88007cc23828
      [267156.101054] RDX: 0000000000000010 RSI: 800000006d1ad160 RDI: 0000000000000282
      [267156.101054] RBP: ffffffff8100bc0e R08: 0000000000000001 R09: ffff880000000000
      [267156.101054] R10: 0000000000000000 R11: 0000000087654321 R12: 0000000000000282
      [267156.101054] R13: ffff88007cc23918 R14: ffffffff811adb30 R15: ffff8800ba6b2ef0
      [267156.101054] FS:  00007fb972b2c740(0000) GS:ffff880006300000(0000) knlGS:0000000000000000
      [267156.101054] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [267156.101054] CR2: 00007fb9722481a0 CR3: 0000000048165000 CR4: 00000000000006e0
      [267156.101054] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [267156.101054] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [267156.101054] Process umount (pid: 7615, threadinfo ffff88007cc22000, task ffff88008800e440)
      [267156.101054] Stack:
      [267156.101054]  01ff88007cc239d8 0000000000000000 ffffffff81a91560 ffffffff811adb30
      [267156.101054] <d> 0000000000000000 ffff8800989699c0 ffff880098969a40 0000000000000000
      [267156.101054] <d> ffff88007cc239b8 ffffffff810a7992 ffff88007cc239e8 ffffffff810714f4
      [267156.101054] Call Trace:
      [267156.101054]  [<ffffffff811adb30>] ? invalidate_bh_lru+0x0/0x50
      [267156.101054]  [<ffffffff810a7992>] ? smp_call_function+0x22/0x30
      [267156.101054]  [<ffffffff810714f4>] ? on_each_cpu+0x24/0x50
      [267156.101054]  [<ffffffff811ad7ac>] ? invalidate_bh_lrus+0x1c/0x20
      [267156.101054]  [<ffffffff811ae665>] ? invalidate_bdev+0x25/0x50
      [267156.101054]  [<ffffffffa043fa64>] ? ldiskfs_put_super+0x1f4/0x3e0 [ldiskfs]
      [267156.101054]  [<ffffffff8117d6ab>] ? generic_shutdown_super+0x5b/0xe0
      [267156.101054]  [<ffffffff8117d761>] ? kill_block_super+0x31/0x50
      [267156.101054]  [<ffffffff8117e825>] ? deactivate_super+0x85/0xa0
      [267156.101054]  [<ffffffff8119a89f>] ? mntput_no_expire+0xbf/0x110
      [267156.101054]  [<ffffffffa050b5af>] ? osd_device_fini+0x32f/0x380 [osd_ldiskfs]
      [267156.101054]  [<ffffffffa0a35cc7>] ? class_cleanup+0x577/0xda0 [obdclass]
      [267156.101054]  [<ffffffffa0a0be9c>] ? class_name2dev+0x7c/0xe0 [obdclass]
      [267156.101054]  [<ffffffffa0a375ac>] ? class_process_config+0x10bc/0x1c80 [obdclass]
      [267156.101054]  [<ffffffffa0a30f93>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
      [267156.101054]  [<ffffffffa0a382e9>] ? class_manual_cleanup+0x179/0x6e0 [obdclass]
      [267156.101054]  [<ffffffffa0a09ce1>] ? class_export_put+0x101/0x2c0 [obdclass]
      [267156.101054]  [<ffffffffa0512ce4>] ? osd_obd_disconnect+0x174/0x1e0 [osd_ldiskfs]
      [267156.101054]  [<ffffffffa0a3ab2e>] ? lustre_put_lsi+0x17e/0xe20 [obdclass]
      [267156.101054]  [<ffffffffa0a42ad8>] ? lustre_common_put_super+0x5d8/0xc20 [obdclass]
      [267156.101054]  [<ffffffffa0a43eea>] ? server_put_super+0x1ca/0xe60 [obdclass]
      [267156.101054]  [<ffffffff8117d6ab>] ? generic_shutdown_super+0x5b/0xe0
      [267156.101054]  [<ffffffff8117d796>] ? kill_anon_super+0x16/0x60
      [267156.101054]  [<ffffffffa0a3a0e6>] ? lustre_kill_super+0x36/0x60 [obdclass]
      [267156.101054]  [<ffffffff8117e825>] ? deactivate_super+0x85/0xa0
      [267156.101054]  [<ffffffff8119a89f>] ? mntput_no_expire+0xbf/0x110
      [267156.101054]  [<ffffffff8119b34b>] ? sys_umount+0x7b/0x3a0
      [267156.101054]  [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
      [267156.101054] Code: 36 45 00 0f ae f0 48 8b 7b 30 ff 15 29 c4 98 00 80 7d c7 00 0f 84 9f fe ff ff f6 43 20 01 0f 84 95 fe ff ff 0f 1f 44 00 00 f3 90 <f6> 43 20 01 75 f8 e9 83 fe ff ff 0f 1f 00 4c 89 ea 4c 89 f6 44 
      

      examining the information I found there's also another thread on a different pu in jbd2:

      (gdb) bt
      #0  __ticket_spin_trylock (lock=0xffff88006b9338e8)
          at /home/green/bk/linux-2.6.32-279.2.1.el6-debug/arch/x86/include/asm/spinlock.h:155
      #1  __raw_spin_trylock (lock=0xffff88006b9338e8)
          at /home/green/bk/linux-2.6.32-279.2.1.el6-debug/arch/x86/include/asm/spinlock.h:217
      #2  __spin_lock_debug (lock=0xffff88006b9338e8) at lib/spinlock_debug.c:109
      #3  _raw_spin_lock (lock=0xffff88006b9338e8) at lib/spinlock_debug.c:132
      #4  0xffffffff814fb054 in __spin_lock_irqsave (lock=<optimized out>)
          at include/linux/spinlock_api_smp.h:256
      #5  _spin_lock_irqsave (lock=<optimized out>) at kernel/spinlock.c:66
      #6  0xffffffff81051f52 in __wake_up (q=0xffff88006b9338e8, mode=3, 
          nr_exclusive=1, key=0x0) at kernel/sched.c:6203
      #7  0xffffffffa0385663 in kjournald2 ()
      #8  0xffffffff8108fa16 in kthread (_create=0xffff880070a175e8)
          at kernel/kthread.c:78
      #9  0xffffffff8100c14a in child_rip () at arch/x86/kernel/entry_64.S:1211
      

      I have a crashdump in /exports/crashdumps/t2/hung.dmp though I am not sure if anybody would liek to dig into this.

      Attachments

        Activity

          People

            wc-triage WC Triage
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: