Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1919

Soft lockup on MGS stop

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.4.0
    • Lustre 2.4.0
    • None
    • 3
    • 4265

    Description

      After a bunch of recent MGS-related landings, I am unable to go into sanity much anymore, first or second mgs unmount attempt hangs on current master.
      100% incident ratio with REFORMAT=yes SLOW=yes sh sanity.sh
      I have 10G ram and 8 CPUs in this vm:

      [  742.243128] Lustre: DEBUG MARKER: == sanity test 17k: symlinks: rsync with xattrs enabled =========================== 18:09:54 (1347487794)
      [  742.751459] Lustre: DEBUG MARKER: == sanity test 17m: run e2fsck against MDT which contains short/long symlink == 18:09:55 (1347487795)
      [  772.153542] Lustre: 4505:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487817/real 1347487817]  req@ffff88025a670bf0 x1412943327533287/t0(0) o400->lustre-MDT0000-mdc-ffff880239395bf0@0@lo:12/10 lens 224/224 e 0 to 1 dl 1347487824 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [  772.158030] Lustre: lustre-MDT0000-mdc-ffff880239395bf0: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      [  777.152548] Lustre: 4511:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487822/real 1347487822]  req@ffff88025e138bf0 x1412943327533293/t0(0) o400->lustre-MDT0000-mdc-ffff880239395bf0@0@lo:12/10 lens 224/224 e 0 to 1 dl 1347487829 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [  777.153801] LustreError: 166-1: MGC192.168.1.205@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
      [  777.158799] Lustre: 4511:0:(client.c:1905:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [  778.153531] Lustre: 4504:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487824/real 1347487824]  req@ffff880261c0ebf0 x1412943327533296/t0(0) o38->lustre-MDT0000-mdc-ffff880239395bf0@0@lo:12/10 lens 400/544 e 0 to 1 dl 1347487830 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [  783.153544] Lustre: 4504:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487829/real 1347487829]  req@ffff88028f55bbf0 x1412943327533297/t0(0) o250->MGC192.168.1.205@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1347487835 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [  793.153542] Lustre: 4504:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487834/real 1347487834]  req@ffff8802389adbf0 x1412943327533300/t0(0) o38->lustre-MDT0000-mdc-ffff880239395bf0@0@lo:12/10 lens 400/544 e 0 to 1 dl 1347487845 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [  798.153534] Lustre: 4504:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487839/real 1347487839]  req@ffff880231971bf0 x1412943327533303/t0(0) o250->MGC192.168.1.205@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1347487850 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [  813.153531] Lustre: 4504:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487849/real 1347487849]  req@ffff880247083bf0 x1412943327533308/t0(0) o38->lustre-MDT0000-mdc-ffff880239395bf0@0@lo:12/10 lens 400/544 e 0 to 1 dl 1347487865 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [  817.156401] Lustre: lustre-OST0000: haven't heard from client lustre-MDT0000-mdtlov_UUID (at 0@lo) in 52 seconds. I think it's dead, and I am evicting it. exp ffff8802390bdbf0, cur 1347487869 expire 1347487839 last 1347487817
      [  838.152794] Lustre: 4504:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487869/real 1347487869]  req@ffff880262d9bbf0 x1412943327533318/t0(0) o38->lustre-MDT0000-mdc-ffff880239395bf0@0@lo:12/10 lens 400/544 e 0 to 1 dl 1347487890 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [  838.157141] Lustre: 4504:0:(client.c:1905:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [  840.056007] BUG: soft lockup - CPU#1 stuck for 67s! [obd_zombid:4474]
      [  840.056255] Modules linked in: lustre obdfilter ost cmm mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs exportfs mdd mds mgs lquota jbd obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet sha512_generic sha256_generic libcfs sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      [  840.056255] CPU 1 
      [  840.056255] Modules linked in: lustre obdfilter ost cmm mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs exportfs mdd mds mgs lquota jbd obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet sha512_generic sha256_generic libcfs sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      [  840.056255] 
      [  840.056255] Pid: 4474, comm: obd_zombid Not tainted 2.6.32-debug #3 Bochs Bochs
      [  840.056255] RIP: 0010:[<ffffffffa0388808>]  [<ffffffffa0388808>] server_deregister_mount+0xa8/0x390 [obdclass]
      [  840.056255] RSP: 0018:ffff880292c3ddd0  EFLAGS: 00010282
      [  840.056255] RAX: ffff880265478e50 RBX: ffff880292c3dde0 RCX: 0000000000000000
      [  840.056255] RDX: ffff880292e45f08 RSI: 0000000000000000 RDI: ffff88028fa5fc70
      [  840.056255] RBP: ffffffff8100bc0e R08: 00000000ffffffff R09: 000000000000009e
      [  840.056255] R10: 000000000000000f R11: 000000000000000f R12: ffff8802654b008c
      [  840.056255] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffffffffffff
      [  840.056255] FS:  00007f3b197eb700(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
      [  840.056255] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  840.056255] CR2: ffff880292e45fd8 CR3: 0000000001a25000 CR4: 00000000000006e0
      [  840.056255] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  840.056255] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  840.056255] Process obd_zombid (pid: 4474, threadinfo ffff880292c3c000, task ffff88028d4a23c0)
      [  840.056255] Stack:
      [  840.056255]  ffff880265478e50 ffff8802654b008c ffff880292c3de00 ffffffffa0388c50
      [  840.056255] <d> ffff8802654b0080 ffff8802654b008c ffff880292c3de20 ffffffffa098eb3e
      [  840.056255] <d> ffff8802654b0080 0000000000000000 ffff880292c3de90 ffffffffa037b7f2
      [  840.056255] Call Trace:
      [  840.056255]  [<ffffffffa0388c50>] ? server_put_mount+0x160/0x290 [obdclass]
      [  840.056255]  [<ffffffffa098eb3e>] ? mgs_cleanup+0x4e/0x1c0 [mgs]
      [  840.056255]  [<ffffffffa037b7f2>] ? class_decref+0x212/0x590 [obdclass]
      [  840.056255]  [<ffffffffa0364c74>] ? obd_zombie_impexp_cull+0x314/0x620 [obdclass]
      [  840.056255]  [<ffffffffa0365045>] ? obd_zombie_impexp_thread+0xc5/0x1c0 [obdclass]
      [  840.056255]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
      [  840.056255]  [<ffffffffa0364f80>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
      [  840.056255]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
      [  840.056255]  [<ffffffffa0364f80>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
      [  840.056255]  [<ffffffffa0364f80>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
      [  840.056255]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      [  840.056255] Code: a6 3b a0 c7 05 6a d4 08 00 93 00 00 00 48 c7 05 6b d4 08 00 00 00 00 00 c7 05 59 d4 08 00 04 00 00 01 48 8b 50 10 48 85 d2 74 07 <44> 8b 82 d0 00 00 00 4c 89 e1 48 c7 c6 a8 25 3c a0 48 c7 c7 40 
      [  840.056255] Call Trace:
      [  840.056255]  [<ffffffffa0388c50>] ? server_put_mount+0x160/0x290 [obdclass]
      [  840.056255]  [<ffffffffa098eb3e>] ? mgs_cleanup+0x4e/0x1c0 [mgs]
      [  840.056255]  [<ffffffffa037b7f2>] ? class_decref+0x212/0x590 [obdclass]
      [  840.056255]  [<ffffffffa0364c74>] ? obd_zombie_impexp_cull+0x314/0x620 [obdclass]
      [  840.056255]  [<ffffffffa0365045>] ? obd_zombie_impexp_thread+0xc5/0x1c0 [obdclass]
      [  840.056255]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
      [  840.056255]  [<ffffffffa0364f80>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
      [  840.056255]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
      [  840.056255]  [<ffffffffa0364f80>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
      [  840.056255]  [<ffffffffa0364f80>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
      [  840.056255]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      Attachments

        Activity

          People

            wc-triage WC Triage
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: