[LU-1919] Soft lockup on MGS stop Created: 12/Sep/12  Updated: 02/Jan/13  Resolved: 21/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Critical
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4265

 Description   

After a bunch of recent MGS-related landings, I am unable to go into sanity much anymore, first or second mgs unmount attempt hangs on current master.
100% incident ratio with REFORMAT=yes SLOW=yes sh sanity.sh
I have 10G ram and 8 CPUs in this vm:

[  742.243128] Lustre: DEBUG MARKER: == sanity test 17k: symlinks: rsync with xattrs enabled =========================== 18:09:54 (1347487794)
[  742.751459] Lustre: DEBUG MARKER: == sanity test 17m: run e2fsck against MDT which contains short/long symlink == 18:09:55 (1347487795)
[  772.153542] Lustre: 4505:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487817/real 1347487817]  req@ffff88025a670bf0 x1412943327533287/t0(0) o400->lustre-MDT0000-mdc-ffff880239395bf0@0@lo:12/10 lens 224/224 e 0 to 1 dl 1347487824 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[  772.158030] Lustre: lustre-MDT0000-mdc-ffff880239395bf0: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
[  777.152548] Lustre: 4511:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487822/real 1347487822]  req@ffff88025e138bf0 x1412943327533293/t0(0) o400->lustre-MDT0000-mdc-ffff880239395bf0@0@lo:12/10 lens 224/224 e 0 to 1 dl 1347487829 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[  777.153801] LustreError: 166-1: MGC192.168.1.205@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
[  777.158799] Lustre: 4511:0:(client.c:1905:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[  778.153531] Lustre: 4504:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487824/real 1347487824]  req@ffff880261c0ebf0 x1412943327533296/t0(0) o38->lustre-MDT0000-mdc-ffff880239395bf0@0@lo:12/10 lens 400/544 e 0 to 1 dl 1347487830 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[  783.153544] Lustre: 4504:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487829/real 1347487829]  req@ffff88028f55bbf0 x1412943327533297/t0(0) o250->MGC192.168.1.205@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1347487835 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[  793.153542] Lustre: 4504:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487834/real 1347487834]  req@ffff8802389adbf0 x1412943327533300/t0(0) o38->lustre-MDT0000-mdc-ffff880239395bf0@0@lo:12/10 lens 400/544 e 0 to 1 dl 1347487845 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[  798.153534] Lustre: 4504:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487839/real 1347487839]  req@ffff880231971bf0 x1412943327533303/t0(0) o250->MGC192.168.1.205@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1347487850 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[  813.153531] Lustre: 4504:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487849/real 1347487849]  req@ffff880247083bf0 x1412943327533308/t0(0) o38->lustre-MDT0000-mdc-ffff880239395bf0@0@lo:12/10 lens 400/544 e 0 to 1 dl 1347487865 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[  817.156401] Lustre: lustre-OST0000: haven't heard from client lustre-MDT0000-mdtlov_UUID (at 0@lo) in 52 seconds. I think it's dead, and I am evicting it. exp ffff8802390bdbf0, cur 1347487869 expire 1347487839 last 1347487817
[  838.152794] Lustre: 4504:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1347487869/real 1347487869]  req@ffff880262d9bbf0 x1412943327533318/t0(0) o38->lustre-MDT0000-mdc-ffff880239395bf0@0@lo:12/10 lens 400/544 e 0 to 1 dl 1347487890 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[  838.157141] Lustre: 4504:0:(client.c:1905:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[  840.056007] BUG: soft lockup - CPU#1 stuck for 67s! [obd_zombid:4474]
[  840.056255] Modules linked in: lustre obdfilter ost cmm mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs exportfs mdd mds mgs lquota jbd obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet sha512_generic sha256_generic libcfs sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
[  840.056255] CPU 1 
[  840.056255] Modules linked in: lustre obdfilter ost cmm mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs exportfs mdd mds mgs lquota jbd obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet sha512_generic sha256_generic libcfs sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
[  840.056255] 
[  840.056255] Pid: 4474, comm: obd_zombid Not tainted 2.6.32-debug #3 Bochs Bochs
[  840.056255] RIP: 0010:[<ffffffffa0388808>]  [<ffffffffa0388808>] server_deregister_mount+0xa8/0x390 [obdclass]
[  840.056255] RSP: 0018:ffff880292c3ddd0  EFLAGS: 00010282
[  840.056255] RAX: ffff880265478e50 RBX: ffff880292c3dde0 RCX: 0000000000000000
[  840.056255] RDX: ffff880292e45f08 RSI: 0000000000000000 RDI: ffff88028fa5fc70
[  840.056255] RBP: ffffffff8100bc0e R08: 00000000ffffffff R09: 000000000000009e
[  840.056255] R10: 000000000000000f R11: 000000000000000f R12: ffff8802654b008c
[  840.056255] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffffffffffff
[  840.056255] FS:  00007f3b197eb700(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
[  840.056255] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  840.056255] CR2: ffff880292e45fd8 CR3: 0000000001a25000 CR4: 00000000000006e0
[  840.056255] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  840.056255] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  840.056255] Process obd_zombid (pid: 4474, threadinfo ffff880292c3c000, task ffff88028d4a23c0)
[  840.056255] Stack:
[  840.056255]  ffff880265478e50 ffff8802654b008c ffff880292c3de00 ffffffffa0388c50
[  840.056255] <d> ffff8802654b0080 ffff8802654b008c ffff880292c3de20 ffffffffa098eb3e
[  840.056255] <d> ffff8802654b0080 0000000000000000 ffff880292c3de90 ffffffffa037b7f2
[  840.056255] Call Trace:
[  840.056255]  [<ffffffffa0388c50>] ? server_put_mount+0x160/0x290 [obdclass]
[  840.056255]  [<ffffffffa098eb3e>] ? mgs_cleanup+0x4e/0x1c0 [mgs]
[  840.056255]  [<ffffffffa037b7f2>] ? class_decref+0x212/0x590 [obdclass]
[  840.056255]  [<ffffffffa0364c74>] ? obd_zombie_impexp_cull+0x314/0x620 [obdclass]
[  840.056255]  [<ffffffffa0365045>] ? obd_zombie_impexp_thread+0xc5/0x1c0 [obdclass]
[  840.056255]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
[  840.056255]  [<ffffffffa0364f80>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
[  840.056255]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
[  840.056255]  [<ffffffffa0364f80>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
[  840.056255]  [<ffffffffa0364f80>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
[  840.056255]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[  840.056255] Code: a6 3b a0 c7 05 6a d4 08 00 93 00 00 00 48 c7 05 6b d4 08 00 00 00 00 00 c7 05 59 d4 08 00 04 00 00 01 48 8b 50 10 48 85 d2 74 07 <44> 8b 82 d0 00 00 00 4c 89 e1 48 c7 c6 a8 25 3c a0 48 c7 c7 40 
[  840.056255] Call Trace:
[  840.056255]  [<ffffffffa0388c50>] ? server_put_mount+0x160/0x290 [obdclass]
[  840.056255]  [<ffffffffa098eb3e>] ? mgs_cleanup+0x4e/0x1c0 [mgs]
[  840.056255]  [<ffffffffa037b7f2>] ? class_decref+0x212/0x590 [obdclass]
[  840.056255]  [<ffffffffa0364c74>] ? obd_zombie_impexp_cull+0x314/0x620 [obdclass]
[  840.056255]  [<ffffffffa0365045>] ? obd_zombie_impexp_thread+0xc5/0x1c0 [obdclass]
[  840.056255]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
[  840.056255]  [<ffffffffa0364f80>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
[  840.056255]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
[  840.056255]  [<ffffffffa0364f80>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
[  840.056255]  [<ffffffffa0364f80>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
[  840.056255]  [<ffffffff8100c140>] ? child_rip+0x0/0x20


 Comments   
Comment by Alex Zhuravlev [ 13/Sep/12 ]

http://review.whamcloud.com/#change,3982

Comment by Alex Zhuravlev [ 21/Sep/12 ]

can be closed now?

Generated at Sat Feb 10 01:20:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.