[LU-2092] cpu lockup on lustre umount Created: 04/Oct/12 Updated: 05/Oct/12 Resolved: 05/Oct/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 4370 |
| Description |
|
Ater recent landings of quota code the first unmount by sanity.sh on my test node (8 cores, 10G RAM) locks up: [23052.136007] BUG: soft lockup - CPU#5 stuck for 67s! [umount:19443] [23052.136988] Modules linked in: lustre obdfilter osp lod ost mdt osd_ldiskfs f sfilt_ldiskfs ldiskfs mdd mds mgs lquota obdecho mgc lov osc mdc lmv fid fld ptl rpc obdclass lvfs ksocklnd lnet libcfs ext2 exportfs jbd sha512_generic sha256_generic sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 m bcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_p iix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] [23052.137256] CPU 5[23052.137256] Modules linked in: lustre obdfilter osp lod ost mdt osd_ldiskfs f sfilt_ldiskfs ldiskfs mdd mds mgs lquota obdecho mgc lov osc mdc lmv fid fld ptl rpc obdclass lvfs ksocklnd lnet libcfs ext2 exportfs jbd sha512_generic sha256_generic sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 m bcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_p iix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] [23052.137256] [23052.137256] Pid: 19443, comm: umount Tainted: G W --------------- 2.6.32-debug #5 Bochs Bochs[23052.137256] RIP: 0010:[<ffffffffa078c163>] [<ffffffffa078c163>] lprocfs_remo ve_nolock+0x23/0x140 [obdclass] [23052.137256] RSP: 0018:ffff880158fe18c8 EFLAGS: 00010286 [23052.137256] RAX: ffffffffa0814020 RBX: ffff880158fe18f8 RCX: 0000000000000000[23052.137256] RDX: 0000000000000000 RSI: 0000000000000030 RDI: ffff8802187f9988 [23052.137256] RBP: ffffffff8100bc0e R08: ffff880158fe1908 R09: 0000000000000059 [23052.137256] R10: 0000000000000001 R11: 0000000087654321 R12: ffff880158fe19d8 [23052.137256] R13: ffff8802187f98a8 R14: ffff88027a589df0 R15: ffff88027a589ee8 [23052.137256] FS: 00007fca35fa7740(0000) GS:ffff880028340000(0000) knlGS:0000000000000000 [23052.137256] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [23052.137256] CR2: ffff8802187faf78 CR3: 0000000158247000 CR4: 00000000000006e0 [23052.137256] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [23052.137256] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [23052.137256] Process umount (pid: 19443, threadinfo ffff880158fe0000, task ffff8801b4d4c2c0) [23052.137256] Stack: [23052.137256] ffff880100000010 0000000000000246 ffff8802187f9988 ffff880158fe19d8 [23052.137256] <d> ffff8802187f98a8 ffff88027a589df0 ffff880158fe1918 ffffffffa078c3d5 [23052.137256] <d> ffff880158fe1908 ffff8802187f9930 ffff880158fe1948 ffffffffa046dede [23052.137256] Call Trace: [23052.137256] [<ffffffffa078c3d5>] ? lprocfs_remove+0x25/0x40 [obdclass] [23052.137256] [<ffffffffa046dede>] ? qmt_pool_free+0x3e/0x240 [lquota] [23052.137256] [<ffffffffa046e2fc>] ? qmt_pool_fini+0xbc/0x240 [lquota] [23052.137256] [<ffffffffa0468b00>] ? qmt_device_fini+0x140/0x500 [lquota] [23052.137256] [<ffffffffa07ad3c7>] ? class_cleanup+0x577/0xdc0 [obdclass] [23052.137256] [<ffffffffa07825dc>] ? class_name2dev+0x7c/0xf0 [obdclass] [23052.137256] [<ffffffffa07aecb5>] ? class_process_config+0x10a5/0x1ca0 [obdclass] [23052.137256] [<ffffffffa04bc118>] ? libcfs_log_return+0x28/0x40 [libcfs] [23052.137256] [<ffffffffa07a8671>] ? lustre_cfg_new+0x391/0x7e0 [obdclass] [23052.137256] [<ffffffffa07afa29>] ? class_manual_cleanup+0x179/0x6e0 [obdclass] [23052.137256] [<ffffffffa04bc118>] ? libcfs_log_return+0x28/0x40 [libcfs] [23052.137256] [<ffffffffa046833e>] ? qmt_device_obd_disconnect+0xee/0x130 [lquota] [23052.137256] [<ffffffffa0a3f4fe>] ? mdt_quota_fini+0xee/0x410 [mdt] [23052.137256] [<ffffffffa0a40c54>] ? mdt_device_fini+0x74/0x500 [mdt] [23052.137256] [<ffffffffa07ad3c7>] ? class_cleanup+0x577/0xdc0 [obdclass] [23052.137256] [<ffffffffa07825dc>] ? class_name2dev+0x7c/0xf0 [obdclass] [23052.137256] [<ffffffffa07aecb5>] ? class_process_config+0x10a5/0x1ca0 [obdclass] [23052.137256] [<ffffffffa04bc118>] ? libcfs_log_return+0x28/0x40 [libcfs] [23052.137256] [<ffffffffa07a8671>] ? lustre_cfg_new+0x391/0x7e0 [obdclass] [23052.137256] [<ffffffffa07afa29>] ? class_manual_cleanup+0x179/0x6e0 [obdclass] [23052.137256] [<ffffffffa07825dc>] ? class_name2dev+0x7c/0xf0 [obdclass] [23052.137256] [<ffffffffa07bbf9c>] ? server_put_super+0x5ec/0x1230 [obdclass] [23052.137256] [<ffffffff8117fabb>] ? generic_shutdown_super+0x5b/0xe0 [23052.137256] [<ffffffff8117fba6>] ? kill_anon_super+0x16/0x60 [23052.137256] [<ffffffffa07b1846>] ? lustre_kill_super+0x36/0x60 [obdclass] [23052.137256] [<ffffffff81180c35>] ? deactivate_super+0x85/0xa0 [23052.137256] [<ffffffff8119ccaf>] ? mntput_no_expire+0xbf/0x110 [23052.137256] [<ffffffff8119d75b>] ? sys_umount+0x7b/0x3a0 [23052.137256] [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b [23052.137256] Code: 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 55 41 54 53 48 83 ec 10 0f 1f 44 00 00 48 8b 1f 48 85 db 74 60 48 c7 07 00 00 00 00 <4c> 8b 73 48 4d 85 f6 75 0f e9 c4 00 00 00 0f 1f 80 00 00 00 00 [23052.137256] Call Trace: [23052.137256] [<ffffffffa04c16d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [23052.137256] [<ffffffffa078c3d5>] ? lprocfs_remove+0x25/0x40 [obdclass] [23052.137256] [<ffffffffa046dede>] ? qmt_pool_free+0x3e/0x240 [lquota] [23052.137256] [<ffffffffa046e2fc>] ? qmt_pool_fini+0xbc/0x240 [lquota] [23052.137256] [<ffffffffa0468b00>] ? qmt_device_fini+0x140/0x500 [lquota] [23052.137256] [<ffffffffa07ad3c7>] ? class_cleanup+0x577/0xdc0 [obdclass] [23052.137256] [<ffffffffa07825dc>] ? class_name2dev+0x7c/0xf0 [obdclass] [23052.137256] [<ffffffffa07aecb5>] ? class_process_config+0x10a5/0x1ca0 [obdclass] [23052.137256] [<ffffffffa04bc118>] ? libcfs_log_return+0x28/0x40 [libcfs] [23052.137256] [<ffffffffa07a8671>] ? lustre_cfg_new+0x391/0x7e0 [obdclass] [23052.137256] [<ffffffffa07afa29>] ? class_manual_cleanup+0x179/0x6e0 [obdclass] [23052.137256] [<ffffffffa04bc118>] ? libcfs_log_return+0x28/0x40 [libcfs] [23052.137256] [<ffffffffa046833e>] ? qmt_device_obd_disconnect+0xee/0x130 [lquota] [23052.137256] [<ffffffffa0a3f4fe>] ? mdt_quota_fini+0xee/0x410 [mdt] [23052.137256] [<ffffffffa0a40c54>] ? mdt_device_fini+0x74/0x500 [mdt] [23052.137256] [<ffffffffa07ad3c7>] ? class_cleanup+0x577/0xdc0 [obdclass] [23052.137256] [<ffffffffa07825dc>] ? class_name2dev+0x7c/0xf0 [obdclass] [23052.137256] [<ffffffffa07aecb5>] ? class_process_config+0x10a5/0x1ca0 [obdclass] [23052.137256] [<ffffffffa04bc118>] ? libcfs_log_return+0x28/0x40 [libcfs] [23052.137256] [<ffffffffa07a8671>] ? lustre_cfg_new+0x391/0x7e0 [obdclass] [23052.137256] [<ffffffffa07afa29>] ? class_manual_cleanup+0x179/0x6e0 [obdclass] [23052.137256] [<ffffffffa07825dc>] ? class_name2dev+0x7c/0xf0 [obdclass] [23052.137256] [<ffffffffa07bbf9c>] ? server_put_super+0x5ec/0x1230 [obdclass] [23052.137256] [<ffffffff8117fabb>] ? generic_shutdown_super+0x5b/0xe0 [23052.137256] [<ffffffff8117fba6>] ? kill_anon_super+0x16/0x60 [23052.137256] [<ffffffffa07b1846>] ? lustre_kill_super+0x36/0x60 [obdclass] [23052.137256] [<ffffffff81180c35>] ? deactivate_super+0x85/0xa0 [23052.137256] [<ffffffff8119ccaf>] ? mntput_no_expire+0xbf/0x110 [23052.137256] [<ffffffff8119d75b>] ? sys_umount+0x7b/0x3a0 [23052.137256] [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b much later followed by: [26328.160006] BUG: soft lockup - CPU#7 stuck for 68s! [rsyslogd:1134][26328.160007] Modules linked in: lustre obdfilter osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mds mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs ext2 exportfs jbd sha512_generic sha256_generic sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_p iix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] [26328.160017] CPU 7[26328.160018] Modules linked in: lustre obdfilter osp lod ost mdt osd_ldiskfs f sfilt_ldiskfs ldiskfs mdd mds mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs ext2 exportfs jbd sha512_generic sha256_g eneric sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 m bcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_p iix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] [26328.160027] [26328.160028] Pid: 1134, comm: rsyslogd Tainted: G W --------------- 2.6.32-debug #5 Bochs Bochs [26328.160029] RIP: 0010:[<ffffffff8104873a>] [<ffffffff8104873a>] flush_tlb_others_ipi+0x11a/0x130 [26328.160033] RSP: 0018:ffff88027ab71d68 EFLAGS: 00000246 [26328.160034] RAX: 0000000000000000 RBX: ffff88027ab71da8 RCX: 0000000000000008 [26328.160034] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffffff81e09368 [26328.160035] RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000008 [26328.160036] R10: 0000000000000000 R11: 00007f9700000000 R12: ffff88027ab71d58 [26328.160037] R13: ffffffff8100bc0e R14: 0000000000000000 R15: ffff88027a02df28 [26328.160038] FS: 00007f970ffff700(0000) GS:ffff8800283c0000(0000) knlGS:0000000000000000 [26328.160039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [26328.160040] CR2: 00000000006dee5c CR3: 000000025a13e000 CR4: 00000000000006e0 [26328.160043] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [26328.160045] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [26328.160046] Process rsyslogd (pid: 1134, threadinfo ffff88027ab70000, task ffff88025a1c4580) [26328.160047] Stack: [26328.160048] ffffffffffffffff ffffffff81e09340 00007f970004e000 ffff8802674b2a30 [26328.160049] <d> ffffffffffffffff ffff8802674b2d18 00007f970004f000 ffff880259c9d000 [26328.160050] <d> ffff88027ab71dd8 ffffffff810487c6 ffff88027ab71dd8 ffff8802674b2a30 [26328.160052] Call Trace: [26328.160054] [<ffffffff810487c6>] ? native_flush_tlb_others+0x76/0x90 [26328.160056] [<ffffffff8104899c>] ? flush_tlb_mm+0x5c/0xa0 [26328.160058] [<ffffffff81144ce0>] ? mprotect_fixup+0x680/0x820 [26328.160060] [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40 [26328.160062] [<ffffffff81145005>] ? sys_mprotect+0x185/0x250 [26328.160064] [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b [26328.160065] Code: f8 c9 c3 66 0f 1f 44 00 00 48 8b 05 e1 c9 b5 00 41 8d b5 f0 00 00 00 4c 89 e7 ff 90 e0 00 00 00 eb 09 0f 1f 80 00 00 00 00 f3 90 <8b> 35 84 39 b5 00 4c 89 e7 e8 a8 73 23 00 85 c0 74 ec eb 90 66 [26328.160074] Call Trace: [26328.160075] [<ffffffff81048748>] ? flush_tlb_others_ipi+0x128/0x130 [26328.160077] [<ffffffff810487c6>] ? native_flush_tlb_others+0x76/0x90 [26328.160078] [<ffffffff8104899c>] ? flush_tlb_mm+0x5c/0xa0 [26328.160080] [<ffffffff81144ce0>] ? mprotect_fixup+0x680/0x820 [26328.160081] [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40 [26328.160083] [<ffffffff81145005>] ? sys_mprotect+0x185/0x250 [26328.160085] [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Oleg Drokin [ 04/Oct/12 ] |
|
run it another time and this time after first soft lookup report caught a kernel panic in the same spot: [ 1151.204782] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 1151.206037] last sysfs file: /sys/devices/pci0000:00/0000:00:06.0/local_cpus[ 1151.207144] CPU 7
[ 1151.207444] Modules linked in: lustre obdfilter osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs exportfs mdd mds mgs lquota jbd obdecho mgc lov osc mdc lm
v fid fld ptlrpc obdclass lvfs ksocklnd lnet sha512_generic sha256_generic libcfs sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcach
e jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix d
m_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
[ 1151.214574]
[ 1151.214813] Pid: 27237, comm: umount Not tainted 2.6.32-debug #5 Bochs Bochs
[ 1151.215941] RIP: 0010:[<ffffffffa036e17b>] [<ffffffffa036e17b>] lprocfs_remo
ve_nolock+0x3b/0x140 [obdclass]
[ 1151.217529] RSP: 0018:ffff8800d56c38c8 EFLAGS: 00010202
[ 1151.218378] RAX: ffffffffa03f5fe0 RBX: 016f5905c7480004 RCX: 0000000000000000[ 1151.219471] RDX: 0000000000000000 RSI: 0000000000000030 RDI: ffff88015849aba8
[ 1151.220571] RBP: ffff8800d56c38f8 R08: ffff8800d56c3908 R09: 0000000000000059
[ 1151.221773] R10: 0000000000000001 R11: 0000000087654321 R12: 016f5905c7480004
[ 1151.222957] R13: ffff88015849aac8 R14: ffffffffa036e3d5 R15: ffff8801583e5ee8
[ 1151.224383] FS: 00007f12881c0740(0000) GS:ffff8800283c0000(0000) knlGS:0000000000000000
[ 1151.225599] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1151.226535] CR2: 00007fff9ff07008 CR3: 00000000d717e000 CR4: 00000000000006e0
[ 1151.227686] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1151.228865] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1151.230037] Process umount (pid: 27237, threadinfo ffff8800d56c2000, task ffff8801009d8500)
[ 1151.231403] Stack:
[ 1151.231732] ffff880000000010 0000000000000246 ffff88015849aba8 ffff8800d56c39d8
[ 1151.233030] <d> ffff88015849aac8 ffff8801583e5df0 ffff8800d56c3918 ffffffffa036e3d5
[ 1151.234359] <d> ffff8800d56c3908 ffff88015849ab50 ffff8800d56c3948 ffffffffa0988ede
[ 1151.235680] Call Trace:
[ 1151.236105] [<ffffffffa036e3d5>] lprocfs_remove+0x25/0x40 [obdclass]
[ 1151.237162] [<ffffffffa0988ede>] qmt_pool_free+0x3e/0x240 [lquota]
[ 1151.238222] [<ffffffffa09892fc>] qmt_pool_fini+0xbc/0x240 [lquota]
[ 1151.239226] [<ffffffffa0983b00>] qmt_device_fini+0x140/0x500 [lquota]
[ 1151.240307] [<ffffffffa038f3b7>] class_cleanup+0x577/0xdc0 [obdclass]
[ 1151.241541] [<ffffffffa03645dc>] ? class_name2dev+0x7c/0xf0 [obdclass]
[ 1151.242830] [<ffffffffa0390ca5>] class_process_config+0x10a5/0x1ca0 [obdclass]
[ 1151.244146] [<ffffffffa01f3118>] ? libcfs_log_return+0x28/0x40 [libcfs]
[ 1151.245446] [<ffffffffa038a661>] ? lustre_cfg_new+0x391/0x7e0 [obdclass]
[ 1151.246830] [<ffffffffa0391a19>] class_manual_cleanup+0x179/0x6e0 [obdclass]
[ 1151.248127] [<ffffffffa01f3118>] ? libcfs_log_return+0x28/0x40 [libcfs]
[ 1151.249497] [<ffffffffa098333e>] qmt_device_obd_disconnect+0xee/0x130 [lquot
...
|
| Comment by Oleg Drokin [ 04/Oct/12 ] |
|
Ok, so the problem is improper cleanup order, where quota master proc entry is removed before quota pools are stopped, and the quota pools have their own entries registered under quota master proc dir hierarchy (and removed recursively on quota master proc entry removal). Patch at http://review.whamcloud.com/4191 |
| Comment by Oleg Drokin [ 05/Oct/12 ] |
|
Patch landed, fixed |