[LU-3032] crash in ll_ping Created: 26/Mar/13 Updated: 12/Sep/13 Resolved: 06/Jun/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.1, Lustre 2.5.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | Dmitry Eremin (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 7398 |
| Description |
|
Running conf-sanity.sh in a loop I hit this crash: [ 6625.425296] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC [ 6625.425755] last sysfs file: /sys/devices/system/cpu/possible [ 6625.426029] CPU 1 [ 6625.426068] Modules linked in: ptlrpc(-) obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: fld] [ 6625.429023] [ 6625.429231] Pid: 4991, comm: ll_ping Not tainted 2.6.32-debug #6 Bochs Bochs [ 6625.429247] RIP: 0010:[<ffffffff8104d2e6>] [<ffffffff8104d2e6>] __wake_up_common+0x56/0x90 [ 6625.429247] RSP: 0018:ffff88008f9e5de0 EFLAGS: 00010082 [ 6625.429247] RAX: ffffffffa0d963fa RBX: ffff880098e26f68 RCX: 0000000000000000 [ 6625.429247] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffffa0d963fa [ 6625.429247] RBP: ffff88008f9e5e20 R08: 0000000000000000 R09: 000000000000005c [ 6625.429247] R10: 0000000000000001 R11: 0000000000000000 R12: c284e8fffb9889d0 [ 6625.429247] R13: 00000000ffffb502 R14: 0000000000000000 R15: 0000000000000000 [ 6625.429247] FS: 00007f7026720700(0000) GS:ffff880006240000(0000) knlGS:0000000000000000 [ 6625.429247] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 6625.429247] CR2: 000000000089ee20 CR3: 00000000905c0000 CR4: 00000000000006e0 [ 6625.429247] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6625.429247] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 6625.429247] Process ll_ping (pid: 4991, threadinfo ffff88008f9e4000, task ffff8800af3ea5c0) [ 6625.429247] Stack: [ 6625.429247] ffffffffffffffff 0000000300000001 0000000000000001 ffff880098e26f68 [ 6625.429247] <d> 0000000000000282 0000000000000003 0000000000000001 0000000000000000 [ 6625.429247] <d> ffff88008f9e5e60 ffffffff81051f68 ffff88008f9e5e40 0000000000000001 [ 6625.429247] Call Trace: [ 6625.429247] [<ffffffff81051f68>] __wake_up+0x48/0x70 [ 6625.429247] [<ffffffffa0acf7fa>] cfs_waitq_signal+0x1a/0x20 [libcfs] [ 6625.429247] [<ffffffffa0d5f73f>] ptlrpc_pinger_main+0x5cf/0x7f0 [ptlrpc] [ 6625.429247] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [ 6625.429247] [<ffffffffa0d5f170>] ? ptlrpc_pinger_main+0x0/0x7f0 [ptlrpc] [ 6625.429247] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 6625.429247] [<ffffffffa0d5f170>] ? ptlrpc_pinger_main+0x0/0x7f0 [ptlrpc] [ 6625.429247] [<ffffffffa0d5f170>] ? ptlrpc_pinger_main+0x0/0x7f0 [ptlrpc] [ 6625.429247] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 6625.429247] Code: e8 18 48 39 c7 4c 8b 60 18 74 3d 49 83 ec 18 eb 0b 0f 1f 40 00 4c 89 e0 4c 8d 62 e8 44 8b 28 4c 89 f1 44 89 fa 8b 75 cc 48 89 c7 <ff> 50 10 85 c0 74 0c 41 83 e5 01 74 06 83 6d c8 01 74 0a 4c 39 test output had this: == conf-sanity test 22: start a client before osts (should return errs) == 03:20:55 (1364282455)
start mds service on centos6-14.localnet
Starting mds1: -o loop /tmp/lustre-mdt1 /mnt/mds1
Started lustre-MDT0000
Client mount with ost in logs, but none running
start ost1 service on centos6-14.localnet
Starting ost1: -o loop /tmp/lustre-ost1 /mnt/ost1
Started lustre-OST0000
error: list_param: /proc/{fs,sys}/{lnet,lustre}/osc/lustre-OST0000-osc-MDT0000/ost_server_uuid: Found no match
error: list_param: /proc/{fs,sys}/{lnet,lustre}/osc/lustre-OST0000-osc-MDT0000/ost_server_uuid: Found no match
Stopping client centos6-14.localnet /mnt/lustre (opts:)
PASS
Client mount with a running ost
start ost1 service on centos6-14.localnet
Starting ost1: -o loop /tmp/lustre-ost1 /mnt/ost1
Started lustre-OST0000
mount lustre on /mnt/lustre.....
Starting client: centos6-14.localnet: -o user_xattr,flock centos6-14.localnet@tcp:/lustre /mnt/lustre
centos6-14.localnet: osc.lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 14 sec
centos6-14.localnet: osc.lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
setup single mount lustre success
PASS
umount lustre on /mnt/lustre.....
Stopping client centos6-14.localnet /mnt/lustre (opts:)
stop ost1 service on centos6-14.localnet
Stopping /mnt/ost1 (opts:-f) on centos6-14.localnet
waited 0 for 10 ST ost OSS OSS_uuid 0
Crashdump and modules are in /exports/crashdumps/192.168.10.224-2013-03-26-03\:21\:55/ |
| Comments |
| Comment by Li Wei (Inactive) [ 28/Mar/13 ] |
[ 6625.426068] Modules linked in: ptlrpc(-) [...] The ptlrpc kernel module was being removed. ptlrpc_stop_pinger() must have been called. The RIP was at linux-2.6.32-279.2.1.el6-debug/kernel/sched.c: 6182 0xffffffff8104d2da <__wake_up_common+74>: mov %r14,%rcx 0xffffffff8104d2dd <__wake_up_common+77>: mov %r15d,%edx 0xffffffff8104d2e0 <__wake_up_common+80>: mov -0x34(%rbp),%esi 0xffffffff8104d2e3 <__wake_up_common+83>: mov %rax,%rdi 0xffffffff8104d2e6 <__wake_up_common+86>: callq *0x10(%rax) From the "0x10" offset, I think RAX contains the address of "curr", which seems to contain arbitrary data (e.g., flags): $4 = {
flags = 4294948098,
private = 0x33e8fffdd6f8e8ff,
func = 0xfffc803ee8fff8ce,
task_list = {
next = 0xc284e8fffb9889e8,
prev = 0xc9fffacfbfe8fffb
}
}
I suspect the following race happened, resulting a use-after-free situation: ~ ptlrpc_stop_pinger() ~ ptlrpc_pinger_main()
-----------------------------------------------------------------------------
thread_set_flags(SVC_STOPPING)
cfs_waitq_signal()
mutex_unlock() ...
thread_set_flags(SVC_STOPPED)
l_wait_event(thread_is_stopped): Did not sleep
OBD_FREE_PTR(pinger_thread)
cfs_waitq_signal()
The memory used by pinger_thread might have been freed and reallocated to something else, when ptlrpc_pinger_main() used it in cvs_waitq_signal(). |
| Comment by Dmitry Eremin (Inactive) [ 12/Apr/13 ] |
|
I agree with investigation. I'd like to propose a patch http://review.whamcloud.com/6040. |