Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.6.0, Lustre 2.5.1
-
Centos 6.5
-
3
-
15202
Description
We got the following oops during testing:
<4>PGD 0 <4>Oops: 0002 [#1] SMP <4>last sysfs file: /sys/devices/system/cpu/online <4>CPU 1 <4>Modules linked in: lmv(U) fld(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfs lockd fscache auth_rpcgss nfs_acl sunrpc VSMqfs(P)(U) autofs4 ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 ppdev vmware_balloon parport_pc parport vmxnet3 sg i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom vmw_pvscsi pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] <4> <4>Pid: 24385, comm: ldlm_cb00_056 Tainted: P --------------- 2.6.32-431.17.1.el6.x86_64 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform <4>RIP: 0010:[<ffffffff8118a509>] [<ffffffff8118a509>] fput+0x9/0x30 <4>RSP: 0018:ffff88012db55c20 EFLAGS: 00010246 <4>RAX: 00000000ffffffe0 RBX: ffff8800a8ea4fc0 RCX: 0000000000000000 <4>RDX: ffffffffa03c9eb0 RSI: 0000000000000000 RDI: 0000000000000000 <4>RBP: ffff88012db55c20 R08: 00000000ffffff0a R09: 00000000fffffffc <4>R10: 0000000000000001 R11: 282064656c696166 R12: ffffffffa03c9c60 <4>R13: ffff88005df240f8 R14: 0000000000000000 R15: ffff88013b4ca000 <4>FS: 0000000000000000(0000) GS:ffff880028280000(0000) knlGS:0000000000000000 <4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b <4>CR2: 0000000000000030 CR3: 0000000001a85000 CR4: 00000000000407e0 <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>Process ldlm_cb00_056 (pid: 24385, threadinfo ffff88012db54000, task ffff88012da5f500) <4>Stack: <4> ffff88012db55c60 ffffffffa0388044 0000000000000002 00000000ffffffe0 <4><d> ffff88005df240f8 ffff88005df24100 ffffffffa07103be ffff8801398d7000 <4><d> ffff88012db55cc0 ffffffffa08649f7 ffff88008c9af3f0 000000008116f303 <4>Call Trace: <4> [<ffffffffa0388044>] libcfs_kkuc_group_put+0x94/0x180 [libcfs] <4> [<ffffffffa08649f7>] mdc_set_info_async+0x147/0x780 [mdc] <4> [<ffffffffa0699fad>] ldlm_callback_handler+0x4dd/0x1800 [ptlrpc] <4> [<ffffffffa04c321f>] ? keys_fill+0x6f/0x190 [obdclass] <4> [<ffffffffa06b8f6c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] <4> [<ffffffffa06bf61b>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] <4> [<ffffffffa06c7a35>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] <4> [<ffffffffa036f4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs] <4> [<ffffffffa03804ff>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] <4> [<ffffffffa06bf119>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 <4> [<ffffffffa06c8d9d>] ptlrpc_main+0xaed/0x1740 [ptlrpc] <4> [<ffffffffa06c82b0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] <4> [<ffffffff8109ab56>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109aac0>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4>Code: fe ff ff 31 d2 48 89 de 83 cf ff ff d0 e9 da fe ff ff 48 89 df e8 f8 63 04 00 e9 bb fe ff ff 0f 1f 00 55 48 89 e5 0f 1f 44 00 00 <f0> 48 ff 4f 30 0f 94 c0 84 c0 75 0b c9 c3 66 0f 1f 84 00 00 00 <1>RIP [<ffffffff8118a509>] fput+0x9/0x30 <4> RSP <ffff88012db55c20> <4>CR2: 0000000000000030
crash bt:
PID: 24385 TASK: ffff88012da5f500 CPU: 1 COMMAND: "ldlm_cb00_056"
#0 [ffff88012db55810] machine_kexec at ffffffff81038f3b
#1 [ffff88012db55870] crash_kexec at ffffffff810c59f2
#2 [ffff88012db55940] oops_end at ffffffff8152b7f0
#3 [ffff88012db55970] no_context at ffffffff8104a00b
#4 [ffff88012db559c0] __bad_area_nosemaphore at ffffffff8104a295
#5 [ffff88012db55a10] bad_area_nosemaphore at ffffffff8104a363
#6 [ffff88012db55a20] __do_page_fault at ffffffff8104aabf
#7 [ffff88012db55b40] do_page_fault at ffffffff8152d73e
#8 [ffff88012db55b70] page_fault at ffffffff8152aaf5
[exception RIP: fput+9]
RIP: ffffffff8118a509 RSP: ffff88012db55c20 RFLAGS: 00010246
RAX: 00000000ffffffe0 RBX: ffff8800a8ea4fc0 RCX: 0000000000000000
RDX: ffffffffa03c9eb0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88012db55c20 R8: 00000000ffffff0a R9: 00000000fffffffc
R10: 0000000000000001 R11: 282064656c696166 R12: ffffffffa03c9c60
R13: ffff88005df240f8 R14: 0000000000000000 R15: ffff88013b4ca000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff88012db55c28] libcfs_kkuc_group_put at ffffffffa0388044 [libcfs]
#10 [ffff88012db55c68] mdc_set_info_async at ffffffffa08649f7 [mdc]
#11 [ffff88012db55cc8] ldlm_callback_handler at ffffffffa0699fad [ptlrpc]
#12 [ffff88012db55d68] ptlrpc_server_handle_request at ffffffffa06c7a35
[ptlrpc]
#13 [ffff88012db55e48] ptlrpc_main at ffffffffa06c8d9d [ptlrpc]
#14 [ffff88012db55ee8] kthread at ffffffff8109ab56
#15 [ffff88012db55f48] kernel_thread at ffffffff8100c20a
The offending line in libcfs_kkuc_group_put() is
fput(reg->kr_fp);
reg is coming from kkuc_groups, which is an array of lists.
crash64> rd kkuc_groups 8 ffffffffa03c9c40: 0000000000000000 0000000000000000 ................ ffffffffa03c9c50: 0000000000000000 0000000000000000 ................ ffffffffa03c9c60: ffff8800a8ea4fc0 ffff8800a8ea4fc0 .O.......O...... ffffffffa03c9c70: 0000000000000000 0000000000000000 ................
So only one element is on the lists.
list ffffffffa03c9c60 -s kkuc_reg
...
ffff8800a8ea4fc0
struct kkuc_reg {
kr_chain = {
next = 0xffffffffa03c9c60 <kkuc_groups+32>,
prev = 0xffffffffa03c9c60 <kkuc_groups+32>
},
kr_uid = 23389,
kr_fp = 0x0,
kr_data = 0xffff8800a8ea4f80
}
So apparently reg->kr_fp is NULL. I'm not sure about reg, but since
it's the only one in the list, and RBX=ffff8800a8ea4fc0, that must be it.
Looking at libcfs_kkuc_group_put(), it appears that it is not locking
things properly:
down_read(&kg_sem);
cfs_list_for_each_entry(reg, &kkuc_groups[group], kr_chain) {
...
fput(reg->kr_fp);
reg->kr_fp = NULL;
...
up_read(&kg_sem);
Since reg can be modified, the lock should be down_write/up_write
instead. I suspect there was a race where 2 callers executed that
function. One won and the 2nd crashed.