Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.6.0, Lustre 2.5.1
-
Centos 6.5
-
3
-
15202
Description
We got the following oops during testing:
<4>PGD 0 <4>Oops: 0002 [#1] SMP <4>last sysfs file: /sys/devices/system/cpu/online <4>CPU 1 <4>Modules linked in: lmv(U) fld(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfs lockd fscache auth_rpcgss nfs_acl sunrpc VSMqfs(P)(U) autofs4 ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 ppdev vmware_balloon parport_pc parport vmxnet3 sg i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom vmw_pvscsi pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] <4> <4>Pid: 24385, comm: ldlm_cb00_056 Tainted: P --------------- 2.6.32-431.17.1.el6.x86_64 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform <4>RIP: 0010:[<ffffffff8118a509>] [<ffffffff8118a509>] fput+0x9/0x30 <4>RSP: 0018:ffff88012db55c20 EFLAGS: 00010246 <4>RAX: 00000000ffffffe0 RBX: ffff8800a8ea4fc0 RCX: 0000000000000000 <4>RDX: ffffffffa03c9eb0 RSI: 0000000000000000 RDI: 0000000000000000 <4>RBP: ffff88012db55c20 R08: 00000000ffffff0a R09: 00000000fffffffc <4>R10: 0000000000000001 R11: 282064656c696166 R12: ffffffffa03c9c60 <4>R13: ffff88005df240f8 R14: 0000000000000000 R15: ffff88013b4ca000 <4>FS: 0000000000000000(0000) GS:ffff880028280000(0000) knlGS:0000000000000000 <4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b <4>CR2: 0000000000000030 CR3: 0000000001a85000 CR4: 00000000000407e0 <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>Process ldlm_cb00_056 (pid: 24385, threadinfo ffff88012db54000, task ffff88012da5f500) <4>Stack: <4> ffff88012db55c60 ffffffffa0388044 0000000000000002 00000000ffffffe0 <4><d> ffff88005df240f8 ffff88005df24100 ffffffffa07103be ffff8801398d7000 <4><d> ffff88012db55cc0 ffffffffa08649f7 ffff88008c9af3f0 000000008116f303 <4>Call Trace: <4> [<ffffffffa0388044>] libcfs_kkuc_group_put+0x94/0x180 [libcfs] <4> [<ffffffffa08649f7>] mdc_set_info_async+0x147/0x780 [mdc] <4> [<ffffffffa0699fad>] ldlm_callback_handler+0x4dd/0x1800 [ptlrpc] <4> [<ffffffffa04c321f>] ? keys_fill+0x6f/0x190 [obdclass] <4> [<ffffffffa06b8f6c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] <4> [<ffffffffa06bf61b>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] <4> [<ffffffffa06c7a35>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] <4> [<ffffffffa036f4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs] <4> [<ffffffffa03804ff>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] <4> [<ffffffffa06bf119>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 <4> [<ffffffffa06c8d9d>] ptlrpc_main+0xaed/0x1740 [ptlrpc] <4> [<ffffffffa06c82b0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] <4> [<ffffffff8109ab56>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109aac0>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4>Code: fe ff ff 31 d2 48 89 de 83 cf ff ff d0 e9 da fe ff ff 48 89 df e8 f8 63 04 00 e9 bb fe ff ff 0f 1f 00 55 48 89 e5 0f 1f 44 00 00 <f0> 48 ff 4f 30 0f 94 c0 84 c0 75 0b c9 c3 66 0f 1f 84 00 00 00 <1>RIP [<ffffffff8118a509>] fput+0x9/0x30 <4> RSP <ffff88012db55c20> <4>CR2: 0000000000000030
crash bt:
PID: 24385 TASK: ffff88012da5f500 CPU: 1 COMMAND: "ldlm_cb00_056" #0 [ffff88012db55810] machine_kexec at ffffffff81038f3b #1 [ffff88012db55870] crash_kexec at ffffffff810c59f2 #2 [ffff88012db55940] oops_end at ffffffff8152b7f0 #3 [ffff88012db55970] no_context at ffffffff8104a00b #4 [ffff88012db559c0] __bad_area_nosemaphore at ffffffff8104a295 #5 [ffff88012db55a10] bad_area_nosemaphore at ffffffff8104a363 #6 [ffff88012db55a20] __do_page_fault at ffffffff8104aabf #7 [ffff88012db55b40] do_page_fault at ffffffff8152d73e #8 [ffff88012db55b70] page_fault at ffffffff8152aaf5 [exception RIP: fput+9] RIP: ffffffff8118a509 RSP: ffff88012db55c20 RFLAGS: 00010246 RAX: 00000000ffffffe0 RBX: ffff8800a8ea4fc0 RCX: 0000000000000000 RDX: ffffffffa03c9eb0 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88012db55c20 R8: 00000000ffffff0a R9: 00000000fffffffc R10: 0000000000000001 R11: 282064656c696166 R12: ffffffffa03c9c60 R13: ffff88005df240f8 R14: 0000000000000000 R15: ffff88013b4ca000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff88012db55c28] libcfs_kkuc_group_put at ffffffffa0388044 [libcfs] #10 [ffff88012db55c68] mdc_set_info_async at ffffffffa08649f7 [mdc] #11 [ffff88012db55cc8] ldlm_callback_handler at ffffffffa0699fad [ptlrpc] #12 [ffff88012db55d68] ptlrpc_server_handle_request at ffffffffa06c7a35 [ptlrpc] #13 [ffff88012db55e48] ptlrpc_main at ffffffffa06c8d9d [ptlrpc] #14 [ffff88012db55ee8] kthread at ffffffff8109ab56 #15 [ffff88012db55f48] kernel_thread at ffffffff8100c20a
The offending line in libcfs_kkuc_group_put() is
fput(reg->kr_fp);
reg is coming from kkuc_groups, which is an array of lists.
crash64> rd kkuc_groups 8 ffffffffa03c9c40: 0000000000000000 0000000000000000 ................ ffffffffa03c9c50: 0000000000000000 0000000000000000 ................ ffffffffa03c9c60: ffff8800a8ea4fc0 ffff8800a8ea4fc0 .O.......O...... ffffffffa03c9c70: 0000000000000000 0000000000000000 ................
So only one element is on the lists.
list ffffffffa03c9c60 -s kkuc_reg ... ffff8800a8ea4fc0 struct kkuc_reg { kr_chain = { next = 0xffffffffa03c9c60 <kkuc_groups+32>, prev = 0xffffffffa03c9c60 <kkuc_groups+32> }, kr_uid = 23389, kr_fp = 0x0, kr_data = 0xffff8800a8ea4f80 }
So apparently reg->kr_fp is NULL. I'm not sure about reg, but since
it's the only one in the list, and RBX=ffff8800a8ea4fc0, that must be it.
Looking at libcfs_kkuc_group_put(), it appears that it is not locking
things properly:
down_read(&kg_sem); cfs_list_for_each_entry(reg, &kkuc_groups[group], kr_chain) { ... fput(reg->kr_fp); reg->kr_fp = NULL; ... up_read(&kg_sem);
Since reg can be modified, the lock should be down_write/up_write
instead. I suspect there was a race where 2 callers executed that
function. One won and the 2nd crashed.