[LU-1539] NULL-pointer dereference in lnet_match_md() Created: 19/Jun/12 Updated: 08/Apr/13 Resolved: 08/Apr/13 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Li Wei (Inactive) | Assignee: | Li Wei (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
A single-node cluster (i.e., an x86 VM with four logical CPUs and 512 MB memory, running 2.6.32-220.4.2.el6.78b37c2 and master at the 2.3.56 tag). |
||
| Severity: | 3 |
| Rank (Obsolete): | 4268 |
| Description |
|
I had been running sanity 103 in a loop for a hour or so before hitting this: CentOS Linux release 6.0 (Final) Kernel 2.6.32-220.4.2.el6.78b37c2 on an i686 h221g login: LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_connect operation failed with -11 LustreError: 3013:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114LustreError: 3606:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114 LustreError: 6704:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114 LustreError: 9790:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114 LustreError: 12877:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114 LustreError: 15965:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114 LustreError: 2809:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 2422: rc -2 LustreError: 19052:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114 LustreError: 22138:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114 LustreError: 2808:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 3311: rc -2 LustreError: 2807:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 4053: rc -2 LustreError: 28312:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114 LustreError: 28312:0:(mdc_request.c:1429:mdc_quotactl()) Skipped 1 previous similar message LustreError: 2808:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 4255: rc -2 LustreError: 8444:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114 LustreError: 8444:0:(mdc_request.c:1429:mdc_quotactl()) Skipped 3 previous similar messages LustreError: 3017:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 10087: rc -2 LustreError: 671:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114 LustreError: 671:0:(mdc_request.c:1429:mdc_quotactl()) Skipped 7 previous similar messages LustreError: 2807:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 11031: rc -2 LustreError: 2809:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 11568: rc -2 LustreError: 3017:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 13501: rc -2 LustreError: 3017:0:(ldlm_resource.c:1104:ldlm_resource_get()) Skipped 1 previous similar message LustreError: 3017:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 15694: rc -2 BUG: unable to handle kernel NULL pointer dereference at 00000002 IP: [<e089cc8e>] lnet_match_md+0x25e/0x410 [lnet] *pdpt = 000000001258c001 *pde = 0000000000000000 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/possible Modules linked in: lustre(U) obdfilter(U) ost(U) osd_ldiskfs(U) cmm(U) fsfilt_ldiskfs(U) mdt(U) mdd(U) mds(U) mgs(U) lquota(U) ldiskfs(U) exportfs jbd mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sunrpc ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate dm_mirror dm_region_hash dm_log ppdev parport_pc parport sg i2c_piix4 i2c_core snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000 ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix ahci dm_mod [last unloaded: mperf] Pid: 6682, comm: rm Tainted: P W ---------------- 2.6.32-220.4.2.el6.78b37c2 #1 innotek GmbH VirtualBox EIP: 0060:[<e089cc8e>] EFLAGS: 00010246 CPU: 2 EIP is at lnet_match_md+0x25e/0x410 [lnet] EAX: dce57948 EBX: 0000000c ECX: 00000002 EDX: c9638700 ESI: ca675f40 EDI: 00000001 EBP: c9dc36f0 ESP: c9dc36a8 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process rm (pid: 6682, ti=c9dc2000 task=cb486ab0 task.ti=c9dc2000) Stack: 00000000 00090000 00003039 00000240 00000000 f3b57d0e 0004fdfe ca675f40 <0> ca86e200 c9dc377c c9dc3778 dce57948 00000002 00090000 dce57940 ca86e200 <0> 00003039 e08c5b38 c9dc378c e08a7977 00000000 00090000 00003039 00000240 Call Trace: [<e08a7977>] ? lnet_parse+0xdd7/0x1f40 [lnet] [<c05fe095>] ? string+0x35/0xd0 [<c05ff38a>] ? vsnprintf+0x2ea/0x3f0 [<e08a92c6>] ? lolnd_send+0x36/0xb0 [lnet] [<e089f087>] ? lnet_ni_send+0x37/0xf0 [lnet] [<e08a300d>] ? lnet_send+0x3bd/0xe20 [lnet] [<e08a6052>] ? LNetPut+0x332/0xe80 [lnet] [<e18c32b0>] ? ptl_send_buf+0x2a0/0x940 [ptlrpc] [<e18c77a6>] ? ptl_send_rpc+0x686/0xfa0 [ptlrpc] [<e18b7a0e>] ? ptlrpc_send_new_req+0x50e/0xa70 [ptlrpc] [<e18be38e>] ? ptlrpc_set_wait+0x5ee/0xac0 [ptlrpc] [<e18b7fd9>] ? ptlrpc_set_add_req+0x69/0xe0 [ptlrpc] [<e18be8e8>] ? ptlrpc_queue_wait+0x88/0x330 [ptlrpc] [<e188fcd2>] ? ldlm_cli_enqueue+0x402/0x960 [ptlrpc] [<e0a8e712>] ? mdc_enter_request+0x22/0x190 [mdc] [<e18924c0>] ? ldlm_completion_ast+0x0/0xa10 [ptlrpc] [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre] [<e0a93917>] ? mdc_enqueue+0x327/0x1330 [mdc] [<e0a94b6c>] ? mdc_intent_lock+0x24c/0x7cc [mdc] [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre] [<e18924c0>] ? ldlm_completion_ast+0x0/0xa10 [ptlrpc] [<e0a94920>] ? mdc_intent_lock+0x0/0x7cc [mdc] [<e0b94625>] ? md_intent_lock+0x115/0x5c0 [lmv] [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre] [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre] [<e0b982f4>] ? lmv_intent_open+0x364/0x1330 [lmv] [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre] [<e0b995bb>] ? lmv_intent_lock+0x2fb/0x4a0 [lmv] [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre] [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre] [<e0b992c0>] ? lmv_intent_lock+0x0/0x4a0 [lmv] [<e0e1f64f>] ? md_intent_lock.clone.0+0xff/0x5b0 [lustre] [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre] [<e0e21119>] ? ll_revalidate_it+0x189/0x1140 [lustre] [<e0e7889c>] ? ll_convert_intent+0xdc/0x310 [lustre] [<e0e2220b>] ? ll_revalidate_nd+0x13b/0x510 [lustre] [<c0536db0>] ? do_lookup+0x40/0x180 [<c05375a3>] ? __link_path_walk+0x5e3/0xd60 [<c051cc9d>] ? kmem_cache_alloc+0xfd/0x110 [<c0537f31>] ? path_walk+0x51/0xc0 [<c05380b9>] ? do_path_lookup+0x59/0x90 [<c0538e0c>] ? do_filp_open+0xdc/0xb00 [<c05057d1>] ? handle_mm_fault+0x131/0x1d0 [<c05286a8>] ? do_sys_open+0x58/0x130 [<c04afebc>] ? audit_syscall_entry+0x21c/0x240 [<c05287b1>] ? sys_openat+0x31/0x50 [<c0409a9f>] ? sysenter_do_call+0x12/0x28 Code: 01 74 22 b8 0c 28 8c e0 c7 05 1c 28 8c e0 00 00 04 00 e8 76 b3 f7 ff 8d b6 00 00 00 00 83 c1 08 e9 4f ff ff ff 8b 4d e8 8b 45 e4 <8b> 11 39 c1 74 0c 89 c8 89 55 e8 e9 53 ff ff ff 66 90 83 ff 02 EIP: [<e089cc8e>] lnet_match_md+0x25e/0x410 [lnet] SS:ESP 0068:c9dc36a8 CR2: 0000000000000002 The test was at [...]
performing setfacl...
[3] $ mkdir d -- ok
[4] $ chown bin:bin d -- ok
[5] $ cd d -- ok
[7] $ su bin -- ok
[8] $ sg bin -- [(1,0)(1 1 2 3,1 1 2 3)]ok
[9] $ umask 027 -- ok
[10] $ touch g -- ok
[11] $ ls -dl g | awk '{print $1}' -- ok
[14] $ setfacl -m m:- g -- ok
[15] $ ls -dl g | awk '{print $1}' -- ok
[18] $ getfacl g -- ok
[28] $ setfacl -x m g -- ok
[29] $ getfacl g -- ok
[38] $ setfacl -m u:daemon:rw g -- ok
[39] $ getfacl g -- ok
[50] $ setfacl -m u::rwx,g::r-x,o:- g -- ok
[51] $ getfacl g -- ok
[62] $ setfacl -m u::rwx,g::r-x,o:-,m:- g -- ok
[63] $ getfacl g -- ok
[74] $ setfacl -m u::rwx,g::r-x,o:-,u:root:-,m:- g -- ok
[75] $ getfacl g -- ok
[87] $ setfacl -m u::rwx,g::r-x,o:-,u:root:-,m:- g -- ok
[88] $ getfacl g -- ok
[100] $ setfacl -m u::rwx,g::r-x,o:-,u:root:- g -- ok
[101] $ getfacl g -- ok
[113] $ setfacl --test -x u: g -- ok
[116] $ setfacl --test -x u:x -- ok
[119] $ setfacl -m d:u:root:rwx g -- ok
[122] $ setfacl -x m g -- ok
[129] $ mkdir d -- ok
[130] $ setfacl --test -m u::rwx,u:bin:rwx,g::r-x,o::--- d -- ok
[133] $ setfacl --test -m u::rwx,u:bin:rwx,g::r-x,m::---,o::--- d -- ok
[136] $ setfacl --test -d -m u::rwx,u:bin:rwx,g::r-x,o::--- d -- ok
[139] $ setfacl --test -d -m u::rwx,u:bin:rwx,g::r-x,m::---,o::--- d -- ok
[142] $ su -- ok
[143] $ cd .. -- ok
[144] $ rm -r d --
|
| Comments |
| Comment by Liang Zhen (Inactive) [ 19/Jun/12 ] |
|
Doug, could you help me on this? I will look into it too |
| Comment by Li Wei (Inactive) [ 19/Jun/12 ] |
|
Unfortunately, kdump didn't take over for some reason. So that's pretty much all information I have. |
| Comment by Doug Oucharek (Inactive) [ 22/Jun/12 ] |
|
After code inspecting lnet_match_md() and any other code which could mess with the MD/ME lists, I cannot see any way for a NULL pointer reference. Also, the call trace seems strange to me with the "string" and "vsnprintf" calls between lnet_parse() and lolnd_send(). I'm suspecting stack corruption so the NULL pointer may not have even happened in lnet_match_md(). I will try to run the sanity test myself to see if I can gather more info. However: I can see from Liang's changes which have landed to master that all of this code has changed (lnet_match_md() does not even exist anymore). Should we just retest when those changes come through? |
| Comment by Li Wei (Inactive) [ 22/Jun/12 ] |
|
I'm fine with leaving it there until I get the chance to try latest master next week. After all, this appeared only once on my VM. |
| Comment by Jodi Levi (Inactive) [ 05/Jul/12 ] |
|
Please reassign as appropriate when you have more information no this bug. |