[LU-1539] NULL-pointer dereference in lnet_match_md() Created: 19/Jun/12  Updated: 08/Apr/13  Resolved: 08/Apr/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Li Wei (Inactive) Assignee: Li Wei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

A single-node cluster (i.e., an x86 VM with four logical CPUs and 512 MB memory, running 2.6.32-220.4.2.el6.78b37c2 and master at the 2.3.56 tag).


Severity: 3
Rank (Obsolete): 4268

 Description   

I had been running sanity 103 in a loop for a hour or so before hitting this:

CentOS Linux release 6.0 (Final)
Kernel 2.6.32-220.4.2.el6.78b37c2 on an i686

h221g login: LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_connect operation failed with -11
LustreError: 3013:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114LustreError: 3606:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114
LustreError: 6704:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114
LustreError: 9790:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114
LustreError: 12877:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114
LustreError: 15965:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114
LustreError: 2809:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 2422: rc -2
LustreError: 19052:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114
LustreError: 22138:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114
LustreError: 2808:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 3311: rc -2
LustreError: 2807:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 4053: rc -2
LustreError: 28312:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114
LustreError: 28312:0:(mdc_request.c:1429:mdc_quotactl()) Skipped 1 previous similar message
LustreError: 2808:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 4255: rc -2
LustreError: 8444:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114
LustreError: 8444:0:(mdc_request.c:1429:mdc_quotactl()) Skipped 3 previous similar messages
LustreError: 3017:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 10087: rc -2
LustreError: 671:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114
LustreError: 671:0:(mdc_request.c:1429:mdc_quotactl()) Skipped 7 previous similar messages
LustreError: 2807:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 11031: rc -2
LustreError: 2809:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 11568: rc -2
LustreError: 3017:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 13501: rc -2
LustreError: 3017:0:(ldlm_resource.c:1104:ldlm_resource_get()) Skipped 1 previous similar message
LustreError: 3017:0:(ldlm_resource.c:1104:ldlm_resource_get()) lvbo_init failed for resource 15694: rc -2
BUG: unable to handle kernel NULL pointer dereference at 00000002
IP: [<e089cc8e>] lnet_match_md+0x25e/0x410 [lnet]
*pdpt = 000000001258c001 *pde = 0000000000000000 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/possible
Modules linked in: lustre(U) obdfilter(U) ost(U) osd_ldiskfs(U) cmm(U) fsfilt_ldiskfs(U) mdt(U) mdd(U) mds(U) mgs(U) lquota(U) ldiskfs(U) exportfs jbd mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sunrpc ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate dm_mirror dm_region_hash dm_log ppdev parport_pc parport sg i2c_piix4 i2c_core snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000 ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix ahci dm_mod [last unloaded: mperf]

Pid: 6682, comm: rm Tainted: P        W  ----------------   2.6.32-220.4.2.el6.78b37c2 #1 innotek GmbH VirtualBox
EIP: 0060:[<e089cc8e>] EFLAGS: 00010246 CPU: 2
EIP is at lnet_match_md+0x25e/0x410 [lnet]
EAX: dce57948 EBX: 0000000c ECX: 00000002 EDX: c9638700
ESI: ca675f40 EDI: 00000001 EBP: c9dc36f0 ESP: c9dc36a8
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068

Process rm (pid: 6682, ti=c9dc2000 task=cb486ab0 task.ti=c9dc2000)
Stack:
 00000000 00090000 00003039 00000240 00000000 f3b57d0e 0004fdfe ca675f40
<0> ca86e200 c9dc377c c9dc3778 dce57948 00000002 00090000 dce57940 ca86e200
<0> 00003039 e08c5b38 c9dc378c e08a7977 00000000 00090000 00003039 00000240
Call Trace:
 [<e08a7977>] ? lnet_parse+0xdd7/0x1f40 [lnet]
 [<c05fe095>] ? string+0x35/0xd0
 [<c05ff38a>] ? vsnprintf+0x2ea/0x3f0
 [<e08a92c6>] ? lolnd_send+0x36/0xb0 [lnet]
 [<e089f087>] ? lnet_ni_send+0x37/0xf0 [lnet]
 [<e08a300d>] ? lnet_send+0x3bd/0xe20 [lnet]
 [<e08a6052>] ? LNetPut+0x332/0xe80 [lnet]
 [<e18c32b0>] ? ptl_send_buf+0x2a0/0x940 [ptlrpc]
 [<e18c77a6>] ? ptl_send_rpc+0x686/0xfa0 [ptlrpc]
 [<e18b7a0e>] ? ptlrpc_send_new_req+0x50e/0xa70 [ptlrpc]
 [<e18be38e>] ? ptlrpc_set_wait+0x5ee/0xac0 [ptlrpc]
 [<e18b7fd9>] ? ptlrpc_set_add_req+0x69/0xe0 [ptlrpc]
 [<e18be8e8>] ? ptlrpc_queue_wait+0x88/0x330 [ptlrpc]
 [<e188fcd2>] ? ldlm_cli_enqueue+0x402/0x960 [ptlrpc]
 [<e0a8e712>] ? mdc_enter_request+0x22/0x190 [mdc]
 [<e18924c0>] ? ldlm_completion_ast+0x0/0xa10 [ptlrpc]
 [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre]
 [<e0a93917>] ? mdc_enqueue+0x327/0x1330 [mdc]
 [<e0a94b6c>] ? mdc_intent_lock+0x24c/0x7cc [mdc]
 [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre]
 [<e18924c0>] ? ldlm_completion_ast+0x0/0xa10 [ptlrpc]
 [<e0a94920>] ? mdc_intent_lock+0x0/0x7cc [mdc]
 [<e0b94625>] ? md_intent_lock+0x115/0x5c0 [lmv]
 [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre]
 [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre]
 [<e0b982f4>] ? lmv_intent_open+0x364/0x1330 [lmv]
 [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre]
 [<e0b995bb>] ? lmv_intent_lock+0x2fb/0x4a0 [lmv]
 [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre]
 [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre]
 [<e0b992c0>] ? lmv_intent_lock+0x0/0x4a0 [lmv]
 [<e0e1f64f>] ? md_intent_lock.clone.0+0xff/0x5b0 [lustre]
 [<e0e72750>] ? ll_md_blocking_ast+0x0/0x9a0 [lustre]
 [<e0e21119>] ? ll_revalidate_it+0x189/0x1140 [lustre]
 [<e0e7889c>] ? ll_convert_intent+0xdc/0x310 [lustre]
 [<e0e2220b>] ? ll_revalidate_nd+0x13b/0x510 [lustre]
 [<c0536db0>] ? do_lookup+0x40/0x180
 [<c05375a3>] ? __link_path_walk+0x5e3/0xd60
 [<c051cc9d>] ? kmem_cache_alloc+0xfd/0x110
 [<c0537f31>] ? path_walk+0x51/0xc0
 [<c05380b9>] ? do_path_lookup+0x59/0x90
 [<c0538e0c>] ? do_filp_open+0xdc/0xb00
 [<c05057d1>] ? handle_mm_fault+0x131/0x1d0
 [<c05286a8>] ? do_sys_open+0x58/0x130
 [<c04afebc>] ? audit_syscall_entry+0x21c/0x240
 [<c05287b1>] ? sys_openat+0x31/0x50
 [<c0409a9f>] ? sysenter_do_call+0x12/0x28
Code: 01 74 22 b8 0c 28 8c e0 c7 05 1c 28 8c e0 00 00 04 00 e8 76 b3 f7 ff 8d b6 00 00 00 00 83 c1 08 e9 4f ff ff ff 8b 4d e8 8b 45 e4 <8b> 11 39 c1 74 0c 89 c8 89 55 e8 e9 53 ff ff ff 66 90 83 ff 02 
EIP: [<e089cc8e>] lnet_match_md+0x25e/0x410 [lnet] SS:ESP 0068:c9dc36a8
CR2: 0000000000000002

The test was at

[...]
performing setfacl...
[3] $ mkdir d -- ok
[4] $ chown bin:bin d -- ok
[5] $ cd d -- ok
[7] $ su bin -- ok
[8] $ sg bin -- [(1,0)(1 1 2 3,1 1 2 3)]ok
[9] $ umask 027 -- ok
[10] $ touch g -- ok
[11] $ ls -dl g | awk '{print $1}' -- ok
[14] $ setfacl -m m:- g -- ok
[15] $ ls -dl g | awk '{print $1}' -- ok
[18] $ getfacl g -- ok
[28] $ setfacl -x m g -- ok
[29] $ getfacl g -- ok
[38] $ setfacl -m u:daemon:rw g -- ok
[39] $ getfacl g -- ok
[50] $ setfacl -m u::rwx,g::r-x,o:- g -- ok
[51] $ getfacl g -- ok
[62] $ setfacl -m u::rwx,g::r-x,o:-,m:- g -- ok
[63] $ getfacl g -- ok
[74] $ setfacl -m u::rwx,g::r-x,o:-,u:root:-,m:- g -- ok
[75] $ getfacl g -- ok
[87] $ setfacl -m u::rwx,g::r-x,o:-,u:root:-,m:- g -- ok
[88] $ getfacl g -- ok
[100] $ setfacl -m u::rwx,g::r-x,o:-,u:root:- g -- ok
[101] $ getfacl g -- ok
[113] $ setfacl --test -x u: g -- ok
[116] $ setfacl --test -x u:x -- ok
[119] $ setfacl -m d:u:root:rwx g -- ok
[122] $ setfacl -x m g -- ok
[129] $ mkdir d -- ok
[130] $ setfacl --test -m u::rwx,u:bin:rwx,g::r-x,o::--- d -- ok
[133] $ setfacl --test -m u::rwx,u:bin:rwx,g::r-x,m::---,o::--- d -- ok
[136] $ setfacl --test -d -m u::rwx,u:bin:rwx,g::r-x,o::--- d -- ok
[139] $ setfacl --test -d -m u::rwx,u:bin:rwx,g::r-x,m::---,o::--- d -- ok
[142] $ su -- ok
[143] $ cd .. -- ok
[144] $ rm -r d -- 


 Comments   
Comment by Liang Zhen (Inactive) [ 19/Jun/12 ]

Doug, could you help me on this? I will look into it too

Comment by Li Wei (Inactive) [ 19/Jun/12 ]

Unfortunately, kdump didn't take over for some reason. So that's pretty much all information I have.

Comment by Doug Oucharek (Inactive) [ 22/Jun/12 ]

After code inspecting lnet_match_md() and any other code which could mess with the MD/ME lists, I cannot see any way for a NULL pointer reference.

Also, the call trace seems strange to me with the "string" and "vsnprintf" calls between lnet_parse() and lolnd_send(). I'm suspecting stack corruption so the NULL pointer may not have even happened in lnet_match_md().

I will try to run the sanity test myself to see if I can gather more info. However: I can see from Liang's changes which have landed to master that all of this code has changed (lnet_match_md() does not even exist anymore). Should we just retest when those changes come through?

Comment by Li Wei (Inactive) [ 22/Jun/12 ]

I'm fine with leaving it there until I get the chance to try latest master next week. After all, this appeared only once on my VM.

Comment by Jodi Levi (Inactive) [ 05/Jul/12 ]

Please reassign as appropriate when you have more information no this bug.

Generated at Sat Feb 10 01:17:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.