[LU-1102] NULL pointer dereference in capa_encrypt_id+0x8b/0x3e0 Created: 14/Feb/12 Updated: 30/Apr/14 Resolved: 24/May/12 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.1.1 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.1.2 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | paj | ||
| Severity: | 3 |
| Rank (Obsolete): | 4622 |
| Description |
|
We hit a NULL pointer dereference this morning which brought down our dual purpose MDS/OSS node for a 2.1 cluster of ours. The console reports the following: 2012-02-14 07:01:06 LustreError: 17526:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 07:01:06 LustreError: 17526:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15632468 previous similar messages 2012-02-14 07:11:06 LustreError: 13499:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 07:11:06 LustreError: 13499:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15700830 previous similar messages 2012-02-14 07:21:06 LustreError: 17613:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 07:21:06 LustreError: 17613:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15687688 previous similar messages 2012-02-14 07:30:09 Intel AES-NI instructions are not detected. 2012-02-14 07:30:09 Intel AES-NI instructions are not detected. 2012-02-14 07:30:09 padlock: VIA PadLock not detected. 2012-02-14 07:30:09 BUG: unable to handle kernel NULL pointer dereference at 000000000000004e 2012-02-14 07:30:09 IP: [<ffffffffa04fbc9b>] capa_encrypt_id+0x8b/0x3e0 [obdclass] 2012-02-14 07:30:09 PGD 608f7a067 PUD 62cccd067 PMD 0 2012-02-14 07:30:09 Oops: 0000 [#1] SMP 2012-02-14 07:30:09 last sysfs file: /sys/module/cryptd/initstate 2012-02-14 07:30:09 CPU 5 2012-02-14 07:30:09 Modules linked in: aesni_intel(-) cryptd aes_x86_64 aes_generic cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) mgs(U) obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) mbcache jbd2 lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) sd_mod crc_t10dif ib_srp scsi_transport_srp scsi_tgt ko2iblnd(U) lnet(U) libcfs(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm_intel kvm sg sr_mod cdrom mpt2sas scsi_transport_sas raid_class serio_raw i2c_i801 i2c_core ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core igb dca [last unloaded: scsi_wait_scan] 2012-02-14 07:30:10 2012-02-14 07:30:10 Pid: 14435, comm: mdt_23 Not tainted 2.6.32-220.4.1.1chaos.ch5.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH 2012-02-14 07:30:10 RIP: 0010:[<ffffffffa04fbc9b>] [<ffffffffa04fbc9b>] capa_encrypt_id+0x8b/0x3e0 [obdclass] 2012-02-14 07:30:10 RSP: 0018:ffff8806086f3880 EFLAGS: 00010282 2012-02-14 07:30:10 RAX: fffffffffffffffe RBX: fffffffffffffffe RCX: 0000000000000000 2012-02-14 07:30:10 RDX: 000000000000001c RSI: 0000000000000286 RDI: 0000000000000286 2012-02-14 07:30:10 RBP: ffff8806086f3960 R08: 0000000000000000 R09: ffff88035d73e000 2012-02-14 07:30:10 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000038 2012-02-14 07:30:10 R13: ffff8806086f3990 R14: ffff8805edb45348 R15: ffff8806086f39a0 2012-02-14 07:30:10 FS: 00002aaaab06eb20(0000) GS:ffff88034ac20000(0000) knlGS:0000000000000000 2012-02-14 07:30:10 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2012-02-14 07:30:10 CR2: 000000000000004e CR3: 00000005ebea9000 CR4: 00000000000006e0 2012-02-14 07:30:10 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2012-02-14 07:30:10 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2012-02-14 07:30:10 Process mdt_23 (pid: 14435, threadinfo ffff8806086f2000, task ffff88060893cb00) 2012-02-14 07:30:10 Stack: 2012-02-14 07:30:10 0000000000000004 0000000000000000 ffff8806086f3900 ffffffff8130b695 2012-02-14 07:30:10 <0> 0000000000000004 ffffffff81af6e08 0000000000000000 ffffffffa0a40000 2012-02-14 07:30:10 <0> ffff8806086f38d0 0000000022e6dabb ffff8806086f3930 0000000000000004 2012-02-14 07:30:10 Call Trace: 2012-02-14 07:30:10 [<ffffffff8130b695>] ? extract_entropy+0xe5/0x140 2012-02-14 07:30:10 [<ffffffffa0a40000>] ? ftrace_raw_event_ldiskfs_mb_release_group_pa+0x50/0xd0 [ldiskfs] 2012-02-14 07:30:10 [<ffffffffa0cb5f0d>] osd_capa_get+0x2cd/0x610 [osd_ldiskfs] 2012-02-14 07:30:10 [<ffffffff8119a417>] ? generic_getxattr+0x87/0x90 2012-02-14 07:30:10 [<ffffffffa0bd8ca0>] mdd_capa_get+0xa0/0x2c0 [mdd] 2012-02-14 07:30:10 [<ffffffffa0ce3ced>] cml_capa_get+0x6d/0x180 [cmm] 2012-02-14 07:30:10 [<ffffffffa0c1e270>] mo_capa_get+0x30/0x70 [mdt] 2012-02-14 07:30:10 [<ffffffffa0c29a81>] mdt_getattr_internal+0x6a1/0xc20 [mdt] 2012-02-14 07:30:10 [<ffffffffa0c2f3a2>] mdt_getattr_name_lock+0xb52/0x1540 [mdt] 2012-02-14 07:30:10 [<ffffffffa065000b>] ? __req_capsule_get+0x15b/0x5a0 [ptlrpc] 2012-02-14 07:30:10 [<ffffffffa062f524>] ? lustre_msg_get_flags+0x34/0x70 [ptlrpc] 2012-02-14 07:30:10 [<ffffffffa0c301dd>] mdt_intent_getattr+0x24d/0x3c0 [mdt] 2012-02-14 07:30:10 [<ffffffffa0c2dda9>] mdt_intent_policy+0x2d9/0x550 [mdt] 2012-02-14 07:30:10 [<ffffffffa0398b6f>] ? cfs_hash_bd_from_key+0x3f/0xc0 [libcfs] 2012-02-14 07:30:10 [<ffffffffa05f6ac2>] ldlm_lock_enqueue+0x272/0x7e0 [ptlrpc] 2012-02-14 07:30:10 [<ffffffffa0615206>] ldlm_handle_enqueue0+0x406/0xd70 [ptlrpc] 2012-02-14 07:30:10 [<ffffffffa0c2d94a>] mdt_enqueue+0x4a/0x100 [mdt] 2012-02-14 07:30:10 [<ffffffffa0c2674d>] mdt_handle_common+0x73d/0x12b0 [mdt] 2012-02-14 07:30:10 [<ffffffffa062f334>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc] 2012-02-14 07:30:10 [<ffffffffa0c27395>] mdt_regular_handle+0x15/0x20 [mdt] 2012-02-14 07:30:10 [<ffffffffa063b181>] ptlrpc_main+0xcd1/0x1690 [ptlrpc] 2012-02-14 07:30:10 [<ffffffffa063a4b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc] 2012-02-14 07:30:10 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-02-14 07:30:10 [<ffffffffa063a4b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc] 2012-02-14 07:30:10 [<ffffffffa063a4b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc] 2012-02-14 07:30:10 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-02-14 07:30:10 Code: 05 0d dc ea ff 02 0f 85 44 01 00 00 48 8d 7d 80 ba 0f 00 00 00 be 04 00 00 00 e8 51 93 d3 e0 48 85 c0 48 89 c3 0f 84 e3 02 00 00 <48> 8b 40 50 8b 90 e0 00 00 00 41 39 d4 0f 83 a2 00 00 00 c1 e2 2012-02-14 07:30:10 RIP [<ffffffffa04fbc9b>] capa_encrypt_id+0x8b/0x3e0 [obdclass] 2012-02-14 07:30:10 RSP <ffff8806086f3880> 2012-02-14 07:30:10 CR2: 000000000000004e There was also the following messages every hour for a few hours prior to the crash: 2012-02-14 06:01:06 LustreError: 17526:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 06:01:06 LustreError: 17526:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15581891 previous similar messages 2012-02-14 06:01:06 Feb 14 06:01:06 sumom32 kernel: LsrErr 72::fle_aa.c16fle_uhcp()qoc0040 ocpblt asbe asd<>utero:1560(itr_aac16fle_uhcpa) kpe 5881rvossmlrmsae 2012-02-14 06:11:06 LustreError: 17640:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 06:11:06 LustreError: 17640:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15585479 previous similar messages 2012-02-14 06:21:06 LustreError: 17718:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 06:21:06 LustreError: 17718:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15538316 previous similar messages 2012-02-14 06:31:06 LustreError: 17562:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 06:31:06 LustreError: 17562:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15628102 previous similar messages 2012-02-14 06:41:06 LustreError: 17746:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 06:41:06 LustreError: 17746:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15556085 previous similar messages 2012-02-14 06:51:06 LustreError: 17640:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 06:51:06 LustreError: 17640:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15610175 previous similar messages 2012-02-14 05:01:06 LustreError: 17613:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 05:01:06 LustreError: 17613:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15608069 previous similar messages 2012-02-14 05:01:06 Feb 14 05:01:06 sumom32 kernel: utero:1630(itrca:4:itrat_aa) eq/oc004:n aaiiyhsbe asd<>utero: 71::fle_aac16fle__aa) kpe 5009peiossmlrmsae 2012-02-14 05:11:06 LustreError: 17560:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 05:11:06 LustreError: 17526:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 05:11:06 LustreError: 17526:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15645006 previous similar messages 2012-02-14 05:11:06 LustreError: 17560:0:(filter_capa.c:146:filter_auth_capa()) Skipped 671 previous similar messages 2012-02-14 05:21:06 LustreError: 17526:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 05:21:06 LustreError: 17526:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15594099 previous similar messages 2012-02-14 05:31:06 LustreError: 17618:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 05:31:06 LustreError: 17618:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15582323 previous similar messages 2012-02-14 05:31:06 Feb 14 05:31:06 sumom32 kernel: utero:1560(itrcp.:4:itrat_aa) kpe 5406peiu iia esgs<>utero:1500:(itrcp.:4:itrat_aa) kpe 7 rvossml esgs3>utero:1680(itrcpa.:4:itrat_aa) e/p /x0 ocpblt a enpse 2012-02-14 05:41:06 LustreError: 17802:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 05:41:06 LustreError: 17802:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15637350 previous similar messages 2012-02-14 05:51:06 LustreError: 13417:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 05:51:06 LustreError: 13417:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15638302 previous similar messages 2012-02-14 04:01:06 LustreError: 17718:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 04:01:06 LustreError: 17718:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15621761 previous similar messages 2012-02-14 04:11:06 LustreError: 17560:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 04:11:06 LustreError: 17560:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15658332 previous similar messages 2012-02-14 04:21:06 LustreError: 13466:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 04:21:06 LustreError: 13466:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15618610 previous similar messages 2012-02-14 04:31:06 LustreError: 17467:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 04:31:06 LustreError: 17467:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15664435 previous similar messages 2012-02-14 04:41:06 LustreError: 17526:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 04:41:06 LustreError: 17526:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15568067 previous similar messages 2012-02-14 04:51:06 LustreError: 17467:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed 2012-02-14 04:51:06 LustreError: 17467:0:(filter_capa.c:146:filter_auth_capa()) Skipped 15626923 previous similar messages |
| Comments |
| Comment by Oleg Drokin [ 14/Feb/12 ] |
|
Did you enable lustre capabilities somehow? |
| Comment by Prakash Surya (Inactive) [ 14/Feb/12 ] |
|
This is the first I've head of lustre capabilities so I would assume that we are not explicitly enabling it. Is it enabled by default in 2.1? |
| Comment by Christopher Morrone [ 14/Feb/12 ] |
|
We certainly didn't intend to be using capabilities, which definitely made me wonder why we are seeing those messages. |
| Comment by Oleg Drokin [ 14/Feb/12 ] |
|
No, it's not enabled by default, but you somhow managed to enable it. |
| Comment by Prakash Surya (Inactive) [ 14/Feb/12 ] |
|
Yea, just checked the capa flags using lctl and they are enabled. I'm not quite sure why that's the case though, I don't think we did that intentionally. |
| Comment by Christopher Morrone [ 14/Feb/12 ] |
|
We checked obdfilter's and mdt's "capa" file. For obdfilter it reports: capability on: oss and mdt reports capability on: oss mds As far as I can tell, this is the default in the 2.1 code. The initializer functions just set the fields "= 1" that the capa proc functions are printing out. |
| Comment by Christopher Morrone [ 14/Feb/12 ] |
|
It looks to me like Fan Yong turned them on by default in commit 79923ef0316c07b09891fb9b5bb31b4009f9731e. And the commit message doesn't imply to me at all that something like that was happening. Supposedly that commit has something to do with ORNL-3, which I can't access. |
| Comment by Christopher Morrone [ 14/Feb/12 ] |
|
On, no, sorry, the mdt lines just moved around, that wasn't the original commit that turned them on. |
| Comment by Oleg Drokin [ 15/Feb/12 ] |
|
Hm, strange. otherwise it should not be enabled still. |
| Comment by Prakash Surya (Inactive) [ 15/Feb/12 ] |
|
Would that show up on the client or the servers? or both? I'm looking on the servers, grepping for that string, and its not coming up. |
| Comment by Oleg Drokin [ 15/Feb/12 ] |
|
should show up on clients |
| Comment by Peter Jones [ 22/Feb/12 ] |
|
I just wanted to check in one this issue. What is the present status from the LLNL side? Is this still a priority? |
| Comment by Christopher Morrone [ 24/Feb/12 ] |
|
It is probably not "blocker" level, but definitely still high priority. Especially since capa may be incorrectly enabled by default in 2.1. Oleg: We only have 1.8 on the client side at this point, so we would not see those messages from clients at mount time. And frankly that is probably one of the stupid console messages that needs to be removed from 2.1. |
| Comment by Oleg Drokin [ 27/Feb/12 ] |
|
Hm, if you only had 1.8 clients, then the capa connect bit should never be set as well and the code should not be triggered. I wonder if you have a system dump from that crash and can find this particular export and see what's inside it? |
| Comment by Prakash Surya (Inactive) [ 27/Feb/12 ] |
|
It appears that we do have a dump from this crash. I don't have permissions to access it yet, though. |
| Comment by Oleg Drokin [ 27/Feb/12 ] |
|
Bobi, can you please hunt down where are the CAPA flags removed on either the client or server that leads to capabilities never getting enabled for any exports even if both client and server are 2.x and have the ability to do capabilities. |
| Comment by Zhenyu Xu [ 27/Feb/12 ] |
|
the lproc capa switch is used to en/disable MDT/OST device's capability, and MDS/OSS always using export's connection flag (OBD_CONNECT_MDS_CAPA/OBD_CONNECT_OSS_CAPA)+ MDT/OST device's capability to decide whether run capa get code path. And capa_encrypt_id() can only be called with remote client (OBD_CONNECT_RMT_CLIENT) connection, and 1.8 clients do not have these 3 flags in their connection request. These connection flag values are: under 1.8 #define OBD_CONNECT_RMT_CLIENT 0x10000ULL /*Remote client */ #define OBD_CONNECT_MDS_CAPA 0x100000ULL /*MDS capability */ #define OBD_CONNECT_OSS_CAPA 0x200000ULL /*OSS capability */ under 2.1 #define OBD_CONNECT_RMT_CLIENT 0x10000ULL /*Remote client */ #define OBD_CONNECT_MDS_CAPA 0x100000ULL /*MDS capability */ #define OBD_CONNECT_OSS_CAPA 0x200000ULL /*OSS capability */ They are consistent. |
| Comment by Bruno Faccini (Inactive) [ 28/Apr/12 ] |
|
At CEA/T100, we experienced the same crash on one MDS in a Cluster with all Clients/Server running only with Lustre 2.0.0.1. The panicing thread's stack looks the same and we also had the same preceding msgs "Intel AES-NI instructions are not detected/padlock: VIA PadLock not detected". Crash-dump analysis along with Lustre/Kernel source code reading seems to indicate that in capa_encrypt_id() only checking a NULL return from ll_crypto_alloc_blkcipher() to detect a failure is wrong since return value is a -ERRNO (-ENOENT/-2 in our case and likeky due to the errors/msgs in loading any available encryption modules/methods ...). So, even if I did not already investigate if this "capability encryption" scenario is valid or not in our config./environment, I think that at least to prevent the Oops, capa_encrypt_id() (and all others routines like keyblock_init()/capa_[en,de]crypt_id()/gss_[un]wrap_kerberos() calling ll_crypto_alloc_blkcipher()/crypto_alloc_blkcipher() ...) code/check should already be changed with something like following to comply with Kernel/Modules return values :
|
| Comment by Bruno Faccini (Inactive) [ 28/Apr/12 ] |
|
Oops, sorry for not using a good format for my previous fix/patch proposal, so "same player shoot again", here is the way I think ll_crypto_alloc_blkcipher()/crypto_alloc_blkcipher() return value must be changed/checked in all calling routines : [root@gaia1 lustre-2.0.0.1]# pwd
/root/rpmbuild/BUILD/lustre-2.0.0.1
[root@gaia1 lustre-2.0.0.1]# diff -urN lustre/obdclass/capa.c.orig lustre/obdclass/capa.c.bfi
--- lustre/obdclass/capa.c.orig 2012-04-28 16:13:39.427591179 +0200
+++ lustre/obdclass/capa.c.bfi 2012-04-28 16:14:18.997589562 +0200
@@ -289,7 +289,7 @@
/* passing "aes" in a variable instead of a constant string keeps gcc
* 4.3.2 happy */
tfm = ll_crypto_alloc_blkcipher(alg, 0, 0 );
- if (tfm == NULL) {
+ if (IS_ERR(tfm)) {
CERROR("failed to load transform for aes\n");
RETURN(-EFAULT);
}
[root@gaia1 lustre-2.0.0.1]#
|
| Comment by Zhenyu Xu [ 09/May/12 ] |
|
master patch tracking at http://review.whamcloud.com/2703 |
| Comment by Zhenyu Xu [ 14/May/12 ] |
|
landed for 2.3.x |
| Comment by Susan Coulter [ 30/Apr/14 ] |
|
Is this bug and |
| Comment by Christopher Morrone [ 30/Apr/14 ] |
|
No, this is not the same as FYI, it is best not to ask questions in resolved tickets; the likelihood of a reply is much better on an a fully open ticket. |