[LU-7866] BUG: unable to handle kernel NULL pointer dereference at (null) Created: 11/Mar/16 Updated: 30/Jan/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | patch, soak | ||
| Environment: |
lola |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Error occurred during soak testing of build '20160309' (b2_8 RC5) (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160309 also). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active - active HA failover configuration. (For teset set-up configuration see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration) Sequence of events:
Attached client (lola-32) message, console and vmcore-dmesg.txt file. |
| Comments |
| Comment by Frank Heckes (Inactive) [ 11/Mar/16 ] |
|
Crash file has been saved to lhn.lola.hpdd.intel.com:/scratch/crashdumps/lu-7866/lola-32/127.0.0.1-2016-03-11-03\:55\:54 |
| Comment by Oleg Drokin [ 11/Mar/16 ] |
|
The actual problem here is two lines above the crash: <3>LustreError: 201682:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `mdt_body' (2 of 1) in format `LDLM_INTENT_OPEN': 0 vs. 216 (server) <3> req@ffff880e244a90c0 x1528368254991312/t489629526959(489629526959) o101->soaked-MDT0001-mdc-ffff88082dd02c00@192.168.1.108@o2ib10:12/10 lens 840/192 e 1 to 0 dl 1457697328 ref 2 fl Complete:R/4/0 rc -107/-107 This then causes the crash because: void ll_open_cleanup(struct super_block *sb, struct ptlrpc_request *open_req)
{
struct mdt_body *body;
struct md_op_data *op_data;
struct ptlrpc_request *close_req = NULL;
struct obd_export *exp = ll_s2sbi(sb)->ll_md_exp;
ENTRY;
body = req_capsule_server_get(&open_req->rq_pill, &RMF_MDT_BODY); <=== Returns NULL due to message above
OBD_ALLOC_PTR(op_data);
if (op_data == NULL) {
CWARN("%s: cannot allocate op_data to release open handle for "
DFID"\n",
ll_get_fsname(sb, NULL, 0), PFID(&body->mbo_fid1));
RETURN_EXIT;
}
op_data->op_fid1 = body->mbo_fid1; <==== Whoops!
|
| Comment by Peter Jones [ 11/Mar/16 ] |
|
Hongchao Can you please look into this? Oleg has suggested that you should 1) add error handling Thanks Peter |
| Comment by Oleg Drokin [ 11/Mar/16 ] |
|
Frank - are these builds using the RC5 RPM or do you self-build them from tip of whatever branch? Because it says "build '20160302'", but if you self build it does not help us and you need to put debuginfo vmlinux and debuginfo modules alongside the crashdumps |
| Comment by Frank Heckes (Inactive) [ 14/Mar/16 ] |
|
My apologies there's a typo in the description field above (I corrected it). The build under test was b2_8 rc5 and had been downloaded from Jenkens job lustre-2_8 . The debuginfo RPMs can be found at lhn.lola.hpdd.intel.com:/scratch/rpms/20160309/notinstalled/server/x86_64. |
| Comment by Hongchao Zhang [ 14/Mar/16 ] |
|
the related request has been actually failed with -ENOTCONN(-107), then there is not reply fields (Reply's bufcount is 1, only contains |
| Comment by Hongchao Zhang [ 15/Mar/16 ] |
|
status update: It's not clear where the request is modified, and it could be changed by the replay, but there is no "P" flags in the request, |
| Comment by Hongchao Zhang [ 15/Mar/16 ] |
|
Hi Frank, |
| Comment by Hongchao Zhang [ 16/Mar/16 ] |
|
the issue can be reproduced if the request is replayed but failed between ll_prep_inod and ll_open_cleanup. |
| Comment by Frank Heckes (Inactive) [ 16/Mar/16 ] |
|
Yes, to extract the debug log from the crash file. For log information on MDS side (MDT000 {0,1}I'm going to check, too and will attach the files here. |
| Comment by Frank Heckes (Inactive) [ 16/Mar/16 ] |
|
Odd, I decompressed the Lustre client 'normal' kernel and the debuginfo kernel, both returned the [root@lola-16 crash_lustre]# crash /scratch/crashdumps/lu-7866/lola-32/127.0.0.1-2016-03-11-03\:55\:54/vmcore /tmp/vmlinux-2.6.32-504.30.3.el6.x86_64 crash 6.1.0-6.el6_6 Copyright (C) 2002-2012 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. crash: /tmp/vmlinux-2.6.32-504.30.3.el6.x86_64: no .gnu_debuglink section crash: /tmp/vmlinux-2.6.32-504.30.3.el6.x86_64: no debugging data available Path to Oleg's lustre.so is configured via ~/.crashrc. Any ideas? |
| Comment by Hongchao Zhang [ 23/Mar/16 ] |
|
the issue should be caused by the race between the replay and the normal process of the open request. Thread 1: Thread 2: Then, the MDT1 got failover and the recovery is initiated, then the LDLM_ENQUEUE request in Thread 1 is replayed but failed |
| Comment by Gerrit Updater [ 31/Mar/16 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/19256 |