[LU-7866] BUG: unable to handle kernel NULL pointer dereference at (null) - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.8.0
Labels:
- patch
- soak
Environment:
lola
build: https://build.hpdd.intel.com/job/lustre-b2_8/12/

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Error occurred during soak testing of build '20160309' (b2_8 RC5) (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160309 also). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active - active HA failover configuration. (For teset set-up configuration see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration)

Sequence of events:

mds_restart : 2016-03-11 03:41:05,597 - 2016-03-11 03:54:49,109 lola-8

2016-03-11 03:56 Lustre client lola-32 crashed with the following error:

<1>BUG: unable to handle kernel NULL pointer dereference at (null)
<1>IP: [<ffffffffa0a0241f>] ll_open_cleanup+0xaf/0x600 [lustre]
<4>PGD 38a867067 PUD 775372067 PMD 0
<4>Oops: 0000 [#1] SMP
<4>last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:06:00.0/infiniband_mad/umad0/port
<4>CPU 1
<4>Modules linked in: osc(U) mgc(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) nfsi]
<4>
<4>Pid: 201682, comm: simul Not tainted 2.6.32-504.30.3.el6.x86_64 #1 Intel Corporation S2600GZ/S2600GZ
<4>RIP: 0010:[<ffffffffa0a0241f>]  [<ffffffffa0a0241f>] ll_open_cleanup+0xaf/0x600 [lustre]
<4>RSP: 0018:ffff8801d3f298e8  EFLAGS: 00010286
<4>RAX: 0000000000000000 RBX: ffff8807f7271a00 RCX: ffff88102dd13ca0
<4>RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff88102f395000
<4>RBP: ffff8801d3f29928 R08: ffff88034f47e9c0 R09: 0000000000000000
<4>R10: 0000000000000010 R11: 0000000000000000 R12: 0000000000000000
<4>R13: ffff88082dd02c00 R14: ffff88081a5fc800 R15: ffff8801d3f29988
<4>FS:  00007f72f89d3700(0000) GS:ffff880045e20000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>CR2: 0000000000000000 CR3: 00000003b89f2000 CR4: 00000000001407e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process simul (pid: 201682, threadinfo ffff8801d3f28000, task ffff8804c6d73520)
<4>Stack:
<4> ffff8801d3f298f8 0000000000000000 ffff8801d3f29928 0000000000000001
<4><d> ffff88082dd02c00 ffff88081a5fc800 ffff8809648898c0 ffff8801d3f29988
<4><d> ffff8801d3f299f8 ffffffffa0a09c4a fffffffffffffffb 00ff880a531788c0
<4>Call Trace:
<4> [<ffffffffa0a09c4a>] ll_prep_inode+0x20a/0xc40 [lustre]
<4> [<ffffffffa07804b2>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
<4> [<ffffffffa0a214f0>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre]
<4> [<ffffffffa0a21fe1>] ll_lookup_it_finish+0x321/0x12e0 [lustre]
<4> [<ffffffffa0723dd0>] ? ldlm_lock_decref_internal+0x2e0/0xa80 [ptlrpc]
<4> [<ffffffffa0569925>] ? class_handle2object+0x95/0x190 [obdclass]
<4> [<ffffffff81174ab3>] ? kmem_cache_alloc_trace+0x1b3/0x1c0
<4> [<ffffffffa0a1e439>] ? ll_i2suppgid+0x19/0x30 [lustre]
<4> [<ffffffffa0a1e47e>] ? ll_i2gids+0x2e/0xd0 [lustre]
<4> [<ffffffffa0a214f0>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre]
<4> [<ffffffffa0a23226>] ll_lookup_it+0x286/0xda0 [lustre]
<4> [<ffffffffa0a23dc9>] ll_lookup_nd+0x89/0x4f0 [lustre]
<4> [<ffffffff8119e055>] do_lookup+0x1a5/0x230
<4> [<ffffffff8119ece4>] __link_path_walk+0x7a4/0x1000
<4> [<ffffffff8119f7fa>] path_walk+0x6a/0xe0
<4> [<ffffffff8119fa0b>] filename_lookup+0x6b/0xc0
<4> [<ffffffff8122daa6>] ? security_file_alloc+0x16/0x20
<4> [<ffffffff811a0ee4>] do_filp_open+0x104/0xd20
<4> [<ffffffff81063c63>] ? perf_event_task_sched_out+0x33/0x70
<4> [<ffffffff8129943a>] ? strncpy_from_user+0x4a/0x90
<4> [<ffffffff811ae392>] ? alloc_fd+0x92/0x160
<4> [<ffffffff8118b157>] do_sys_open+0x67/0x130
<4> [<ffffffff8118b260>] sys_open+0x20/0x30
<4> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
<4>Code: ba 38 01 00 00 31 f6 e8 30 62 b6 ff f6 05 fd 6b a9 ff 10 74 0d 80 3d f0 6b a9 ff 00 0f 88 ba 01 00 00 48 85 db 0f 84 16 02 00 00 <49> 8b 04 24 48 89 03 49
<1>RIP  [<ffffffffa0a0241f>] ll_open_cleanup+0xaf/0x600 [lustre]
<4> RSP <ffff8801d3f298e8>
<4>CR2: 0000000000000000

Attached client (lola-32) message, console and vmcore-dmesg.txt file.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console-loa-32.log.bz2
96 kB
11/Mar/16 7:30 PM
lola-32-vmcore-dmesg.txt.bz2
30 kB
11/Mar/16 7:30 PM
messages-lola-32.log.bz2
354 kB
11/Mar/16 7:32 PM

Activity

[LU-7866] BUG: unable to handle kernel NULL pointer dereference at (null)

Hongchao Zhang added a comment - 14/Mar/16 12:46 PM

the related request has been actually failed with -ENOTCONN(-107), then there is not reply fields (Reply's bufcount is 1, only contains
PTLRPC_BODY) in the request, still looking why the failed request can go to the "ll_prep_inode/ll_open_cleanup".

Hongchao Zhang added a comment - 14/Mar/16 12:46 PM the related request has been actually failed with -ENOTCONN(-107), then there is not reply fields (Reply's bufcount is 1, only contains PTLRPC_BODY) in the request, still looking why the failed request can go to the "ll_prep_inode/ll_open_cleanup".

Frank Heckes (Inactive) added a comment - 14/Mar/16 9:05 AM

My apologies there's a typo in the description field above (I corrected it). The build under test was b2_8 rc5 and had been downloaded from Jenkens job lustre-2_8 . The debuginfo RPMs can be found at lhn.lola.hpdd.intel.com:/scratch/rpms/20160309/notinstalled/server/x86_64.

Frank Heckes (Inactive) added a comment - 14/Mar/16 9:05 AM My apologies there's a typo in the description field above (I corrected it). The build under test was b2_8 rc5 and had been downloaded from Jenkens job lustre-2_8 . The debuginfo RPMs can be found at lhn.lola.hpdd.intel.com:/scratch/rpms/20160309/notinstalled/server/x86_64 .

Oleg Drokin added a comment - 11/Mar/16 9:44 PM

Frank - are these builds using the RC5 RPM or do you self-build them from tip of whatever branch? Because it says "build '20160302'", but if you self build it does not help us and you need to put debuginfo vmlinux and debuginfo modules alongside the crashdumps

Oleg Drokin added a comment - 11/Mar/16 9:44 PM Frank - are these builds using the RC5 RPM or do you self-build them from tip of whatever branch? Because it says "build '20160302'", but if you self build it does not help us and you need to put debuginfo vmlinux and debuginfo modules alongside the crashdumps

Peter Jones added a comment - 11/Mar/16 9:42 PM

Hongchao

Can you please look into this? Oleg has suggested that you should

1) add error handling
2. figure out why the request reply does not have what we think it should have inside

Thanks

Peter

Peter Jones added a comment - 11/Mar/16 9:42 PM Hongchao Can you please look into this? Oleg has suggested that you should 1) add error handling 2. figure out why the request reply does not have what we think it should have inside Thanks Peter

Oleg Drokin added a comment - 11/Mar/16 9:38 PM

The actual problem here is two lines above the crash:

<3>LustreError: 201682:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `mdt_body' (2 of 1) in format `LDLM_INTENT_OPEN': 0 vs. 216 (server)
<3>  req@ffff880e244a90c0 x1528368254991312/t489629526959(489629526959) o101->soaked-MDT0001-mdc-ffff88082dd02c00@192.168.1.108@o2ib10:12/10 lens 840/192 e 1 to 0 dl 1457697328 ref 2 fl Complete:R/4/0 rc -107/-107

This then causes the crash because:

void ll_open_cleanup(struct super_block *sb, struct ptlrpc_request *open_req)
{
        struct mdt_body                 *body;
        struct md_op_data               *op_data;
        struct ptlrpc_request           *close_req = NULL;
        struct obd_export               *exp       = ll_s2sbi(sb)->ll_md_exp;
        ENTRY;

        body = req_capsule_server_get(&open_req->rq_pill, &RMF_MDT_BODY);  <=== Returns NULL due to message above
        OBD_ALLOC_PTR(op_data);
        if (op_data == NULL) {
                CWARN("%s: cannot allocate op_data to release open handle for "
                      DFID"\n",
                      ll_get_fsname(sb, NULL, 0), PFID(&body->mbo_fid1));

                RETURN_EXIT;
        }

        op_data->op_fid1 = body->mbo_fid1;       <==== Whoops!

Oleg Drokin added a comment - 11/Mar/16 9:38 PM The actual problem here is two lines above the crash: <3>LustreError: 201682:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `mdt_body' (2 of 1) in format `LDLM_INTENT_OPEN': 0 vs. 216 (server) <3> req@ffff880e244a90c0 x1528368254991312/t489629526959(489629526959) o101->soaked-MDT0001-mdc-ffff88082dd02c00@192.168.1.108@o2ib10:12/10 lens 840/192 e 1 to 0 dl 1457697328 ref 2 fl Complete:R/4/0 rc -107/-107 This then causes the crash because: void ll_open_cleanup(struct super_block *sb, struct ptlrpc_request *open_req) { struct mdt_body *body; struct md_op_data *op_data; struct ptlrpc_request *close_req = NULL; struct obd_export *exp = ll_s2sbi(sb)->ll_md_exp; ENTRY; body = req_capsule_server_get(&open_req->rq_pill, &RMF_MDT_BODY); <=== Returns NULL due to message above OBD_ALLOC_PTR(op_data); if (op_data == NULL) { CWARN("%s: cannot allocate op_data to release open handle for " DFID"\n", ll_get_fsname(sb, NULL, 0), PFID(&body->mbo_fid1)); RETURN_EXIT; } op_data->op_fid1 = body->mbo_fid1; <==== Whoops!

Frank Heckes (Inactive) added a comment - 11/Mar/16 7:25 PM

Crash file has been saved to lhn.lola.hpdd.intel.com:/scratch/crashdumps/lu-7866/lola-32/127.0.0.1-2016-03-11-03\:55\:54

Frank Heckes (Inactive) added a comment - 11/Mar/16 7:25 PM Crash file has been saved to lhn.lola.hpdd.intel.com:/scratch/crashdumps/lu-7866/lola-32/127.0.0.1-2016-03-11-03\:55\:54

People

Assignee:: Hongchao Zhang

Reporter:: Frank Heckes (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 11/Mar/16 7:13 PM

Updated:: 30/Jan/22 10:08 AM