Details
-
Bug
-
Resolution: Unresolved
-
Blocker
-
None
-
Lustre 2.8.0
-
lola
build: https://build.hpdd.intel.com/job/lustre-b2_8/12/
-
3
-
9223372036854775807
Description
Error occurred during soak testing of build '20160309' (b2_8 RC5) (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160309 also). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active - active HA failover configuration. (For teset set-up configuration see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration)
Sequence of events:
- mds_restart : 2016-03-11 03:41:05,597 - 2016-03-11 03:54:49,109 lola-8
- 2016-03-11 03:56 Lustre client lola-32 crashed with the following error:
<1>BUG: unable to handle kernel NULL pointer dereference at (null) <1>IP: [<ffffffffa0a0241f>] ll_open_cleanup+0xaf/0x600 [lustre] <4>PGD 38a867067 PUD 775372067 PMD 0 <4>Oops: 0000 [#1] SMP <4>last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:06:00.0/infiniband_mad/umad0/port <4>CPU 1 <4>Modules linked in: osc(U) mgc(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) nfsi] <4> <4>Pid: 201682, comm: simul Not tainted 2.6.32-504.30.3.el6.x86_64 #1 Intel Corporation S2600GZ/S2600GZ <4>RIP: 0010:[<ffffffffa0a0241f>] [<ffffffffa0a0241f>] ll_open_cleanup+0xaf/0x600 [lustre] <4>RSP: 0018:ffff8801d3f298e8 EFLAGS: 00010286 <4>RAX: 0000000000000000 RBX: ffff8807f7271a00 RCX: ffff88102dd13ca0 <4>RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff88102f395000 <4>RBP: ffff8801d3f29928 R08: ffff88034f47e9c0 R09: 0000000000000000 <4>R10: 0000000000000010 R11: 0000000000000000 R12: 0000000000000000 <4>R13: ffff88082dd02c00 R14: ffff88081a5fc800 R15: ffff8801d3f29988 <4>FS: 00007f72f89d3700(0000) GS:ffff880045e20000(0000) knlGS:0000000000000000 <4>CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b <4>CR2: 0000000000000000 CR3: 00000003b89f2000 CR4: 00000000001407e0 <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>Process simul (pid: 201682, threadinfo ffff8801d3f28000, task ffff8804c6d73520) <4>Stack: <4> ffff8801d3f298f8 0000000000000000 ffff8801d3f29928 0000000000000001 <4><d> ffff88082dd02c00 ffff88081a5fc800 ffff8809648898c0 ffff8801d3f29988 <4><d> ffff8801d3f299f8 ffffffffa0a09c4a fffffffffffffffb 00ff880a531788c0 <4>Call Trace: <4> [<ffffffffa0a09c4a>] ll_prep_inode+0x20a/0xc40 [lustre] <4> [<ffffffffa07804b2>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] <4> [<ffffffffa0a214f0>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] <4> [<ffffffffa0a21fe1>] ll_lookup_it_finish+0x321/0x12e0 [lustre] <4> [<ffffffffa0723dd0>] ? ldlm_lock_decref_internal+0x2e0/0xa80 [ptlrpc] <4> [<ffffffffa0569925>] ? class_handle2object+0x95/0x190 [obdclass] <4> [<ffffffff81174ab3>] ? kmem_cache_alloc_trace+0x1b3/0x1c0 <4> [<ffffffffa0a1e439>] ? ll_i2suppgid+0x19/0x30 [lustre] <4> [<ffffffffa0a1e47e>] ? ll_i2gids+0x2e/0xd0 [lustre] <4> [<ffffffffa0a214f0>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] <4> [<ffffffffa0a23226>] ll_lookup_it+0x286/0xda0 [lustre] <4> [<ffffffffa0a23dc9>] ll_lookup_nd+0x89/0x4f0 [lustre] <4> [<ffffffff8119e055>] do_lookup+0x1a5/0x230 <4> [<ffffffff8119ece4>] __link_path_walk+0x7a4/0x1000 <4> [<ffffffff8119f7fa>] path_walk+0x6a/0xe0 <4> [<ffffffff8119fa0b>] filename_lookup+0x6b/0xc0 <4> [<ffffffff8122daa6>] ? security_file_alloc+0x16/0x20 <4> [<ffffffff811a0ee4>] do_filp_open+0x104/0xd20 <4> [<ffffffff81063c63>] ? perf_event_task_sched_out+0x33/0x70 <4> [<ffffffff8129943a>] ? strncpy_from_user+0x4a/0x90 <4> [<ffffffff811ae392>] ? alloc_fd+0x92/0x160 <4> [<ffffffff8118b157>] do_sys_open+0x67/0x130 <4> [<ffffffff8118b260>] sys_open+0x20/0x30 <4> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b <4>Code: ba 38 01 00 00 31 f6 e8 30 62 b6 ff f6 05 fd 6b a9 ff 10 74 0d 80 3d f0 6b a9 ff 00 0f 88 ba 01 00 00 48 85 db 0f 84 16 02 00 00 <49> 8b 04 24 48 89 03 49 <1>RIP [<ffffffffa0a0241f>] ll_open_cleanup+0xaf/0x600 [lustre] <4> RSP <ffff8801d3f298e8> <4>CR2: 0000000000000000
Attached client (lola-32) message, console and vmcore-dmesg.txt file.
the issue should be caused by the race between the replay and the normal process of the open request.
Thread 1:
send the LDLM_ENQUEUE request to MDT1 to open the directory, and send requests to other MDTs to get the attributes
after get the reply from the MDT1, but it was stuck by one of the requests to other MDTs.
Thread 2:
closing the file, and the "rq_replay" of the open request is cleared in "mdc_close".
Then, the MDT1 got failover and the recovery is initiated, then the LDLM_ENQUEUE request in Thread 1 is replayed but failed
with -ENOTCONN during replaying the request to MDT1 and its "rq_repmsg" will contain only the "PTLRPC_BODY" field.
then the issue is triggered in "ll_open_cleanup".