[LU-12956] oops in target_handle_connect() Created: 10/Nov/19  Updated: 19/Nov/20  Resolved: 19/Nov/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Upstream
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Alex Zhuravlev Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
Lustre: DEBUG MARKER: == replay-single test 111b: DNE: unlink striped dir, fail MDT2 ======================================= 01:31:46 (1573417906)
...
LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
Lustre: DEBUG MARKER: tmp.kWfRXHK7nj: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
PGD 0 P4D 0 
Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
CPU: 0 PID: 11132 Comm: mdt00_000 Tainted: G        W  O     --------- ---  4.18.0 #31
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
RIP: 0010:          (null)
Code: Bad RIP value.
RSP: 0018:ffff88017a7dbcb8 EFLAGS: 00010202
RAX: ffffffff8216dd00 RBX: ffff880172b72a80 RCX: 0000000000000000
RDX: 0000000000000098 RSI: 0000000000000000 RDI: ffff88012e41cd48
RBP: ffff880172364208 R08: 0000000000000028 R09: ffff880172b72f30
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801723643a8
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff880198e00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 000000014e73c000 CR4: 00000000000006b0
Call Trace:
 ? target_handle_connect+0x1ca9/0x2eb0 [ptlrpc]
 ? tgt_request_handle+0x148/0x17b0 [ptlrpc]
 ? ptlrpc_main+0x1182/0x3180 [ptlrpc]
 ? __kthread_parkme+0x2a/0x80
 ? kthread+0x100/0x140
 ? ptlrpc_register_service+0x15a0/0x15a0 [ptlrpc]
 ? kthread_flush_work_fn+0x10/0x10
 ? ret_from_fork+0x24/0x30


 Comments   
Comment by Andreas Dilger [ 10/Nov/19 ]

Can you please run "list *(target_handle_connect+0x1ca9)" with gdb on the ptlrpc.ko module to see where that is failing.

Comment by Alex Zhuravlev [ 11/Nov/19 ]

sure

(gdb) list *(target_handle_connect+0x1ca9)
0x15979 is in target_handle_connect (include/linux/hrtimer.h:285).
280		return ktime_to_ns(timer->node.expires);
281	}
282	
283	static inline ktime_t hrtimer_expires_remaining(const struct hrtimer *timer)
284	{
285		return ktime_sub(timer->node.expires, timer->base->get_time());
286	}
287	
288	static inline ktime_t hrtimer_cb_get_time(struct hrtimer *timer)
289	{

and objdump:

                        stale = target->obd_stale_clients;
000000000001591a <target_handle_connect+0x1c7a> mov    0x5b8(%rbp),%r13d
0000000000015921 <target_handle_connect+0x1c81> mov    0x6dc(%rbp),%r14d
0000000000015928 <target_handle_connect+0x1c88> mov    0x5b0(%rbp),%r15d
000000000001592f <target_handle_connect+0x1c8f> mov    %eax,0x10(%rsp)
0000000000015933 <target_handle_connect+0x1c93> mov    0x688(%rbp),%rax
000000000001593a <target_handle_connect+0x1c9a> mov    %rax,0x8(%rsp)
000000000001593f <target_handle_connect+0x1c9f> mov    0x6a0(%rbp),%rax
0000000000015946 <target_handle_connect+0x1ca6> callq  *0x60(%rax)
0000000000015949 <target_handle_connect+0x1ca9> mov    0x8(%rsp),%rcx
000000000001594e <target_handle_connect+0x1cae> mov    $0x3b9aca00,%esi
                                msg = "to recover in";
0000000000015953 <target_handle_connect+0x1cb3> mov    $0x0,%r10
                        15956: R_X86_64_32S     .rodata.str1.1+0xfaa
000000000001595a <target_handle_connect+0x1cba> sub    %rax,%rcx
000000000001595d <target_handle_connect+0x1cbd> mov    %rcx,%rax
0000000000015960 <target_handle_connect+0x1cc0> cqto   
0000000000015962 <target_handle_connect+0x1cc2> idiv   %rsi

looks like obd_recovery_timer wasn't initialized

Comment by Alex Zhuravlev [ 15/Nov/19 ]

Lustre: DEBUG MARKER: == replay-single test 80a: DNE: create remote dir, drop update rep from MDT0, fail MDT0 ============== 14:07:34 (1573808854)
LustreError: 5753:0:(ldlm_lib.c:3080:target_send_reply_msg()) @@@ dropping reply req@00000000432bd521 x1650253602473600/t365072220167(0) o1000->lustre-MDT0001-mdtlov_UUID@192.168.122.230@tcp:36/0 lens 1056/4320 e 0 to 0 dl 1573808861 ref 1 fl Interpret:/0/0 rc 0/0 job:'osp_up0-1.0'
Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
LustreError: 8083:0:(ldlm_lib.c:1331:target_handle_connect()) ASSERTION( timer->base->get_time ) failed:
LustreError: 8083:0:(ldlm_lib.c:1331:target_handle_connect()) LBUG
Call Trace:
? __schedule+0x2ad/0xb00
schedule+0x34/0x80
lbug_with_loc+0x79/0x80 [libcfs]
? target_handle_connect+0x2c62/0x2f20 [ptlrpc]
? tgt_request_handle+0x148/0x17b0 [ptlrpc]
? ptlrpc_main+0x1182/0x3180 [ptlrpc]
? kthread+0x100/0x140

Comment by Alexander Boyko [ 29/Oct/20 ]

Looks like hrtimer_expires_remaining(timer) should be changed to hrtimer_get_remaining(timer). With hrtimer_expires_remaining() race is possible with hrtimer_start()>hrtimer_start_range_ns()>switch_hrtimer_base()

Comment by Gerrit Updater [ 02/Nov/20 ]

Alexander Boyko (alexander.boyko@hpe.com) uploaded a new patch: https://review.whamcloud.com/40513
Subject: LU-12956 ldlm: fix hrtimer using
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f787eddc0a1f6f41aae87ceda94dbf7fcd92f646

Comment by Gerrit Updater [ 19/Nov/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40513/
Subject: LU-12956 ldlm: fix hrtimer using
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 01a70a56540f095b3dc30656b7135636b4b3abef

Comment by Peter Jones [ 19/Nov/20 ]

Landed for 2.14

Generated at Sat Feb 10 02:57:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.