[LU-12956] oops in target_handle_connect() Created: 10/Nov/19 Updated: 19/Nov/20 Resolved: 19/Nov/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Upstream |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Alex Zhuravlev | Assignee: | Alexander Boyko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
Lustre: DEBUG MARKER: == replay-single test 111b: DNE: unlink striped dir, fail MDT2 ======================================= 01:31:46 (1573417906)
...
LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
Lustre: DEBUG MARKER: tmp.kWfRXHK7nj: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid,mdc.lustre-MDT0001-mdc-*.mds_server_uuid
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
PGD 0 P4D 0
Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
CPU: 0 PID: 11132 Comm: mdt00_000 Tainted: G W O --------- --- 4.18.0 #31
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
RIP: 0010: (null)
Code: Bad RIP value.
RSP: 0018:ffff88017a7dbcb8 EFLAGS: 00010202
RAX: ffffffff8216dd00 RBX: ffff880172b72a80 RCX: 0000000000000000
RDX: 0000000000000098 RSI: 0000000000000000 RDI: ffff88012e41cd48
RBP: ffff880172364208 R08: 0000000000000028 R09: ffff880172b72f30
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801723643a8
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
FS: 0000000000000000(0000) GS:ffff880198e00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 000000014e73c000 CR4: 00000000000006b0
Call Trace:
? target_handle_connect+0x1ca9/0x2eb0 [ptlrpc]
? tgt_request_handle+0x148/0x17b0 [ptlrpc]
? ptlrpc_main+0x1182/0x3180 [ptlrpc]
? __kthread_parkme+0x2a/0x80
? kthread+0x100/0x140
? ptlrpc_register_service+0x15a0/0x15a0 [ptlrpc]
? kthread_flush_work_fn+0x10/0x10
? ret_from_fork+0x24/0x30
|
| Comments |
| Comment by Andreas Dilger [ 10/Nov/19 ] |
|
Can you please run "list *(target_handle_connect+0x1ca9)" with gdb on the ptlrpc.ko module to see where that is failing. |
| Comment by Alex Zhuravlev [ 11/Nov/19 ] |
|
sure (gdb) list *(target_handle_connect+0x1ca9) 0x15979 is in target_handle_connect (include/linux/hrtimer.h:285). 280 return ktime_to_ns(timer->node.expires); 281 } 282 283 static inline ktime_t hrtimer_expires_remaining(const struct hrtimer *timer) 284 { 285 return ktime_sub(timer->node.expires, timer->base->get_time()); 286 } 287 288 static inline ktime_t hrtimer_cb_get_time(struct hrtimer *timer) 289 { and objdump:
stale = target->obd_stale_clients;
000000000001591a <target_handle_connect+0x1c7a> mov 0x5b8(%rbp),%r13d
0000000000015921 <target_handle_connect+0x1c81> mov 0x6dc(%rbp),%r14d
0000000000015928 <target_handle_connect+0x1c88> mov 0x5b0(%rbp),%r15d
000000000001592f <target_handle_connect+0x1c8f> mov %eax,0x10(%rsp)
0000000000015933 <target_handle_connect+0x1c93> mov 0x688(%rbp),%rax
000000000001593a <target_handle_connect+0x1c9a> mov %rax,0x8(%rsp)
000000000001593f <target_handle_connect+0x1c9f> mov 0x6a0(%rbp),%rax
0000000000015946 <target_handle_connect+0x1ca6> callq *0x60(%rax)
0000000000015949 <target_handle_connect+0x1ca9> mov 0x8(%rsp),%rcx
000000000001594e <target_handle_connect+0x1cae> mov $0x3b9aca00,%esi
msg = "to recover in";
0000000000015953 <target_handle_connect+0x1cb3> mov $0x0,%r10
15956: R_X86_64_32S .rodata.str1.1+0xfaa
000000000001595a <target_handle_connect+0x1cba> sub %rax,%rcx
000000000001595d <target_handle_connect+0x1cbd> mov %rcx,%rax
0000000000015960 <target_handle_connect+0x1cc0> cqto
0000000000015962 <target_handle_connect+0x1cc2> idiv %rsi
looks like obd_recovery_timer wasn't initialized |
| Comment by Alex Zhuravlev [ 15/Nov/19 ] |
|
Lustre: DEBUG MARKER: == replay-single test 80a: DNE: create remote dir, drop update rep from MDT0, fail MDT0 ============== 14:07:34 (1573808854) |
| Comment by Alexander Boyko [ 29/Oct/20 ] |
|
Looks like hrtimer_expires_remaining(timer) should be changed to hrtimer_get_remaining(timer). With hrtimer_expires_remaining() race is possible with hrtimer_start() |
| Comment by Gerrit Updater [ 02/Nov/20 ] |
|
Alexander Boyko (alexander.boyko@hpe.com) uploaded a new patch: https://review.whamcloud.com/40513 |
| Comment by Gerrit Updater [ 19/Nov/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40513/ |
| Comment by Peter Jones [ 19/Nov/20 ] |
|
Landed for 2.14 |