Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.1.2
-
3
-
6298
Description
We recently had a report of a General Protection Fault on a client node running a purge workload.
The console message:
general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu11/cache/index2/shared_cpu_map CPU 0 Modules linked in: cpufreq_ondemand nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm uinput ahci ib_qib(U) ib_mad ib_core dcdbas microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core shpchp xt_owner ipt_LOG xt_multiport ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc igb dca [last unloaded: cpufreq_ondemand] Pid: 3895, comm: ll_sa_25915 Tainted: G W ---------------- 2.6.32-220.23.1.1chaos.ch5.x86_64 #1 Dell XS23-TY35 /0GW08P RIP: 0010:[<ffffffff8118fc8c>] [<ffffffff8118fc8c>] __d_lookup+0x8c/0x150 RSP: 0018:ffff88049ad3dcc0 EFLAGS: 00010202 RAX: 000000000000000f RBX: 2e342036343a3732 RCX: 0000000000000016 RDX: 018721e08df08940 RSI: ffff88049ad3ddc0 RDI: ffff880421557300 RBP: ffff88049ad3dd10 R08: ffff880589fdad30 R09: 00000000ffffffff R10: 0000000000000000 R11: 0000000000000000 R12: 2e342036343a371a R13: ffff880421557300 R14: 000000009e75e374 R15: ffff8803a22822b8 FS: 00002aaaab05db20(0000) GS:ffff880028200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fffffffc010 CR3: 0000000297c13000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 Process ll_sa_25915 (pid: 3895, threadinfo ffff88049ad3c000, task ffff8804f49a0aa0) Stack: ffff8802fc86b3b8 0000000f00000246 000000000000000f ffff88049ad3ddc0 <0> ffff880028215fc0 0000000002170c3c ffff88049ad3ddc0 ffff880421557300 <0> ffff880421557300 ffff8803a22822b8 ffff88049ad3dd40 ffffffff811908fc Call Trace: [<ffffffff811908fc>] d_lookup+0x3c/0x60 [<ffffffffa09b672c>] ll_statahead_one+0x1ec/0x14a0 [lustre] [<ffffffff81051ba3>] ? __wake_up+0x53/0x70 [<ffffffff8109144c>] ? remove_wait_queue+0x3c/0x50 [<ffffffffa09b7c98>] ll_statahead_thread+0x2b8/0x890 [lustre] [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20 [<ffffffffa09b79e0>] ? ll_statahead_thread+0x0/0x890 [lustre] [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffffa09b79e0>] ? ll_statahead_thread+0x0/0x890 [lustre] [<ffffffffa09b79e0>] ? ll_statahead_thread+0x0/0x890 [lustre] [<ffffffff8100c140>] ? child_rip+0x0/0x20 Code: 48 03 05 88 4b a7 00 48 8b 18 8b 45 bc 48 85 db 48 89 45 c0 75 11 eb 74 0f 1f 80 00 00 00 00 48 8b 1b 48 85 db 74 65 4c 8d 63 e8 <45> 39 74 24 30 75 ed 4d 39 6c 24 28 75 e6 4d 8d 7c 24 08 4c 89 RIP [<ffffffff8118fc8c>] __d_lookup+0x8c/0x150 RSP <ffff88049ad3dcc0>
The stack reported by crash:
crash> bt
PID: 3895 TASK: ffff8804f49a0aa0 CPU: 0 COMMAND: "ll_sa_25915"
#0 [ffff88049ad3da50] machine_kexec at ffffffff8103216b
#1 [ffff88049ad3dab0] crash_kexec at ffffffff810b8d12
#2 [ffff88049ad3db80] oops_end at ffffffff814f2b00
#3 [ffff88049ad3dbb0] die at ffffffff8100f26b
#4 [ffff88049ad3dbe0] do_general_protection at ffffffff814f2692
#5 [ffff88049ad3dc10] general_protection at ffffffff814f1e65
[exception RIP: __d_lookup+140]
RIP: ffffffff8118fc8c RSP: ffff88049ad3dcc0 RFLAGS: 00010202
RAX: 000000000000000f RBX: 2e342036343a3732 RCX: 0000000000000016
RDX: 018721e08df08940 RSI: ffff88049ad3ddc0 RDI: ffff880421557300
RBP: ffff88049ad3dd10 R8: ffff880589fdad30 R9: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000000 R12: 2e342036343a371a
R13: ffff880421557300 R14: 000000009e75e374 R15: ffff8803a22822b8
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#6 [ffff88049ad3dd18] d_lookup at ffffffff811908fc
#7 [ffff88049ad3dd48] ll_statahead_one at ffffffffa09b672c [lustre]
#8 [ffff88049ad3de18] ll_statahead_thread at ffffffffa09b7c98 [lustre]
#9 [ffff88049ad3df48] kernel_thread at ffffffff8100c14a
And source level information from GDB:
(gdb) l *__d_lookup+140
0xffffffff8118fc8c is in __d_lookup (fs/dcache.c:1409).
1404 rcu_read_lock();
1405
1406 hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
1407 struct qstr *qstr;
1408
1409 if (dentry->d_name.hash != hash)
1410 continue;
1411 if (dentry->d_parent != parent)
1412 continue;
1413
Here's the assembly of __d_lookup around the RIP of the failure:
crash> x/80i __d_lookup ... 0xffffffff8118fc86 <__d_lookup+134>: je 0xffffffff8118fced <__d_lookup+237> 0xffffffff8118fc88 <__d_lookup+136>: lea -0x18(%rbx),%r12 0xffffffff8118fc8c <__d_lookup+140>: cmp %r14d,0x30(%r12) 0xffffffff8118fc91 <__d_lookup+145>: jne 0xffffffff8118fc80 <__d_lookup+128> 0xffffffff8118fc93 <__d_lookup+147>: cmp %r13,0x28(%r12) ...
The backtrace above from crash shows the register contents when the failure occurred. %r14 appears to contain the hash value for the qstr we are looking up (i.e. hash in the source code). I tend to trust this because I believe I found that qstr structure on the stack here:
crash> p -x *(struct qstr *)0xffff88049ad3ddc0
$30 = {
hash = 0x9e75e374,
len = 0xf,
name = 0xffff8802fc86b3b8 "MultiFab_D_07813"
}
The value in %r14 and the value of the hash for the dump qstr match up. I'm not sure what the value in %r12 represents though, and my assembly is a bit rusty so I'm not entirely sure how to decode the cmp instruction that we fail on. My only guess is we're dereferencing %r12 which isn't a valid pointer value.
Attachments
Issue Links
- is related to
-
LU-7973 Lustre client crash in __d_lookup() - BUG: unable to handle kernel paging request
-
- Resolved
-