[LU-2704] GPF in __d_lookup called from ll_statahead_one Created: 29/Jan/13 Updated: 03/Apr/16 Resolved: 03/Apr/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 6298 | ||||||||
| Description |
|
We recently had a report of a General Protection Fault on a client node running a purge workload. The console message: general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu11/cache/index2/shared_cpu_map CPU 0 Modules linked in: cpufreq_ondemand nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm uinput ahci ib_qib(U) ib_mad ib_core dcdbas microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core shpchp xt_owner ipt_LOG xt_multiport ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc igb dca [last unloaded: cpufreq_ondemand] Pid: 3895, comm: ll_sa_25915 Tainted: G W ---------------- 2.6.32-220.23.1.1chaos.ch5.x86_64 #1 Dell XS23-TY35 /0GW08P RIP: 0010:[<ffffffff8118fc8c>] [<ffffffff8118fc8c>] __d_lookup+0x8c/0x150 RSP: 0018:ffff88049ad3dcc0 EFLAGS: 00010202 RAX: 000000000000000f RBX: 2e342036343a3732 RCX: 0000000000000016 RDX: 018721e08df08940 RSI: ffff88049ad3ddc0 RDI: ffff880421557300 RBP: ffff88049ad3dd10 R08: ffff880589fdad30 R09: 00000000ffffffff R10: 0000000000000000 R11: 0000000000000000 R12: 2e342036343a371a R13: ffff880421557300 R14: 000000009e75e374 R15: ffff8803a22822b8 FS: 00002aaaab05db20(0000) GS:ffff880028200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fffffffc010 CR3: 0000000297c13000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 Process ll_sa_25915 (pid: 3895, threadinfo ffff88049ad3c000, task ffff8804f49a0aa0) Stack: ffff8802fc86b3b8 0000000f00000246 000000000000000f ffff88049ad3ddc0 <0> ffff880028215fc0 0000000002170c3c ffff88049ad3ddc0 ffff880421557300 <0> ffff880421557300 ffff8803a22822b8 ffff88049ad3dd40 ffffffff811908fc Call Trace: [<ffffffff811908fc>] d_lookup+0x3c/0x60 [<ffffffffa09b672c>] ll_statahead_one+0x1ec/0x14a0 [lustre] [<ffffffff81051ba3>] ? __wake_up+0x53/0x70 [<ffffffff8109144c>] ? remove_wait_queue+0x3c/0x50 [<ffffffffa09b7c98>] ll_statahead_thread+0x2b8/0x890 [lustre] [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20 [<ffffffffa09b79e0>] ? ll_statahead_thread+0x0/0x890 [lustre] [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffffa09b79e0>] ? ll_statahead_thread+0x0/0x890 [lustre] [<ffffffffa09b79e0>] ? ll_statahead_thread+0x0/0x890 [lustre] [<ffffffff8100c140>] ? child_rip+0x0/0x20 Code: 48 03 05 88 4b a7 00 48 8b 18 8b 45 bc 48 85 db 48 89 45 c0 75 11 eb 74 0f 1f 80 00 00 00 00 48 8b 1b 48 85 db 74 65 4c 8d 63 e8 <45> 39 74 24 30 75 ed 4d 39 6c 24 28 75 e6 4d 8d 7c 24 08 4c 89 RIP [<ffffffff8118fc8c>] __d_lookup+0x8c/0x150 RSP <ffff88049ad3dcc0> The stack reported by crash: crash> bt
PID: 3895 TASK: ffff8804f49a0aa0 CPU: 0 COMMAND: "ll_sa_25915"
#0 [ffff88049ad3da50] machine_kexec at ffffffff8103216b
#1 [ffff88049ad3dab0] crash_kexec at ffffffff810b8d12
#2 [ffff88049ad3db80] oops_end at ffffffff814f2b00
#3 [ffff88049ad3dbb0] die at ffffffff8100f26b
#4 [ffff88049ad3dbe0] do_general_protection at ffffffff814f2692
#5 [ffff88049ad3dc10] general_protection at ffffffff814f1e65
[exception RIP: __d_lookup+140]
RIP: ffffffff8118fc8c RSP: ffff88049ad3dcc0 RFLAGS: 00010202
RAX: 000000000000000f RBX: 2e342036343a3732 RCX: 0000000000000016
RDX: 018721e08df08940 RSI: ffff88049ad3ddc0 RDI: ffff880421557300
RBP: ffff88049ad3dd10 R8: ffff880589fdad30 R9: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000000 R12: 2e342036343a371a
R13: ffff880421557300 R14: 000000009e75e374 R15: ffff8803a22822b8
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#6 [ffff88049ad3dd18] d_lookup at ffffffff811908fc
#7 [ffff88049ad3dd48] ll_statahead_one at ffffffffa09b672c [lustre]
#8 [ffff88049ad3de18] ll_statahead_thread at ffffffffa09b7c98 [lustre]
#9 [ffff88049ad3df48] kernel_thread at ffffffff8100c14a
And source level information from GDB: (gdb) l *__d_lookup+140
0xffffffff8118fc8c is in __d_lookup (fs/dcache.c:1409).
1404 rcu_read_lock();
1405
1406 hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
1407 struct qstr *qstr;
1408
1409 if (dentry->d_name.hash != hash)
1410 continue;
1411 if (dentry->d_parent != parent)
1412 continue;
1413
Here's the assembly of __d_lookup around the RIP of the failure: crash> x/80i __d_lookup ... 0xffffffff8118fc86 <__d_lookup+134>: je 0xffffffff8118fced <__d_lookup+237> 0xffffffff8118fc88 <__d_lookup+136>: lea -0x18(%rbx),%r12 0xffffffff8118fc8c <__d_lookup+140>: cmp %r14d,0x30(%r12) 0xffffffff8118fc91 <__d_lookup+145>: jne 0xffffffff8118fc80 <__d_lookup+128> 0xffffffff8118fc93 <__d_lookup+147>: cmp %r13,0x28(%r12) ... The backtrace above from crash shows the register contents when the failure occurred. %r14 appears to contain the hash value for the qstr we are looking up (i.e. hash in the source code). I tend to trust this because I believe I found that qstr structure on the stack here: crash> p -x *(struct qstr *)0xffff88049ad3ddc0
$30 = {
hash = 0x9e75e374,
len = 0xf,
name = 0xffff8802fc86b3b8 "MultiFab_D_07813"
}
The value in %r14 and the value of the hash for the dump qstr match up. I'm not sure what the value in %r12 represents though, and my assembly is a bit rusty so I'm not entirely sure how to decode the cmp instruction that we fail on. My only guess is we're dereferencing %r12 which isn't a valid pointer value. |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 29/Jan/13 ] |
|
Yes, but %r12 can be directly inherited from a wrong parent 1st parameter if we just came in __d_lookup() or be the result of a corrupted dentry on the hash-list if already looping ... So now you need to check parent dentry pointer value upon call of __d_lookup(). |
| Comment by Bruno Faccini (Inactive) [ 06/Feb/13 ] |
|
Hello Prakash, Have you been able to find dentry pointer value passed as "parent" parameter upon call of __d_lookup() from ll_statahead_one() with inlined d_lookup() ?? Is the crash-dump available ? |
| Comment by Prakash Surya (Inactive) [ 06/Feb/13 ] |
|
I did find an address pointing to a dentry structure (0xffff880421557300), but it's hard to say if that was the parent, or just an entry in the list during traversal. I do have a crash dump, how do I post it to your FTP server? |
| Comment by Prakash Surya (Inactive) [ 07/Feb/13 ] |
|
I uploaded a tarball named $ crash LU-2704/vmcore-ansel12-2013-01-26-09\:07\:05 LU-2704/vmlinux If you need anything else, let me know. |
| Comment by Bruno Faccini (Inactive) [ 13/Feb/13 ] |
|
Sorry for the lack of update, I saved your tarball and currently xfer it to continue crash-dump analysis, will keep you updated. |
| Comment by Bruno Faccini (Inactive) [ 15/Feb/13 ] |
|
Ok, I have been able to dig into the crash-dump, and actually I am a bit confused because here is what I found : _ to answer one of the question I had before, "parent"/1st parameter of [__]d_lookup() is valid and points to a dentry with a name of "Level_0" : $3 = {
d_count = {
counter = 0x3
},
d_flags = 0x0,
d_lock = {
raw_lock = {
slock = 0x78b078b
}
},
d_mounted = 0x0,
d_inode = 0xffff8803a22822b8,
d_hash = {
next = 0x0,
pprev = 0xffffc90001489a50
},
d_parent = 0xffff8804215a2980,
d_name = {
hash = 0x4816d9c5,
len = 0x7,
name = 0xffff8804215573a0
},
d_lru = {
next = 0xffff880421557340,
prev = 0xffff880421557340
},
d_u = {
d_child = {
next = 0xffff880421576910,
prev = 0xffff8804215a29e0
},
d_rcu = {
next = 0xffff880421576910,
func = 0xffff8804215a29e0
}
},
d_subdirs = {
next = 0xffff880421557360,
prev = 0xffff880421557360
},
d_alias = {
next = 0xffff8803a22822e8,
prev = 0xffff8803a22822e8
},
d_time = 0x0,
d_op = 0xffffffffa09d3280,
d_sb = 0xffff880622524000,
d_fsdata = 0xffff8804170c1440,
d_iname = {0x4c, 0x65, 0x76, 0x65, 0x6c, 0x5f, 0x30, 0x0, 0x5f, 0x4d, 0x46, 0x5f, 0x44, 0x5f, 0x30, 0x37, 0x38, 0x32, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
}
_ as you found too, "name"/2nd parameter of [__]d_lookup() is also valid and points to this : $7 = {
hash = 0x9e75e374,
len = 0xf,
name = 0xffff8802fc86b3b8 -->> "MultiFab_D_07813"
}
_ due to optimisation there is no intermediate registers/results to track/check some of the d_hash() inlined function computations, so I had to replay it by hand (and verified same method works with dentry->parent and its own hash!) with help of __d_lookup() deassembly and source code. And what I found is that the dentry_hashtable[] entry contains what looks like a valid pointer to the 1st element/dentry beeing linked on their d_hash linked-list. But when trying to resolve/dereference it crash complains it is not present/dumped and moreover the page this address belongs is currently assigned to a Lustre inode mapping !!! _ this explains why the page has not been dumped and likely the kind of data that was accessed to trigger the GPF !!! _ there are 2 others current threads having old references this inode in their stacks, "ptlrpcd-brw" and "flush-lustre-2", but they have not been scheduled recently (I mean since less than 1s!). _ more interesting is that there is an other thread "lpurge" running on an other Core, which also triggers a GPF in the dentry layer !!... I will dig into it now. _ last, there have been earlier OOM situations logged and also these msgs found in the Kernel log : LustreError: 29314:0:(mdc_locks.c:723:mdc_enqueue()) ldlm_cli_enqueue: -108 LustreError: 29314:0:(mdc_locks.c:723:mdc_enqueue()) ldlm_cli_enqueue: -108 LustreError: 29314:0:(file.c:158:ll_close_inode_openhandle()) inode 5053885323872960658 mdc close failed: rc = -108 LustreError: 29314:0:(file.c:158:ll_close_inode_openhandle()) inode 5044781017488031745 mdc close failed: rc = -108 LustreError: 29314:0:(dir.c:477:ll_get_dir_page()) read cache page: [0x24d18001:0x3db440f0:0x0] at 3512391207215765018: rc -108 LustreError: 29314:0:(dir.c:648:ll_readdir()) error reading dir [0x24d18001:0x3db440f0:0x0] at 3512391207215765018: rc -108 Lustre: lsc-MDT0000-mdc-ffff880622524000: Connection restored to lsc-MDT0000 (at 172.19.1.100@o2ib100) LustreError: 11-0: lsc-MDT0000-mdc-ffff880622524000: Communicating with 172.19.1.100@o2ib100, operation mds_getattr_lock failed with -107. Lustre: lsc-MDT0000-mdc-ffff880622524000: Connection to lsc-MDT0000 (at 172.19.1.100@o2ib100) was lost; in progress operations using this service will wait for recovery to complete LustreError: 167-0: This client was evicted by lsc-MDT0000; in progress operations using this service will fail. LustreError: 25915:0:(mdc_locks.c:723:mdc_enqueue()) ldlm_cli_enqueue: -108 LustreError: 4613:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-MDT0000-mdc-ffff880622524000 resource refcount nonzero (1) after lock cleanup; forcing cleanup. LustreError: 4613:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff88012b155800 (2500481170767/301/0/0) (rc: 1) Lustre: lsc-MDT0000-mdc-ffff880622524000: Connection restored to lsc-MDT0000 (at 172.19.1.100@o2ib100) LustreError: 25915:0:(mdc_locks.c:723:mdc_enqueue()) Skipped 1001 previous similar messages LustreError: 11-0: lsc-MDT0000-mdc-ffff880622524000: Communicating with 172.19.1.100@o2ib100, operation mds_getattr_lock failed with -107. Lustre: lsc-MDT0000-mdc-ffff880622524000: Connection to lsc-MDT0000 (at 172.19.1.100@o2ib100) was lost; in progress operations using this service will wait for recovery to complete LustreError: 167-0: This client was evicted by lsc-MDT0000; in progress operations using this service will fail. LustreError: 9659:0:(mdc_locks.c:723:mdc_enqueue()) ldlm_cli_enqueue: -4 LustreError: 9681:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-MDT0000-mdc-ffff880622524000 resource refcount nonzero (1) after lock cleanup; forcing cleanup. LustreError: 9681:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880135b3ba80 (2957967665419/1/0/0) (rc: 1) Lustre: lsc-MDT0000-mdc-ffff880622524000: Connection restored to lsc-MDT0000 (at 172.19.1.100@o2ib100) LustreError: 9659:0:(mdc_locks.c:723:mdc_enqueue()) Skipped 4 previous similar messages LustreError: 9659:0:(file.c:2196:ll_inode_revalidate_fini()) failure -4 inode 8764462226200395777 general protection fault: 0000 [#1] SMP ..... which may likely be related ... So more to come, but seems to me that we may have crashed due to an orphan reference left in dentry_hashtable[], which may have occurred during some specific eviction conditions .... |
| Comment by Bruno Faccini (Inactive) [ 22/Feb/13 ] |
|
The 2nd thread/"lpurge" also handling a GPF is in fact working on the same parent directory and hash value and thus also triggers the same+corrupted dentry_hashtable[] entry. This is ok, since the 1st thread in GPF is a ll_sa_<PID> companion thread of the 2nd !! On the other hand, the Lustre inode mapping owner of the page the corrupted reference maps, is one of the current file-descriptors for this 2nd thread/"lpurge" ... I am not familiar with the "lpurge" command, but is it doing/using tricky things from a Lustre point of view ?? |
| Comment by Christopher Morrone [ 22/Feb/13 ] |
|
lpurge is part of the lustre-tools-llnl package. You can see the code here: https://github.com/chaos/lustre-tools-llnl/blob/master/src/lpurge.c lpurge tries to constrain itself to only talking to the MDS when possible, so it uses the IOC_MDC_GETFILEINFO ioctl, for instance. Other than that, nothing too tricky that I can remember. It is not even multi-threaded. If there were two lpurge threads running, that it likely because the sysadmin ran lpurge more than once. But I would expect them to run each copy in a different subdirectory (likely different users' directories in lustre). It is possible that someone made a mistake though. I would not be surprised if they share some top-level directory if no mistake was made in running the command. |
| Comment by Bruno Faccini (Inactive) [ 26/Feb/13 ] |
|
Thank's for this infos Chris, but I have never enough, sorry ... Can you confirm me if lpurge was run with -R (i.e., removing selected files) at the time of the crash ? I can not verify by myself in the crash since by default user-context pages are not dumped. I think I may have found a path in the code that could allow/explain the dentry_hashtable[] corruption I found in your crash-dump, I will give it a try and expose it to our regression tests. |
| Comment by Bruno Faccini (Inactive) [ 26/Feb/13 ] |
|
Patch against b2_1 at http://review.whamcloud.com/5533. |
| Comment by Prakash Surya (Inactive) [ 27/Feb/13 ] |
|
Bruno, lpurge was run with the "-R" option to remove the files. |
| Comment by Bruno Faccini (Inactive) [ 06/Mar/13 ] |
|
I realize that I forgot to ask, but I presume it has been a single shoot/crash until now ?? BTW, the way it can happen is still unexplained, and actually I only have a debug-patch ready to go, in order to detect dentry_hashtable[] un-safe access ... |
| Comment by Prakash Surya (Inactive) [ 06/Mar/13 ] |
|
AFAIK, it's only hit once. But I've heard vague reports about crashes from our admins when running lpurge before, so there's a chance of unreported incidents of this crash occurring. |
| Comment by Bruno Faccini (Inactive) [ 12/Mar/13 ] |
|
Exposure of patch vs the full test-suite, went thru a lot of external conditions (mainly the one tracked in |
| Comment by Bruno Faccini (Inactive) [ 29/Apr/13 ] |
|
Prakash, Chris, |
| Comment by Sebastien Piechurski [ 23/Jul/14 ] |
|
I am not sure this will be of any help, but I wanted to mention we had several occurrences of this issue at one of our customer site. |
| Comment by Bruno Faccini (Inactive) [ 23/Jul/14 ] |
|
Hello Seb, |
| Comment by Sebastien Piechurski [ 24/Jul/14 ] |
|
The customer is running our version based on 2.1.6. Your patch is not part of it, so I will check if the customer still has some occurrences and try to put the patch there if that's the case. |
| Comment by Andreas Dilger [ 03/Apr/16 ] |
|
Have not had any feedback on this 2.1 bug in 2 years. Closing. |