[LU-2704] GPF in __d_lookup called from ll_statahead_one Created: 29/Jan/13  Updated: 03/Apr/16  Resolved: 03/Apr/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Prakash Surya (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: llnl

Issue Links:
Related
is related to LU-7973 Lustre client crash in __d_lookup() -... Resolved
Severity: 3
Rank (Obsolete): 6298

 Description   

We recently had a report of a General Protection Fault on a client node running a purge workload.

The console message:

general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/cpu11/cache/index2/shared_cpu_map
CPU 0 
Modules linked in: cpufreq_ondemand nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm uinput ahci ib_qib(U) ib_mad ib_core dcdbas microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core shpchp xt_owner ipt_LOG xt_multiport ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc igb dca [last unloaded: cpufreq_ondemand]

Pid: 3895, comm: ll_sa_25915 Tainted: G        W  ----------------   2.6.32-220.23.1.1chaos.ch5.x86_64 #1 Dell       XS23-TY35       /0GW08P
RIP: 0010:[<ffffffff8118fc8c>]  [<ffffffff8118fc8c>] __d_lookup+0x8c/0x150
RSP: 0018:ffff88049ad3dcc0  EFLAGS: 00010202
RAX: 000000000000000f RBX: 2e342036343a3732 RCX: 0000000000000016
RDX: 018721e08df08940 RSI: ffff88049ad3ddc0 RDI: ffff880421557300
RBP: ffff88049ad3dd10 R08: ffff880589fdad30 R09: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000000 R12: 2e342036343a371a
R13: ffff880421557300 R14: 000000009e75e374 R15: ffff8803a22822b8
FS:  00002aaaab05db20(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffffffc010 CR3: 0000000297c13000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
Process ll_sa_25915 (pid: 3895, threadinfo ffff88049ad3c000, task ffff8804f49a0aa0)
Stack:
 ffff8802fc86b3b8 0000000f00000246 000000000000000f ffff88049ad3ddc0
<0> ffff880028215fc0 0000000002170c3c ffff88049ad3ddc0 ffff880421557300
<0> ffff880421557300 ffff8803a22822b8 ffff88049ad3dd40 ffffffff811908fc
Call Trace:
 [<ffffffff811908fc>] d_lookup+0x3c/0x60
 [<ffffffffa09b672c>] ll_statahead_one+0x1ec/0x14a0 [lustre]
 [<ffffffff81051ba3>] ? __wake_up+0x53/0x70
 [<ffffffff8109144c>] ? remove_wait_queue+0x3c/0x50
 [<ffffffffa09b7c98>] ll_statahead_thread+0x2b8/0x890 [lustre]
 [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
 [<ffffffffa09b79e0>] ? ll_statahead_thread+0x0/0x890 [lustre]
 [<ffffffff8100c14a>] child_rip+0xa/0x20
 [<ffffffffa09b79e0>] ? ll_statahead_thread+0x0/0x890 [lustre]
 [<ffffffffa09b79e0>] ? ll_statahead_thread+0x0/0x890 [lustre]
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: 48 03 05 88 4b a7 00 48 8b 18 8b 45 bc 48 85 db 48 89 45 c0 75 11 eb 74 0f 1f 80 00 00 00 00 48 8b 1b 48 85 db 74 65 4c 8d 63 e8 <45> 39 74 24 30 75 ed 4d 39 6c 24 28 75 e6 4d 8d 7c 24 08 4c 89 
RIP  [<ffffffff8118fc8c>] __d_lookup+0x8c/0x150
 RSP <ffff88049ad3dcc0>

The stack reported by crash:

crash> bt
PID: 3895   TASK: ffff8804f49a0aa0  CPU: 0   COMMAND: "ll_sa_25915"
 #0 [ffff88049ad3da50] machine_kexec at ffffffff8103216b
 #1 [ffff88049ad3dab0] crash_kexec at ffffffff810b8d12
 #2 [ffff88049ad3db80] oops_end at ffffffff814f2b00
 #3 [ffff88049ad3dbb0] die at ffffffff8100f26b
 #4 [ffff88049ad3dbe0] do_general_protection at ffffffff814f2692
 #5 [ffff88049ad3dc10] general_protection at ffffffff814f1e65
    [exception RIP: __d_lookup+140]
    RIP: ffffffff8118fc8c  RSP: ffff88049ad3dcc0  RFLAGS: 00010202
    RAX: 000000000000000f  RBX: 2e342036343a3732  RCX: 0000000000000016
    RDX: 018721e08df08940  RSI: ffff88049ad3ddc0  RDI: ffff880421557300
    RBP: ffff88049ad3dd10   R8: ffff880589fdad30   R9: 00000000ffffffff
    R10: 0000000000000000  R11: 0000000000000000  R12: 2e342036343a371a
    R13: ffff880421557300  R14: 000000009e75e374  R15: ffff8803a22822b8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffff88049ad3dd18] d_lookup at ffffffff811908fc
 #7 [ffff88049ad3dd48] ll_statahead_one at ffffffffa09b672c [lustre]
 #8 [ffff88049ad3de18] ll_statahead_thread at ffffffffa09b7c98 [lustre]
 #9 [ffff88049ad3df48] kernel_thread at ffffffff8100c14a

And source level information from GDB:

(gdb) l *__d_lookup+140
0xffffffff8118fc8c is in __d_lookup (fs/dcache.c:1409).
1404            rcu_read_lock();
1405
1406            hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
1407                    struct qstr *qstr;
1408
1409                    if (dentry->d_name.hash != hash)
1410                            continue;
1411                    if (dentry->d_parent != parent)
1412                            continue;
1413

Here's the assembly of __d_lookup around the RIP of the failure:

crash> x/80i __d_lookup
...
   0xffffffff8118fc86 <__d_lookup+134>: je     0xffffffff8118fced <__d_lookup+237>
   0xffffffff8118fc88 <__d_lookup+136>: lea    -0x18(%rbx),%r12
   0xffffffff8118fc8c <__d_lookup+140>: cmp    %r14d,0x30(%r12)
   0xffffffff8118fc91 <__d_lookup+145>: jne    0xffffffff8118fc80 <__d_lookup+128>
   0xffffffff8118fc93 <__d_lookup+147>: cmp    %r13,0x28(%r12)
...

The backtrace above from crash shows the register contents when the failure occurred. %r14 appears to contain the hash value for the qstr we are looking up (i.e. hash in the source code). I tend to trust this because I believe I found that qstr structure on the stack here:

crash> p -x *(struct qstr *)0xffff88049ad3ddc0
$30 = {
  hash = 0x9e75e374, 
  len = 0xf, 
  name = 0xffff8802fc86b3b8 "MultiFab_D_07813"
}

The value in %r14 and the value of the hash for the dump qstr match up. I'm not sure what the value in %r12 represents though, and my assembly is a bit rusty so I'm not entirely sure how to decode the cmp instruction that we fail on. My only guess is we're dereferencing %r12 which isn't a valid pointer value.



 Comments   
Comment by Bruno Faccini (Inactive) [ 29/Jan/13 ]

Yes, but %r12 can be directly inherited from a wrong parent 1st parameter if we just came in __d_lookup() or be the result of a corrupted dentry on the hash-list if already looping ...

So now you need to check parent dentry pointer value upon call of __d_lookup().

Comment by Bruno Faccini (Inactive) [ 06/Feb/13 ]

Hello Prakash,

Have you been able to find dentry pointer value passed as "parent" parameter upon call of __d_lookup() from ll_statahead_one() with inlined d_lookup() ??

Is the crash-dump available ?

Comment by Prakash Surya (Inactive) [ 06/Feb/13 ]

I did find an address pointing to a dentry structure (0xffff880421557300), but it's hard to say if that was the parent, or just an entry in the list during traversal. I do have a crash dump, how do I post it to your FTP server?

Comment by Prakash Surya (Inactive) [ 07/Feb/13 ]

I uploaded a tarball named LU-2704.tar in that directory. I has the vmcore, vmlinux, and Lustre modules. I believe you should be able to inspect it like so:

$ crash LU-2704/vmcore-ansel12-2013-01-26-09\:07\:05 LU-2704/vmlinux

If you need anything else, let me know.

Comment by Bruno Faccini (Inactive) [ 13/Feb/13 ]

Sorry for the lack of update, I saved your tarball and currently xfer it to continue crash-dump analysis, will keep you updated.

Comment by Bruno Faccini (Inactive) [ 15/Feb/13 ]

Ok, I have been able to dig into the crash-dump, and actually I am a bit confused because here is what I found :

_ to answer one of the question I had before, "parent"/1st parameter of [__]d_lookup() is valid and points to a dentry with a name of "Level_0" :

$3 = {
  d_count = {
    counter = 0x3
  },
  d_flags = 0x0,
  d_lock = {
    raw_lock = {
      slock = 0x78b078b
    }
  },
  d_mounted = 0x0,
  d_inode = 0xffff8803a22822b8,
  d_hash = {
    next = 0x0,
    pprev = 0xffffc90001489a50
  },
  d_parent = 0xffff8804215a2980,
  d_name = {
    hash = 0x4816d9c5,
    len = 0x7,
    name = 0xffff8804215573a0
  },
  d_lru = {
    next = 0xffff880421557340,
    prev = 0xffff880421557340
  },
  d_u = {
    d_child = {
      next = 0xffff880421576910,
      prev = 0xffff8804215a29e0
    },
    d_rcu = {
      next = 0xffff880421576910,
      func = 0xffff8804215a29e0
    }
  },
  d_subdirs = {
    next = 0xffff880421557360,
    prev = 0xffff880421557360
  },
  d_alias = {
    next = 0xffff8803a22822e8,
    prev = 0xffff8803a22822e8
  },
  d_time = 0x0,
  d_op = 0xffffffffa09d3280,
  d_sb = 0xffff880622524000,
  d_fsdata = 0xffff8804170c1440,
  d_iname = {0x4c, 0x65, 0x76, 0x65, 0x6c, 0x5f, 0x30, 0x0, 0x5f, 0x4d, 0x46, 0x5f, 0x44, 0x5f, 0x30, 0x37, 0x38, 0x32, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
}

_ as you found too, "name"/2nd parameter of [__]d_lookup() is also valid and points to this :

$7 = {
  hash = 0x9e75e374,
  len = 0xf,
  name = 0xffff8802fc86b3b8 -->> "MultiFab_D_07813"
}

_ due to optimisation there is no intermediate registers/results to track/check some of the d_hash() inlined function computations, so I had to replay it by hand (and verified same method works with dentry->parent and its own hash!) with help of __d_lookup() deassembly and source code. And what I found is that the dentry_hashtable[] entry contains what looks like a valid pointer to the 1st element/dentry beeing linked on their d_hash linked-list. But when trying to resolve/dereference it crash complains it is not present/dumped and moreover the page this address belongs is currently assigned to a Lustre inode mapping !!!

_ this explains why the page has not been dumped and likely the kind of data that was accessed to trigger the GPF !!!

_ there are 2 others current threads having old references this inode in their stacks, "ptlrpcd-brw" and "flush-lustre-2", but they have not been scheduled recently (I mean since less than 1s!).

_ more interesting is that there is an other thread "lpurge" running on an other Core, which also triggers a GPF in the dentry layer !!... I will dig into it now.

_ last, there have been earlier OOM situations logged and also these msgs found in the Kernel log :

LustreError: 29314:0:(mdc_locks.c:723:mdc_enqueue()) ldlm_cli_enqueue: -108
LustreError: 29314:0:(mdc_locks.c:723:mdc_enqueue()) ldlm_cli_enqueue: -108
LustreError: 29314:0:(file.c:158:ll_close_inode_openhandle()) inode 5053885323872960658 mdc close failed: rc = -108
LustreError: 29314:0:(file.c:158:ll_close_inode_openhandle()) inode 5044781017488031745 mdc close failed: rc = -108
LustreError: 29314:0:(dir.c:477:ll_get_dir_page()) read cache page: [0x24d18001:0x3db440f0:0x0] at 3512391207215765018: rc -108
LustreError: 29314:0:(dir.c:648:ll_readdir()) error reading dir [0x24d18001:0x3db440f0:0x0] at 3512391207215765018: rc -108
Lustre: lsc-MDT0000-mdc-ffff880622524000: Connection restored to lsc-MDT0000 (at 172.19.1.100@o2ib100)
LustreError: 11-0: lsc-MDT0000-mdc-ffff880622524000: Communicating with 172.19.1.100@o2ib100, operation mds_getattr_lock failed with -107.
Lustre: lsc-MDT0000-mdc-ffff880622524000: Connection to lsc-MDT0000 (at 172.19.1.100@o2ib100) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 167-0: This client was evicted by lsc-MDT0000; in progress operations using this service will fail.
LustreError: 25915:0:(mdc_locks.c:723:mdc_enqueue()) ldlm_cli_enqueue: -108
LustreError: 4613:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-MDT0000-mdc-ffff880622524000 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 4613:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff88012b155800 (2500481170767/301/0/0) (rc: 1)
Lustre: lsc-MDT0000-mdc-ffff880622524000: Connection restored to lsc-MDT0000 (at 172.19.1.100@o2ib100)
LustreError: 25915:0:(mdc_locks.c:723:mdc_enqueue()) Skipped 1001 previous similar messages
LustreError: 11-0: lsc-MDT0000-mdc-ffff880622524000: Communicating with 172.19.1.100@o2ib100, operation mds_getattr_lock failed with -107.
Lustre: lsc-MDT0000-mdc-ffff880622524000: Connection to lsc-MDT0000 (at 172.19.1.100@o2ib100) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 167-0: This client was evicted by lsc-MDT0000; in progress operations using this service will fail.
LustreError: 9659:0:(mdc_locks.c:723:mdc_enqueue()) ldlm_cli_enqueue: -4
LustreError: 9681:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-MDT0000-mdc-ffff880622524000 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 9681:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880135b3ba80 (2957967665419/1/0/0) (rc: 1)
Lustre: lsc-MDT0000-mdc-ffff880622524000: Connection restored to lsc-MDT0000 (at 172.19.1.100@o2ib100)
LustreError: 9659:0:(mdc_locks.c:723:mdc_enqueue()) Skipped 4 previous similar messages
LustreError: 9659:0:(file.c:2196:ll_inode_revalidate_fini()) failure -4 inode 8764462226200395777
general protection fault: 0000 [#1] SMP
.....

which may likely be related ...

So more to come, but seems to me that we may have crashed due to an orphan reference left in dentry_hashtable[], which may have occurred during some specific eviction conditions ....

Comment by Bruno Faccini (Inactive) [ 22/Feb/13 ]

The 2nd thread/"lpurge" also handling a GPF is in fact working on the same parent directory and hash value and thus also triggers the same+corrupted dentry_hashtable[] entry. This is ok, since the 1st thread in GPF is a ll_sa_<PID> companion thread of the 2nd !!

On the other hand, the Lustre inode mapping owner of the page the corrupted reference maps, is one of the current file-descriptors for this 2nd thread/"lpurge" ...

I am not familiar with the "lpurge" command, but is it doing/using tricky things from a Lustre point of view ??

Comment by Christopher Morrone [ 22/Feb/13 ]

lpurge is part of the lustre-tools-llnl package. You can see the code here:

https://github.com/chaos/lustre-tools-llnl/blob/master/src/lpurge.c

lpurge tries to constrain itself to only talking to the MDS when possible, so it uses the IOC_MDC_GETFILEINFO ioctl, for instance.

Other than that, nothing too tricky that I can remember. It is not even multi-threaded. If there were two lpurge threads running, that it likely because the sysadmin ran lpurge more than once. But I would expect them to run each copy in a different subdirectory (likely different users' directories in lustre). It is possible that someone made a mistake though.

I would not be surprised if they share some top-level directory if no mistake was made in running the command.

Comment by Bruno Faccini (Inactive) [ 26/Feb/13 ]

Thank's for this infos Chris, but I have never enough, sorry ... Can you confirm me if lpurge was run with -R (i.e., removing selected files) at the time of the crash ? I can not verify by myself in the crash since by default user-context pages are not dumped.

I think I may have found a path in the code that could allow/explain the dentry_hashtable[] corruption I found in your crash-dump, I will give it a try and expose it to our regression tests.

Comment by Bruno Faccini (Inactive) [ 26/Feb/13 ]

Patch against b2_1 at http://review.whamcloud.com/5533.

Comment by Prakash Surya (Inactive) [ 27/Feb/13 ]

Bruno, lpurge was run with the "-R" option to remove the files.

Comment by Bruno Faccini (Inactive) [ 06/Mar/13 ]

I realize that I forgot to ask, but I presume it has been a single shoot/crash until now ??

BTW, the way it can happen is still unexplained, and actually I only have a debug-patch ready to go, in order to detect dentry_hashtable[] un-safe access ...

Comment by Prakash Surya (Inactive) [ 06/Mar/13 ]

AFAIK, it's only hit once. But I've heard vague reports about crashes from our admins when running lpurge before, so there's a chance of unreported incidents of this crash occurring.

Comment by Bruno Faccini (Inactive) [ 12/Mar/13 ]

Exposure of patch vs the full test-suite, went thru a lot of external conditions (mainly the one tracked in LU-2458) causing some tests to fail randomly. Since no side-effect from patch have been detected I re-submitted it with no extra tests-parameters.

Comment by Bruno Faccini (Inactive) [ 29/Apr/13 ]

Prakash, Chris,
Did you give a try to integrate my debug patch in your Lustre builds ?? Also, did you experience new occurrence of such crash running lpurge ??

Comment by Sebastien Piechurski [ 23/Jul/14 ]

I am not sure this will be of any help, but I wanted to mention we had several occurrences of this issue at one of our customer site.
I have a crash dump if there can be any hints to what happened, but I guess that all interesting traces are already gone at the time of the crash.
If you want to take a look anyway, just tell me.

Comment by Bruno Faccini (Inactive) [ 23/Jul/14 ]

Hello Seb,
Very late/night update !
Does this customer also run with 2.1.2 or a latest version?
Yes, I can have a look in crash-dump to confirm or not it is the same scenario than original. But if you already had several occurrences, you may want to integrate my debug patch in your distro, it did not introduce any regression during intensive tests and will help to identify the root cause.

Comment by Sebastien Piechurski [ 24/Jul/14 ]

The customer is running our version based on 2.1.6. Your patch is not part of it, so I will check if the customer still has some occurrences and try to put the patch there if that's the case.

Comment by Andreas Dilger [ 03/Apr/16 ]

Have not had any feedback on this 2.1 bug in 2 years. Closing.

Generated at Sat Feb 10 01:27:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.