Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2704

GPF in __d_lookup called from ll_statahead_one

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.1.2
    • 3
    • 6298

    Description

      We recently had a report of a General Protection Fault on a client node running a purge workload.

      The console message:

      general protection fault: 0000 [#1] SMP 
      last sysfs file: /sys/devices/system/cpu/cpu11/cache/index2/shared_cpu_map
      CPU 0 
      Modules linked in: cpufreq_ondemand nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm uinput ahci ib_qib(U) ib_mad ib_core dcdbas microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core shpchp xt_owner ipt_LOG xt_multiport ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc igb dca [last unloaded: cpufreq_ondemand]
      
      Pid: 3895, comm: ll_sa_25915 Tainted: G        W  ----------------   2.6.32-220.23.1.1chaos.ch5.x86_64 #1 Dell       XS23-TY35       /0GW08P
      RIP: 0010:[<ffffffff8118fc8c>]  [<ffffffff8118fc8c>] __d_lookup+0x8c/0x150
      RSP: 0018:ffff88049ad3dcc0  EFLAGS: 00010202
      RAX: 000000000000000f RBX: 2e342036343a3732 RCX: 0000000000000016
      RDX: 018721e08df08940 RSI: ffff88049ad3ddc0 RDI: ffff880421557300
      RBP: ffff88049ad3dd10 R08: ffff880589fdad30 R09: 00000000ffffffff
      R10: 0000000000000000 R11: 0000000000000000 R12: 2e342036343a371a
      R13: ffff880421557300 R14: 000000009e75e374 R15: ffff8803a22822b8
      FS:  00002aaaab05db20(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00007fffffffc010 CR3: 0000000297c13000 CR4: 00000000000006f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
      Process ll_sa_25915 (pid: 3895, threadinfo ffff88049ad3c000, task ffff8804f49a0aa0)
      Stack:
       ffff8802fc86b3b8 0000000f00000246 000000000000000f ffff88049ad3ddc0
      <0> ffff880028215fc0 0000000002170c3c ffff88049ad3ddc0 ffff880421557300
      <0> ffff880421557300 ffff8803a22822b8 ffff88049ad3dd40 ffffffff811908fc
      Call Trace:
       [<ffffffff811908fc>] d_lookup+0x3c/0x60
       [<ffffffffa09b672c>] ll_statahead_one+0x1ec/0x14a0 [lustre]
       [<ffffffff81051ba3>] ? __wake_up+0x53/0x70
       [<ffffffff8109144c>] ? remove_wait_queue+0x3c/0x50
       [<ffffffffa09b7c98>] ll_statahead_thread+0x2b8/0x890 [lustre]
       [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
       [<ffffffffa09b79e0>] ? ll_statahead_thread+0x0/0x890 [lustre]
       [<ffffffff8100c14a>] child_rip+0xa/0x20
       [<ffffffffa09b79e0>] ? ll_statahead_thread+0x0/0x890 [lustre]
       [<ffffffffa09b79e0>] ? ll_statahead_thread+0x0/0x890 [lustre]
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Code: 48 03 05 88 4b a7 00 48 8b 18 8b 45 bc 48 85 db 48 89 45 c0 75 11 eb 74 0f 1f 80 00 00 00 00 48 8b 1b 48 85 db 74 65 4c 8d 63 e8 <45> 39 74 24 30 75 ed 4d 39 6c 24 28 75 e6 4d 8d 7c 24 08 4c 89 
      RIP  [<ffffffff8118fc8c>] __d_lookup+0x8c/0x150
       RSP <ffff88049ad3dcc0>
      

      The stack reported by crash:

      crash> bt
      PID: 3895   TASK: ffff8804f49a0aa0  CPU: 0   COMMAND: "ll_sa_25915"
       #0 [ffff88049ad3da50] machine_kexec at ffffffff8103216b
       #1 [ffff88049ad3dab0] crash_kexec at ffffffff810b8d12
       #2 [ffff88049ad3db80] oops_end at ffffffff814f2b00
       #3 [ffff88049ad3dbb0] die at ffffffff8100f26b
       #4 [ffff88049ad3dbe0] do_general_protection at ffffffff814f2692
       #5 [ffff88049ad3dc10] general_protection at ffffffff814f1e65
          [exception RIP: __d_lookup+140]
          RIP: ffffffff8118fc8c  RSP: ffff88049ad3dcc0  RFLAGS: 00010202
          RAX: 000000000000000f  RBX: 2e342036343a3732  RCX: 0000000000000016
          RDX: 018721e08df08940  RSI: ffff88049ad3ddc0  RDI: ffff880421557300
          RBP: ffff88049ad3dd10   R8: ffff880589fdad30   R9: 00000000ffffffff
          R10: 0000000000000000  R11: 0000000000000000  R12: 2e342036343a371a
          R13: ffff880421557300  R14: 000000009e75e374  R15: ffff8803a22822b8
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #6 [ffff88049ad3dd18] d_lookup at ffffffff811908fc
       #7 [ffff88049ad3dd48] ll_statahead_one at ffffffffa09b672c [lustre]
       #8 [ffff88049ad3de18] ll_statahead_thread at ffffffffa09b7c98 [lustre]
       #9 [ffff88049ad3df48] kernel_thread at ffffffff8100c14a
      

      And source level information from GDB:

      (gdb) l *__d_lookup+140
      0xffffffff8118fc8c is in __d_lookup (fs/dcache.c:1409).
      1404            rcu_read_lock();
      1405
      1406            hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
      1407                    struct qstr *qstr;
      1408
      1409                    if (dentry->d_name.hash != hash)
      1410                            continue;
      1411                    if (dentry->d_parent != parent)
      1412                            continue;
      1413
      

      Here's the assembly of __d_lookup around the RIP of the failure:

      crash> x/80i __d_lookup
      ...
         0xffffffff8118fc86 <__d_lookup+134>: je     0xffffffff8118fced <__d_lookup+237>
         0xffffffff8118fc88 <__d_lookup+136>: lea    -0x18(%rbx),%r12
         0xffffffff8118fc8c <__d_lookup+140>: cmp    %r14d,0x30(%r12)
         0xffffffff8118fc91 <__d_lookup+145>: jne    0xffffffff8118fc80 <__d_lookup+128>
         0xffffffff8118fc93 <__d_lookup+147>: cmp    %r13,0x28(%r12)
      ...
      

      The backtrace above from crash shows the register contents when the failure occurred. %r14 appears to contain the hash value for the qstr we are looking up (i.e. hash in the source code). I tend to trust this because I believe I found that qstr structure on the stack here:

      crash> p -x *(struct qstr *)0xffff88049ad3ddc0
      $30 = {
        hash = 0x9e75e374, 
        len = 0xf, 
        name = 0xffff8802fc86b3b8 "MultiFab_D_07813"
      }
      

      The value in %r14 and the value of the hash for the dump qstr match up. I'm not sure what the value in %r12 represents though, and my assembly is a bit rusty so I'm not entirely sure how to decode the cmp instruction that we fail on. My only guess is we're dereferencing %r12 which isn't a valid pointer value.

      Attachments

        Issue Links

          Activity

            People

              bfaccini Bruno Faccini (Inactive)
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: