Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7973

Lustre client crash in __d_lookup() - BUG: unable to handle kernel paging request

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      There have been several occurrences in the field (at sites running with various 2.1/2.5 based Lustre versions) of this kind of crashes with following signatures/stacks examples :

      <1>BUG: unable to handle kernel paging request at ffffffff00000018
      <1>IP: [<ffffffff811ad11c>] __d_lookup+0x8c/0x150
      <4>PGD 1a8f067 PUD 0 
      <4>Oops: 0000 [#1] SMP 
      <4>last sysfs file: /sys/devices/system/cpu/online
      <4>CPU 0 
      <4>Modules linked in: lmv(U) fld(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) ko2iblnd(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) libcfs(U) nfs fscache iptable_filter ip_tables nfsd nfs_acl auth_rpcgss exportfs autofs4 sha512_generic crc32c_intel lockd sunrpc bonding ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm sg ipmi_devintf joydev microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support bnx2x libcrc32c mdio dcdbas sb_edac edac_core lpc_ich mfd_core shpchp ext4 jbd2 mbcache mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core sd_mod crc_t10dif ahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
      <4>
      <4>Pid: 25256, comm: rsync Not tainted 2.6.32-573.1.1.el6.x86_64 #1 Dell Inc. PowerEdge R630/0CNCJW
      <4>RIP: 0010:[<ffffffff811ad11c>]  [<ffffffff811ad11c>] __d_lookup+0x8c/0x150
      <4>RSP: 0018:ffff881b8cd4fb98  EFLAGS: 00010286
      <4>RAX: 0000000000000010 RBX: ffffffff00000000 RCX: 0000000000000018
      <4>RDX: 018721e0b8549035 RSI: ffff881b8cd4fcd8 RDI: ffff880b4e78c300
      <4>RBP: ffff881b8cd4fbe8 R08: 0000000000000001 R09: 0000000000000000
      <4>R10: 0000000000000001 R11: 0000000000000001 R12: fffffffeffffffe8
      <4>R13: ffff880b4e78c300 R14: 00000000e58e7d29 R15: ffff881ed58a1520
      <4>FS:  00007f09288c4700(0000) GS:ffff880062400000(0000) knlGS:0000000000000000
      <4>CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      <4>CR2: ffffffff00000018 CR3: 0000001d76c3a000 CR4: 00000000001407f0
      <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>Process rsync (pid: 25256, threadinfo ffff881b8cd4c000, task ffff881ed58a1520)
      <4>Stack:
      <4> ffff880c2a16d046 000000108119f01c 0000000000000010 ffff881b8cd4fcd8
      <4><d> 0000000000000001 ffff881b8cd4fdb8 ffff881b8cd4fce8 ffff881b8cd4fcd8
      <4><d> ffff8820662e5a80 ffff881ed58a1520 ffff881b8cd4fc48 ffffffff811a16f6
      <4>Call Trace:
      <4> [<ffffffff811a16f6>] do_lookup+0x36/0x230
      <4> [<ffffffffa0d31cd2>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
      <4> [<ffffffff811a24f4>] __link_path_walk+0x7a4/0x1000
      <4> [<ffffffffa11b7cf2>] ? osc_find_cbdata+0xa2/0x150 [osc]
      <4> [<ffffffff811a300a>] path_walk+0x6a/0xe0
      <4> [<ffffffff811a321b>] filename_lookup+0x6b/0xc0
      <4> [<ffffffff811a4347>] user_path_at+0x57/0xa0
      <4> [<ffffffff810f326e>] ? call_rcu+0xe/0x10
      <4> [<ffffffff811ab90f>] ? d_free+0x3f/0x60
      <4> [<ffffffff811b43d0>] ? mntput_no_expire+0x30/0x110
      <4> [<ffffffff811a0331>] ? path_put+0x31/0x40
      <4> [<ffffffff81197750>] vfs_fstatat+0x50/0xa0
      <4> [<ffffffff8119780e>] vfs_lstat+0x1e/0x20
      <4> [<ffffffff81197834>] sys_newlstat+0x24/0x50
      <4> [<ffffffff810e8ab7>] ? audit_syscall_entry+0x1d7/0x200
      <4> [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60
      <4> [<ffffffff8153be5e>] ? do_device_not_available+0xe/0x10
      <4> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
      <4>Code: 48 03 05 d8 ea a6 00 48 8b 18 8b 45 bc 48 85 db 48 89 45 c0 75 11 eb 74 0f 1f 80 00 00 00 00 48 8b 1b 48 85 db 74 65 4c 8d 63 e8 <45> 39 74 24 30 75 ed 4d 39 6c 24 28 75 e6 4d 8d 7c 24 08 4c 89 
      <1>RIP  [<ffffffff811ad11c>] __d_lookup+0x8c/0x150
      <4> RSP <ffff881b8cd4fb98>
      <4>CR2: ffffffff00000018
      

      or

      <1>BUG: unable to handle kernel paging request at ffffffff00000008
      <1>IP: [<ffffffffa13821d5>] ll_md_blocking_ast+0x615/0x7d0 [lustre]
      <4>PGD 1a87067 PUD 0
      <4>Oops: 0002 1 SMP
      <4>last sysfs file: /sys/devices/system/cpu/online
      <4>CPU 21
      <4>Modules linked in: nfs fscache lmv(U) fld(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) ptlrpc(U) obdclass(U) ko2iblnd(U) lnet(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf rdma_ucm(U) ib_ucm(U) rdma_cm(U) iw_cm(U) ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) mlx5_ib(U) mlx5_core(U) mlx4_en(U) mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) compat(U) microcode iTCO_wdt iTCO_vendor_support power_meter sg nvidia(P)(U) i2c_i801 lpc_ich mfd_core shpchp igb dca i2c_algo_bit i2c_core ptp pps_core be2net ext4 jbd2 mbcache sd_mod crc_t10dif megaraid_sas xhci_hcd ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      <4>
      <4>Pid: 2309, comm: ll_imp_inval Tainted: P --------------- 2.6.32-431.el6.x86_64 #1 Supermicro X10DRi/X10DRi-T
      <4>RIP: 0010:[<ffffffffa13821d5>] [<ffffffffa13821d5>] ll_md_blocking_ast+0x615/0x7d0 [lustre]
      <4>RSP: 0018:ffff887e10b8db40 EFLAGS: 00010286
      <4>RAX: ffff88401d60a8c8 RBX: ffff883e6e1a8d40 RCX: ffffc90002ad37f8
      <4>RDX: ffffffff00000000 RSI: 0000000000000000 RDI: ffff88401d60a8c8
      <4>RBP: ffff887e10b8dbe0 R08: 0000000000000003 R09: 000000000000001b
      <4>R10: 0000000000015dfb R11: 0000000000000000 R12: ffff88401d60a8c0
      <4>R13: ffff887ce475f638 R14: ffff887d1d695ea0 R15: ffff887d1d695e40
      <4>FS: 0000000000000000(0000) GS:ffff880190ec0000(0000) knlGS:0000000000000000
      <4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      <4>CR2: ffffffff00000008 CR3: 0000004023d54000 CR4: 00000000001407e0
      <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>Process ll_imp_inval (pid: 2309, threadinfo ffff887e10b8c000, task ffff887eaf803500)
      <4>Stack:
      <4> 1700fade1700fade ffff883cdfae4c4b ffffffffa10cdf50 ffff88401d60a8c8
      <4><d> ffff887ce475f668 ffff887d1d695e48 ffffc9004abf71f0 ffff887b9c92d9c0
      <4><d> ffff887e10b8dba0 ffffffffa0e5a23c ffff887b9c92d9c0 0000000000000013
      <4>Call Trace:
      <4> [<ffffffffa0e5a23c>] ? class_handle_unhash+0x3c/0x50 [obdclass]
      <4> [<ffffffffa103903c>] ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc]
      <4> [<ffffffffa104859a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
      <4> [<ffffffffa104d030>] ldlm_cli_cancel+0x60/0x360 [ptlrpc]
      <4> [<ffffffffa1041e3d>] cleanup_resource+0x18d/0x310 [ptlrpc]
      <4> [<ffffffffa0d07ade>] ? cfs_hash_spin_lock+0xe/0x10 [libcfs]
      <4> [<ffffffffa1041fef>] ldlm_resource_clean+0x2f/0x60 [ptlrpc]
      <4> [<ffffffffa0d07d5c>] cfs_hash_for_each_relax+0x17c/0x350 [libcfs]
      <4> [<ffffffffa1041fc0>] ? ldlm_resource_clean+0x0/0x60 [ptlrpc]
      <4> [<ffffffffa1041fc0>] ? ldlm_resource_clean+0x0/0x60 [ptlrpc]
      <4> [<ffffffffa0d096ef>] cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs]
      <4> [<ffffffffa103ed7e>] ldlm_namespace_cleanup+0x2e/0xc0 [ptlrpc]
      <4> [<ffffffffa11e7cc9>] mdc_import_event+0x1e9/0xa30 [mdc]
      <4> [<ffffffffa108b27c>] ptlrpc_invalidate_import+0x2bc/0x8f0 [ptlrpc]
      <4> [<ffffffffa0d019f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      <4> [<ffffffffa108e550>] ? ptlrpc_invalidate_import_thread+0x0/0x2e0 [ptlrpc]
      <4> [<ffffffffa108e598>] ptlrpc_invalidate_import_thread+0x48/0x2e0 [ptlrpc]
      <4> [<ffffffff8109aef6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ae60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>Code: 8b 14 24 85 d2 75 37 41 8b 54 24 04 f6 c2 10 75 2d 83 ca 10 49 8b 4c 24 20 41 89 54 24 04 49 8b 54 24 18 48 85 d2 48 89 11 74 04 <48> 89 4a 08 48 ba 00 02 20 00 00 00 ad de 49 89 54 24 20 66 ff
      <1>RIP [<ffffffffa13821d5>] ll_md_blocking_ast+0x615/0x7d0 [lustre]
      <4> RSP <ffff887e10b8db40>
      <4>CR2: ffffffff00000008
      

      or

      <1>BUG: unable to handle kernel paging request at ffffffff00000018
      <1>IP: [<ffffffff811a375c>] __d_lookup+0x8c/0x150
      <4>PGD 1a87067 PUD 0
      <4>Oops: 0000 1 SMP
      <4>last sysfs file: /sys/devices/system/cpu/online
      <4>CPU 10
      <4>Modules linked in: lmv(U) fld(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) ptlrpc(U) obdclass(U) ko2iblnd(U) lnet(U) libcfs(U) bridge stp llc nfs fscache sha512_generic sha256_generic crc32c_intel nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables rdma_ucm(U) ib_ucm(U) rdma_cm(U) iw_cm(U) ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) mlx5_ib(U) mlx5_core(U) mlx4_en(U) mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) compat(U) microcode iTCO_wdt iTCO_vendor_support power_meter nvidia(P)(U) i2c_i801 sg lpc_ich mfd_core shpchp igb dca i2c_algo_bit i2c_core ptp pps_core be2net ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom megaraid_sas xhci_hcd ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
      <4>
      <4>Pid: 33054, comm: tar Tainted: P --------------- 2.6.32-431.el6.x86_64 #1 Supermicro SYS-7048R-TR/X10DRi
      <4>RIP: 0010:[<ffffffff811a375c>] [<ffffffff811a375c>] __d_lookup+0x8c/0x150
      <4>RSP: 0018:ffff880ba631fbd8 EFLAGS: 00010286
      <4>RAX: 0000000000000003 RBX: ffffffff00000000 RCX: 000000000000001a
      <4>RDX: 018721de7d980c23 RSI: ffff880ba631fd18 RDI: ffff8860a3ef4e40
      <4>RBP: ffff880ba631fc28 R08: 0000000000000000 R09: 0000000000000000
      <4>R10: 0000000000000001 R11: 0000000000000001 R12: fffffffeffffffe8
      <4>R13: ffff8860a3ef4e40 R14: 000000000027beea R15: ffff8838a4086080
      <4>FS: 00007fbfb40467a0(0000) GS:ffff884161400000(0000) knlGS:0000000000000000
      <4>CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      <4>CR2: ffffffff00000018 CR3: 00000015850f1000 CR4: 00000000001407e0
      <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>Process tar (pid: 33054, threadinfo ffff880ba631e000, task ffff8838a4086080)
      <4>Stack:
      <4> ffff8878808c004d 0000000381196213 0000000000000003 ffff880ba631fd18
      <4><d> 0000000000000001 ffff880ba631fe08 ffff880ba631fd28 ffff880ba631fd18
      <4><d> ffff888024d055c0 ffff8838a4086080 ffff880ba631fc88 ffffffff811988c6
      <4>Call Trace:
      <4> [<ffffffff811988c6>] do_lookup+0x36/0x230
      <4> [<ffffffffa0f08462>] ? ldlm_res_hop_get_locked+0x12/0x20 [ptlrpc]
      <4> [<ffffffff81198dc0>] __link_path_walk+0x200/0xff0
      <4> [<ffffffffa0f09766>] ? ldlm_resource_putref+0x66/0x280 [ptlrpc]
      <4> [<ffffffff81199e6a>] path_walk+0x6a/0xe0
      <4> [<ffffffff8119b64a>] do_filp_open+0x1fa/0xd20
      <4> [<ffffffff810ec785>] ? call_rcu_sched+0x15/0x20
      <4> [<ffffffff810ec79e>] ? call_rcu+0xe/0x10
      <4> [<ffffffff81282705>] ? _atomic_dec_and_lock+0x55/0x80
      <4> [<ffffffff811aaa20>] ? mntput_no_expire+0x30/0x110
      <4> [<ffffffff811a8212>] ? alloc_fd+0x92/0x160
      <4> [<ffffffff81185d29>] do_sys_open+0x69/0x140
      <4> [<ffffffff8100c715>] ? math_state_restore+0x45/0x60
      <4> [<ffffffff81185e40>] sys_open+0x20/0x30
      <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      <4>Code: 48 03 05 f8 6c a6 00 48 8b 18 8b 45 bc 48 85 db 48 89 45 c0 75 11 eb 74 0f 1f 80 00 00 00 00 48 8b 1b 48 85 db 74 65 4c 8d 63 e8 <45> 39 74 24 30 75 ed 4d 39 6c 24 28 75 e6 4d 8d 7c 24 08 4c 89
      <1>RIP [<ffffffff811a375c>] __d_lookup+0x8c/0x150
      <4> RSP <ffff880ba631fbd8>
      <4>CR2: ffffffff00000018
      

      or

      general protection fault: 0000 [#1] SMP 
      last sysfs file: /sys/devices/system/cpu/cpu11/cache/index2/shared_cpu_map
      CPU 0 
      Modules linked in: cpufreq_ondemand nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm uinput ahci ib_qib(U) ib_mad ib_core dcdbas microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core shpchp xt_owner ipt_LOG xt_multiport ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc igb dca [last unloaded: cpufreq_ondemand]
      
      Pid: 3895, comm: ll_sa_25915 Tainted: G        W  ----------------   2.6.32-220.23.1.1chaos.ch5.x86_64 #1 Dell       XS23-TY35       /0GW08P
      RIP: 0010:[<ffffffff8118fc8c>]  [<ffffffff8118fc8c>] __d_lookup+0x8c/0x150
      RSP: 0018:ffff88049ad3dcc0  EFLAGS: 00010202
      RAX: 000000000000000f RBX: 2e342036343a3732 RCX: 0000000000000016
      RDX: 018721e08df08940 RSI: ffff88049ad3ddc0 RDI: ffff880421557300
      RBP: ffff88049ad3dd10 R08: ffff880589fdad30 R09: 00000000ffffffff
      R10: 0000000000000000 R11: 0000000000000000 R12: 2e342036343a371a
      R13: ffff880421557300 R14: 000000009e75e374 R15: ffff8803a22822b8
      FS:  00002aaaab05db20(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00007fffffffc010 CR3: 0000000297c13000 CR4: 00000000000006f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
      Process ll_sa_25915 (pid: 3895, threadinfo ffff88049ad3c000, task ffff8804f49a0aa0)
      Stack:
       ffff8802fc86b3b8 0000000f00000246 000000000000000f ffff88049ad3ddc0
      <0> ffff880028215fc0 0000000002170c3c ffff88049ad3ddc0 ffff880421557300
      <0> ffff880421557300 ffff8803a22822b8 ffff88049ad3dd40 ffffffff811908fc
      Call Trace:
       [<ffffffff811908fc>] d_lookup+0x3c/0x60
       [<ffffffffa09b672c>] ll_statahead_one+0x1ec/0x14a0 [lustre]
       [<ffffffff81051ba3>] ? __wake_up+0x53/0x70
       [<ffffffff8109144c>] ? remove_wait_queue+0x3c/0x50
       [<ffffffffa09b7c98>] ll_statahead_thread+0x2b8/0x890 [lustre]
       [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
       [<ffffffffa09b79e0>] ? ll_statahead_thread+0x0/0x890 [lustre]
       [<ffffffff8100c14a>] child_rip+0xa/0x20
       [<ffffffffa09b79e0>] ? ll_statahead_thread+0x0/0x890 [lustre]
       [<ffffffffa09b79e0>] ? ll_statahead_thread+0x0/0x890 [lustre]
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Code: 48 03 05 88 4b a7 00 48 8b 18 8b 45 bc 48 85 db 48 89 45 c0 75 11 eb 74 0f 1f 80 00 00 00 00 48 8b 1b 48 85 db 74 65 4c 8d 63 e8 <45> 39 74 24 30 75 ed 4d 39 6c 24 28 75 e6 4d 8d 7c 24 08 4c 89 
      RIP  [<ffffffff8118fc8c>] __d_lookup+0x8c/0x150
       RSP <ffff88049ad3dcc0>
      

      All of their crash-dumps analysis show the same problem of a dentry->d_hash->next corrupted pointer.

      Attachments

        Issue Links

          Activity

            People

              bfaccini Bruno Faccini (Inactive)
              bfaccini Bruno Faccini (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: