Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6596

GPF: RIP [<ffffffffa076924b>] ptlrpc_replay_next+0xdb/0x380 [ptlrpc]

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.5.4
    • None
    • kernel 2.6.32-504.12.2.el6
      lustre-2.5.3.90 w/ some bullpatches on clients and servers
    • 3
    • 9223372036854775807

    Description

      Since its update from 2.5.3 to 2.5.3.90, one of our customer is hitting the following GPF on client nodes while the MDT is in recovery.

      <4>general protection fault: 0000 [0000001] SMP
      <4>last sysfs file: /sys/module/ipv6/initstate
      <4>CPU 21
      <4>Modules linked in: iptable_mangle iptable_filter lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic crc32c_intel libcfs(U) nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_sa(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) mic(U) uinput ipmi_si ipmi_msghandler sg compat(U) lpc_ich mfd_core ioatdma myri10ge igb dca i2c_algo_bit i2c_core ptp pps_core ext4 jbd2 mbcache ahci sd_mod crc_t10dif dm_mirror dm_region_hash dm_log dm_mod megaraid_sas [last unloaded: scsi_wait_scan]
      <4>
      <4>Pid: 10457, comm: ptlrpcd_rcv Not tainted 2.6.32-504.12.2.el6.Bull.72.x86_64 0000001 BULL bullx super-node
      <4>RIP: 0010:[<ffffffffa076924b>] [<ffffffffa076924b>] ptlrpc_replay_next+0xdb/0x380 [ptlrpc]
      <4>RSP: 0018:ffff88086d375bb0 EFLAGS: 00010296
      <4>RAX: 5a5a5a5a5a5a5a5a RBX: ffff88107ad2d800 RCX: ffff8806e2f65d10
      <4>RDX: ffff88107ad2d8b0 RSI: ffff88086d375c1c RDI: ffff88107ad2d800
      <4>RBP: ffff88086d375be0 R08: 0000000000000000 R09: 0000000000000000
      <4>R10: ffff88087ca2fe50 R11: 0000000000000000 R12: 0000000000000000
      <4>R13: ffff88107ad2da90 R14: ffff88086d375c1c R15: ffff881e7f308000
      <4>FS: 0000000000000000(0000) GS:ffff88089c540000(0000) knlGS:0000000000000000
      <4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      <4>CR2: 00002ba018b2c000 CR3: 0000000001a85000 CR4: 00000000000007e0
      <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>Process ptlrpcd_rcv (pid: 10457, threadinfo ffff88086d374000, task ffff88087b2b6040)
      <4>Stack:
      <4> 0000000000000000 ffff88107ad2d800 ffff880917d81800 ffff88107ad2da90
      <4><d> 0000000000000000 ffff880d22697cc0 ffff88086d375c40 ffffffffa078e570
      <4><d> 0000000000000000 ffff88107b316078 0000000000000000 ffff880d22697cc0
      <4>Call Trace:
      <4> [<ffffffffa078e570>] ptlrpc_import_recovery_state_machine+0x360/0xc30 [ptlrpc]
      <4> [<ffffffffa078fc69>] ptlrpc_connect_interpret+0x779/0x21d0 [ptlrpc]
      <4> [<ffffffffa0784d6b>] ? ptlrpc_pinger_commit_expected+0x1b/0x90 [ptlrpc]
      <4> [<ffffffffa076605d>] ptlrpc_check_set+0x31d/0x1c20 [ptlrpc]
      <4> [<ffffffff81087fdb>] ? try_to_del_timer_sync+0x7b/0xe0
      <4> [<ffffffffa0792613>] ptlrpcd_check+0x533/0x550 [ptlrpc]
      <4> [<ffffffffa0792b2b>] ptlrpcd+0x20b/0x370 [ptlrpc]
      <4> [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa0792920>] ? ptlrpcd+0x0/0x370 [ptlrpc]
      <4> [<ffffffff8109e66e>] kthread+0x9e/0xc0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>Code: c0 00 00 00 48 8b 00 48 39 c2 48 89 83 c0 00 00 00 75 18 eb 23 0f 1f 00 48 8b 00 48 39 c2 48 89 83 c0 00 00 00 0f 84 8c 00 00 00 <4c> 3b 60 f0 4c 8d b8 f0 fe ff ff 73 e0 4d 85 ff 74 7a f6 83 95
      <1>RIP [<ffffffffa076924b>] ptlrpc_replay_next+0xdb/0x380 [ptlrpc]
      <4> RSP <ffff88086d375bb0>
      

      Our MDS is an active/passive HA cluster. This GPF can occur during failover or failback of the MDT.

      Occurred on 05/04, 05/07 and 05/12. During the last occurrence, we lost 200 compute nodes and 2 login nodes.

      The stack looks like LU-6022.

      Could you help us on this one?

      Attachments

        Issue Links

          Activity

            People

              bfaccini Bruno Faccini (Inactive)
              bruno.travouillon Bruno Travouillon (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: