Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8459

Need LU-6596 fix in Lustre 2.5

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.5.3
    • 3
    • 9223372036854775807

    Description

      We hit this crash on a Lustre 2.5 client, which matches the stack trace from LU-6596.

      2016-07-29 14:07:39 [442101.409121] LustreError: 11-0: lsd-MDT0000-mdc-ffff88200c40b800: Communicating with 172.19.2.102@o2ib100, operation ldlm_enqueue failed with -19.
      9680 2016-07-29 14:07:39 [442101.423805] Lustre: lsd-MDT0000-mdc-ffff88200c40b800: Connection to lsd-MDT0000 (at 172.19.2.102@o2ib100) was lost; in progress operations using this service will wait for recovery to complete
      9681 2016-07-29 14:07:39 [442101.443039] Lustre: Skipped 12 previous similar messages
      9682 2016-07-29 14:08:36 [442159.149684] LustreError: 166-1: MGC172.19.2.102@o2ib100: Connection to MGS (at 172.19.2.102@o2ib100) was lost; in progress operations using this service will fail
      9683 2016-07-29 14:08:36 [442159.166010] LustreError: Skipped 1 previous similar message
      9684 2016-07-29 14:10:42 [442284.831307] Lustre: Evicted from MGS (at 172.19.2.102@o2ib100) after server handle changed from 0xd01d02d77b80e403 to 0xd01d02d8bb9f7cc6
      9685 2016-07-29 14:10:42 [442284.845115] Lustre: Skipped 1 previous similar message
      9686 2016-07-29 14:10:42 [442284.851163] Lustre: MGC172.19.2.102@o2ib100: Connection restored to MGS (at 172.19.2.102@o2ib100)
      9687 2016-07-29 14:10:43 [442285.410347] general protection fault: 0000 [#1] SMP
      9688 2016-07-29 14:10:43 [442285.416010] Modules linked in: xfs libcrc32c lmv(OE) fld(OE) mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) fid(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_res     olver xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack nf_log_ipv4 nf_log_common xt_LOG xt_multiport iptable_filter nfsv3 nfs fscache ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm ib_sa ocrdma iw_cxgb4 iw_cm iw_cxgb3 in     tel_powerclamp coretemp hfi1(OE) intel_rapl iTCO_wdt ib_mad ib_core iTCO_vendor_support ipmi_devintf kvm ib_addr sg mei_me sb_edac pcspkr mei lpc_ich shpchp i2c_i801 edac_core mfd_core ipmi_si ipmi_msghandler acpi_power_meter acpi_cpufreq nfsd auth_     rpcgss nfs_acl lockd grace binfmt_misc ip_tables ext4 mbcache jbd2 dm_service_time sd_mod crc_t10dif crct10dif_generic mxm_wmi crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel mgag200 ghash_clmulni_intel syscopyarea sysfillrect sysimgblt      drm_kms_helper aesni_intel be2net lrw ttm igb gf128mul glue_helper vxlan ahci dca ablk_helper libahci ip6_udp_tunnel ptp drm cryptd udp_tunnel libata pps_core i2c_algo_bit i2c_core wmi sunrpc dm_mirror dm_region_hash dm_log iscsi_tcp be2iscsi bnx2i      cnic uio cxgb4 cxgb3 mdio libcxgbi libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate dm_multipath dm_mod
      9689 2016-07-29 14:10:43 [442285.556322] CPU: 12 PID: 9331 Comm: ptlrpcd_rcv Tainted: P           OE  ------------   3.10.0-327.22.2.1chaos.ch6.x86_64 #1
      9690 2016-07-29 14:10:43 [442285.568946] Hardware name: Penguin Computing Relion 2900e/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
      9691 2016-07-29 14:10:43 [442285.581180] task: ffff88101b35c500 ti: ffff880ff8cd0000 task.ti: ffff880ff8cd0000
      9692 2016-07-29 14:10:43 [442285.589631] RIP: 0010:[<ffffffffa0f1ed13>]  [<ffffffffa0f1ed13>] ptlrpc_replay_next+0xd3/0x390 [ptlrpc]
      9693 2016-07-29 14:10:43 [442285.600278] RSP: 0018:ffff880ff8cd3bd0  EFLAGS: 00010296
      9694 2016-07-29 14:10:43 [442285.606305] RAX: 5a5a5a5a5a5a5a5a RBX: ffff88200c75f000 RCX: 0000004b0947e4a5
      9695 2016-07-29 14:10:43 [442285.614370] RDX: ffff88200c75f0b0 RSI: ffff880e36ef5110 RDI: ffff88200c75f298
      9696 2016-07-29 14:10:43 [442285.622431] RBP: ffff880ff8cd3bf8 R08: 0000000000000004 R09: 0000000000000028
      9697 2016-07-29 14:10:43 [442285.630493] R10: ffff88200c75f080 R11: 0000000000000096 R12: ffff88200c75f298
      9698 2016-07-29 14:10:43 [442285.638554] R13: ffff880ff8cd3c10 R14: 0000000000000000 R15: ffff880e36ef4d10
      9699 2016-07-29 14:10:43 [442285.646618] FS:  0000000000000000(0000) GS:ffff88103f380000(0000) knlGS:0000000000000000
      9700 2016-07-29 14:10:43 [442285.655749] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      9701 2016-07-29 14:10:43 [442285.662260] CR2: 00005555557efcd0 CR3: 0000000001962000 CR4: 00000000003407e0
      9702 2016-07-29 14:10:43 [442285.670323] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      9703 2016-07-29 14:10:43 [442285.678385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      9704 2016-07-29 14:10:43 [442285.686446] Stack:
      9705 2016-07-29 14:10:43 [442285.688785]  ffff88200c75f000 ffff88200c75f298 0000000000000001 ffff8810272d4800
      9706 2016-07-29 14:10:43 [442285.697180]  ffff880e54f620c0 ffff880ff8cd3c38 ffffffffa0f42ff2 ffff880e54f62000
      9707 2016-07-29 14:10:43 [442285.705576]  ffff882000000000 0000000000000001 000000008573ebfd ffff88200c75f000
      9708 2016-07-29 14:10:43 [442285.713975] Call Trace:
      9709 2016-07-29 14:10:43 [442285.716825]  [<ffffffffa0f42ff2>] ptlrpc_import_recovery_state_machine+0x1b2/0xc80 [ptlrpc]
      9710 2016-07-29 14:10:43 [442285.726261]  [<ffffffffa0f45c56>] ptlrpc_connect_interpret+0x7c6/0x22b0 [ptlrpc]
      9711 2016-07-29 14:10:43 [442285.734629]  [<ffffffffa0f1b794>] ptlrpc_check_set.part.21+0x2c4/0x2090 [ptlrpc]
      9712 2016-07-29 14:10:43 [442285.742989]  [<ffffffff810907be>] ? try_to_del_timer_sync+0x5e/0x90
      9713 2016-07-29 14:10:43 [442285.750098]  [<ffffffffa0f1d5bb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
      9714 2016-07-29 14:10:43 [442285.757402]  [<ffffffffa0f4920b>] ptlrpcd_check+0x4eb/0x730 [ptlrpc]
      9715 2016-07-29 14:10:43 [442285.764606]  [<ffffffffa0f4967f>] ptlrpcd+0x22f/0x3f0 [ptlrpc]
      9716 2016-07-29 14:10:43 [442285.771221]  [<ffffffff810bd4b0>] ? wake_up_state+0x20/0x20
      9717 2016-07-29 14:10:43 [442285.777551]  [<ffffffffa0f49450>] ? ptlrpcd_check+0x730/0x730 [ptlrpc]
      9718 2016-07-29 14:10:43 [442285.784942]  [<ffffffff810a997f>] kthread+0xcf/0xe0
      9719 2016-07-29 14:10:43 [442285.790483]  [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
      9720 2016-07-29 14:10:43 [442285.797874]  [<ffffffff8165d658>] ret_from_fork+0x58/0x90
      9721 2016-07-29 14:10:43 [442285.803998]  [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
      9722 2016-07-29 14:10:43 [442285.811382] Code: 00 48 8b 00 48 39 c2 48 89 83 c0 00 00 00 75 1b e9 b0 02 00 00 0f 1f 00 48 8b 00 48 39 c2 48 89 83 c0 00 00 00 0f 84 9c 00 00 00 <4c> 3b 70 f0 73 e7 4c 8d b8 f0 fe ff ff 4d 85 ff 0f 84 86 00 00
      9723 2016-07-29 14:10:43 [442285.833162] RIP  [<ffffffffa0f1ed13>] ptlrpc_replay_next+0xd3/0x390 [ptlrpc]
      9724 2016-07-29 14:10:43 [442285.841150]  RSP <ffff880ff8cd3bd0>
      9725 2016-07-29 14:10:44 [442286.427802] ---[ end trace 3534deb517e19fd7 ]---
      9726 2016-07-29 14:10:44 [442286.487942] Kernel panic - not syncing: Fatal exception
      

      Attachments

        Issue Links

          Activity

            People

              bogl Bob Glossman (Inactive)
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: