Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.10.0
    • Lustre 2.10.0
    • Soak test cluster, version=lustre: 2.9.51_45_g3b3eeeb - tip of master
    • 3
    • 9223372036854775807

    Description

      Failover of lola-11 to lola-10

      2017-01-24 16:44:56,220:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0003 on lola-10 ...
      

      Immediately prior to the failover mount, lola-10 reports hung processes.

      Jan 24 16:41:03 lola-10 kernel: Lustre: 4554:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1485304863/real 1485304863]  req@ffff8804131dd680 x1557446666855008/t0(0) o38->soaked-MDT0003-osp-MDT0002@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1485304880 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
      Jan 24 16:41:03 lola-10 kernel: Lustre: 4554:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      Jan 24 16:41:35 lola-10 kernel: LNet: Service thread pid 6159 was inactive for 226.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Jan 24 16:41:35 lola-10 kernel: Pid: 6159, comm: mdt00_005
      Jan 24 16:41:35 lola-10 kernel:
      Jan 24 16:41:35 lola-10 kernel: Call Trace:
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff8105e4e3>] ? __wake_up+0x53/0x70
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa0a33bb2>] top_trans_stop+0x782/0xbd0 [ptlrpc]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa11b4aac>] lod_trans_stop+0x2bc/0x330 [lod]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1259f31>] mdd_trans_stop+0x21/0xc6 [mdd]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1246e63>] mdd_create+0x1373/0x17e0 [mdd]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1105c44>] ? mdt_version_save+0x84/0x1a0 [mdt]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa110fcfc>] mdt_reint_create+0xbdc/0xfe0 [mdt]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa11013bc>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09bb2bb>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff81299b7a>] ? strlcpy+0x4a/0x60
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1102a2a>] ? old_init_ucred_common+0xda/0x2b0 [mdt]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1104f5d>] mdt_reint_rec+0x5d/0x200 [mdt]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa10efd9b>] mdt_reint_internal+0x62b/0xa50 [mdt]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa10f066b>] mdt_reint+0x6b/0x120 [mdt]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa0a1f17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09ca66b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa067c84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09c56e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09cba11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09caf50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      Jan 24 16:41:35 lola-10 kernel:
      Jan 24 16:41:35 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1485304895.6159
      

      Mount occurs

      Jan 24 16:46:22 lola-10 kernel: Lustre: soaked-MDT0003: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      ...
      Jan 24 16:46:27 lola-10 kernel: Lustre: soaked-MDT0003: Will be in recovery for at least 2:30, or until 22 clients reconnect
      ...
      Jan 24 16:47:44 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      Jan 24 16:47:44 lola-10 kernel:  req@ffff88041171b980 x1557448608035648/t21476913294(0) o36->57c15fb3-4c15-952f-3686-35f4c9caa941@192.168.1.126@o2ib100:-1/-1 lens 632/568 e 19 to 0 dl 1485305269 ref 2 fl Interpret:/0/0 rc 0/0
      Jan 24 16:47:47 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      Jan 24 16:47:47 lola-10 kernel:  req@ffff880737e13cc0 x1557448490734944/t21476914413(0) o36->1e7aaa7d-f417-ebd3-eff8-7787d5dac449@192.168.1.123@o2ib100:-1/-1 lens 616/3128 e 19 to 0 dl 1485305272 ref 2 fl Interpret:/0/0 rc 0/0
      Jan 24 16:47:47 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) Skipped 22 previous similar messages
      Jan 24 16:47:50 lola-10 kernel: Lustre: soaked-MDT0002: Client ffcd9b78-a534-1d1b-1494-f25c5b00edf1 (at 192.168.1.131@o2ib100) reconnecting
      

      Recovery appears to stall

      Jan 24 16:50:18 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1
      Jan 24 16:50:19 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1
      Jan 24 16:50:19 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) Skipped 2 previous similar messages
      Jan 24 16:50:21 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1
      

      System is still spinning at the present time.

      Attachments

        1. stack-dump-lola-10.txt
          1.22 MB
        2. lustre-log.1485304895.6159.txt.gz
          9.70 MB
        3. trace
          1.03 MB
        4. lu-9049.tar.gz
          4.47 MB

        Issue Links

          Activity

            [LU-9049] DNE MDT Never completes recovery
            laisiyao Lai Siyao added a comment -

            https://review.whamcloud.com/#/c/27708/ is fix for LU-9678, and it's a fix for osd-zfs, which is server code.
            https://review.whamcloud.com/#/c/28165/ is fix for LU-9203, which affects both client and server.
            https://review.whamcloud.com/#/c/26965/ is fix for this ticket, which is a fix for server code, but it modified lu_object_put(), so it affects client code too.

            I'd suggest porting the latter two patches, and there is no direct dependency between them.

            laisiyao Lai Siyao added a comment - https://review.whamcloud.com/#/c/27708/ is fix for LU-9678 , and it's a fix for osd-zfs, which is server code. https://review.whamcloud.com/#/c/28165/ is fix for LU-9203 , which affects both client and server. https://review.whamcloud.com/#/c/26965/ is fix for this ticket, which is a fix for server code, but it modified lu_object_put(), so it affects client code too. I'd suggest porting the latter two patches, and there is no direct dependency between them.

            While porting this to upstream kernel the question came up what dependencies need  to land before this patch lands.

            simmonsja James A Simmons added a comment - While porting this to upstream kernel the question came up what dependencies need  to land before this patch lands.
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26965/
            Subject: LU-9049 obdclass: change object lookup to no wait mode
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fa14bdf6b648d1d4023a4fa88789059d185f4a07

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26965/ Subject: LU-9049 obdclass: change object lookup to no wait mode Project: fs/lustre-release Branch: master Current Patch Set: Commit: fa14bdf6b648d1d4023a4fa88789059d185f4a07

            The new lnet_cpt_of_md trouble looks more like LU-9203.

            yong.fan nasf (Inactive) added a comment - The new lnet_cpt_of_md trouble looks more like LU-9203 .

            May have spoken too soon. Right after I posted the above, we had this:
            MDT000

            Jun 22 01:13:20 soak-8 kernel: mdt_out01_020: page allocation failure: order:4, mode:0x10c050
            Jun 22 01:13:20 soak-8 kernel: CPU: 10 PID: 6106 Comm: mdt_out01_020 Tainted: P           OE  ------------   3.10.0-514.21.1.el7_lustre.x86_64 #1
            Jun 22 01:13:20 soak-8 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            Jun 22 01:13:20 soak-8 kernel: 000000000010c050 000000005af41680 ffff8803eb6c38a8 ffffffff81687163
            Jun 22 01:13:20 soak-8 kernel: ffff8803eb6c3938 ffffffff81187090 0000000000000000 ffff88083ffd8000
            Jun 22 01:13:20 soak-8 kernel: 0000000000000004 000000000010c050 ffff8803eb6c3938 000000005af41680
            Jun 22 01:13:20 soak-8 kernel: Call Trace:
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff81687163>] dump_stack+0x19/0x1b
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff81187090>] warn_alloc_failed+0x110/0x180
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff81682cf7>] __alloc_pages_slowpath+0x6b7/0x725
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff8118b645>] __alloc_pages_nodemask+0x405/0x420
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff811cf7fa>] alloc_pages_current+0xaa/0x170
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff81185f6e>] __get_free_pages+0xe/0x50
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff811db09e>] kmalloc_order_trace+0x2e/0xa0
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff811dd871>] __kmalloc+0x221/0x240
            Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0cd0399>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
            Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f82e1d>] out_handle+0xa5d/0x1920 [ptlrpc]
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff810ce4c4>] ? update_curr+0x104/0x190
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff810c9bf8>] ? __enqueue_entity+0x78/0x80
            Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f15b82>] ? lustre_msg_get_opc+0x22/0xf0 [ptlrpc]
            Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f78a99>] ? tgt_request_preprocess.isra.26+0x299/0x790 [ptlrpc]
            Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f798a5>] tgt_request_handle+0x915/0x1360 [ptlrpc]
            Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f23143>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
            Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f20938>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
            Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f27120>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
            Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f26680>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff81697798>] ret_from_fork+0x58/0x90
            Jun 22 01:13:20 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
            Jun 22 01:13:20 soak-8 kernel: Mem-Info:
            Jun 22 01:13:20 soak-8 kernel: active_anon:8213 inactive_anon:9497 isolated_anon:0#012 active_file:4337450 inactive_file:1514906 isolated_file:0#012 unevictable:2118 dirty:265 writeback:1 unstable:0#012 slab_reclaimable:1636391 slab_unreclaimable:224759#012 mapped:10001 shmem:4325 pagetables:1435 bounce:0#012 free:77455 free_pcp:174 free_cma:0
            Jun 22 01:13:20 soak-8 kernel: Node 1 Normal free:129652kB min:45728kB low:57160kB high:68592kB active_anon:22588kB inactive_anon:24092kB active_file:9510128kB inactive_file:2799920kB unevictable:6448kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16498508kB mlocked:6448kB dirty:1388kB writeback:0kB mapped:19960kB shmem:16728kB slab_reclaimable:2991008kB slab_unreclaimable:678664kB kernel_stack:6800kB pagetables:3076kB unstable:0kB bounce:0kB free_pcp:812kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
            Jun 22 01:13:20 soak-8 kernel: lowmem_reserve[]: 0 0 0 0
            Jun 22 01:13:20 soak-8 kernel: Node 1 Normal: 7564*4kB (UEM) 4942*8kB (UEM) 3212*16kB (UEM) 192*32kB (EM) 12*64kB (EM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 128096kB
            Jun 22 01:13:20 soak-8 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
            Jun 22 01:13:20 soak-8 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
            Jun 22 01:13:20 soak-8 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
            Jun 22 01:13:20 soak-8 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
            Jun 22 01:13:20 soak-8 kernel: 5859102 total pagecache pages
            Jun 22 01:13:20 soak-8 kernel: 4 pages in swap cache
            Jun 22 01:13:20 soak-8 kernel: Swap cache stats: add 16, delete 12, find 0/0
            Jun 22 01:13:20 soak-8 kernel: Free swap  = 16319420kB
            Jun 22 01:13:21 soak-8 kernel: Total swap = 16319484kB
            Jun 22 01:13:21 soak-8 kernel: 8369064 pages RAM
            Jun 22 01:13:21 soak-8 kernel: 0 pages HighMem/MovableOnly
            Jun 22 01:13:21 soak-8 kernel: 241204 pages reserved
            Jun 22 01:13:21 soak-8 kernel: BUG: unable to handle kernel paging request at ffffeb04007e4c40
            Jun 22 01:13:21 soak-8 kernel: IP: [<ffffffffa0c3c2ff>] lnet_cpt_of_md+0xdf/0x120 [lnet]
            Jun 22 01:13:21 soak-8 kernel: PGD 0
            Jun 22 01:13:21 soak-8 kernel: Oops: 0000 [#1] SMP
            Jun 22 01:13:21 soak-8 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_round_robin sb_edac ioatdma edac_core ipmi_devintf sg ntb mei_me mei iTCO_wdt iTCO_vendor_support i2c_i801 lpc_ich ipmi_ssif shpchp pcspkr wmi
            Jun 22 01:13:21 soak-8 kernel: ipmi_si ipmi_msghandler dm_multipath nfsd dm_mod nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper isci syscopyarea sysfillrect sysimgblt igb fb_sys_fops ahci crct10dif_pclmul ttm crct10dif_common ptp crc32c_intel libsas libahci pps_core drm mlx4_core mpt2sas dca libata raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes
            Jun 22 01:13:21 soak-8 kernel: CPU: 9 PID: 5062 Comm: mdt_out01_014 Tainted: P           OE  ------------   3.10.0-514.21.1.el7_lustre.x86_64 #1
            Jun 22 01:13:21 soak-8 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            Jun 22 01:13:21 soak-8 kernel: task: ffff880829e50fb0 ti: ffff88080f344000 task.ti: ffff88080f344000
            Jun 22 01:13:21 soak-8 kernel: RIP: 0010:[<ffffffffa0c3c2ff>]  [<ffffffffa0c3c2ff>] lnet_cpt_of_md+0xdf/0x120 [lnet]
            Jun 22 01:13:21 soak-8 kernel: RSP: 0018:ffff88080f347a18  EFLAGS: 00010202
            Jun 22 01:13:21 soak-8 kernel: RAX: 00000104007e4c40 RBX: 00050000c0a8016d RCX: 000077ff80000000
            Jun 22 01:13:21 soak-8 kernel: RDX: ffffea0000000000 RSI: 0000000000000001 RDI: ffff8804182023c0
            Jun 22 01:13:21 soak-8 kernel: RBP: ffff88080f347a18 R08: 0000000000000009 R09: 00000000000090c0
            Jun 22 01:13:21 soak-8 kernel: R10: ffffffffa0c46171 R11: ffffc9001f931100 R12: ffff880503b5c000
            Jun 22 01:13:21 soak-8 kernel: R13: 00050000c0a8016d R14: 0000000000000001 R15: ffff8804acad0600
            Jun 22 01:13:21 soak-8 kernel: FS:  0000000000000000(0000) GS:ffff88082d840000(0000) knlGS:0000000000000000
            Jun 22 01:13:21 soak-8 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            Jun 22 01:13:21 soak-8 kernel: CR2: ffffeb04007e4c40 CR3: 00000000019be000 CR4: 00000000000407e0
            Jun 22 01:13:21 soak-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            Jun 22 01:13:21 soak-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            Jun 22 01:13:21 soak-8 kernel: Stack:
            Jun 22 01:13:21 soak-8 kernel: ffff88080f347ad8 ffffffffa0c437dc ffff88080f347a90 ffff8804acad0638
            Jun 22 01:13:21 soak-8 kernel: ffffea0020b0de00 ffff88080f347af8 ffffffffffffffff ffff880829e50fb0
            Jun 22 01:13:21 soak-8 kernel: ffff880829e50fb0 ffff88082c379000 ffff88082c379000 0000000100017a88
            Jun 22 01:13:21 soak-8 kernel: Call Trace:
            Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0c437dc>] lnet_select_pathway+0x5c/0x1140 [lnet]
            Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0c45fb1>] lnet_send+0x51/0x180 [lnet]
            Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0c46325>] LNetPut+0x245/0x7a0 [lnet]
            Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f0ebc6>] ptl_send_buf+0x146/0x530 [ptlrpc]
            Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0baacde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f31fb7>] ? at_measured+0x1c7/0x380 [ptlrpc]
            Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f11e6b>] ptlrpc_send_reply+0x29b/0x840 [ptlrpc]
            Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0ed098e>] target_send_reply_msg+0x8e/0x170 [ptlrpc]
            Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0edb476>] target_send_reply+0x306/0x730 [ptlrpc]
            Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f18b07>] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc]
            Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f79517>] tgt_request_handle+0x587/0x1360 [ptlrpc]
            Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f23143>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
            Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f20938>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            Jun 22 01:13:21 soak-8 kernel: [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
            Jun 22 01:13:21 soak-8 kernel: [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
            Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f27120>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
            Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f26680>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
            Jun 22 01:13:21 soak-8 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
            Jun 22 01:13:21 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
            Jun 22 01:13:21 soak-8 kernel: [<ffffffff81697798>] ret_from_fork+0x58/0x90
            Jun 22 01:13:21 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
            Jun 22 01:13:21 soak-8 kernel: Code: ff 77 00 00 48 8b 3d 21 53 03 00 48 01 d0 48 0f 42 0d 26 8d d8 e0 48 ba 00 00 00 00 00 ea ff ff 48 01 c8 48 c1 e8 0c 48 c1 e0 06 <48> 8b 34 10 48 c1 ee 36 e8 e4 ec f6 ff 5d c3 66 90 b8 ff ff ff
            Jun 22 01:13:21 soak-8 kernel: RIP  [<ffffffffa0c3c2ff>] lnet_cpt_of_md+0xdf/0x120 [lnet]
            Jun 22 01:13:21 soak-8 kernel: RSP <ffff88080f347a18>
            Jun 22 01:13:21 soak-8 kernel: CR2: ffffeb04007e4c40
            

            And on MDT001

            Jun 22 01:13:20 soak-9 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0001: operation out_update to node 192.168.1.108@o2ib failed: rc = -12
            Jun 22 01:13:20 soak-9 kernel: LustreError: 4612:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012  req@ffff88081121e300 x1570858955230384/t0(0) o1000->soaked-MDT0000-osp-MDT0001@192.168.1.108@o2ib:24/4 lens 376/192 e 0 to 0 dl 1498094043 ref 2 fl Interpret:ReM/0/0 rc -12/-12
            Jun 22 01:13:20 soak-9 kernel: general protection fault: 0000 [#1] SMP
            Jun 22 01:13:20 soak-9 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm dm_round_robin irqbypass crc32_pclmul ghash_clmulni_intel ipmi_ssif aesni_intel sg lrw gf128mul ipmi_devintf glue_helper ablk_helper cryptd mei_me mei ntb ipmi_si ioatdma ipmi_msghandler iTCO_wdt iTCO_vendor_support lpc_ich wmi pcspkr sb_edac i2c_i801
            Jun 22 01:13:20 soak-9 kernel: edac_core shpchp dm_multipath dm_mod nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm isci crct10dif_pclmul crct10dif_common ptp ahci libsas crc32c_intel libahci pps_core drm mlx4_core mpt2sas libata dca raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes
            Jun 22 01:13:20 soak-9 kernel: CPU: 11 PID: 4612 Comm: osp_up0-1 Tainted: P           OE  ------------   3.10.0-514.21.1.el7_lustre.x86_64 #1
            Jun 22 01:13:20 soak-9 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            Jun 22 01:13:20 soak-9 kernel: task: ffff8808155c0fb0 ti: ffff8808155c8000 task.ti: ffff8808155c8000
            Jun 22 01:13:20 soak-9 kernel: RIP: 0010:[<ffffffffa14fe0d3>]  [<ffffffffa14fe0d3>] osp_send_update_thread+0x1d3/0x600 [osp]
            Jun 22 01:13:20 soak-9 kernel: RSP: 0018:ffff8808155cbe00  EFLAGS: 00010282
            Jun 22 01:13:20 soak-9 kernel: RAX: ffff8803f3ad1900 RBX: ffff8804014772a0 RCX: 000000018040003e
            Jun 22 01:13:20 soak-9 kernel: RDX: 000000018040003f RSI: 5a5a5a5a5a5a5a5a RDI: 0000000040000000
            Jun 22 01:13:20 soak-9 kernel: RBP: ffff8808155cbec0 R08: ffff8807c6b12600 R09: 000000018040003e
            Jun 22 01:13:20 soak-9 kernel: R10: 00000000c6b11e01 R11: ffffea001f1ac400 R12: ffff8808155c0fb0
            Jun 22 01:13:20 soak-9 kernel: R13: ffff880401b55000 R14: 00000000fffffff4 R15: ffff8804014772b0
            Jun 22 01:13:20 soak-9 kernel: FS:  0000000000000000(0000) GS:ffff88082d8c0000(0000) knlGS:0000000000000000
            Jun 22 01:13:20 soak-9 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            Jun 22 01:13:20 soak-9 kernel: CR2: 00007f8d82582000 CR3: 00000000019be000 CR4: 00000000000407e0
            Jun 22 01:13:20 soak-9 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            Jun 22 01:13:20 soak-9 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            Jun 22 01:13:20 soak-9 kernel: Stack:
            Jun 22 01:13:20 soak-9 kernel: ffff880401b551e8 ffff8808155c0fb0 ffff8808155c0fb0 ffff8803f3ad1900
            Jun 22 01:13:20 soak-9 kernel: 0000000000000000 0000000000000000 0000000000000000 ffff8808155c0fb0
            Jun 22 01:13:20 soak-9 kernel: ffffffff810c54e0 dead000000000100 dead000000000200 0000000210000003
            Jun 22 01:13:20 soak-9 kernel: Call Trace:
            Jun 22 01:13:20 soak-9 kernel: [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20
            Jun 22 01:13:20 soak-9 kernel: [<ffffffffa14fdf00>] ? osp_invalidate_request+0x390/0x390 [osp]
            Jun 22 01:13:20 soak-9 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
            Jun 22 01:13:20 soak-9 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
            Jun 22 01:13:20 soak-9 kernel: [<ffffffff81697798>] ret_from_fork+0x58/0x90
            Jun 22 01:13:20 soak-9 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
            Jun 22 01:13:20 soak-9 kernel: Code: 19 e0 48 8b 95 58 ff ff ff 48 8b 52 48 48 3b 53 30 74 5d 4c 89 ff e8 fd 08 19 e0 45 85 f6 78 65 48 8b 85 58 ff ff ff 48 8b 70 40 <f0> ff 8e 88 00 00 00 0f 94 c0 84 c0 0f 84 db fe ff ff 48 8d 7d
            Jun 22 01:13:20 soak-9 kernel: RIP  [<ffffffffa14fe0d3>] osp_send_update_thread+0x1d3/0x600 [osp]
            Jun 22 01:13:20 soak-9 kernel: RSP <ffff8808155cbe00>
            Jun 22 01:13:20 soak-9 kernel: ---[ end trace 6ac75bb1f736a48a ]---
            Jun 22 01:13:20 soak-9 kernel: Kernel panic - not syncing: Fatal exception
            ~                                                                                                                       
            ~                                                                                                                       
            ~           

            Is this related or new?

            cliffw Cliff White (Inactive) added a comment - May have spoken too soon. Right after I posted the above, we had this: MDT000 Jun 22 01:13:20 soak-8 kernel: mdt_out01_020: page allocation failure: order:4, mode:0x10c050 Jun 22 01:13:20 soak-8 kernel: CPU: 10 PID: 6106 Comm: mdt_out01_020 Tainted: P OE ------------ 3.10.0-514.21.1.el7_lustre.x86_64 #1 Jun 22 01:13:20 soak-8 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 Jun 22 01:13:20 soak-8 kernel: 000000000010c050 000000005af41680 ffff8803eb6c38a8 ffffffff81687163 Jun 22 01:13:20 soak-8 kernel: ffff8803eb6c3938 ffffffff81187090 0000000000000000 ffff88083ffd8000 Jun 22 01:13:20 soak-8 kernel: 0000000000000004 000000000010c050 ffff8803eb6c3938 000000005af41680 Jun 22 01:13:20 soak-8 kernel: Call Trace: Jun 22 01:13:20 soak-8 kernel: [<ffffffff81687163>] dump_stack+0x19/0x1b Jun 22 01:13:20 soak-8 kernel: [<ffffffff81187090>] warn_alloc_failed+0x110/0x180 Jun 22 01:13:20 soak-8 kernel: [<ffffffff81682cf7>] __alloc_pages_slowpath+0x6b7/0x725 Jun 22 01:13:20 soak-8 kernel: [<ffffffff8118b645>] __alloc_pages_nodemask+0x405/0x420 Jun 22 01:13:20 soak-8 kernel: [<ffffffff811cf7fa>] alloc_pages_current+0xaa/0x170 Jun 22 01:13:20 soak-8 kernel: [<ffffffff81185f6e>] __get_free_pages+0xe/0x50 Jun 22 01:13:20 soak-8 kernel: [<ffffffff811db09e>] kmalloc_order_trace+0x2e/0xa0 Jun 22 01:13:20 soak-8 kernel: [<ffffffff811dd871>] __kmalloc+0x221/0x240 Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0cd0399>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f82e1d>] out_handle+0xa5d/0x1920 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffff810ce4c4>] ? update_curr+0x104/0x190 Jun 22 01:13:20 soak-8 kernel: [<ffffffff810c9bf8>] ? __enqueue_entity+0x78/0x80 Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f15b82>] ? lustre_msg_get_opc+0x22/0xf0 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f78a99>] ? tgt_request_preprocess.isra.26+0x299/0x790 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f798a5>] tgt_request_handle+0x915/0x1360 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f23143>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f20938>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20 Jun 22 01:13:20 soak-8 kernel: [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90 Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f27120>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f26680>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 Jun 22 01:13:20 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Jun 22 01:13:20 soak-8 kernel: [<ffffffff81697798>] ret_from_fork+0x58/0x90 Jun 22 01:13:20 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Jun 22 01:13:20 soak-8 kernel: Mem-Info: Jun 22 01:13:20 soak-8 kernel: active_anon:8213 inactive_anon:9497 isolated_anon:0#012 active_file:4337450 inactive_file:1514906 isolated_file:0#012 unevictable:2118 dirty:265 writeback:1 unstable:0#012 slab_reclaimable:1636391 slab_unreclaimable:224759#012 mapped:10001 shmem:4325 pagetables:1435 bounce:0#012 free:77455 free_pcp:174 free_cma:0 Jun 22 01:13:20 soak-8 kernel: Node 1 Normal free:129652kB min:45728kB low:57160kB high:68592kB active_anon:22588kB inactive_anon:24092kB active_file:9510128kB inactive_file:2799920kB unevictable:6448kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16498508kB mlocked:6448kB dirty:1388kB writeback:0kB mapped:19960kB shmem:16728kB slab_reclaimable:2991008kB slab_unreclaimable:678664kB kernel_stack:6800kB pagetables:3076kB unstable:0kB bounce:0kB free_pcp:812kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Jun 22 01:13:20 soak-8 kernel: lowmem_reserve[]: 0 0 0 0 Jun 22 01:13:20 soak-8 kernel: Node 1 Normal: 7564*4kB (UEM) 4942*8kB (UEM) 3212*16kB (UEM) 192*32kB (EM) 12*64kB (EM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 128096kB Jun 22 01:13:20 soak-8 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Jun 22 01:13:20 soak-8 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Jun 22 01:13:20 soak-8 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Jun 22 01:13:20 soak-8 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Jun 22 01:13:20 soak-8 kernel: 5859102 total pagecache pages Jun 22 01:13:20 soak-8 kernel: 4 pages in swap cache Jun 22 01:13:20 soak-8 kernel: Swap cache stats: add 16, delete 12, find 0/0 Jun 22 01:13:20 soak-8 kernel: Free swap = 16319420kB Jun 22 01:13:21 soak-8 kernel: Total swap = 16319484kB Jun 22 01:13:21 soak-8 kernel: 8369064 pages RAM Jun 22 01:13:21 soak-8 kernel: 0 pages HighMem/MovableOnly Jun 22 01:13:21 soak-8 kernel: 241204 pages reserved Jun 22 01:13:21 soak-8 kernel: BUG: unable to handle kernel paging request at ffffeb04007e4c40 Jun 22 01:13:21 soak-8 kernel: IP: [<ffffffffa0c3c2ff>] lnet_cpt_of_md+0xdf/0x120 [lnet] Jun 22 01:13:21 soak-8 kernel: PGD 0 Jun 22 01:13:21 soak-8 kernel: Oops: 0000 [#1] SMP Jun 22 01:13:21 soak-8 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_round_robin sb_edac ioatdma edac_core ipmi_devintf sg ntb mei_me mei iTCO_wdt iTCO_vendor_support i2c_i801 lpc_ich ipmi_ssif shpchp pcspkr wmi Jun 22 01:13:21 soak-8 kernel: ipmi_si ipmi_msghandler dm_multipath nfsd dm_mod nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper isci syscopyarea sysfillrect sysimgblt igb fb_sys_fops ahci crct10dif_pclmul ttm crct10dif_common ptp crc32c_intel libsas libahci pps_core drm mlx4_core mpt2sas dca libata raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes Jun 22 01:13:21 soak-8 kernel: CPU: 9 PID: 5062 Comm: mdt_out01_014 Tainted: P OE ------------ 3.10.0-514.21.1.el7_lustre.x86_64 #1 Jun 22 01:13:21 soak-8 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 Jun 22 01:13:21 soak-8 kernel: task: ffff880829e50fb0 ti: ffff88080f344000 task.ti: ffff88080f344000 Jun 22 01:13:21 soak-8 kernel: RIP: 0010:[<ffffffffa0c3c2ff>] [<ffffffffa0c3c2ff>] lnet_cpt_of_md+0xdf/0x120 [lnet] Jun 22 01:13:21 soak-8 kernel: RSP: 0018:ffff88080f347a18 EFLAGS: 00010202 Jun 22 01:13:21 soak-8 kernel: RAX: 00000104007e4c40 RBX: 00050000c0a8016d RCX: 000077ff80000000 Jun 22 01:13:21 soak-8 kernel: RDX: ffffea0000000000 RSI: 0000000000000001 RDI: ffff8804182023c0 Jun 22 01:13:21 soak-8 kernel: RBP: ffff88080f347a18 R08: 0000000000000009 R09: 00000000000090c0 Jun 22 01:13:21 soak-8 kernel: R10: ffffffffa0c46171 R11: ffffc9001f931100 R12: ffff880503b5c000 Jun 22 01:13:21 soak-8 kernel: R13: 00050000c0a8016d R14: 0000000000000001 R15: ffff8804acad0600 Jun 22 01:13:21 soak-8 kernel: FS: 0000000000000000(0000) GS:ffff88082d840000(0000) knlGS:0000000000000000 Jun 22 01:13:21 soak-8 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 22 01:13:21 soak-8 kernel: CR2: ffffeb04007e4c40 CR3: 00000000019be000 CR4: 00000000000407e0 Jun 22 01:13:21 soak-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 22 01:13:21 soak-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jun 22 01:13:21 soak-8 kernel: Stack: Jun 22 01:13:21 soak-8 kernel: ffff88080f347ad8 ffffffffa0c437dc ffff88080f347a90 ffff8804acad0638 Jun 22 01:13:21 soak-8 kernel: ffffea0020b0de00 ffff88080f347af8 ffffffffffffffff ffff880829e50fb0 Jun 22 01:13:21 soak-8 kernel: ffff880829e50fb0 ffff88082c379000 ffff88082c379000 0000000100017a88 Jun 22 01:13:21 soak-8 kernel: Call Trace: Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0c437dc>] lnet_select_pathway+0x5c/0x1140 [lnet] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0c45fb1>] lnet_send+0x51/0x180 [lnet] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0c46325>] LNetPut+0x245/0x7a0 [lnet] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f0ebc6>] ptl_send_buf+0x146/0x530 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0baacde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f31fb7>] ? at_measured+0x1c7/0x380 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f11e6b>] ptlrpc_send_reply+0x29b/0x840 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0ed098e>] target_send_reply_msg+0x8e/0x170 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0edb476>] target_send_reply+0x306/0x730 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f18b07>] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f79517>] tgt_request_handle+0x587/0x1360 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f23143>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f20938>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20 Jun 22 01:13:21 soak-8 kernel: [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90 Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f27120>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f26680>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 Jun 22 01:13:21 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Jun 22 01:13:21 soak-8 kernel: [<ffffffff81697798>] ret_from_fork+0x58/0x90 Jun 22 01:13:21 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Jun 22 01:13:21 soak-8 kernel: Code: ff 77 00 00 48 8b 3d 21 53 03 00 48 01 d0 48 0f 42 0d 26 8d d8 e0 48 ba 00 00 00 00 00 ea ff ff 48 01 c8 48 c1 e8 0c 48 c1 e0 06 <48> 8b 34 10 48 c1 ee 36 e8 e4 ec f6 ff 5d c3 66 90 b8 ff ff ff Jun 22 01:13:21 soak-8 kernel: RIP [<ffffffffa0c3c2ff>] lnet_cpt_of_md+0xdf/0x120 [lnet] Jun 22 01:13:21 soak-8 kernel: RSP <ffff88080f347a18> Jun 22 01:13:21 soak-8 kernel: CR2: ffffeb04007e4c40 And on MDT001 Jun 22 01:13:20 soak-9 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0001: operation out_update to node 192.168.1.108@o2ib failed: rc = -12 Jun 22 01:13:20 soak-9 kernel: LustreError: 4612:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply ' (1 of 1) in format `OUT_UPDATE' : 0 vs. 4096 (server)#012 req@ffff88081121e300 x1570858955230384/t0(0) o1000->soaked-MDT0000-osp-MDT0001@192.168.1.108@o2ib:24/4 lens 376/192 e 0 to 0 dl 1498094043 ref 2 fl Interpret:ReM/0/0 rc -12/-12 Jun 22 01:13:20 soak-9 kernel: general protection fault: 0000 [#1] SMP Jun 22 01:13:20 soak-9 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm dm_round_robin irqbypass crc32_pclmul ghash_clmulni_intel ipmi_ssif aesni_intel sg lrw gf128mul ipmi_devintf glue_helper ablk_helper cryptd mei_me mei ntb ipmi_si ioatdma ipmi_msghandler iTCO_wdt iTCO_vendor_support lpc_ich wmi pcspkr sb_edac i2c_i801 Jun 22 01:13:20 soak-9 kernel: edac_core shpchp dm_multipath dm_mod nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm isci crct10dif_pclmul crct10dif_common ptp ahci libsas crc32c_intel libahci pps_core drm mlx4_core mpt2sas libata dca raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes Jun 22 01:13:20 soak-9 kernel: CPU: 11 PID: 4612 Comm: osp_up0-1 Tainted: P OE ------------ 3.10.0-514.21.1.el7_lustre.x86_64 #1 Jun 22 01:13:20 soak-9 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 Jun 22 01:13:20 soak-9 kernel: task: ffff8808155c0fb0 ti: ffff8808155c8000 task.ti: ffff8808155c8000 Jun 22 01:13:20 soak-9 kernel: RIP: 0010:[<ffffffffa14fe0d3>] [<ffffffffa14fe0d3>] osp_send_update_thread+0x1d3/0x600 [osp] Jun 22 01:13:20 soak-9 kernel: RSP: 0018:ffff8808155cbe00 EFLAGS: 00010282 Jun 22 01:13:20 soak-9 kernel: RAX: ffff8803f3ad1900 RBX: ffff8804014772a0 RCX: 000000018040003e Jun 22 01:13:20 soak-9 kernel: RDX: 000000018040003f RSI: 5a5a5a5a5a5a5a5a RDI: 0000000040000000 Jun 22 01:13:20 soak-9 kernel: RBP: ffff8808155cbec0 R08: ffff8807c6b12600 R09: 000000018040003e Jun 22 01:13:20 soak-9 kernel: R10: 00000000c6b11e01 R11: ffffea001f1ac400 R12: ffff8808155c0fb0 Jun 22 01:13:20 soak-9 kernel: R13: ffff880401b55000 R14: 00000000fffffff4 R15: ffff8804014772b0 Jun 22 01:13:20 soak-9 kernel: FS: 0000000000000000(0000) GS:ffff88082d8c0000(0000) knlGS:0000000000000000 Jun 22 01:13:20 soak-9 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 22 01:13:20 soak-9 kernel: CR2: 00007f8d82582000 CR3: 00000000019be000 CR4: 00000000000407e0 Jun 22 01:13:20 soak-9 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 22 01:13:20 soak-9 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jun 22 01:13:20 soak-9 kernel: Stack: Jun 22 01:13:20 soak-9 kernel: ffff880401b551e8 ffff8808155c0fb0 ffff8808155c0fb0 ffff8803f3ad1900 Jun 22 01:13:20 soak-9 kernel: 0000000000000000 0000000000000000 0000000000000000 ffff8808155c0fb0 Jun 22 01:13:20 soak-9 kernel: ffffffff810c54e0 dead000000000100 dead000000000200 0000000210000003 Jun 22 01:13:20 soak-9 kernel: Call Trace: Jun 22 01:13:20 soak-9 kernel: [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20 Jun 22 01:13:20 soak-9 kernel: [<ffffffffa14fdf00>] ? osp_invalidate_request+0x390/0x390 [osp] Jun 22 01:13:20 soak-9 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 Jun 22 01:13:20 soak-9 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Jun 22 01:13:20 soak-9 kernel: [<ffffffff81697798>] ret_from_fork+0x58/0x90 Jun 22 01:13:20 soak-9 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Jun 22 01:13:20 soak-9 kernel: Code: 19 e0 48 8b 95 58 ff ff ff 48 8b 52 48 48 3b 53 30 74 5d 4c 89 ff e8 fd 08 19 e0 45 85 f6 78 65 48 8b 85 58 ff ff ff 48 8b 70 40 <f0> ff 8e 88 00 00 00 0f 94 c0 84 c0 0f 84 db fe ff ff 48 8d 7d Jun 22 01:13:20 soak-9 kernel: RIP [<ffffffffa14fe0d3>] osp_send_update_thread+0x1d3/0x600 [osp] Jun 22 01:13:20 soak-9 kernel: RSP <ffff8808155cbe00> Jun 22 01:13:20 soak-9 kernel: ---[ end trace 6ac75bb1f736a48a ]--- Jun 22 01:13:20 soak-9 kernel: Kernel panic - not syncing: Fatal exception ~ ~ ~ Is this related or new?

            Ran 40+ hours in soak, 12+ hours with LFSCK. Did not hit the LBUG.

            cliffw Cliff White (Inactive) added a comment - Ran 40+ hours in soak, 12+ hours with LFSCK. Did not hit the LBUG.
            laisiyao Lai Siyao added a comment -

            Cliff, could you test soak again with both https://review.whamcloud.com/26965 and https://review.whamcloud.com/#/c/27708/ , the latter is the fix for LU-9678, which may be easily triggered by the former (though not the cause of it) by Nasf's opinion.

            laisiyao Lai Siyao added a comment - Cliff, could you test soak again with both https://review.whamcloud.com/26965 and https://review.whamcloud.com/#/c/27708/ , the latter is the fix for LU-9678 , which may be easily triggered by the former (though not the cause of it) by Nasf's opinion.

            Tested latest version of the patch. Had a hard failure on OSS after OSS failover. May not be related

            cliffw Cliff White (Inactive) added a comment - Tested latest version of the patch. Had a hard failure on OSS after OSS failover. May not be related
            laisiyao Lai Siyao added a comment -

            I just rebased patch to latest master, Cliff, will you test the latest update again to make sure it's caused by this patch?

            laisiyao Lai Siyao added a comment - I just rebased patch to latest master, Cliff, will you test the latest update again to make sure it's caused by this patch?

            People

              laisiyao Lai Siyao
              cliffw Cliff White (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: