[LU-9049] DNE MDT Never completes recovery Created: 25/Jan/17  Updated: 08/May/18  Resolved: 22/Jun/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 1
Labels: soak
Environment:

Soak test cluster, version=lustre: 2.9.51_45_g3b3eeeb - tip of master


Attachments: File lu-9049.tar.gz     File lustre-log.1485304895.6159.txt.gz     Text File stack-dump-lola-10.txt     HTML File trace    
Issue Links:
Related
is related to LU-8815 MDS fails to complete recovery Closed
is related to LU-8481 MDT hung in recovery Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Failover of lola-11 to lola-10

2017-01-24 16:44:56,220:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0003 on lola-10 ...

Immediately prior to the failover mount, lola-10 reports hung processes.

Jan 24 16:41:03 lola-10 kernel: Lustre: 4554:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1485304863/real 1485304863]  req@ffff8804131dd680 x1557446666855008/t0(0) o38->soaked-MDT0003-osp-MDT0002@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1485304880 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Jan 24 16:41:03 lola-10 kernel: Lustre: 4554:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Jan 24 16:41:35 lola-10 kernel: LNet: Service thread pid 6159 was inactive for 226.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jan 24 16:41:35 lola-10 kernel: Pid: 6159, comm: mdt00_005
Jan 24 16:41:35 lola-10 kernel:
Jan 24 16:41:35 lola-10 kernel: Call Trace:
Jan 24 16:41:35 lola-10 kernel: [<ffffffff8105e4e3>] ? __wake_up+0x53/0x70
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa0a33bb2>] top_trans_stop+0x782/0xbd0 [ptlrpc]
Jan 24 16:41:35 lola-10 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa11b4aac>] lod_trans_stop+0x2bc/0x330 [lod]
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1259f31>] mdd_trans_stop+0x21/0xc6 [mdd]
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1246e63>] mdd_create+0x1373/0x17e0 [mdd]
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1105c44>] ? mdt_version_save+0x84/0x1a0 [mdt]
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa110fcfc>] mdt_reint_create+0xbdc/0xfe0 [mdt]
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa11013bc>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09bb2bb>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
Jan 24 16:41:35 lola-10 kernel: [<ffffffff81299b7a>] ? strlcpy+0x4a/0x60
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1102a2a>] ? old_init_ucred_common+0xda/0x2b0 [mdt]
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1104f5d>] mdt_reint_rec+0x5d/0x200 [mdt]
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa10efd9b>] mdt_reint_internal+0x62b/0xa50 [mdt]
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa10f066b>] mdt_reint+0x6b/0x120 [mdt]
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa0a1f17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09ca66b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa067c84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09c56e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
Jan 24 16:41:35 lola-10 kernel: [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09cba11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09caf50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
Jan 24 16:41:35 lola-10 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
Jan 24 16:41:35 lola-10 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Jan 24 16:41:35 lola-10 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
Jan 24 16:41:35 lola-10 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Jan 24 16:41:35 lola-10 kernel:
Jan 24 16:41:35 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1485304895.6159

Mount occurs

Jan 24 16:46:22 lola-10 kernel: Lustre: soaked-MDT0003: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
...
Jan 24 16:46:27 lola-10 kernel: Lustre: soaked-MDT0003: Will be in recovery for at least 2:30, or until 22 clients reconnect
...
Jan 24 16:47:44 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
Jan 24 16:47:44 lola-10 kernel:  req@ffff88041171b980 x1557448608035648/t21476913294(0) o36->57c15fb3-4c15-952f-3686-35f4c9caa941@192.168.1.126@o2ib100:-1/-1 lens 632/568 e 19 to 0 dl 1485305269 ref 2 fl Interpret:/0/0 rc 0/0
Jan 24 16:47:47 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
Jan 24 16:47:47 lola-10 kernel:  req@ffff880737e13cc0 x1557448490734944/t21476914413(0) o36->1e7aaa7d-f417-ebd3-eff8-7787d5dac449@192.168.1.123@o2ib100:-1/-1 lens 616/3128 e 19 to 0 dl 1485305272 ref 2 fl Interpret:/0/0 rc 0/0
Jan 24 16:47:47 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) Skipped 22 previous similar messages
Jan 24 16:47:50 lola-10 kernel: Lustre: soaked-MDT0002: Client ffcd9b78-a534-1d1b-1494-f25c5b00edf1 (at 192.168.1.131@o2ib100) reconnecting

Recovery appears to stall

Jan 24 16:50:18 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1
Jan 24 16:50:19 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1
Jan 24 16:50:19 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) Skipped 2 previous similar messages
Jan 24 16:50:21 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1

System is still spinning at the present time.



 Comments   
Comment by Cliff White (Inactive) [ 25/Jan/17 ]

Stack dump from lola-10 and first lustre-log dumped. The system is available if anyone wishes to have a look.

Comment by Cliff White (Inactive) [ 25/Jan/17 ]

At the same time, lola-8 (MGS/MDT0) is having hangs:

an 24 16:40:04 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1485304804/real 1485304804]  req@ffff8804e8eb1cc0 x1557447056153856/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1485304815 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Jan 24 16:40:04 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Jan 24 16:41:09 lola-8 kernel: LNet: Service thread pid 6336 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jan 24 16:41:09 lola-8 kernel: Pid: 6336, comm: mdt00_014
Jan 24 16:41:09 lola-8 kernel:
Jan 24 16:41:09 lola-8 kernel: Call Trace:
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa0baebb2>] top_trans_stop+0x782/0xbd0 [ptlrpc]
Jan 24 16:41:09 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa134caac>] lod_trans_stop+0x2bc/0x330 [lod]
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa136dc3c>] ? lod_attr_get+0x9c/0x1b0 [lod]
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa13f1f31>] mdd_trans_stop+0x21/0xc6 [mdd]
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa13d67f2>] mdd_unlink+0x532/0x1020 [mdd]
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa129cec8>] mdo_unlink+0x18/0x50 [mdt]
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa12a5fcf>] mdt_reint_unlink+0xcaf/0x10c0 [mdt]
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa129cf5d>] mdt_reint_rec+0x5d/0x200 [mdt]
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa1287d9b>] mdt_reint_internal+0x62b/0xa50 [mdt]
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa128866b>] mdt_reint+0x6b/0x120 [mdt]
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa0b9a17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa0b4566b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa082a84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa0b406e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
Jan 24 16:41:09 lola-8 kernel: [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa0b46a11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
Jan 24 16:41:09 lola-8 kernel: [<ffffffffa0b45f50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
Jan 24 16:41:09 lola-8 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
Jan 24 16:41:09 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Jan 24 16:41:09 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
Jan 24 16:41:09 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Jan 24 16:41:09 lola-8 kernel:

Mount occurs

Jan 24 16:41:44 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1485304904/real 1485304904]  req@ffff8804e8e9a3c0 x1557447056161856/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1485304925 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Jan 24 16:41:44 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Jan 24 16:42:49 lola-8 kernel: Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib10, removing former export from same NID
Jan 24 16:42:49 lola-8 kernel: Lustre: soaked-MDT0000: Connection restored to soaked-MDT0000-lwp-MDT0002_UUID (at 192.168.1.110@o2ib10)
Jan 24 16:44:39 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1485305079/real 1485305079]  req@ffff8804eb39d080 x1557447056175808/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1485305110 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Jan 24 16:44:39 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Jan 24 16:45:01 lola-8 CROND[7107]: (root) CMD (logger -p local0.notice -t TIME "Time stamp for console")
Jan 24 16:45:01 lola-8 TIME: Time stamp for console
Jan 24 16:46:22 lola-8 kernel: Lustre: soaked-MDT0000: Received LWP connection from 192.168.1.110@o2ib10, removing former export from 192.168.1.111@o2ib10
Jan 24 16:46:22 lola-8 kernel: Lustre: soaked-MDT0000: Connection restored to soaked-MDT0003-mdtlov_UUID (at 192.168.1.110@o2ib10)
Jan 24 16:47:44 lola-8 kernel: Lustre: 6298:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
Jan 24 16:47:44 lola-8 kernel:  req@ffff8805945686c0 x1557448469206752/t21499374434(0) o36->905c8d4a-b076-6b11-d52f-9bf74da1631a@192.168.1.120@o2ib100:-1/-1 lens 624/3128 e 24 to 0 dl 1485305269 ref 2 fl Interpret:/0/0 rc 0/0
Jan 24 16:47:50 lola-8 kernel: Lustre: soaked-MDT0000: Client 905c8d4a-b076-6b11-d52f-9bf74da1631a (at 192.168.1.120@o2ib100) reconnecting
Jan 24 16:47:50 lola-8 kernel: Lustre: soaked-MDT0000: Connection restored to 905c8d4a-b076-6b11-d52f-9bf74da1631a (at 192.168.1.120@o2ib100)

System continues to have blocked processes

Jan 24 16:53:38 lola-8 kernel: INFO: task mdt_out00_002:6070 blocked for more than 120 seconds.
Jan 24 16:53:38 lola-8 kernel:      Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
Jan 24 16:53:38 lola-8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 24 16:53:38 lola-8 kernel: mdt_out00_002 D 0000000000000016     0  6070      2 0x00000080
Jan 24 16:53:38 lola-8 kernel: ffff88082230bb00 0000000000000046 0000000000000000 ffff88082230bcc0
Jan 24 16:53:38 lola-8 kernel: ffff8803c91c2f00 ffff8803eb000118 000007f09680dbc0 ffff8803eb000118
Jan 24 16:53:38 lola-8 kernel: ffff88082230bb00 000000010080a9dc ffff8808276045f8 ffff88082230bfd8
Jan 24 16:53:38 lola-8 kernel: Call Trace:
Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0939f8d>] lu_object_find_at+0x3d/0xe0 [obdclass]
Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b5c612>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b31e10>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [<ffffffffa093ad1c>] dt_locate_at+0x1c/0xa0 [obdclass]
Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0ba2e07>] out_handle+0x1067/0x18d0 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50
Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b9308a>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b9a17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b4566b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [<ffffffffa082a84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b406e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90
Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b46a11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b45f50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
Jan 24 16:53:38 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Jan 24 16:53:38 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
Jan 24 16:53:38 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Comment by Di Wang [ 25/Jan/17 ]

There are a few interesting process on lola-8. It maybe a deadlock, need investigate a bit. Nasf: could you please check if this lfsck thread is suspicious here? thanks.

lfsck         S 0000000000000010     0  6061      2 0x00000080
 ffff880821d9f710 0000000000000046 0000000000000000 ffff880821d9f6d4
 0000000000000000 ffff88043fe83e00 00003c78463d5afe ffff880038615a00
 0000000000000200 0000000103f26d04 ffff880821d07068 ffff880821d9ffd8
Call Trace:
 [<ffffffff8153a9b2>] schedule_timeout+0x192/0x2e0
 [<ffffffff81089fa0>] ? process_timeout+0x0/0x10
 [<ffffffffa0b2ae61>] ptlrpc_set_wait+0x321/0x960 [ptlrpc]
 [<ffffffffa0b200d0>] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc]
 [<ffffffff81067650>] ? default_wake_function+0x0/0x20
 [<ffffffffa0b373c5>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]
 [<ffffffffa0b2b521>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
 [<ffffffffa1458322>] osp_remote_sync+0xf2/0x1e0 [osp]
 [<ffffffffa143b677>] osp_attr_get+0x447/0x710 [osp]
 [<ffffffffa09167be>] ? lprocfs_counter_add+0x9e/0x140 [obdclass]
 [<ffffffffa143cd45>] osp_object_init+0x1f5/0x360 [osp]
 [<ffffffffa0938a58>] lu_object_alloc+0xd8/0x320 [obdclass]
 [<ffffffffa0939e41>] lu_object_find_try+0x151/0x260 [obdclass]
 [<ffffffffa0db4f3a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
 [<ffffffffa093a001>] lu_object_find_at+0xb1/0xe0 [obdclass]
 [<ffffffffa0db8cf3>] ? fld_server_lookup+0x53/0x330 [fld]
 [<ffffffffa093a06f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa120f522>] lfsck_namespace_verify_stripe_slave+0x232/0xbf0 [lfsck]
 [<ffffffffa10628cb>] ? ldiskfs_xattr_trusted_get+0x2b/0x30 [ldiskfs]
 [<ffffffff8153c200>] ? hrtimer_nanosleep_restart+0x10/0x90
 [<ffffffffa11cd375>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
 [<ffffffffa11c3466>] lfsck_open_dir+0xa6/0x3a0 [lfsck]
 [<ffffffffa11c6267>] lfsck_exec_oit+0x667/0xb30 [lfsck]
 [<ffffffffa0db4f3a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
 [<ffffffffa11c6f55>] lfsck_master_oit_engine+0x825/0x15b0 [lfsck]
 [<ffffffff81067662>] ? default_wake_function+0x12/0x20
 [<ffffffffa11ca650>] lfsck_master_engine+0xb50/0x1460 [lfsck]
 [<ffffffff81067650>] ? default_wake_function+0x0/0x20
 [<ffffffffa11c9b00>] ? lfsck_master_engine+0x0/0x1460 [lfsck]
 [<ffffffff810a138e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

mdt_out00_002 D 0000000000000016     0  6070      2 0x00000080
 ffff88082230bb00 0000000000000046 0000000000000000 ffff88082230bcc0
 ffff8803c91c2f00 ffff8803eb000118 000007f09680dbc0 ffff8803eb000118
 ffff88082230bb00 000000010080a9dc ffff8808276045f8 ffff88082230bfd8
Call Trace:
 [<ffffffffa0939f8d>] lu_object_find_at+0x3d/0xe0 [obdclass]
 [<ffffffffa0b5c612>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
 [<ffffffff81067650>] ? default_wake_function+0x0/0x20
 [<ffffffffa0b31e10>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc]
 [<ffffffffa093ad1c>] dt_locate_at+0x1c/0xa0 [obdclass]
 [<ffffffffa0ba2e07>] out_handle+0x1067/0x18d0 [ptlrpc]
 [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50
 [<ffffffffa0b9308a>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
 [<ffffffffa0b9a17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [<ffffffffa0b4566b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
 [<ffffffffa082a84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [<ffffffffa0b406e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
 [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90
 [<ffffffffa0b46a11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
 [<ffffffffa0b45f50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
 [<ffffffff810a138e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

mdt00_004     S 0000000000000013     0  6077      2 0x00000080
 ffff880837067870 0000000000000046 0000000000000000 ffffffff810a1c1c
 ffff880836aaf520 ffff8808370677f0 0000240d5abd2b10 0000000000000000
 ffff880837067830 0000000102588d0f ffff880836aafad8 ffff880837067fd8
Call Trace:
 [<ffffffff810a1c1c>] ? remove_wait_queue+0x3c/0x50
 [<ffffffffa0b04170>] ? ldlm_expired_completion_wait+0x0/0x230 [ptlrpc]
 [<ffffffffa0b094dd>] ldlm_completion_ast+0x67d/0x9a0 [ptlrpc]
 [<ffffffff81067650>] ? default_wake_function+0x0/0x20
 [<ffffffffa0b0886e>] ldlm_cli_enqueue_local+0x21e/0x810 [ptlrpc]
 [<ffffffffa0b08e60>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc]
 [<ffffffffa12807a0>] ? mdt_blocking_ast+0x0/0x2e0 [mdt]
 [<ffffffffa128d4e4>] mdt_object_local_lock+0x3a4/0xb00 [mdt]
 [<ffffffffa12807a0>] ? mdt_blocking_ast+0x0/0x2e0 [mdt]
 [<ffffffffa0b08e60>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc]
 [<ffffffffa093a001>] ? lu_object_find_at+0xb1/0xe0 [obdclass]
 [<ffffffffa128e143>] mdt_object_lock_internal+0x63/0x320 [mdt]
 [<ffffffffa128e614>] mdt_object_lock+0x14/0x20 [mdt]
 [<ffffffffa1295266>] mdt_getattr_name_lock+0xdd6/0x1910 [mdt]
 [<ffffffffa12962c2>] mdt_intent_getattr+0x292/0x470 [mdt]
 [<ffffffffa128691e>] mdt_intent_policy+0x4be/0xc70 [mdt]
 [<ffffffffa0ae76d7>] ldlm_lock_enqueue+0x127/0x990 [ptlrpc]
 [<ffffffffa0b134bb>] ldlm_handle_enqueue0+0x98b/0x14e0 [ptlrpc]
 [<ffffffffa0b98fd1>] tgt_enqueue+0x61/0x230 [ptlrpc]
 [<ffffffffa0b9a17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [<ffffffffa0b4566b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
 [<ffffffffa082a84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [<ffffffffa0b406e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
 [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90
 [<ffffffffa0b46a11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
 [<ffffffffa0b45f50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
 [<ffffffff810a138e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

mdt00_009     S 0000000000000000     0  6304      2 0x00000080
 ffff88063d177930 0000000000000046 ffffffff00000018 ffff88063d177900
 ffff88063d177890 ffffffff8153a9ba ffff88063d1778d0 0000000000000286
 0000000000000000 000000005888a5eb ffff88063d163068 ffff88063d177fd8
Call Trace:
 [<ffffffff8153a9ba>] ? schedule_timeout+0x19a/0x2e0
 [<ffffffffa0b04170>] ? ldlm_expired_completion_wait+0x0/0x230 [ptlrpc]
 [<ffffffffa0b094dd>] ldlm_completion_ast+0x67d/0x9a0 [ptlrpc]
 [<ffffffff81067650>] ? default_wake_function+0x0/0x20
 [<ffffffffa0b0886e>] ldlm_cli_enqueue_local+0x21e/0x810 [ptlrpc]
 [<ffffffffa0b08e60>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc]
 [<ffffffffa12807a0>] ? mdt_blocking_ast+0x0/0x2e0 [mdt]
 [<ffffffffa128d4e4>] mdt_object_local_lock+0x3a4/0xb00 [mdt]
 [<ffffffffa12807a0>] ? mdt_blocking_ast+0x0/0x2e0 [mdt]
 [<ffffffffa0b08e60>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc]
 [<ffffffffa128e143>] mdt_object_lock_internal+0x63/0x320 [mdt]
 [<ffffffffa128e614>] mdt_object_lock+0x14/0x20 [mdt]
 [<ffffffffa12a781a>] mdt_reint_create+0x6fa/0xfe0 [mdt]
 [<ffffffffa12993bc>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
 [<ffffffffa0b362bb>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
 [<ffffffff81299b7a>] ? strlcpy+0x4a/0x60
 [<ffffffffa12997df>] ? ucred_set_jobid+0x5f/0x70 [mdt]
 [<ffffffffa129aa2a>] ? old_init_ucred_common+0xda/0x2b0 [mdt]
 [<ffffffffa129cf5d>] mdt_reint_rec+0x5d/0x200 [mdt]
 [<ffffffffa1287d9b>] mdt_reint_internal+0x62b/0xa50 [mdt]
 [<ffffffffa128866b>] mdt_reint+0x6b/0x120 [mdt]
 [<ffffffffa0b9a17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [<ffffffffa0b4566b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
 [<ffffffffa082a84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [<ffffffffa0b406e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
 [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90
 [<ffffffffa0b46a11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
 [<ffffffffa0b45f50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
 [<ffffffff810a138e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

mdt00_014     S 0000000000000012     0  6336      2 0x00000080
 ffff88062f783990 0000000000000046 0000000000000000 ffff8804137e8398
 0000000000000282 0000000000000003 0000073705fbcc95 0000000000000282
 ffff88062f783940 0000000100748091 ffff88063393c5f8 ffff88062f783fd8
Call Trace:
 [<ffffffffa0baebb2>] top_trans_stop+0x782/0xbd0 [ptlrpc]
 [<ffffffff81067650>] ? default_wake_function+0x0/0x20
 [<ffffffffa134caac>] lod_trans_stop+0x2bc/0x330 [lod]
 [<ffffffffa136dc3c>] ? lod_attr_get+0x9c/0x1b0 [lod]
 [<ffffffffa13f1f31>] mdd_trans_stop+0x21/0xc6 [mdd]
 [<ffffffffa13d67f2>] mdd_unlink+0x532/0x1020 [mdd]
 [<ffffffffa129cec8>] mdo_unlink+0x18/0x50 [mdt]
 [<ffffffffa12a5fcf>] mdt_reint_unlink+0xcaf/0x10c0 [mdt]
 [<ffffffffa129cf5d>] mdt_reint_rec+0x5d/0x200 [mdt]
 [<ffffffffa1287d9b>] mdt_reint_internal+0x62b/0xa50 [mdt]
 [<ffffffffa128866b>] mdt_reint+0x6b/0x120 [mdt]
 [<ffffffffa0b9a17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [<ffffffffa0b4566b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
 [<ffffffffa082a84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [<ffffffffa0b406e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
 [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90
 [<ffffffffa0b46a11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
 [<ffffffffa0b45f50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
 [<ffffffff810a138e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
Comment by Joseph Gmitter (Inactive) [ 26/Jan/17 ]

Hi Alex,
Since Nasf is out for the Chinese holiday, do you think you could look at the lfsck thread above to see if you have any input?
Thanks.
Joe

Comment by Alex Zhuravlev [ 26/Jan/17 ]

oK..

Comment by nasf (Inactive) [ 27/Jan/17 ]

As explained in the LU-8815, someone holds reference on the dying object as to block others who are trying to locate the object with the same FID. If the reference holder is blocked for other reason, then will cause cascading block, looks like the system hung. Here, according to the stack trace, the LFSCK engine is one of the threads that are blocked by the dying object reference holder. What need to be found out:

1) what the dying object is.
2) who hold the reference on such dying object.
3) why the reference holder does not release the dying object.

In fact, if we can the answers for 1) and/or 2), then get the answers for the 3) may be not difficult. Unfortunately, I still has no answers for none of them.

Comment by Alex Zhuravlev [ 27/Jan/17 ]

I think we're ready to get rid of check for LU_OBJECT_BANSHEE from htable_lookup() - the only reason was to prevent an access to such objects, but we've got the check in the every method.

Comment by Jian Yu [ 29/Jan/17 ]

sanity-lfsck test 30 hit the similar failure on master branch in review-dne-part-2 test session:
https://testing.hpdd.intel.com/test_sets/1fd15748-e5ce-11e6-978f-5254006e85c2

Console log on one of the two MDSs:

Lustre: DEBUG MARKER: == sanity-lfsck test 30: LFSCK can recover the orphans from backend /lost+found ====================== 20:00:39 (1485633639)
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x161d
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x1624
general protection fault: 0000 [#1] SMP 
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev parport_pc parport pcspkr i2c_piix4 virtio_balloon nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus virtio_blk drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul crct10dif_common ttm ata_piix crc32c_intel 8139too virtio_pci virtio_ring serio_raw virtio drm i2c_core 8139cp mii libata floppy
CPU: 0 PID: 11163 Comm: mdt00_001 Tainted: G           OE  ------------   3.10.0-514.2.2.el7_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
task: ffff880043db5e20 ti: ffff880044818000 task.ti: ffff880044818000
RIP: 0010:[<ffffffff811de300>]  [<ffffffff811de300>] kmem_cache_alloc_trace+0x80/0x200
RSP: 0018:ffff88004481b820  EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff880036ad1c00 RCX: 000000000068969a
RDX: 0000000000689699 RSI: 0000000000008050 RDI: ffff88007d001a00
RBP: ffff88004481b858 R08: 0000000000019aa0 R09: ffff88007d001a00
R10: ffffffffa0aeb29e R11: 64657474696d6d6f R12: 0002680000000100
R13: 0000000000008050 R14: 0000000000000058 R15: ffff88007d001a00
FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f203169a000 CR3: 000000007b855000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffff88007d001a00 ffffffffa0aeb29e ffff880036ad1c00 ffff8800787d3000
 ffff880078d57840 ffff88003f82d800 0000000000000000 ffff88004481b8d0
 ffffffffa0aeb29e ffff8800395034bb 0000000900000659 0000000000000000
Call Trace:
 [<ffffffffa0aeb29e>] ? tgt_last_rcvd_update+0x64e/0xc80 [ptlrpc]
 [<ffffffffa0aeb29e>] tgt_last_rcvd_update+0x64e/0xc80 [ptlrpc]
 [<ffffffffa0aef49d>] tgt_txn_stop_cb+0x19d/0x490 [ptlrpc]
 [<ffffffffa07308b8>] ? libcfs_debug_vmsg2+0x708/0xb70 [libcfs]
 [<ffffffffa087063e>] dt_txn_hook_stop+0x7e/0xd0 [obdclass]
 [<ffffffffa0cb71fb>] osd_trans_stop+0x12b/0x590 [osd_ldiskfs]
 [<ffffffffa0b09509>] top_trans_stop+0x99/0x970 [ptlrpc]
 [<ffffffffa0f27399>] lod_trans_stop+0x259/0x340 [lod]
 [<ffffffffa0fb5bad>] mdd_trans_stop+0x1d/0x28 [mdd]
 [<ffffffffa0fa5493>] mdd_unlink+0x523/0xd80 [mdd]
 [<ffffffffa0ea3145>] mdo_unlink+0x43/0x45 [mdt]
 [<ffffffffa0e770c4>] mdt_reint_unlink+0xc04/0x11b0 [mdt]
 [<ffffffffa0e7a500>] mdt_reint_rec+0x80/0x210 [mdt]
 [<ffffffffa0e5cdb2>] mdt_reint_internal+0x5b2/0x9b0 [mdt]
 [<ffffffffa0e680f7>] mdt_reint+0x67/0x140 [mdt]
 [<ffffffffa0af57b5>] tgt_request_handle+0x915/0x1320 [ptlrpc]
 [<ffffffffa0aa282b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
 [<ffffffffa0aa03e8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
 [<ffffffff810c4ff2>] ? default_wake_function+0x12/0x20
 [<ffffffff810ba238>] ? __wake_up_common+0x58/0x90
 [<ffffffffa0aa6830>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
 [<ffffffffa0aa5d90>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
 [<ffffffff810b064f>] kthread+0xcf/0xe0
 [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff81696898>] ret_from_fork+0x58/0x90
 [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
Code: d0 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85 e4 0f 84 24 01 00 00 48 85 c0 0f 84 1b 01 00 00 49 63 47 20 48 8d 4a 01 4d 8b 07 <49> 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 49 63 
RIP  [<ffffffff811de300>] kmem_cache_alloc_trace+0x80/0x200
 RSP <ffff88004481b820>
Comment by Cliff White (Inactive) [ 19/Apr/17 ]

restarted soak with latest lustre-master (3561) mounted all nodes.
Soak-10 does not complete recovery, recovery timer increments until it can increment no more.
Attached Lustre kernel dump, system logs

Comment by Cliff White (Inactive) [ 20/Apr/17 ]

I left the system overnight, and recovery did eventually complete.

[17166.319310] Lustre: 3626:0:(ldlm_lib.c:1781:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1
[17166.339932] Lustre: 3626:0:(ldlm_lib.c:1781:extend_recovery_timer()) Skipped 211354 previous similar messages
[17462.141184] Lustre: 3626:0:(ldlm_lib.c:2046:target_recovery_overseer()) soaked-MDT0002 recovery is aborted by hard timeout
[17462.156428] Lustre: 3626:0:(ldlm_lib.c:2056:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[17462.174477] Lustre: 3626:0:(ldlm_lib.c:2552:target_recovery_thread()) too long recovery - read logs
[17462.187265] Lustre: soaked-MDT0002: Recovery over after 283:48, of 3 clients 3 recovered and 0 were evicted.
[17462.187267] LustreError: dumping log to /tmp/lustre-log.1492647191.3626

I would still consider this an issue.

Comment by Peter Jones [ 20/Apr/17 ]

Lai

Could you please advise on this one

Peter

Comment by Gerrit Updater [ 05/May/17 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/26965
Subject: LU-9049 obdclass: unhash object when destroying object
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a4e1fe3886bfa945c9860dc53d45d2d0a1a0b0e0

Comment by Cliff White (Inactive) [ 13/Jun/17 ]

Tested patch on soak. Used patch set 7, build 48150

  • MDS aborts recovery on multiple occasions after failover/restart, clients are evicted.
    Two MDS failures:
    Jun 12 22:28:52 soak-8 kernel: Lustre: 4536:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1497306524/real 0]  req@ffff8803fc27f200 x1570030437379168/t0(0) o400->soaked-MDT0001-osp-MDT0000@192.168.1.109@o2ib10:24/4 lens 224/224 e 0 to 1 dl 1497306531 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
    Jun 12 22:28:52 soak-8 kernel: Lustre: 4536:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message
    Jun 12 22:28:52 soak-8 kernel: Lustre: soaked-MDT0001-osp-MDT0000: Connection to soaked-MDT0001 (at 192.168.1.109@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
    Jun 12 22:29:07 soak-8 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.120@o2ib100 (no target). If you are running an HA pair check that the target is mounted on the other server.
    Jun 12 22:29:10 soak-8 kernel: mdt_out01_013: page allocation failure: order:4, mode:0x10c050
    Jun 12 22:29:10 soak-8 kernel: CPU: 10 PID: 5150 Comm: mdt_out01_013 Tainted: P        W  OE  ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1
    Jun 12 22:29:10 soak-8 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
    Jun 12 22:29:10 soak-8 kernel: Lustre: soaked-MDT0000: Rejecting reconnect from the known client soaked-MDT0000-lwp-MDT0002_UUID (at 192.168.1.110@o2ib10) because it is indicating it is a new client
    Jun 12 22:29:10 soak-8 kernel: 000000000010c050
    Jun 12 22:29:10 soak-8 kernel: 00000000cc5d6083 ffff8807e65978a8 ffffffff81686d1f
    Jun 12 22:29:10 soak-8 kernel: ffff8807e6597938 ffffffff81187010 0000000000000000 ffff88083ffd8000
    Jun 12 22:29:10 soak-8 kernel: 0000000000000004 000000000010c050 ffff8807e6597938 00000000cc5d6083
    Jun 12 22:29:10 soak-8 kernel: Call Trace:
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff81686d1f>] dump_stack+0x19/0x1b
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff81187010>] warn_alloc_failed+0x110/0x180
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff816828d7>] __alloc_pages_slowpath+0x6b7/0x725
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff8118b5c5>] __alloc_pages_nodemask+0x405/0x420
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff811cf77a>] alloc_pages_current+0xaa/0x170
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff81185eee>] __get_free_pages+0xe/0x50
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff811db01e>] kmalloc_order_trace+0x2e/0xa0
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff811dd7f1>] __kmalloc+0x221/0x240
    Jun 12 22:29:10 soak-8 kernel: [<ffffffffa0cc9599>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
    Jun 12 22:29:10 soak-8 kernel: [<ffffffffa0f8212d>] out_handle+0xa5d/0x1920 [ptlrpc]
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff810ce474>] ? update_curr+0x104/0x190
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff810c9ba8>] ? __enqueue_entity+0x78/0x80
    Jun 12 22:29:10 soak-8 kernel: [<ffffffffa0f14ce2>] ? lustre_msg_get_opc+0x22/0xf0 [ptlrpc]
    Jun 12 22:29:10 soak-8 kernel: [<ffffffffa0f77da9>] ? tgt_request_preprocess.isra.26+0x299/0x790 [ptlrpc]
    Jun 12 22:29:10 soak-8 kernel: [<ffffffffa0f78bb5>] tgt_request_handle+0x915/0x1360 [ptlrpc]
    Jun 12 22:29:10 soak-8 kernel: [<ffffffffa0f222b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
    Jun 12 22:29:10 soak-8 kernel: [<ffffffffa0f207a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff810c54d2>] ? default_wake_function+0x12/0x20
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
    Jun 12 22:29:10 soak-8 kernel: [<ffffffffa0f262c0>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
    Jun 12 22:29:10 soak-8 kernel: [<ffffffffa0f25820>] ? ptlrpc_register_service+0xe60/0xe60 [ptlrpc]
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
    Jun 12 22:29:10 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
    Jun 12 22:29:10 soak-8 kernel: Mem-Info:
    Jun 12 22:29:10 soak-8 kernel: active_anon:7323 inactive_anon:7468 isolated_anon:0#012 active_file:4687260 inactive_file:2131312 isolated_file:0#012 unevictable:2202 dirty:8519 writeback:0 unstable:0#012 slab_reclaimable:292721 slab_unreclaimable:421232#012 mapped:14619 shmem:2286 pagetables:1440 bounce:0#012 free:271081 free_pcp:195 free_cma:0
    Jun 12 22:29:10 soak-8 kernel: Node 1 Normal free:401704kB min:45728kB low:57160kB high:68592kB active_anon:17004kB inactive_anon:17016kB active_file:10443492kB inactive_file:3351644kB unevictable:4592kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16498508kB mlocked:4592kB dirty:19776kB writeback:0kB mapped:10104kB shmem:8712kB slab_reclaimable:661888kB slab_unreclaimable:1229584kB kernel_stack:7616kB pagetables:2428kB unstable:0kB bounce:0kB free_pcp:780kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    Jun 12 22:29:10 soak-8 kernel: lowmem_reserve[]:
    

    Second MDS

    un 12 22:09:27 soak-11 systemd: Stopping user-0.slice.
    Jun 12 22:29:10 soak-11 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0003: operation out_update to node 192.168.1.108@o2ib10 failed: rc = -12
    Jun 12 22:29:10 soak-11 kernel: LustreError: 4611:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012  req@ffff88075ff89e00 x1570036086727488/t0(0) o1000->soaked-MDT0000-osp-MDT0003@192.168.1.108@o2ib10:24/4 lens 376/192 e 0 to 0 dl 1497306559 ref 2 fl Interpret:ReM/0/0 rc -12/-12
    Jun 12 22:29:11 soak-11 kernel: general protection fault: 0000 [#1] SMP
    Jun 12 22:29:11 soak-11 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core dm_round_robin intel_powerclamp coretemp intel_rapl iosf_mbi sg kvm ipmi_ssif ipmi_devintf irqbypass ipmi_si crc32_pclmul ghash_clmulni_intel aesni_intel mei_me mei lrw i2c_i801 gf128mul glue_helper ablk_helper cryptd sb_edac pcspkr ipmi_msghandler iTCO_wdt iTCO_vendor_support lpc_ich edac_core ioatdma
    Jun 12 22:29:11 soak-11 kernel: ntb shpchp wmi dm_multipath dm_mod nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper isci syscopyarea ahci sysfillrect libsas sysimgblt libahci igb fb_sys_fops crct10dif_pclmul ttm crct10dif_common ptp mpt3sas crc32c_intel pps_core libata drm mlx4_core raid_class dca i2c_algo_bit scsi_transport_sas devlink i2c_core fjes
    Jun 12 22:29:11 soak-11 kernel: CPU: 14 PID: 4611 Comm: osp_up0-3 Tainted: P           OE  ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1
    Jun 12 22:29:11 soak-11 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
    Jun 12 22:29:11 soak-11 kernel: task: ffff88042b320000 ti: ffff88081efc8000 task.ti: ffff88081efc8000
    Jun 12 22:29:11 soak-11 kernel: RIP: 0010:[<ffffffffa14fd583>]  [<ffffffffa14fd583>] osp_send_update_thread+0x1d3/0x600 [osp]
    Jun 12 22:29:11 soak-11 kernel: RSP: 0018:ffff88081efcbe00  EFLAGS: 00010282
    Jun 12 22:29:11 soak-11 kernel: RAX: ffff880762b38180 RBX: ffff88082885e960 RCX: 00000000005466fe
    Jun 12 22:29:11 soak-11 kernel: RDX: 00000000005466fd RSI: 5a5a5a5a5a5a5a5a RDI: ffff88017fc07600
    Jun 12 22:29:11 soak-11 kernel: Lustre: 4521:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1497306543/real 0]  req@ffff880368255d00 x1570036086580928/t0(0) o400->soaked-MDT0001-osp-MDT0003@192.168.1.109@o2ib10:24/4 lens 224/224 e 0 to 1 dl 1497306550 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
    Jun 12 22:29:11 soak-11 kernel: Lustre: soaked-MDT0001-osp-MDT0003: Connection to soaked-MDT0001 (at 192.168.1.109@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
    Jun 12 22:29:11 soak-11 kernel: RBP: ffff88081efcbec0 R08: 0000000000019b20 R09: ffffffffa0cf4b61
    Jun 12 22:29:11 soak-11 kernel: R10: ffff88082d999b20 R11: ffffea001d0f9600 R12: ffff88042b320000
    Jun 12 22:29:11 soak-11 kernel: R13: ffff8803fdf77000 R14: 00000000fffffff4 R15: ffff88082885e970
    Jun 12 22:29:11 soak-11 kernel: FS:  0000000000000000(0000) GS:ffff88082d980000(0000) knlGS:0000000000000000
    Jun 12 22:29:11 soak-11 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Jun 12 22:29:11 soak-11 kernel: CR2: 00007fcbe8826000 CR3: 00000000019be000 CR4: 00000000000407e0
    Jun 12 22:29:11 soak-11 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    Jun 12 22:29:11 soak-11 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Jun 12 22:29:11 soak-11 kernel: Call Trace:
    Jun 12 22:29:11 soak-11 kernel: [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20
    Jun 12 22:29:11 soak-11 kernel: [<ffffffffa14fd3b0>] ? osp_invalidate_request+0x390/0x390 [osp]
    Jun 12 22:29:11 soak-11 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
    Jun 12 22:29:11 soak-11 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
    Jun 12 22:29:11 soak-11 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
    Jun 12 22:29:11 soak-11 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
    Jun 12 22:29:11 soak-11 kernel: Code: 19 e0 48 8b 95 58 ff ff ff 48 8b 52 48 48 3b 53 30 74 5d 4c 89 ff e8 ed 0f 19 e0 45 85 f6 78 65 48 8b 85 58 ff ff ff 48 8b 70 40 <f0> ff 8e 88 00 00 00 0f 94 c0 84 c0 0f 84 db fe ff ff 48 8d 7d
    Jun 12 22:29:11 soak-11 kernel: RIP  [<ffffffffa14fd583>] osp_send_update_thread+0x1d3/0x600 [osp]
    Jun 12 22:29:11 soak-11 kernel: RSP <ffff88081efcbe00>
    Jun 12 22:29:11 soak-11 kernel: ---[ end trace 02ea2c24fb7e0d30 ]---
    

    dump has been messed up on soak, will see if we get more data.

Comment by Lai Siyao [ 13/Jun/17 ]

This looks to be caused by OOM, and it's not directly related with the patch.

Comment by Lai Siyao [ 13/Jun/17 ]

I just rebased patch to latest master, Cliff, will you test the latest update again to make sure it's caused by this patch?

Comment by Cliff White (Inactive) [ 16/Jun/17 ]

Tested latest version of the patch. Had a hard failure on OSS after OSS failover. May not be related

Comment by Lai Siyao [ 20/Jun/17 ]

Cliff, could you test soak again with both https://review.whamcloud.com/26965 and https://review.whamcloud.com/#/c/27708/ , the latter is the fix for LU-9678, which may be easily triggered by the former (though not the cause of it) by Nasf's opinion.

Comment by Cliff White (Inactive) [ 22/Jun/17 ]

Ran 40+ hours in soak, 12+ hours with LFSCK. Did not hit the LBUG.

Comment by Cliff White (Inactive) [ 22/Jun/17 ]

May have spoken too soon. Right after I posted the above, we had this:
MDT000

Jun 22 01:13:20 soak-8 kernel: mdt_out01_020: page allocation failure: order:4, mode:0x10c050
Jun 22 01:13:20 soak-8 kernel: CPU: 10 PID: 6106 Comm: mdt_out01_020 Tainted: P           OE  ------------   3.10.0-514.21.1.el7_lustre.x86_64 #1
Jun 22 01:13:20 soak-8 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
Jun 22 01:13:20 soak-8 kernel: 000000000010c050 000000005af41680 ffff8803eb6c38a8 ffffffff81687163
Jun 22 01:13:20 soak-8 kernel: ffff8803eb6c3938 ffffffff81187090 0000000000000000 ffff88083ffd8000
Jun 22 01:13:20 soak-8 kernel: 0000000000000004 000000000010c050 ffff8803eb6c3938 000000005af41680
Jun 22 01:13:20 soak-8 kernel: Call Trace:
Jun 22 01:13:20 soak-8 kernel: [<ffffffff81687163>] dump_stack+0x19/0x1b
Jun 22 01:13:20 soak-8 kernel: [<ffffffff81187090>] warn_alloc_failed+0x110/0x180
Jun 22 01:13:20 soak-8 kernel: [<ffffffff81682cf7>] __alloc_pages_slowpath+0x6b7/0x725
Jun 22 01:13:20 soak-8 kernel: [<ffffffff8118b645>] __alloc_pages_nodemask+0x405/0x420
Jun 22 01:13:20 soak-8 kernel: [<ffffffff811cf7fa>] alloc_pages_current+0xaa/0x170
Jun 22 01:13:20 soak-8 kernel: [<ffffffff81185f6e>] __get_free_pages+0xe/0x50
Jun 22 01:13:20 soak-8 kernel: [<ffffffff811db09e>] kmalloc_order_trace+0x2e/0xa0
Jun 22 01:13:20 soak-8 kernel: [<ffffffff811dd871>] __kmalloc+0x221/0x240
Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0cd0399>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f82e1d>] out_handle+0xa5d/0x1920 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [<ffffffff810ce4c4>] ? update_curr+0x104/0x190
Jun 22 01:13:20 soak-8 kernel: [<ffffffff810c9bf8>] ? __enqueue_entity+0x78/0x80
Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f15b82>] ? lustre_msg_get_opc+0x22/0xf0 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f78a99>] ? tgt_request_preprocess.isra.26+0x299/0x790 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f798a5>] tgt_request_handle+0x915/0x1360 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f23143>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f20938>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
Jun 22 01:13:20 soak-8 kernel: [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f27120>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f26680>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
Jun 22 01:13:20 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Jun 22 01:13:20 soak-8 kernel: [<ffffffff81697798>] ret_from_fork+0x58/0x90
Jun 22 01:13:20 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Jun 22 01:13:20 soak-8 kernel: Mem-Info:
Jun 22 01:13:20 soak-8 kernel: active_anon:8213 inactive_anon:9497 isolated_anon:0#012 active_file:4337450 inactive_file:1514906 isolated_file:0#012 unevictable:2118 dirty:265 writeback:1 unstable:0#012 slab_reclaimable:1636391 slab_unreclaimable:224759#012 mapped:10001 shmem:4325 pagetables:1435 bounce:0#012 free:77455 free_pcp:174 free_cma:0
Jun 22 01:13:20 soak-8 kernel: Node 1 Normal free:129652kB min:45728kB low:57160kB high:68592kB active_anon:22588kB inactive_anon:24092kB active_file:9510128kB inactive_file:2799920kB unevictable:6448kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16498508kB mlocked:6448kB dirty:1388kB writeback:0kB mapped:19960kB shmem:16728kB slab_reclaimable:2991008kB slab_unreclaimable:678664kB kernel_stack:6800kB pagetables:3076kB unstable:0kB bounce:0kB free_pcp:812kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jun 22 01:13:20 soak-8 kernel: lowmem_reserve[]: 0 0 0 0
Jun 22 01:13:20 soak-8 kernel: Node 1 Normal: 7564*4kB (UEM) 4942*8kB (UEM) 3212*16kB (UEM) 192*32kB (EM) 12*64kB (EM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 128096kB
Jun 22 01:13:20 soak-8 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jun 22 01:13:20 soak-8 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 22 01:13:20 soak-8 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jun 22 01:13:20 soak-8 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 22 01:13:20 soak-8 kernel: 5859102 total pagecache pages
Jun 22 01:13:20 soak-8 kernel: 4 pages in swap cache
Jun 22 01:13:20 soak-8 kernel: Swap cache stats: add 16, delete 12, find 0/0
Jun 22 01:13:20 soak-8 kernel: Free swap  = 16319420kB
Jun 22 01:13:21 soak-8 kernel: Total swap = 16319484kB
Jun 22 01:13:21 soak-8 kernel: 8369064 pages RAM
Jun 22 01:13:21 soak-8 kernel: 0 pages HighMem/MovableOnly
Jun 22 01:13:21 soak-8 kernel: 241204 pages reserved
Jun 22 01:13:21 soak-8 kernel: BUG: unable to handle kernel paging request at ffffeb04007e4c40
Jun 22 01:13:21 soak-8 kernel: IP: [<ffffffffa0c3c2ff>] lnet_cpt_of_md+0xdf/0x120 [lnet]
Jun 22 01:13:21 soak-8 kernel: PGD 0
Jun 22 01:13:21 soak-8 kernel: Oops: 0000 [#1] SMP
Jun 22 01:13:21 soak-8 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_round_robin sb_edac ioatdma edac_core ipmi_devintf sg ntb mei_me mei iTCO_wdt iTCO_vendor_support i2c_i801 lpc_ich ipmi_ssif shpchp pcspkr wmi
Jun 22 01:13:21 soak-8 kernel: ipmi_si ipmi_msghandler dm_multipath nfsd dm_mod nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper isci syscopyarea sysfillrect sysimgblt igb fb_sys_fops ahci crct10dif_pclmul ttm crct10dif_common ptp crc32c_intel libsas libahci pps_core drm mlx4_core mpt2sas dca libata raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes
Jun 22 01:13:21 soak-8 kernel: CPU: 9 PID: 5062 Comm: mdt_out01_014 Tainted: P           OE  ------------   3.10.0-514.21.1.el7_lustre.x86_64 #1
Jun 22 01:13:21 soak-8 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
Jun 22 01:13:21 soak-8 kernel: task: ffff880829e50fb0 ti: ffff88080f344000 task.ti: ffff88080f344000
Jun 22 01:13:21 soak-8 kernel: RIP: 0010:[<ffffffffa0c3c2ff>]  [<ffffffffa0c3c2ff>] lnet_cpt_of_md+0xdf/0x120 [lnet]
Jun 22 01:13:21 soak-8 kernel: RSP: 0018:ffff88080f347a18  EFLAGS: 00010202
Jun 22 01:13:21 soak-8 kernel: RAX: 00000104007e4c40 RBX: 00050000c0a8016d RCX: 000077ff80000000
Jun 22 01:13:21 soak-8 kernel: RDX: ffffea0000000000 RSI: 0000000000000001 RDI: ffff8804182023c0
Jun 22 01:13:21 soak-8 kernel: RBP: ffff88080f347a18 R08: 0000000000000009 R09: 00000000000090c0
Jun 22 01:13:21 soak-8 kernel: R10: ffffffffa0c46171 R11: ffffc9001f931100 R12: ffff880503b5c000
Jun 22 01:13:21 soak-8 kernel: R13: 00050000c0a8016d R14: 0000000000000001 R15: ffff8804acad0600
Jun 22 01:13:21 soak-8 kernel: FS:  0000000000000000(0000) GS:ffff88082d840000(0000) knlGS:0000000000000000
Jun 22 01:13:21 soak-8 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 22 01:13:21 soak-8 kernel: CR2: ffffeb04007e4c40 CR3: 00000000019be000 CR4: 00000000000407e0
Jun 22 01:13:21 soak-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 22 01:13:21 soak-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 22 01:13:21 soak-8 kernel: Stack:
Jun 22 01:13:21 soak-8 kernel: ffff88080f347ad8 ffffffffa0c437dc ffff88080f347a90 ffff8804acad0638
Jun 22 01:13:21 soak-8 kernel: ffffea0020b0de00 ffff88080f347af8 ffffffffffffffff ffff880829e50fb0
Jun 22 01:13:21 soak-8 kernel: ffff880829e50fb0 ffff88082c379000 ffff88082c379000 0000000100017a88
Jun 22 01:13:21 soak-8 kernel: Call Trace:
Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0c437dc>] lnet_select_pathway+0x5c/0x1140 [lnet]
Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0c45fb1>] lnet_send+0x51/0x180 [lnet]
Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0c46325>] LNetPut+0x245/0x7a0 [lnet]
Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f0ebc6>] ptl_send_buf+0x146/0x530 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0baacde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f31fb7>] ? at_measured+0x1c7/0x380 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f11e6b>] ptlrpc_send_reply+0x29b/0x840 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0ed098e>] target_send_reply_msg+0x8e/0x170 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0edb476>] target_send_reply+0x306/0x730 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f18b07>] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f79517>] tgt_request_handle+0x587/0x1360 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f23143>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f20938>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
Jun 22 01:13:21 soak-8 kernel: [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f27120>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f26680>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
Jun 22 01:13:21 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Jun 22 01:13:21 soak-8 kernel: [<ffffffff81697798>] ret_from_fork+0x58/0x90
Jun 22 01:13:21 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Jun 22 01:13:21 soak-8 kernel: Code: ff 77 00 00 48 8b 3d 21 53 03 00 48 01 d0 48 0f 42 0d 26 8d d8 e0 48 ba 00 00 00 00 00 ea ff ff 48 01 c8 48 c1 e8 0c 48 c1 e0 06 <48> 8b 34 10 48 c1 ee 36 e8 e4 ec f6 ff 5d c3 66 90 b8 ff ff ff
Jun 22 01:13:21 soak-8 kernel: RIP  [<ffffffffa0c3c2ff>] lnet_cpt_of_md+0xdf/0x120 [lnet]
Jun 22 01:13:21 soak-8 kernel: RSP <ffff88080f347a18>
Jun 22 01:13:21 soak-8 kernel: CR2: ffffeb04007e4c40

And on MDT001

Jun 22 01:13:20 soak-9 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0001: operation out_update to node 192.168.1.108@o2ib failed: rc = -12
Jun 22 01:13:20 soak-9 kernel: LustreError: 4612:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012  req@ffff88081121e300 x1570858955230384/t0(0) o1000->soaked-MDT0000-osp-MDT0001@192.168.1.108@o2ib:24/4 lens 376/192 e 0 to 0 dl 1498094043 ref 2 fl Interpret:ReM/0/0 rc -12/-12
Jun 22 01:13:20 soak-9 kernel: general protection fault: 0000 [#1] SMP
Jun 22 01:13:20 soak-9 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm dm_round_robin irqbypass crc32_pclmul ghash_clmulni_intel ipmi_ssif aesni_intel sg lrw gf128mul ipmi_devintf glue_helper ablk_helper cryptd mei_me mei ntb ipmi_si ioatdma ipmi_msghandler iTCO_wdt iTCO_vendor_support lpc_ich wmi pcspkr sb_edac i2c_i801
Jun 22 01:13:20 soak-9 kernel: edac_core shpchp dm_multipath dm_mod nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm isci crct10dif_pclmul crct10dif_common ptp ahci libsas crc32c_intel libahci pps_core drm mlx4_core mpt2sas libata dca raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes
Jun 22 01:13:20 soak-9 kernel: CPU: 11 PID: 4612 Comm: osp_up0-1 Tainted: P           OE  ------------   3.10.0-514.21.1.el7_lustre.x86_64 #1
Jun 22 01:13:20 soak-9 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
Jun 22 01:13:20 soak-9 kernel: task: ffff8808155c0fb0 ti: ffff8808155c8000 task.ti: ffff8808155c8000
Jun 22 01:13:20 soak-9 kernel: RIP: 0010:[<ffffffffa14fe0d3>]  [<ffffffffa14fe0d3>] osp_send_update_thread+0x1d3/0x600 [osp]
Jun 22 01:13:20 soak-9 kernel: RSP: 0018:ffff8808155cbe00  EFLAGS: 00010282
Jun 22 01:13:20 soak-9 kernel: RAX: ffff8803f3ad1900 RBX: ffff8804014772a0 RCX: 000000018040003e
Jun 22 01:13:20 soak-9 kernel: RDX: 000000018040003f RSI: 5a5a5a5a5a5a5a5a RDI: 0000000040000000
Jun 22 01:13:20 soak-9 kernel: RBP: ffff8808155cbec0 R08: ffff8807c6b12600 R09: 000000018040003e
Jun 22 01:13:20 soak-9 kernel: R10: 00000000c6b11e01 R11: ffffea001f1ac400 R12: ffff8808155c0fb0
Jun 22 01:13:20 soak-9 kernel: R13: ffff880401b55000 R14: 00000000fffffff4 R15: ffff8804014772b0
Jun 22 01:13:20 soak-9 kernel: FS:  0000000000000000(0000) GS:ffff88082d8c0000(0000) knlGS:0000000000000000
Jun 22 01:13:20 soak-9 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 22 01:13:20 soak-9 kernel: CR2: 00007f8d82582000 CR3: 00000000019be000 CR4: 00000000000407e0
Jun 22 01:13:20 soak-9 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 22 01:13:20 soak-9 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 22 01:13:20 soak-9 kernel: Stack:
Jun 22 01:13:20 soak-9 kernel: ffff880401b551e8 ffff8808155c0fb0 ffff8808155c0fb0 ffff8803f3ad1900
Jun 22 01:13:20 soak-9 kernel: 0000000000000000 0000000000000000 0000000000000000 ffff8808155c0fb0
Jun 22 01:13:20 soak-9 kernel: ffffffff810c54e0 dead000000000100 dead000000000200 0000000210000003
Jun 22 01:13:20 soak-9 kernel: Call Trace:
Jun 22 01:13:20 soak-9 kernel: [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20
Jun 22 01:13:20 soak-9 kernel: [<ffffffffa14fdf00>] ? osp_invalidate_request+0x390/0x390 [osp]
Jun 22 01:13:20 soak-9 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
Jun 22 01:13:20 soak-9 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Jun 22 01:13:20 soak-9 kernel: [<ffffffff81697798>] ret_from_fork+0x58/0x90
Jun 22 01:13:20 soak-9 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Jun 22 01:13:20 soak-9 kernel: Code: 19 e0 48 8b 95 58 ff ff ff 48 8b 52 48 48 3b 53 30 74 5d 4c 89 ff e8 fd 08 19 e0 45 85 f6 78 65 48 8b 85 58 ff ff ff 48 8b 70 40 <f0> ff 8e 88 00 00 00 0f 94 c0 84 c0 0f 84 db fe ff ff 48 8d 7d
Jun 22 01:13:20 soak-9 kernel: RIP  [<ffffffffa14fe0d3>] osp_send_update_thread+0x1d3/0x600 [osp]
Jun 22 01:13:20 soak-9 kernel: RSP <ffff8808155cbe00>
Jun 22 01:13:20 soak-9 kernel: ---[ end trace 6ac75bb1f736a48a ]---
Jun 22 01:13:20 soak-9 kernel: Kernel panic - not syncing: Fatal exception
~                                                                                                                       
~                                                                                                                       
~           

Is this related or new?

Comment by nasf (Inactive) [ 22/Jun/17 ]

The new lnet_cpt_of_md trouble looks more like LU-9203.

Comment by Gerrit Updater [ 22/Jun/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26965/
Subject: LU-9049 obdclass: change object lookup to no wait mode
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fa14bdf6b648d1d4023a4fa88789059d185f4a07

Comment by Peter Jones [ 22/Jun/17 ]

Landed for 2.10

Comment by James A Simmons [ 07/May/18 ]

While porting this to upstream kernel the question came up what dependencies need  to land before this patch lands.

Comment by Lai Siyao [ 08/May/18 ]

https://review.whamcloud.com/#/c/27708/ is fix for LU-9678, and it's a fix for osd-zfs, which is server code.
https://review.whamcloud.com/#/c/28165/ is fix for LU-9203, which affects both client and server.
https://review.whamcloud.com/#/c/26965/ is fix for this ticket, which is a fix for server code, but it modified lu_object_put(), so it affects client code too.

I'd suggest porting the latter two patches, and there is no direct dependency between them.

Generated at Sat Feb 10 02:22:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.