[LU-9983] LBUG llog_osd.c:327:llog_osd_declare_write_rec() - all DNE MDS Created: 13/Sep/17 Updated: 02/Jan/19 Resolved: 17/Dec/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.1, Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.11.0, Lustre 2.10.6 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Cliff White (Inactive) | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
version=2.10.52_83_g1fc4ed3 lustre-master build 3637 - RHEL 7.3 distro with 7.4 kernels |
||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
soak is started, jobs begin to run. Sep 13 15:53:37 soak-8 kernel: LustreError: 4873:0:(out_handler.c:597:out_write()) soaked-MDT0000: empty or wrong size 0 pos: rc = -61 Sep 13 15:53:37 soak-8 kernel: LustreError: 4990:0:(out_handler.c:1000:out_handle()) soaked-MDT0000: invalid update buffer magic 0 expect bdde0002: rc = -71 Sep 13 15:53:37 soak-8 kernel: LustreError: 4873:0:(out_handler.c:597:out_write()) Skipped 2 previous similar messages Sep 13 15:53:44 soak-8 kernel: LustreError: 5117:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 968 actual 344. All other DNE MDTS (01-03) promptly LBUG: Sep 13 15:53:37 soak-9 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0001: operation out_update to node 192.168.1.108@o2ib failed: rc = -71 Sep 13 15:53:37 soak-9 kernel: LustreError: 4734:0:(layout.c:2082:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012 req@ffff880808e79500 x1578438980902080/t0(0) o1000->soaked-MDT0002-osp-MDT0001@192.168.1.110@o2ib:24/4 lens 392/192 e 0 to 0 dl 1505318024 ref 2 fl Interpret:RM/0/0 rc -71/-71 Sep 13 15:53:37 soak-9 kernel: LustreError: Skipped 1 previous similar message Sep 13 15:53:47 soak-9 kernel: LustreError: 4719:0:(llog_osd.c:327:llog_osd_declare_write_rec()) ASSERTION( rec ) failed: Sep 13 15:53:47 soak-9 kernel: LustreError: 4719:0:(llog_osd.c:327:llog_osd_declare_write_rec()) LBUG Sep 13 15:53:47 soak-9 kernel: Pid: 4719, comm: dist_txn-1 Sep 13 15:53:47 soak-9 kernel: #012Call Trace: Sep 13 15:53:47 soak-9 kernel: [<ffffffffc0e667ae>] libcfs_call_trace+0x4e/0x60 [libcfs] Sep 13 15:53:47 soak-9 kernel: [<ffffffffc0e6683c>] lbug_with_loc+0x4c/0xb0 [libcfs] Sep 13 15:53:47 soak-9 kernel: [<ffffffffc0f658f5>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass] Sep 13 15:53:47 soak-9 kernel: [<ffffffffc0f5a0e4>] llog_declare_write_rec+0x84/0x200 [obdclass] Sep 13 15:53:47 soak-9 kernel: [<ffffffffc0f5a862>] llog_cancel_rec+0xe2/0x870 [obdclass] Sep 13 15:53:47 soak-9 kernel: [<ffffffffc0f60f6a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass] Sep 13 15:53:47 soak-9 kernel: [<ffffffffc0e71ba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Sep 13 15:53:47 soak-9 kernel: [<ffffffffc11f22d9>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc] Sep 13 15:53:47 soak-9 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0 Sep 13 15:53:47 soak-9 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20 Sep 13 15:53:47 soak-9 kernel: [<ffffffffc11f1d40>] ? distribute_txn_commit_thread+0x0/0xca0 [ptlrpc] Sep 13 15:53:47 soak-9 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0 Sep 13 15:53:47 soak-9 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Sep 13 15:53:47 soak-9 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 Sep 13 15:53:47 soak-9 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Sep 13 15:53:47 soak-9 kernel: Sep 13 15:53:47 soak-9 kernel: Kernel panic - not syncing: LBUG Sep 13 15:53:50 soak-9 kernel: CPU: 25 PID: 4719 Comm: dist_txn-1 Tainted: P OE ------------ 3.10.0-693.1.1.el7_lustre.x86_64 #1 Sep 13 15:53:50 soak-9 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 Sep 13 15:53:50 soak-9 kernel: ffff880419612f00 000000001ae378cc ffff880418eafb90 ffffffff816a3d6d Sep 13 15:53:50 soak-9 kernel: ffff880418eafc10 ffffffff8169dc54 ffffffff00000008 ffff880418eafc20 Sep 13 15:53:50 soak-9 kernel: ffff880418eafbc0 000000001ae378cc 000000001ae378cc ffff88082da4f8b8 Sep 13 15:53:50 soak-9 kernel: Call Trace: Sep 13 15:53:50 soak-9 kernel: [<ffffffff816a3d6d>] dump_stack+0x19/0x1b Sep 13 15:53:50 soak-9 kernel: [<ffffffff8169dc54>] panic+0xe8/0x20d Sep 13 15:53:50 soak-9 kernel: [<ffffffffc0e66854>] lbug_with_loc+0x64/0xb0 [libcfs] Sep 13 15:53:50 soak-9 kernel: [<ffffffffc0f658f5>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass] Sep 13 15:53:50 soak-9 kernel: [<ffffffffc0f5a0e4>] llog_declare_write_rec+0x84/0x200 [obdclass] Sep 13 15:53:50 soak-9 kernel: [<ffffffffc0f5a862>] llog_cancel_rec+0xe2/0x870 [obdclass] Sep 13 15:53:50 soak-9 kernel: [<ffffffffc0f60f6a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass] Sep 13 15:53:50 soak-9 kernel: [<ffffffffc0e71ba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Sep 13 15:53:50 soak-9 kernel: [<ffffffffc11f22d9>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc] Sep 13 15:53:50 soak-9 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0 Sep 13 15:53:50 soak-9 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 Sep 13 15:53:50 soak-9 kernel: [<ffffffffc11f1d40>] ? sub_trans_commit_cb+0x20/0x20 [ptlrpc] Sep 13 15:53:50 soak-9 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0 Sep 13 15:53:50 soak-9 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 Sep 13 15:53:50 soak-9 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 Sep 13 15:53:50 soak-9 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 Sep 13 15:53:50 soak-9 kernel: Kernel Offset: disabled Sep 13 15:53:50 soak-9 kernel: ------------[ cut here ]------------ soak-10 Sep 13 15:53:37 soak-10 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0002: operation out_update to node 192.168.1.108@o2ib failed: rc = -71 Sep 13 15:53:37 soak-10 kernel: LustreError: 14955:0:(layout.c:2082:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012 req@ffff880818bb5400 x1578438986144240/t0(0) o1000->soaked-MDT0000-osp-MDT0002@192.168.1.108@o2ib:24/4 lens 392/192 e 0 to 0 dl 1505318024 ref 2 fl Interpret:RM/0/0 rc -71/-71 Sep 13 15:53:37 soak-10 kernel: LustreError: 14939:0:(out_handler.c:597:out_write()) soaked-MDT0002: empty or wrong size 0 pos: rc = -61 Sep 13 15:53:37 soak-10 kernel: LustreError: 14939:0:(out_handler.c:1000:out_handle()) soaked-MDT0002: invalid update buffer magic 0 expect bdde0002: rc = -71 Sep 13 15:53:44 soak-10 kernel: LustreError: 14961:0:(llog_osd.c:327:llog_osd_declare_write_rec()) ASSERTION( rec ) failed: Sep 13 15:53:44 soak-10 kernel: LustreError: 14961:0:(llog_osd.c:327:llog_osd_declare_write_rec()) LBUG Sep 13 15:53:44 soak-10 kernel: Pid: 14961, comm: dist_txn-2 Sep 13 15:53:44 soak-10 kernel: #012Call Trace: Sep 13 15:53:44 soak-10 kernel: [<ffffffffc0e087ae>] libcfs_call_trace+0x4e/0x60 [libcfs] Sep 13 15:53:44 soak-10 kernel: [<ffffffffc0e0883c>] lbug_with_loc+0x4c/0xb0 [libcfs] Sep 13 15:53:44 soak-10 kernel: [<ffffffffc0f208f5>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass] Sep 13 15:53:44 soak-10 kernel: [<ffffffffc0f150e4>] llog_declare_write_rec+0x84/0x200 [obdclass] Sep 13 15:53:44 soak-10 kernel: [<ffffffffc0f15862>] llog_cancel_rec+0xe2/0x870 [obdclass] Sep 13 15:53:44 soak-10 kernel: [<ffffffffc0f1bf6a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass] Sep 13 15:53:44 soak-10 kernel: [<ffffffffc0e13ba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Sep 13 15:53:44 soak-10 kernel: [<ffffffffc11fc2d9>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc] Sep 13 15:53:44 soak-10 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20 Sep 13 15:53:44 soak-10 kernel: [<ffffffffc11fbd40>] ? distribute_txn_commit_thread+0x0/0xca0 [ptlrpc] Sep 13 15:53:44 soak-10 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0 Sep 13 15:53:44 soak-10 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Sep 13 15:53:44 soak-10 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 Sep 13 15:53:44 soak-10 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Sep 13 15:53:44 soak-10 kernel: Sep 13 15:53:44 soak-10 kernel: Kernel panic - not syncing: LBUG Sep 13 15:53:45 soak-10 kernel: CPU: 10 PID: 14961 Comm: dist_txn-2 Tainted: P OE ------------ 3.10.0-693.1.1.el7_lustre.x86_64 #1 Sep 13 15:53:45 soak-10 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 Sep 13 15:53:45 soak-10 kernel: ffff8803f8e7bf00 000000009b9e8911 ffff8803f8ea7b90 ffffffff816a3d6d Sep 13 15:53:45 soak-10 kernel: ffff8803f8ea7c10 ffffffff8169dc54 ffffffff00000008 ffff8803f8ea7c20 Sep 13 15:53:45 soak-10 kernel: ffff8803f8ea7bc0 000000009b9e8911 000000009b9e8911 ffff88082d88f8b8 Sep 13 15:53:45 soak-10 kernel: Call Trace: Sep 13 15:53:45 soak-10 kernel: [<ffffffff816a3d6d>] dump_stack+0x19/0x1b Sep 13 15:53:45 soak-10 kernel: [<ffffffff8169dc54>] panic+0xe8/0x20d Sep 13 15:53:45 soak-10 kernel: [<ffffffffc0e08854>] lbug_with_loc+0x64/0xb0 [libcfs] Sep 13 15:53:45 soak-10 kernel: [<ffffffffc0f208f5>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass] Sep 13 15:53:45 soak-10 kernel: [<ffffffffc0f150e4>] llog_declare_write_rec+0x84/0x200 [obdclass] Sep 13 15:53:45 soak-10 kernel: [<ffffffffc0f15862>] llog_cancel_rec+0xe2/0x870 [obdclass] Sep 13 15:53:45 soak-10 kernel: [<ffffffffc0f1bf6a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass] Sep 13 15:53:45 soak-10 kernel: [<ffffffffc0e13ba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Sep 13 15:53:45 soak-10 kernel: [<ffffffffc11fc2d9>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc] Sep 13 15:53:45 soak-10 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 Sep 13 15:53:45 soak-10 kernel: [<ffffffffc11fbd40>] ? sub_trans_commit_cb+0x20/0x20 [ptlrpc] Sep 13 15:53:45 soak-10 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0 Sep 13 15:53:45 soak-10 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 Sep 13 15:53:45 soak-10 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 Sep 13 15:53:45 soak-10 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 Sep 13 15:53:45 soak-10 kernel: Kernel Offset: disabled Sep 13 15:53:45 soak-10 kernel: ------------[ cut here ]------------ soak-11 Sep 13 15:53:43 soak-11 kernel: LustreError: 4755:0:(llog_osd.c:327:llog_osd_declare_write_rec()) ASSERTION( rec ) failed: Sep 13 15:53:43 soak-11 kernel: LustreError: 4755:0:(llog_osd.c:327:llog_osd_declare_write_rec()) LBUG Sep 13 15:53:43 soak-11 kernel: Pid: 4755, comm: dist_txn-3 Sep 13 15:53:43 soak-11 kernel: #012Call Trace: Sep 13 15:53:43 soak-11 kernel: [<ffffffffc0d6f7ae>] libcfs_call_trace+0x4e/0x60 [libcfs] Sep 13 15:53:43 soak-11 kernel: [<ffffffffc0d6f83c>] lbug_with_loc+0x4c/0xb0 [libcfs] Sep 13 15:53:44 soak-11 kernel: [<ffffffffc0e878f5>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass] Sep 13 15:53:44 soak-11 kernel: [<ffffffffc0e7c0e4>] llog_declare_write_rec+0x84/0x200 [obdclass] Sep 13 15:53:44 soak-11 kernel: [<ffffffffc0e7c862>] llog_cancel_rec+0xe2/0x870 [obdclass] Sep 13 15:53:44 soak-11 kernel: [<ffffffffc0e82f6a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass] Sep 13 15:53:44 soak-11 kernel: [<ffffffffc0d7aba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Sep 13 15:53:44 soak-11 kernel: [<ffffffffc11632d9>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc] Sep 13 15:53:44 soak-11 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20 Sep 13 15:53:44 soak-11 kernel: [<ffffffffc1162d40>] ? distribute_txn_commit_thread+0x0/0xca0 [ptlrpc] Sep 13 15:53:44 soak-11 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0 Sep 13 15:53:44 soak-11 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Sep 13 15:53:44 soak-11 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 Sep 13 15:53:44 soak-11 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Sep 13 15:53:44 soak-11 kernel: Sep 13 15:53:44 soak-11 kernel: Kernel panic - not syncing: LBUG |
| Comments |
| Comment by Cliff White (Inactive) [ 18/Sep/17 ] |
|
Hit this again on two MDS over the weekend |
| Comment by Cliff White (Inactive) [ 22/Sep/17 ] |
|
Hit this again after the RHEL 7.4 update - soak really is not useable with this bug. |
| Comment by Gerrit Updater [ 22/Sep/17 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/29177 |
| Comment by Cliff White (Inactive) [ 23/Sep/17 ] |
|
Loaded that build, immediately hit LBUG. Sep 23 00:09:42 soak-9 kernel: LustreError: 2294:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012 req@ffff8808103d1800 x1579283258584816/t0(0) o1000->soaked-MDT0000-osp-MDT0001@192.168.1.108@o2ib:24/4 lens 392/192 e 0 to 0 dl 1506125390 ref 2 fl Interpret:RM/0/0 rc -71/-71 Sep 23 00:09:53 soak-9 kernel: LustreError: 2351:0:(llog_osd.c:327:llog_osd_declare_write_rec()) ASSERTION( rec ) failed: Sep 23 00:09:53 soak-9 kernel: LustreError: 2351:0:(llog_osd.c:327:llog_osd_declare_write_rec()) LBUG Sep 23 00:09:53 soak-9 kernel: Pid: 2351, comm: dist_txn-1 Sep 23 00:09:53 soak-9 kernel: #012Call Trace: Sep 23 00:09:53 soak-9 kernel: [<ffffffffc08907ae>] libcfs_call_trace+0x4e/0x60 [libcfs] Sep 23 00:09:53 soak-9 kernel: [<ffffffffc089083c>] lbug_with_loc+0x4c/0xb0 [libcfs] Sep 23 00:09:53 soak-9 kernel: [<ffffffffc0a16b75>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass] Sep 23 00:09:53 soak-9 kernel: [<ffffffffc0a0b0e4>] llog_declare_write_rec+0x84/0x200 [obdclass] Sep 23 00:09:53 soak-9 kernel: [<ffffffffc0a0b862>] llog_cancel_rec+0xe2/0x870 [obdclass] Sep 23 00:09:53 soak-9 kernel: [<ffffffffc0a1233a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass] Sep 23 00:09:53 soak-9 kernel: [<ffffffffc089bbc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Sep 23 00:09:54 soak-9 kernel: [<ffffffffc0cf2139>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc] Sep 23 00:09:54 soak-9 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20 Sep 23 00:09:54 soak-9 kernel: [<ffffffffc0cf1ba0>] ? distribute_txn_commit_thread+0x0/0xca0 [ptlrpc] Sep 23 00:09:54 soak-9 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0 Sep 23 00:09:54 soak-9 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Sep 23 00:09:54 soak-9 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 Sep 23 00:09:54 soak-9 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Sep 23 00:09:54 soak-9 kernel: |
| Comment by Cliff White (Inactive) [ 23/Sep/17 ] |
|
Will restart with panic_on_lbug=0 |
| Comment by Di Wang [ 23/Sep/17 ] |
|
This actually include 2 bugs. 1. Bulk transfer between MDTs somehow is screwed up, so those update request can not pass the audit. Sep 13 15:53:37 soak-8 kernel: LustreError: 4873:0:(out_handler.c:597:out_write()) soaked-MDT0000: empty or wrong size 0 pos: rc = -61 Sep 13 15:53:37 soak-8 kernel: LustreError: 4990:0:(out_handler.c:1000:out_handle()) soaked-MDT0000: invalid update buffer magic 0 expect bdde0002: rc = -71 2. when the master MDT get failure of remote request caused by 1, there seems a race between invalidate the request and cancel log. 2 seems easy to fix, but 1 is the major issue here. Did we land sth recently which might break the bulk transfer between MDTs? |
| Comment by Gerrit Updater [ 23/Sep/17 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29178 |
| Comment by Gerrit Updater [ 23/Sep/17 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29179 |
| Comment by Di Wang [ 23/Sep/17 ] |
|
https://review.whamcloud.com/29178 is the fix for issue 2. Unfortunately, I am still not sure what cause the issue1. And it is also very interesting, according to the debug log, the update requests still valid when packing the RPC, but somehow on the server side, those request magic number is reset to 0. On the source MDT, see those magic 00000004:00080000:15.0:1506127961.257426:0:3000:0:(osp_trans.c:1330:osp_get_next_request()) ou ffff88042ece15a0 version 3 rpc_version 3 00000004:00000001:15.0:1506127961.257444:0:3000:0:(osp_trans.c:1119:osp_send_update_req()) Process entered 00000004:00000001:15.0:1506127961.257446:0:3000:0:(osp_trans.c:361:osp_prep_update_req()) Process entered 00000004:00000040:15.0:1506127961.257448:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 0 fid = [0x2c0003ab2:0x2:0x0] op = create params = 1 batchid = 0 size = 256 repsize 0 00000004:00000040:15.0:1506127961.257453:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 1 fid = [0x2c0003ab2:0x2:0x0] op = ref_add params = 0 batchid = 0 size = 40 repsize 0 00000004:00000040:15.0:1506127961.257456:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 2 fid = [0x2c0003ab2:0x2:0x0] op = insert params = 3 batchid = 0 size = 96 repsize 0 00000004:00000040:15.0:1506127961.257458:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 3 fid = [0x2c0003ab2:0x2:0x0] op = insert params = 3 batchid = 0 size = 96 repsize 0 00000004:00000040:15.0:1506127961.257461:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 4 fid = [0x2c0003ab2:0x2:0x0] op = xattr_set params = 3 batchid = 0 size = 144 repsize 0 00000004:00000040:15.0:1506127961.257464:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 5 fid = [0x2c0003ab2:0x2:0x0] op = xattr_set params = 3 batchid = 0 size = 160 repsize 0 00000004:00000040:15.0:1506127961.257467:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 6 fid = [0x2c0003ab2:0x2:0x0] op = xattr_set params = 3 batchid = 0 size = 96 repsize 0 00000004:00000040:15.0:1506127961.257469:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 7 fid = [0x2c0003ab2:0x3:0x0] op = create params = 1 batchid = 0 size = 256 repsize 0 00000004:00000040:15.0:1506127961.257472:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 8 fid = [0x20000000a:0x3:0x0] op = insert params = 3 batchid = 0 size = 112 repsize 0 00000004:00000040:15.0:1506127961.257475:0:3000:0:(osp_trans.c:278:object_update_request_dump()) updates = ffff8804b92e0000 magic = bdde0002 count = 9 size = 1256 00000004:00000040:15.0:1506127961.257478:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 0 fid = [0x2c0003ab2:0x1:0x0] op = write params = 2 batchid = 0 size = 32832 repsize 0 00000004:00000040:15.0:1506127961.257482:0:3000:0:(osp_trans.c:278:object_update_request_dump()) updates = ffff880806030000 magic = bdde0002 count = 1 size = 32832 00000004:00000040:15.0:1506127961.257484:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 0 fid = [0x2c0003ab2:0x1:0x0] op = write params = 2 batchid = 0 size = 128 repsize 0 00000004:00000040:15.0:1506127961.257487:0:3000:0:(osp_trans.c:278:object_update_request_dump()) updates = ffff8808067f6000 magic = bdde0002 count = 1 size = 128 00000004:00000040:15.0:1506127961.257489:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 0 fid = [0x2c0003ab2:0x3:0x0] op = write params = 2 batchid = 0 size = 32832 repsize 0 00000004:00000040:15.0:1506127961.257492:0:3000:0:(osp_trans.c:278:object_update_request_dump()) updates = ffff880806040000 magic = bdde0002 count = 1 size = 32832 00000004:00000040:15.0:1506127961.257494:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 0 fid = [0x2c0003ab2:0x3:0x0] op = write params = 2 batchid = 0 size = 2696 repsize 0 00000004:00000040:15.0:1506127961.257497:0:3000:0:(osp_trans.c:278:object_update_request_dump()) updates = ffff8808067f7000 magic = bdde0002 count = 1 size = 2696 But on the receiver, the magic has been changed to zero. 00000800:00000200:10.0:1506127961.258292:0:3044:0:(o2iblnd_cb.c:1386:kiblnd_launch_tx()) conn[ffff88082dfb7200] (20)++ 00000800:00000200:10.0:1506127961.258294:0:3044:0:(o2iblnd_cb.c:1156:kiblnd_queue_tx_locked()) conn[ffff88082dfb7200] (21)++ 00000800:00000200:10.0:1506127961.258299:0:3044:0:(o2iblnd_cb.c:1392:kiblnd_launch_tx()) conn[ffff88082dfb7200] (22)-- 00000100:00000200:10.0:1506127961.258302:0:3044:0:(niobuf.c:262:ptlrpc_start_bulk_transfer()) Transferring 5 pages 77970 bytes via portal 14 id 12345-192.168.1.109@o2ib mbits 0x59c5acbf08b80-0x59c5acbf08b80 00000100:00000001:10.0:1506127961.258306:0:3044:0:(niobuf.c:264:ptlrpc_start_bulk_transfer()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:10.0:1506127961.258595:0:3044:0:(ldlm_lib.c:3268:target_bulk_io()) Process leaving (rc=0 : 0 : 0) 00000020:00020000:10.0:1506127961.258600:0:3044:0:(out_handler.c:1000:out_handle()) soaked-MDT0003: invalid update buffer magic 0 expect bdde0002: rc = -71 00000020:00000001:10.0:1506127961.279044:0:3044:0:(out_handler.c:1001:out_handle()) Process leaving via out_free (rc=18446744073709547449 : -4167 : 0xffffffffffffefb9) 00000020:00000010:10.0:1506127961.279048:0:3044:0:(out_handler.c:1165:out_handle()) kfreed 'update_bufs[i]': 4096 at ffff88083fcb1000. 00000020:00000010:10.0:1506127961.279066:0:3044:0:(out_handler.c:1165:out_handle()) kfreed 'update_bufs[i]': 32841 at ffff88080f180000. 00000020:00000010:10.0:1506127961.279086:0:3044:0:(out_handler.c:1165:out_handle()) kfreed 'update_bufs[i]': 4096 at ffff88080f54f000. 00000020:00000010:10.0:1506127961.279090:0:3044:0:(out_handler.c:1165:out_handle()) kfreed 'update_bufs[i]': 32841 at ffff88080f190000. 00000020:00000010:10.0:1506127961.279115:0:3044:0:(out_handler.c:1165:out_handle()) kfreed 'update_bufs[i]': 4096 at ffff880817a78000. 00000020:00000010:10.0:1506127961.279117:0:3044:0:(out_handler.c:1169:out_handle()) kfreed 'update_bufs': 40 at ffff88042b7fc340. Sarah told me this failure did not happen for build 18, whose top is So I checked all of commit since |
| Comment by Cliff White (Inactive) [ 24/Sep/17 ] |
|
That patch produced the old kernel, (3.10.0-514.26.2) which doesn't work with CENTOS 7.4, can you re-base on a current kernel? (3.10.0-693.2.2) |
| Comment by Gerrit Updater [ 25/Sep/17 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29185 |
| Comment by Gerrit Updater [ 25/Sep/17 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29186 |
| Comment by Di Wang [ 25/Sep/17 ] |
can you re-base on a current kernel? (3.10.0-693.2.2) I just finished up-grading soak to 7.4, really don't want to downgrade. Ok, then let's use bisect to find out which patch caused this issue. Could you please try patch 29185 and 29186? And please remove the update log before re-run thanks. |
| Comment by Cliff White (Inactive) [ 25/Sep/17 ] |
|
Sunday, I re-loaded build 26 and reformatted without DNE. Soak has run for 14 hours without any issues. |
| Comment by Di Wang [ 25/Sep/17 ] |
|
Cliff tried 29185 and it also hit this LBUG. Build 29185 is basically build 18 + 2 kernel upgrade patches, Since we did not hit this problem for build 18. so it seems the kernel upgrade is the culprit here? (somehow it break the bulk transfer between MDTs). I will upload the client/server debug log here, could some network guys have a look? |
| Comment by Andreas Dilger [ 25/Sep/17 ] |
|
Cliff will also try running IOR with data verification (which we should probably be running on soak all the time), so we can verify if bulk transfers of data are working properly or not. This would help isolate if there are problems with the core bulk data transfer mechanism (which would point more in the direction of LNet or OFED changes), or if this is isolated to OUT. Di, it is also bad if the MDS is LASSERTing on data that it got over the network. It should only report an error in this case. |
| Comment by Di Wang [ 25/Sep/17 ] |
Di, it is also bad if the MDS is LASSERTing on data that it got over the network. It should only report an error in this case. Oh, I posted a patch to fix this LASSERT https://review.whamcloud.com/29178 , please check. thanks. |
| Comment by Cliff White (Inactive) [ 25/Sep/17 ] |
|
Loaded patch 29185 - immediate LBUG. loading next patch Sep 25 16:22:16 soak-10 kernel: LustreError: 13509:0:(llog_osd.c:327:llog_osd_declare_write_rec()) ASSERTION( rec ) failed: Sep 25 16:22:16 soak-10 kernel: LustreError: 13509:0:(llog_osd.c:327:llog_osd_declare_write_rec()) LBUG Sep 25 16:22:16 soak-10 kernel: Pid: 13509, comm: dist_txn-2 Sep 25 16:22:16 soak-10 kernel: #012Call Trace: Sep 25 16:22:16 soak-10 kernel: [<ffffffffc09187ae>] libcfs_call_trace+0x4e/0x60 [libcfs] Sep 25 16:22:16 soak-10 kernel: [<ffffffffc091883c>] lbug_with_loc+0x4c/0xb0 [libcfs] Sep 25 16:22:16 soak-10 kernel: [<ffffffffc0a3bac5>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass] Sep 25 16:22:16 soak-10 kernel: [<ffffffffc0a300e4>] llog_declare_write_rec+0x84/0x200 [obdclass] Sep 25 16:22:16 soak-10 kernel: [<ffffffffc0a30862>] llog_cancel_rec+0xe2/0x870 [obdclass] Sep 25 16:22:16 soak-10 kernel: [<ffffffffc0a3728a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass] Sep 25 16:22:16 soak-10 kernel: [<ffffffffc0923bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Sep 25 16:22:17 soak-10 kernel: [<ffffffffc0d170e9>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc] Sep 25 16:22:17 soak-10 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20 Sep 25 16:22:17 soak-10 kernel: [<ffffffffc0d16b50>] ? distribute_txn_commit_thread+0x0/0xca0 [ptlrpc] Sep 25 16:22:17 soak-10 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0 Sep 25 16:22:17 soak-10 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Sep 25 16:22:17 soak-10 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 Sep 25 16:22:17 soak-10 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Sep 25 16:22:17 soak-10 kernel: Sep 25 16:22:17 soak-10 kernel: Kernel panic - not syncing: LBUG |
| Comment by Cliff White (Inactive) [ 25/Sep/17 ] |
|
Errors from soak-8 Sep 25 16:21:41 soak-8 systemd: Removed slice User Slice of root. Sep 25 16:21:41 soak-8 systemd: Stopping User Slice of root. Sep 25 16:22:05 soak-8 kernel: LustreError: 13225:0:(out_handler.c:597:out_write()) soaked-MDT0000: empty or wrong size 0 pos: rc = -61 Sep 25 16:22:05 soak-8 kernel: LustreError: 13706:0:(out_handler.c:1000:out_handle()) soaked-MDT0000: invalid update buffer magic 0 expect bdde0002: rc = -71 Sep 25 16:22:47 soak-8 sshd[13757]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key Sep 25 16:22:47 soak-8 sshd[13757]: Accepted publickey for root from 10.10.1.116 port 50002 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM Sep 25 16:22:47 soak-8 systemd: Created slice User Slice of root. Sep 25 16:22:47 soak-8 systemd: Starting User Slice of root. Sep 25 16:22:47 soak-8 systemd-logind: New session 26 of user root. Sep 25 16:22:47 soak-8 systemd: Started Session 26 of user root. Sep 25 16:22:47 soak-8 systemd: Starting Session 26 of user root. Sep 25 16:22:47 soak-8 sshd[13757]: pam_unix(sshd:session): session opened for user root by (uid=0) Sep 25 16:22:47 soak-8 sshd[13757]: pam_unix(sshd:session): session closed for user root Sep 25 16:22:47 soak-8 systemd-logind: Removed session 26. Sep 25 16:22:47 soak-8 systemd: Removed slice User Slice of root. Sep 25 16:22:47 soak-8 systemd: Stopping User Slice of root. Sep 25 16:22:49 soak-8 kernel: Lustre: 12799:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1506356561/real 0] req@ffff8803f3a70f00 x1579526007953840/t0(0) o400->soaked-MDT0002-osp-MDT0000@192.168.1.110@o2ib:24/4 lens 224/224 e 0 to 1 dl 1506356568 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 25 16:22:49 soak-8 kernel: Lustre: soaked-MDT0003-osp-MDT0000: Connection to soaked-MDT0003 (at 192.168.1.111@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 25 16:22:49 soak-8 kernel: Lustre: 12799:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message Sep 25 16:22:50 soak-8 sshd[13780]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key Sep 25 16:22:50 soak-8 sshd[13780]: Accepted publickey for root from 10.10.1.116 port 50026 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM |
| Comment by Cliff White (Inactive) [ 25/Sep/17 ] |
|
Attempt to run the second patch with IOR and data verification. removed all old update_logs from MDTs. Soak started 2017-09-25 18:02:05 One IOR single-shared-file test completed without error. soak-9 and soak-11 immediately crashed: <15 seconds after start of test. soak-9.log:Sep 25 18:02:17 soak-9 kernel: LustreError: 2686:0:(llog_osd.c:327:llog_osd_declare_write_rec()) LBUG Sep 25 18:01:01 soak-9 systemd: Stopping User Slice of root. Sep 25 18:02:06 soak-9 kernel: LustreError: 2615:0:(out_handler.c:597:out_write()) soaked-MDT0001: empty or wrong size 0 pos: rc = -61 Sep 25 18:02:06 soak-9 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0001: operation out_update to node 192.168.1.108@o2ib failed: rc = -71 Sep 25 18:02:06 soak-9 kernel: LustreError: 2630:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012 req@ffff88080daa3900 x1579533157282240/t0(0) o1000->soaked-MDT0000-osp-MDT0001@192.168.1.108@o2ib:24/4 lens 392/192 e 0 to 0 dl 1506362537 ref 2 fl Interpret:RM/0/0 rc -71/-71 Sep 25 18:02:17 soak-9 kernel: LustreError: 2686:0:(llog_osd.c:327:llog_osd_declare_write_rec()) ASSERTION( rec ) failed: Sep 25 18:02:17 soak-9 kernel: LustreError: 2686:0:(llog_osd.c:327:llog_osd_declare_write_rec()) LBUG Sep 25 18:02:17 soak-9 kernel: Pid: 2686, comm: dist_txn-1 Sep 25 18:02:17 soak-9 kernel: #012Call Trace: Sep 25 18:02:17 soak-9 kernel: [<ffffffffc09097ae>] libcfs_call_trace+0x4e/0x60 [libcfs] Sep 25 18:02:17 soak-9 kernel: [<ffffffffc090983c>] lbug_with_loc+0x4c/0xb0 [libcfs] Sep 25 18:02:17 soak-9 kernel: [<ffffffffc0a93ac5>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass] Sep 25 18:02:17 soak-9 kernel: [<ffffffffc0a880e4>] llog_declare_write_rec+0x84/0x200 [obdclass] Sep 25 18:02:17 soak-9 kernel: [<ffffffffc0a88862>] llog_cancel_rec+0xe2/0x870 [obdclass] Sep 25 18:02:17 soak-9 kernel: [<ffffffffc0a8f28a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass] Sep 25 18:02:17 soak-9 kernel: [<ffffffffc0914bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Sep 25 18:02:17 soak-9 kernel: [<ffffffffc0d20139>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc] Sep 25 18:02:17 soak-9 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0 Sep 25 18:02:17 soak-9 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20 Sep 25 18:02:17 soak-9 kernel: [<ffffffffc0d1fba0>] ? distribute_txn_commit_thread+0x0/0xca0 [ptlrpc] Sep 25 18:02:17 soak-9 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0 Sep 25 18:02:17 soak-9 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Sep 25 18:02:17 soak-9 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 Sep 25 18:02:17 soak-9 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Sep 25 18:02:17 soak-9 kernel: Sep 25 18:02:17 soak-9 kernel: Kernel panic - not syncing: LBUG Next steps? |
| Comment by Cliff White (Inactive) [ 25/Sep/17 ] |
|
Per request, lustre-log from soak-8 attached. |
| Comment by Di Wang [ 25/Sep/17 ] |
|
Ah, so this seems to isolated to OUT. hmm, let me add more debug log. |
| Comment by Gerrit Updater [ 25/Sep/17 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29202 |
| Comment by Gerrit Updater [ 25/Sep/17 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29207 Please ignore this |
| Comment by Di Wang [ 25/Sep/17 ] |
Cliff will also try running IOR with data verification (which we should probably be running on soak all the time), so we can verify if bulk transfers of data are working properly or not. This would help isolate if there are problems with the core bulk data transfer mechanism (which would point more in the direction of LNet or OFED changes), or if this is isolated to OUT. Note: bulk between MDTs are using different iovec ops than OSC<--> OST. See If bulk transferring between OSC and OST are proved to be correct, then maybe we should check those iovec ops for MDTs. |
| Comment by Gerrit Updater [ 26/Sep/17 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29208 Although this patch can avoid the LBUG, but the real reason still needs to be understood. It looks like PTLRPC (or Lnet with RHEL 7.4 kernel) can not handle non-pagesize aligned buffer on the receiver side. |
| Comment by Cliff White (Inactive) [ 26/Sep/17 ] |
|
Previous patch (build 50791) ran 4 hours in soak without LBUG. Switching to https://review.whamcloud.com/#/c/29208/ - build 50793 |
| Comment by Gerrit Updater [ 26/Sep/17 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29218 |
| Comment by Gerrit Updater [ 26/Sep/17 ] |
|
Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29222 |
| Comment by Di Wang [ 28/Sep/17 ] |
|
Finally, I got two nodes with IB in Onyx, and can reproduce the issue easily. And also it works fine with tcp lnd. So it is definitely O2ib issue. If needed, I can ask Sarah to test it in RHEL7.3 to see if it works there. |
| Comment by Gerrit Updater [ 28/Sep/17 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29238 |
| Comment by Gerrit Updater [ 28/Sep/17 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29240 |
| Comment by Cliff White (Inactive) [ 28/Sep/17 ] |
|
Running https://build.hpdd.intel.com/job/lustre-reviews/50862/,
|
| Comment by Gerrit Updater [ 29/Sep/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29208/ |
| Comment by Gerrit Updater [ 29/Sep/17 ] |
|
Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/29270 |
| Comment by Cliff White (Inactive) [ 02/Oct/17 ] |
|
Running the (almost) latest patch, seeing quite a few of these: /scratch/logs/syslog/soak-8.log:Oct 1 22:37:25 soak-8 kernel: LustreError: 8097:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 4 previous similar messages /scratch/logs/syslog/soak-8.log:Oct 1 22:37:25 soak-8 kernel: LustreError: 8097:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 944 actual 416. /scratch/logs/syslog/soak-9.log:Oct 1 22:42:25 soak-9 kernel: LustreError: 2165:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 6 previous similar messages /scratch/logs/syslog/soak-9.log:Oct 1 22:42:25 soak-9 kernel: LustreError: 2165:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 872 actual 416. /scratch/logs/syslog/soak-10.log:Oct 1 22:42:26 soak-10 kernel: LustreError: 2401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 10 previous similar messages /scratch/logs/syslog/soak-10.log:Oct 1 22:42:26 soak-10 kernel: LustreError: 2401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416. /scratch/logs/syslog/soak-10.log:Oct 1 22:42:26 soak-10 kernel: LustreError: 4181:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416. /scratch/logs/syslog/soak-10.log:Oct 1 22:44:04 soak-10 kernel: LustreError: 2351:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416. /scratch/logs/syslog/soak-9.log:Oct 1 22:44:04 soak-9 kernel: LustreError: 2351:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 848 actual 416. /scratch/logs/syslog/soak-10.log:Oct 1 22:57:27 soak-10 kernel: LustreError: 4296:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 8 previous similar messages /scratch/logs/syslog/soak-10.log:Oct 1 22:57:27 soak-10 kernel: LustreError: 4296:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416. /scratch/logs/syslog/soak-9.log:Oct 1 22:57:27 soak-9 kernel: LustreError: 2329:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 9 previous similar messages /scratch/logs/syslog/soak-9.log:Oct 1 22:57:27 soak-9 kernel: LustreError: 2329:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 800 actual 416. /scratch/logs/syslog/soak-9.log:Oct 1 22:59:06 soak-9 kernel: LustreError: 2357:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 776 actual 416. We see some timeouts/lustre-log dumps, but no LBUGs. Further details tomorrow, timeouts may be failover-related. |
| Comment by Andreas Dilger [ 02/Oct/17 ] |
|
I think the mdt_lvbo_fill() error is fallout from PFL and has been seen on master (LU-9825). That said, it is so commonly printed as to make it very annoying, and IMHO something that should be fixed. Lai, could you please take a look. |
| Comment by Peter Jones [ 02/Oct/17 ] |
|
Lai will be out this week so is there someone else who could confirm that this is benign enough to live with in 2.10.1 or not? |
| Comment by Gerrit Updater [ 02/Oct/17 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/29290 |
| Comment by Cliff White (Inactive) [ 02/Oct/17 ] |
|
Should I move soak to this patch? |
| Comment by Joseph Gmitter (Inactive) [ 06/Oct/17 ] |
|
Yes, please test the viability of John's patch as a fix for this issue. It has been validated already by Amir/John on their side. After a successful run at the soak scale, we can create a tmp patch to revert Di's fix and have only John's patch present for a final verification. |
| Comment by Andreas Dilger [ 06/Oct/17 ] |
|
I don't think we need to revert Di's patch, even if John's patch works. |
| Comment by Cliff White (Inactive) [ 06/Oct/17 ] |
|
I'll do this on Monday. |
| Comment by Joseph Gmitter (Inactive) [ 07/Oct/17 ] |
|
Thanks Cliff. Di - do you agree with Andreas that your fix is useful independent of John's proposed fix? |
| Comment by Di Wang [ 07/Oct/17 ] |
|
Yes, Joe. I agree with Andreas. |
| Comment by Cliff White (Inactive) [ 09/Oct/17 ] |
|
With this patch, multiple errors when attempting client mounts. Servers mount. [ 4377.206482] Lustre: 10866:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507572416/real 1507572416] req@ffff880825ec0300 x1580800666501216/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507572460 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [ 4377.238308] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail [ 4377.254213] LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. [ 4377.254668] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib) |
| Comment by Cliff White (Inactive) [ 09/Oct/17 ] |
|
Clients do not mount with this patch - set -1 debug, dumped lustre-log from MDS and client after failure. Attached. |
| Comment by Gerrit Updater [ 11/Oct/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29238/ |
| Comment by Gerrit Updater [ 16/Oct/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29290/ |
| Comment by Chris Horn [ 07/Nov/17 ] |
|
Cray is seeing this issue in our 2.10 testing. I read through the comments in this ticket but I did not see a clear description of the root cause of this issue. Could someone provide a description? I see a ko2iblnd change was made in association with this patch, and I'm concerned that gnilnd may be vulnerable as well. |
| Comment by John Hammond [ 10/Nov/17 ] |
|
Chris, the root cause is the bug addressed by https://review.whamcloud.com/29290 See the commit message for an explanation. This in turn triggers a second bug which is addressed by https://review.whamcloud.com/29238/. You can also see the failed assertion in some OOM cases on an MDS. |
| Comment by Chris Horn [ 10/Nov/17 ] |
|
Got it. Thanks, John. |
| Comment by Gerrit Updater [ 22/Nov/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29218/ |
| Comment by Gerrit Updater [ 17/Dec/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29270/ |
| Comment by Peter Jones [ 17/Dec/17 ] |
|
It looks like all the parts of this have finally landed to master |
| Comment by Gerrit Updater [ 18/Dec/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30580 |