[LU-9983] LBUG llog_osd.c:327:llog_osd_declare_write_rec() - all DNE MDS Created: 13/Sep/17  Updated: 02/Jan/19  Resolved: 17/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.1, Lustre 2.11.0
Fix Version/s: Lustre 2.11.0, Lustre 2.10.6

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: soak
Environment:

version=2.10.52_83_g1fc4ed3 lustre-master build 3637 - RHEL 7.3 distro with 7.4 kernels


Attachments: File MDS.lustre.log.txt.gz     Text File client.log     File client.lustre.log.txt.gz     Text File server.log     File soak-8.lustre.log.gz     Text File vmcore-dmesg.txt    
Issue Links:
Related
is related to LU-9500 MOFED 4/mlx5: Aligning non-aligned pa... Resolved
is related to LU-9026 Adapt to the removal of ib_get_dma_mr() Resolved
is related to LU-9810 Melanox OFED 4.1 support Resolved
is related to LU-10089 kiblnd_fmr_pool_map() Failed to map m... Resolved
is related to LU-10068 OST fails to mount:LustreError: 14558... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

soak is started, jobs begin to run.
soak-8 (MDT0000) reports and error:

Sep 13 15:53:37 soak-8 kernel: LustreError: 4873:0:(out_handler.c:597:out_write()) soaked-MDT0000: empty or wrong size 0 pos: rc = -61
Sep 13 15:53:37 soak-8 kernel: LustreError: 4990:0:(out_handler.c:1000:out_handle()) soaked-MDT0000: invalid update buffer magic 0 expect bdde0002: rc = -71
Sep 13 15:53:37 soak-8 kernel: LustreError: 4873:0:(out_handler.c:597:out_write()) Skipped 2 previous similar messages
Sep 13 15:53:44 soak-8 kernel: LustreError: 5117:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 968 actual 344.

All other DNE MDTS (01-03) promptly LBUG:
soak-9

Sep 13 15:53:37 soak-9 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0001: operation out_update to node 192.168.1.108@o2ib failed: rc = -71
Sep 13 15:53:37 soak-9 kernel: LustreError: 4734:0:(layout.c:2082:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012  req@ffff880808e79500 x1578438980902080/t0(0) o1000->soaked-MDT0002-osp-MDT0001@192.168.1.110@o2ib:24/4 lens 392/192 e 0 to 0 dl 1505318024 ref 2 fl Interpret:RM/0/0 rc -71/-71
Sep 13 15:53:37 soak-9 kernel: LustreError: Skipped 1 previous similar message
Sep 13 15:53:47 soak-9 kernel: LustreError: 4719:0:(llog_osd.c:327:llog_osd_declare_write_rec()) ASSERTION( rec ) failed:
Sep 13 15:53:47 soak-9 kernel: LustreError: 4719:0:(llog_osd.c:327:llog_osd_declare_write_rec()) LBUG
Sep 13 15:53:47 soak-9 kernel: Pid: 4719, comm: dist_txn-1
Sep 13 15:53:47 soak-9 kernel: #012Call Trace:
Sep 13 15:53:47 soak-9 kernel: [<ffffffffc0e667ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Sep 13 15:53:47 soak-9 kernel: [<ffffffffc0e6683c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Sep 13 15:53:47 soak-9 kernel: [<ffffffffc0f658f5>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass]
Sep 13 15:53:47 soak-9 kernel: [<ffffffffc0f5a0e4>] llog_declare_write_rec+0x84/0x200 [obdclass]
Sep 13 15:53:47 soak-9 kernel: [<ffffffffc0f5a862>] llog_cancel_rec+0xe2/0x870 [obdclass]
Sep 13 15:53:47 soak-9 kernel: [<ffffffffc0f60f6a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass]
Sep 13 15:53:47 soak-9 kernel: [<ffffffffc0e71ba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Sep 13 15:53:47 soak-9 kernel: [<ffffffffc11f22d9>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc]
Sep 13 15:53:47 soak-9 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0 
Sep 13 15:53:47 soak-9 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
Sep 13 15:53:47 soak-9 kernel: [<ffffffffc11f1d40>] ? distribute_txn_commit_thread+0x0/0xca0 [ptlrpc]
Sep 13 15:53:47 soak-9 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Sep 13 15:53:47 soak-9 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Sep 13 15:53:47 soak-9 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
Sep 13 15:53:47 soak-9 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Sep 13 15:53:47 soak-9 kernel:
Sep 13 15:53:47 soak-9 kernel: Kernel panic - not syncing: LBUG
Sep 13 15:53:50 soak-9 kernel: CPU: 25 PID: 4719 Comm: dist_txn-1 Tainted: P           OE  ------------   3.10.0-693.1.1.el7_lustre.x86_64 #1
Sep 13 15:53:50 soak-9 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
Sep 13 15:53:50 soak-9 kernel: ffff880419612f00 000000001ae378cc ffff880418eafb90 ffffffff816a3d6d
Sep 13 15:53:50 soak-9 kernel: ffff880418eafc10 ffffffff8169dc54 ffffffff00000008 ffff880418eafc20
Sep 13 15:53:50 soak-9 kernel: ffff880418eafbc0 000000001ae378cc 000000001ae378cc ffff88082da4f8b8
Sep 13 15:53:50 soak-9 kernel: Call Trace:
Sep 13 15:53:50 soak-9 kernel: [<ffffffff816a3d6d>] dump_stack+0x19/0x1b
Sep 13 15:53:50 soak-9 kernel: [<ffffffff8169dc54>] panic+0xe8/0x20d
Sep 13 15:53:50 soak-9 kernel: [<ffffffffc0e66854>] lbug_with_loc+0x64/0xb0 [libcfs]
Sep 13 15:53:50 soak-9 kernel: [<ffffffffc0f658f5>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass]
Sep 13 15:53:50 soak-9 kernel: [<ffffffffc0f5a0e4>] llog_declare_write_rec+0x84/0x200 [obdclass]
Sep 13 15:53:50 soak-9 kernel: [<ffffffffc0f5a862>] llog_cancel_rec+0xe2/0x870 [obdclass]
Sep 13 15:53:50 soak-9 kernel: [<ffffffffc0f60f6a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass]
Sep 13 15:53:50 soak-9 kernel: [<ffffffffc0e71ba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Sep 13 15:53:50 soak-9 kernel: [<ffffffffc11f22d9>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc]
Sep 13 15:53:50 soak-9 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0
Sep 13 15:53:50 soak-9 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
Sep 13 15:53:50 soak-9 kernel: [<ffffffffc11f1d40>] ? sub_trans_commit_cb+0x20/0x20 [ptlrpc]
Sep 13 15:53:50 soak-9 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Sep 13 15:53:50 soak-9 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Sep 13 15:53:50 soak-9 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
Sep 13 15:53:50 soak-9 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Sep 13 15:53:50 soak-9 kernel: Kernel Offset: disabled
Sep 13 15:53:50 soak-9 kernel: ------------[ cut here ]------------

soak-10

Sep 13 15:53:37 soak-10 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0002: operation out_update to node 192.168.1.108@o2ib failed: rc = -71
Sep 13 15:53:37 soak-10 kernel: LustreError: 14955:0:(layout.c:2082:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012  req@ffff880818bb5400 x1578438986144240/t0(0) o1000->soaked-MDT0000-osp-MDT0002@192.168.1.108@o2ib:24/4 lens 392/192 e 0 to 0 dl 1505318024 ref 2 fl Interpret:RM/0/0 rc -71/-71
Sep 13 15:53:37 soak-10 kernel: LustreError: 14939:0:(out_handler.c:597:out_write()) soaked-MDT0002: empty or wrong size 0 pos: rc = -61
Sep 13 15:53:37 soak-10 kernel: LustreError: 14939:0:(out_handler.c:1000:out_handle()) soaked-MDT0002: invalid update buffer magic 0 expect bdde0002: rc = -71
Sep 13 15:53:44 soak-10 kernel: LustreError: 14961:0:(llog_osd.c:327:llog_osd_declare_write_rec()) ASSERTION( rec ) failed:
Sep 13 15:53:44 soak-10 kernel: LustreError: 14961:0:(llog_osd.c:327:llog_osd_declare_write_rec()) LBUG
Sep 13 15:53:44 soak-10 kernel: Pid: 14961, comm: dist_txn-2
Sep 13 15:53:44 soak-10 kernel: #012Call Trace:
Sep 13 15:53:44 soak-10 kernel: [<ffffffffc0e087ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Sep 13 15:53:44 soak-10 kernel: [<ffffffffc0e0883c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Sep 13 15:53:44 soak-10 kernel: [<ffffffffc0f208f5>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass]
Sep 13 15:53:44 soak-10 kernel: [<ffffffffc0f150e4>] llog_declare_write_rec+0x84/0x200 [obdclass]
Sep 13 15:53:44 soak-10 kernel: [<ffffffffc0f15862>] llog_cancel_rec+0xe2/0x870 [obdclass]
Sep 13 15:53:44 soak-10 kernel: [<ffffffffc0f1bf6a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass]
Sep 13 15:53:44 soak-10 kernel: [<ffffffffc0e13ba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Sep 13 15:53:44 soak-10 kernel: [<ffffffffc11fc2d9>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc]
Sep 13 15:53:44 soak-10 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
Sep 13 15:53:44 soak-10 kernel: [<ffffffffc11fbd40>] ? distribute_txn_commit_thread+0x0/0xca0 [ptlrpc]
Sep 13 15:53:44 soak-10 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Sep 13 15:53:44 soak-10 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Sep 13 15:53:44 soak-10 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
Sep 13 15:53:44 soak-10 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Sep 13 15:53:44 soak-10 kernel:
Sep 13 15:53:44 soak-10 kernel: Kernel panic - not syncing: LBUG
Sep 13 15:53:45 soak-10 kernel: CPU: 10 PID: 14961 Comm: dist_txn-2 Tainted: P           OE  ------------   3.10.0-693.1.1.el7_lustre.x86_64 #1
Sep 13 15:53:45 soak-10 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
Sep 13 15:53:45 soak-10 kernel: ffff8803f8e7bf00 000000009b9e8911 ffff8803f8ea7b90 ffffffff816a3d6d
Sep 13 15:53:45 soak-10 kernel: ffff8803f8ea7c10 ffffffff8169dc54 ffffffff00000008 ffff8803f8ea7c20
Sep 13 15:53:45 soak-10 kernel: ffff8803f8ea7bc0 000000009b9e8911 000000009b9e8911 ffff88082d88f8b8
Sep 13 15:53:45 soak-10 kernel: Call Trace:
Sep 13 15:53:45 soak-10 kernel: [<ffffffff816a3d6d>] dump_stack+0x19/0x1b
Sep 13 15:53:45 soak-10 kernel: [<ffffffff8169dc54>] panic+0xe8/0x20d
Sep 13 15:53:45 soak-10 kernel: [<ffffffffc0e08854>] lbug_with_loc+0x64/0xb0 [libcfs]
Sep 13 15:53:45 soak-10 kernel: [<ffffffffc0f208f5>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass]
Sep 13 15:53:45 soak-10 kernel: [<ffffffffc0f150e4>] llog_declare_write_rec+0x84/0x200 [obdclass]
Sep 13 15:53:45 soak-10 kernel: [<ffffffffc0f15862>] llog_cancel_rec+0xe2/0x870 [obdclass]
Sep 13 15:53:45 soak-10 kernel: [<ffffffffc0f1bf6a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass]
Sep 13 15:53:45 soak-10 kernel: [<ffffffffc0e13ba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Sep 13 15:53:45 soak-10 kernel: [<ffffffffc11fc2d9>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc]
Sep 13 15:53:45 soak-10 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
Sep 13 15:53:45 soak-10 kernel: [<ffffffffc11fbd40>] ? sub_trans_commit_cb+0x20/0x20 [ptlrpc]
Sep 13 15:53:45 soak-10 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Sep 13 15:53:45 soak-10 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Sep 13 15:53:45 soak-10 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
Sep 13 15:53:45 soak-10 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Sep 13 15:53:45 soak-10 kernel: Kernel Offset: disabled
Sep 13 15:53:45 soak-10 kernel: ------------[ cut here ]------------

soak-11

Sep 13 15:53:43 soak-11 kernel: LustreError: 4755:0:(llog_osd.c:327:llog_osd_declare_write_rec()) ASSERTION( rec ) failed:
Sep 13 15:53:43 soak-11 kernel: LustreError: 4755:0:(llog_osd.c:327:llog_osd_declare_write_rec()) LBUG
Sep 13 15:53:43 soak-11 kernel: Pid: 4755, comm: dist_txn-3
Sep 13 15:53:43 soak-11 kernel: #012Call Trace:
Sep 13 15:53:43 soak-11 kernel: [<ffffffffc0d6f7ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Sep 13 15:53:43 soak-11 kernel: [<ffffffffc0d6f83c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Sep 13 15:53:44 soak-11 kernel: [<ffffffffc0e878f5>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass]
Sep 13 15:53:44 soak-11 kernel: [<ffffffffc0e7c0e4>] llog_declare_write_rec+0x84/0x200 [obdclass]
Sep 13 15:53:44 soak-11 kernel: [<ffffffffc0e7c862>] llog_cancel_rec+0xe2/0x870 [obdclass]
Sep 13 15:53:44 soak-11 kernel: [<ffffffffc0e82f6a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass]
Sep 13 15:53:44 soak-11 kernel: [<ffffffffc0d7aba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Sep 13 15:53:44 soak-11 kernel: [<ffffffffc11632d9>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc]
Sep 13 15:53:44 soak-11 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
Sep 13 15:53:44 soak-11 kernel: [<ffffffffc1162d40>] ? distribute_txn_commit_thread+0x0/0xca0 [ptlrpc]
Sep 13 15:53:44 soak-11 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Sep 13 15:53:44 soak-11 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Sep 13 15:53:44 soak-11 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
Sep 13 15:53:44 soak-11 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Sep 13 15:53:44 soak-11 kernel:
Sep 13 15:53:44 soak-11 kernel: Kernel panic - not syncing: LBUG


 Comments   
Comment by Cliff White (Inactive) [ 18/Sep/17 ]

Hit this again on two MDS over the weekend

Comment by Cliff White (Inactive) [ 22/Sep/17 ]

Hit this again after the RHEL 7.4 update - soak really is not useable with this bug.

Comment by Gerrit Updater [ 22/Sep/17 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/29177
Subject: LU-9983 llog: revert LU-9848
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 1c11b55cc8d600db09bd7502b01bf8515821a9a3

Comment by Cliff White (Inactive) [ 23/Sep/17 ]

Loaded that build, immediately hit LBUG.

Sep 23 00:09:42 soak-9 kernel: LustreError: 2294:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012  req@ffff8808103d1800 x1579283258584816/t0(0) o1000->soaked-MDT0000-osp-MDT0001@192.168.1.108@o2ib:24/4 lens 392/192 e 0 to 0 dl 1506125390 ref 2 fl Interpret:RM/0/0 rc -71/-71
Sep 23 00:09:53 soak-9 kernel: LustreError: 2351:0:(llog_osd.c:327:llog_osd_declare_write_rec()) ASSERTION( rec ) failed:
Sep 23 00:09:53 soak-9 kernel: LustreError: 2351:0:(llog_osd.c:327:llog_osd_declare_write_rec()) LBUG
Sep 23 00:09:53 soak-9 kernel: Pid: 2351, comm: dist_txn-1
Sep 23 00:09:53 soak-9 kernel: #012Call Trace:
Sep 23 00:09:53 soak-9 kernel: [<ffffffffc08907ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Sep 23 00:09:53 soak-9 kernel: [<ffffffffc089083c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Sep 23 00:09:53 soak-9 kernel: [<ffffffffc0a16b75>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass]
Sep 23 00:09:53 soak-9 kernel: [<ffffffffc0a0b0e4>] llog_declare_write_rec+0x84/0x200 [obdclass]
Sep 23 00:09:53 soak-9 kernel: [<ffffffffc0a0b862>] llog_cancel_rec+0xe2/0x870 [obdclass]
Sep 23 00:09:53 soak-9 kernel: [<ffffffffc0a1233a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass]
Sep 23 00:09:53 soak-9 kernel: [<ffffffffc089bbc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Sep 23 00:09:54 soak-9 kernel: [<ffffffffc0cf2139>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc]
Sep 23 00:09:54 soak-9 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
Sep 23 00:09:54 soak-9 kernel: [<ffffffffc0cf1ba0>] ? distribute_txn_commit_thread+0x0/0xca0 [ptlrpc]
Sep 23 00:09:54 soak-9 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Sep 23 00:09:54 soak-9 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Sep 23 00:09:54 soak-9 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
Sep 23 00:09:54 soak-9 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Sep 23 00:09:54 soak-9 kernel:
Comment by Cliff White (Inactive) [ 23/Sep/17 ]

Will restart with panic_on_lbug=0

Comment by Di Wang [ 23/Sep/17 ]

This actually include 2 bugs.

1. Bulk transfer between MDTs somehow is screwed up, so those update request can not pass the audit.

Sep 13 15:53:37 soak-8 kernel: LustreError: 4873:0:(out_handler.c:597:out_write()) soaked-MDT0000: empty or wrong size 0 pos: rc = -61
Sep 13 15:53:37 soak-8 kernel: LustreError: 4990:0:(out_handler.c:1000:out_handle()) soaked-MDT0000: invalid update buffer magic 0 expect bdde0002: rc = -71

2. when the master MDT get failure of remote request caused by 1, there seems a race between invalidate the request and cancel log.

2 seems easy to fix, but 1 is the major issue here. Did we land sth recently which might break the bulk transfer between MDTs?

Comment by Gerrit Updater [ 23/Sep/17 ]

wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29178
Subject: LU-9983 llog: skip non-exist log cancellation
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 7e54ccd5d638cddd77d1aee76aeeae0e26361974

Comment by Gerrit Updater [ 23/Sep/17 ]

wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29179
Subject: LU-9983 kernel: revert kernel upgrade
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 144226402da2a74b1ab5391393ba4ff64962d11a

Comment by Di Wang [ 23/Sep/17 ]

https://review.whamcloud.com/29178 is the fix for issue 2.

Unfortunately, I am still not sure what cause the issue1. And it is also very interesting, according to the debug log, the update requests still valid when packing the RPC, but somehow on the server side, those request magic number is reset to 0.

On the source MDT, see those magic

00000004:00080000:15.0:1506127961.257426:0:3000:0:(osp_trans.c:1330:osp_get_next_request()) ou ffff88042ece15a0 version 3 rpc_version 3
00000004:00000001:15.0:1506127961.257444:0:3000:0:(osp_trans.c:1119:osp_send_update_req()) Process entered
00000004:00000001:15.0:1506127961.257446:0:3000:0:(osp_trans.c:361:osp_prep_update_req()) Process entered
00000004:00000040:15.0:1506127961.257448:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 0 fid = [0x2c0003ab2:0x2:0x0] op = create params = 1 batchid = 0 size = 256 repsize 0
00000004:00000040:15.0:1506127961.257453:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 1 fid = [0x2c0003ab2:0x2:0x0] op = ref_add params = 0 batchid = 0 size = 40 repsize 0
00000004:00000040:15.0:1506127961.257456:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 2 fid = [0x2c0003ab2:0x2:0x0] op = insert params = 3 batchid = 0 size = 96 repsize 0
00000004:00000040:15.0:1506127961.257458:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 3 fid = [0x2c0003ab2:0x2:0x0] op = insert params = 3 batchid = 0 size = 96 repsize 0
00000004:00000040:15.0:1506127961.257461:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 4 fid = [0x2c0003ab2:0x2:0x0] op = xattr_set params = 3 batchid = 0 size = 144 repsize 0
00000004:00000040:15.0:1506127961.257464:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 5 fid = [0x2c0003ab2:0x2:0x0] op = xattr_set params = 3 batchid = 0 size = 160 repsize 0
00000004:00000040:15.0:1506127961.257467:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 6 fid = [0x2c0003ab2:0x2:0x0] op = xattr_set params = 3 batchid = 0 size = 96 repsize 0
00000004:00000040:15.0:1506127961.257469:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 7 fid = [0x2c0003ab2:0x3:0x0] op = create params = 1 batchid = 0 size = 256 repsize 0
00000004:00000040:15.0:1506127961.257472:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 8 fid = [0x20000000a:0x3:0x0] op = insert params = 3 batchid = 0 size = 112 repsize 0
00000004:00000040:15.0:1506127961.257475:0:3000:0:(osp_trans.c:278:object_update_request_dump()) updates = ffff8804b92e0000 magic = bdde0002 count = 9 size = 1256
00000004:00000040:15.0:1506127961.257478:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 0 fid = [0x2c0003ab2:0x1:0x0] op = write params = 2 batchid = 0 size = 32832 repsize 0
00000004:00000040:15.0:1506127961.257482:0:3000:0:(osp_trans.c:278:object_update_request_dump()) updates = ffff880806030000 magic = bdde0002 count = 1 size = 32832
00000004:00000040:15.0:1506127961.257484:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 0 fid = [0x2c0003ab2:0x1:0x0] op = write params = 2 batchid = 0 size = 128 repsize 0
00000004:00000040:15.0:1506127961.257487:0:3000:0:(osp_trans.c:278:object_update_request_dump()) updates = ffff8808067f6000 magic = bdde0002 count = 1 size = 128
00000004:00000040:15.0:1506127961.257489:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 0 fid = [0x2c0003ab2:0x3:0x0] op = write params = 2 batchid = 0 size = 32832 repsize 0
00000004:00000040:15.0:1506127961.257492:0:3000:0:(osp_trans.c:278:object_update_request_dump()) updates = ffff880806040000 magic = bdde0002 count = 1 size = 32832
00000004:00000040:15.0:1506127961.257494:0:3000:0:(osp_trans.c:272:object_update_request_dump()) i = 0 fid = [0x2c0003ab2:0x3:0x0] op = write params = 2 batchid = 0 size = 2696 repsize 0
00000004:00000040:15.0:1506127961.257497:0:3000:0:(osp_trans.c:278:object_update_request_dump()) updates = ffff8808067f7000 magic = bdde0002 count = 1 size = 2696

But on the receiver, the magic has been changed to zero.

00000800:00000200:10.0:1506127961.258292:0:3044:0:(o2iblnd_cb.c:1386:kiblnd_launch_tx()) conn[ffff88082dfb7200] (20)++
00000800:00000200:10.0:1506127961.258294:0:3044:0:(o2iblnd_cb.c:1156:kiblnd_queue_tx_locked()) conn[ffff88082dfb7200] (21)++
00000800:00000200:10.0:1506127961.258299:0:3044:0:(o2iblnd_cb.c:1392:kiblnd_launch_tx()) conn[ffff88082dfb7200] (22)--
00000100:00000200:10.0:1506127961.258302:0:3044:0:(niobuf.c:262:ptlrpc_start_bulk_transfer()) Transferring 5 pages 77970 bytes via portal 14 id 12345-192.168.1.109@o2ib mbits 0x59c5acbf08b80-0x59c5acbf08b80
00000100:00000001:10.0:1506127961.258306:0:3044:0:(niobuf.c:264:ptlrpc_start_bulk_transfer()) Process leaving (rc=0 : 0 : 0)
00010000:00000001:10.0:1506127961.258595:0:3044:0:(ldlm_lib.c:3268:target_bulk_io()) Process leaving (rc=0 : 0 : 0)
00000020:00020000:10.0:1506127961.258600:0:3044:0:(out_handler.c:1000:out_handle()) soaked-MDT0003: invalid update buffer magic 0 expect bdde0002: rc = -71
00000020:00000001:10.0:1506127961.279044:0:3044:0:(out_handler.c:1001:out_handle()) Process leaving via out_free (rc=18446744073709547449 : -4167 : 0xffffffffffffefb9)
00000020:00000010:10.0:1506127961.279048:0:3044:0:(out_handler.c:1165:out_handle()) kfreed 'update_bufs[i]': 4096 at ffff88083fcb1000.
00000020:00000010:10.0:1506127961.279066:0:3044:0:(out_handler.c:1165:out_handle()) kfreed 'update_bufs[i]': 32841 at ffff88080f180000.
00000020:00000010:10.0:1506127961.279086:0:3044:0:(out_handler.c:1165:out_handle()) kfreed 'update_bufs[i]': 4096 at ffff88080f54f000.
00000020:00000010:10.0:1506127961.279090:0:3044:0:(out_handler.c:1165:out_handle()) kfreed 'update_bufs[i]': 32841 at ffff88080f190000.
00000020:00000010:10.0:1506127961.279115:0:3044:0:(out_handler.c:1165:out_handle()) kfreed 'update_bufs[i]': 4096 at ffff880817a78000.
00000020:00000010:10.0:1506127961.279117:0:3044:0:(out_handler.c:1169:out_handle()) kfreed 'update_bufs': 40 at ffff88042b7fc340.

Sarah told me this failure did not happen for build 18, whose top is
LU-9903 kernel: kernel update RHEL6.9 [2.6.32-696.10.1.el6]

So I checked all of commit since LU-9725, the most suspicious patches are kernel upgrade, so I revert them https://review.whamcloud.com/#/c/29179/ . Cliff: could you please try this? Thanks.

Comment by Cliff White (Inactive) [ 24/Sep/17 ]

That patch produced the old kernel, (3.10.0-514.26.2) which doesn't work with CENTOS 7.4, can you re-base on a current kernel? (3.10.0-693.2.2)
I just finished up-grading soak to 7.4, really don't want to downgrade.

Comment by Gerrit Updater [ 25/Sep/17 ]

wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29185
Subject: LU-9983 lustre: performance v1
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 16343af007dc503f38926fae3a7ebd4e7d66e5a0

Comment by Gerrit Updater [ 25/Sep/17 ]

wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29186
Subject: LU-9983 lustre: 9983 try v2
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d7a390d2f9bb337476c8a1a2b20b524ebc11e7c9

Comment by Di Wang [ 25/Sep/17 ]
 can you re-base on a current kernel? (3.10.0-693.2.2)
I just finished up-grading soak to 7.4, really don't want to downgrade.

Ok, then let's use bisect to find out which patch caused this issue. Could you please try patch 29185 and 29186? And please remove the update log before re-run thanks.

Comment by Cliff White (Inactive) [ 25/Sep/17 ]

Sunday, I re-loaded build 26 and reformatted without DNE. Soak has run for 14 hours without any issues.

Comment by Di Wang [ 25/Sep/17 ]

Cliff tried 29185 and it also hit this LBUG. Build 29185 is basically build 18 + 2 kernel upgrade patches,
https://review.whamcloud.com/29002
https://review.whamcloud.com/28532

Since we did not hit this problem for build 18. so it seems the kernel upgrade is the culprit here? (somehow it break the bulk transfer between MDTs).

I will upload the client/server debug log here, could some network guys have a look?

Comment by Andreas Dilger [ 25/Sep/17 ]

Cliff will also try running IOR with data verification (which we should probably be running on soak all the time), so we can verify if bulk transfers of data are working properly or not. This would help isolate if there are problems with the core bulk data transfer mechanism (which would point more in the direction of LNet or OFED changes), or if this is isolated to OUT.

Di, it is also bad if the MDS is LASSERTing on data that it got over the network. It should only report an error in this case.

Comment by Di Wang [ 25/Sep/17 ]
Di, it is also bad if the MDS is LASSERTing on data that it got over the network. It should only report an error in this case.

Oh, I posted a patch to fix this LASSERT https://review.whamcloud.com/29178 , please check. thanks.

Comment by Cliff White (Inactive) [ 25/Sep/17 ]

Loaded patch 29185 - immediate LBUG. loading next patch

Sep 25 16:22:16 soak-10 kernel: LustreError: 13509:0:(llog_osd.c:327:llog_osd_declare_write_rec()) ASSERTION( rec ) failed:
Sep 25 16:22:16 soak-10 kernel: LustreError: 13509:0:(llog_osd.c:327:llog_osd_declare_write_rec()) LBUG
Sep 25 16:22:16 soak-10 kernel: Pid: 13509, comm: dist_txn-2
Sep 25 16:22:16 soak-10 kernel: #012Call Trace:
Sep 25 16:22:16 soak-10 kernel: [<ffffffffc09187ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Sep 25 16:22:16 soak-10 kernel: [<ffffffffc091883c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Sep 25 16:22:16 soak-10 kernel: [<ffffffffc0a3bac5>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass]
Sep 25 16:22:16 soak-10 kernel: [<ffffffffc0a300e4>] llog_declare_write_rec+0x84/0x200 [obdclass]
Sep 25 16:22:16 soak-10 kernel: [<ffffffffc0a30862>] llog_cancel_rec+0xe2/0x870 [obdclass]
Sep 25 16:22:16 soak-10 kernel: [<ffffffffc0a3728a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass]
Sep 25 16:22:16 soak-10 kernel: [<ffffffffc0923bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Sep 25 16:22:17 soak-10 kernel: [<ffffffffc0d170e9>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc]
Sep 25 16:22:17 soak-10 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
Sep 25 16:22:17 soak-10 kernel: [<ffffffffc0d16b50>] ? distribute_txn_commit_thread+0x0/0xca0 [ptlrpc]
Sep 25 16:22:17 soak-10 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Sep 25 16:22:17 soak-10 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Sep 25 16:22:17 soak-10 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
Sep 25 16:22:17 soak-10 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Sep 25 16:22:17 soak-10 kernel:
Sep 25 16:22:17 soak-10 kernel: Kernel panic - not syncing: LBUG
Comment by Cliff White (Inactive) [ 25/Sep/17 ]

Errors from soak-8

Sep 25 16:21:41 soak-8 systemd: Removed slice User Slice of root.
Sep 25 16:21:41 soak-8 systemd: Stopping User Slice of root.
Sep 25 16:22:05 soak-8 kernel: LustreError: 13225:0:(out_handler.c:597:out_write()) soaked-MDT0000: empty or wrong size 0 pos: rc = -61
Sep 25 16:22:05 soak-8 kernel: LustreError: 13706:0:(out_handler.c:1000:out_handle()) soaked-MDT0000: invalid update buffer magic 0 expect bdde0002: rc = -71
Sep 25 16:22:47 soak-8 sshd[13757]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Sep 25 16:22:47 soak-8 sshd[13757]: Accepted publickey for root from 10.10.1.116 port 50002 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
Sep 25 16:22:47 soak-8 systemd: Created slice User Slice of root.
Sep 25 16:22:47 soak-8 systemd: Starting User Slice of root.
Sep 25 16:22:47 soak-8 systemd-logind: New session 26 of user root.
Sep 25 16:22:47 soak-8 systemd: Started Session 26 of user root.
Sep 25 16:22:47 soak-8 systemd: Starting Session 26 of user root.
Sep 25 16:22:47 soak-8 sshd[13757]: pam_unix(sshd:session): session opened for user root by (uid=0)
Sep 25 16:22:47 soak-8 sshd[13757]: pam_unix(sshd:session): session closed for user root
Sep 25 16:22:47 soak-8 systemd-logind: Removed session 26.
Sep 25 16:22:47 soak-8 systemd: Removed slice User Slice of root.
Sep 25 16:22:47 soak-8 systemd: Stopping User Slice of root.
Sep 25 16:22:49 soak-8 kernel: Lustre: 12799:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1506356561/real 0]  req@ffff8803f3a70f00 x1579526007953840/t0(0) o400->soaked-MDT0002-osp-MDT0000@192.168.1.110@o2ib:24/4 lens 224/224 e 0 to 1 dl 1506356568 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Sep 25 16:22:49 soak-8 kernel: Lustre: soaked-MDT0003-osp-MDT0000: Connection to soaked-MDT0003 (at 192.168.1.111@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 25 16:22:49 soak-8 kernel: Lustre: 12799:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 25 16:22:50 soak-8 sshd[13780]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Sep 25 16:22:50 soak-8 sshd[13780]: Accepted publickey for root from 10.10.1.116 port 50026 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
Comment by Cliff White (Inactive) [ 25/Sep/17 ]

Attempt to run the second patch with IOR and data verification. removed all old update_logs from MDTs. Soak started 2017-09-25 18:02:05 One IOR single-shared-file test completed without error. soak-9 and soak-11 immediately crashed: <15 seconds after start of test.

soak-9.log:Sep 25 18:02:17 soak-9 kernel: LustreError: 2686:0:(llog_osd.c:327:llog_osd_declare_write_rec()) LBUG
Sep 25 18:01:01 soak-9 systemd: Stopping User Slice of root.
Sep 25 18:02:06 soak-9 kernel: LustreError: 2615:0:(out_handler.c:597:out_write()) soaked-MDT0001: empty or wrong size 0 pos: rc = -61
Sep 25 18:02:06 soak-9 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0001: operation out_update to node 192.168.1.108@o2ib failed: rc = -71
Sep 25 18:02:06 soak-9 kernel: LustreError: 2630:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012  req@ffff88080daa3900 x1579533157282240/t0(0) o1000->soaked-MDT0000-osp-MDT0001@192.168.1.108@o2ib:24/4 lens 392/192 e 0 to 0 dl 1506362537 ref 2 fl Interpret:RM/0/0 rc -71/-71
Sep 25 18:02:17 soak-9 kernel: LustreError: 2686:0:(llog_osd.c:327:llog_osd_declare_write_rec()) ASSERTION( rec ) failed:
Sep 25 18:02:17 soak-9 kernel: LustreError: 2686:0:(llog_osd.c:327:llog_osd_declare_write_rec()) LBUG
Sep 25 18:02:17 soak-9 kernel: Pid: 2686, comm: dist_txn-1
Sep 25 18:02:17 soak-9 kernel: #012Call Trace:
Sep 25 18:02:17 soak-9 kernel: [<ffffffffc09097ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Sep 25 18:02:17 soak-9 kernel: [<ffffffffc090983c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Sep 25 18:02:17 soak-9 kernel: [<ffffffffc0a93ac5>] llog_osd_declare_write_rec+0x3c5/0x3d0 [obdclass]
Sep 25 18:02:17 soak-9 kernel: [<ffffffffc0a880e4>] llog_declare_write_rec+0x84/0x200 [obdclass]
Sep 25 18:02:17 soak-9 kernel: [<ffffffffc0a88862>] llog_cancel_rec+0xe2/0x870 [obdclass]
Sep 25 18:02:17 soak-9 kernel: [<ffffffffc0a8f28a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass]
Sep 25 18:02:17 soak-9 kernel: [<ffffffffc0914bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Sep 25 18:02:17 soak-9 kernel: [<ffffffffc0d20139>] distribute_txn_commit_thread+0x599/0xca0 [ptlrpc]
Sep 25 18:02:17 soak-9 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0
Sep 25 18:02:17 soak-9 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
Sep 25 18:02:17 soak-9 kernel: [<ffffffffc0d1fba0>] ? distribute_txn_commit_thread+0x0/0xca0 [ptlrpc]
Sep 25 18:02:17 soak-9 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Sep 25 18:02:17 soak-9 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Sep 25 18:02:17 soak-9 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
Sep 25 18:02:17 soak-9 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Sep 25 18:02:17 soak-9 kernel:
Sep 25 18:02:17 soak-9 kernel: Kernel panic - not syncing: LBUG

Next steps?

Comment by Cliff White (Inactive) [ 25/Sep/17 ]

Per request, lustre-log from soak-8 attached.

Comment by Di Wang [ 25/Sep/17 ]

Ah, so this seems to isolated to OUT. hmm, let me add more debug log.

Comment by Gerrit Updater [ 25/Sep/17 ]

wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29202
Subject: LU-9983 out: try 3
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 4e13d3da5beffabe3cb69268349c366acd36a825

Comment by Gerrit Updater [ 25/Sep/17 ]

wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29207
Subject: LU-9983 out: try 3
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6fcce248254b265f7521deaacb300420445e826b

Please ignore this

Comment by Di Wang [ 25/Sep/17 ]
Cliff will also try running IOR with data verification (which we should probably be running on soak all the time), so we can verify if bulk transfers of data are working properly or not. This would help isolate if there are problems with the core bulk data transfer mechanism (which would point more in the direction of LNet or OFED changes), or if this is isolated to OUT.

Note: bulk between MDTs are using different iovec ops than OSC<--> OST. See LU-5835 http://review.whamcloud.com/12525/

If bulk transferring between OSC and OST are proved to be correct, then maybe we should check those iovec ops for MDTs.

Comment by Gerrit Updater [ 26/Sep/17 ]

wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29208
Subject: LU-9983 osp: align the OSP request size by 4k
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: d741a50ea589969c92685b0fff63e22488b44bb4

Although this patch can avoid the LBUG, but the real reason still needs to be understood. It looks like PTLRPC (or Lnet with RHEL 7.4 kernel) can not handle non-pagesize aligned buffer on the receiver side.

Comment by Cliff White (Inactive) [ 26/Sep/17 ]

Previous patch (build 50791) ran 4 hours in soak without LBUG. Switching to https://review.whamcloud.com/#/c/29208/ - build 50793

Comment by Gerrit Updater [ 26/Sep/17 ]

wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29218
Subject: LU-9983 llog: skip non-exist log cancellation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4df15f1acd5ccdb0d5628a2ce620d03d90ea9171

Comment by Gerrit Updater [ 26/Sep/17 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29222
Subject: LU-9983 lnet: test to see if this is the cause
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6f109b6b5d3fadd077df0414292c52f33659c6e0

Comment by Di Wang [ 28/Sep/17 ]

Finally, I got two nodes with IB in Onyx, and can reproduce the issue easily. And also it works fine with tcp lnd. So it is definitely O2ib issue.

If needed, I can ask Sarah to test it in RHEL7.3 to see if it works there.

Comment by Gerrit Updater [ 28/Sep/17 ]

wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29238
Subject: LU-9983 llog: skip non-exist log cancellation
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 27fe551a1c18e8d90c2c04ebb5a2dee315b26321

Comment by Gerrit Updater [ 28/Sep/17 ]

wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/29240
Subject: LU-9983 osp: align the OSP request size by 4k
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d6bc57faa857d2fb5353476afbf416cf3fde255f

Comment by Cliff White (Inactive) [ 28/Sep/17 ]

Running https://build.hpdd.intel.com/job/lustre-reviews/50862/,

  • connection drops have stopped
  • random job failures have stopped.
  • so far no LBUGS.
    This build is way better than the previous tests.
Comment by Gerrit Updater [ 29/Sep/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29208/
Subject: LU-9983 osp: align the OSP request size by 4k
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: c535d7a4021e3c520246bdde118be97414938266

Comment by Gerrit Updater [ 29/Sep/17 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/29270
Subject: LU-9983 osp: align the OSP request size by 4k
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9994585785eeb46fdb42b6b409764c03ea5a7651

Comment by Cliff White (Inactive) [ 02/Oct/17 ]

Running the (almost) latest patch, seeing quite a few of these:

/scratch/logs/syslog/soak-8.log:Oct  1 22:37:25 soak-8 kernel: LustreError: 8097:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 4 previous similar messages
/scratch/logs/syslog/soak-8.log:Oct  1 22:37:25 soak-8 kernel: LustreError: 8097:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 944 actual 416.
/scratch/logs/syslog/soak-9.log:Oct  1 22:42:25 soak-9 kernel: LustreError: 2165:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 6 previous similar messages
/scratch/logs/syslog/soak-9.log:Oct  1 22:42:25 soak-9 kernel: LustreError: 2165:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 872 actual 416.
/scratch/logs/syslog/soak-10.log:Oct  1 22:42:26 soak-10 kernel: LustreError: 2401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 10 previous similar messages
/scratch/logs/syslog/soak-10.log:Oct  1 22:42:26 soak-10 kernel: LustreError: 2401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
/scratch/logs/syslog/soak-10.log:Oct  1 22:42:26 soak-10 kernel: LustreError: 4181:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
/scratch/logs/syslog/soak-10.log:Oct  1 22:44:04 soak-10 kernel: LustreError: 2351:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
/scratch/logs/syslog/soak-9.log:Oct  1 22:44:04 soak-9 kernel: LustreError: 2351:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 848 actual 416.
/scratch/logs/syslog/soak-10.log:Oct  1 22:57:27 soak-10 kernel: LustreError: 4296:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 8 previous similar messages
/scratch/logs/syslog/soak-10.log:Oct  1 22:57:27 soak-10 kernel: LustreError: 4296:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
/scratch/logs/syslog/soak-9.log:Oct  1 22:57:27 soak-9 kernel: LustreError: 2329:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 9 previous similar messages
/scratch/logs/syslog/soak-9.log:Oct  1 22:57:27 soak-9 kernel: LustreError: 2329:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 800 actual 416.
/scratch/logs/syslog/soak-9.log:Oct  1 22:59:06 soak-9 kernel: LustreError: 2357:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 776 actual 416.

We see some timeouts/lustre-log dumps, but no LBUGs. Further details tomorrow, timeouts may be failover-related.

Comment by Andreas Dilger [ 02/Oct/17 ]

I think the mdt_lvbo_fill() error is fallout from PFL and has been seen on master (LU-9825). That said, it is so commonly printed as to make it very annoying, and IMHO something that should be fixed. Lai, could you please take a look.

Comment by Peter Jones [ 02/Oct/17 ]

Lai will be out this week so is there someone else who could confirm that this is benign enough to live with in 2.10.1 or not?

Comment by Gerrit Updater [ 02/Oct/17 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/29290
Subject: LU-9983 ko2iblnd: allow for discontiguous fragments
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cfce4303780fb6a12d2b08b784e333a2de7de07b

Comment by Cliff White (Inactive) [ 02/Oct/17 ]

Should I move soak to this patch?

Comment by Joseph Gmitter (Inactive) [ 06/Oct/17 ]

Yes, please test the viability of John's patch as a fix for this issue. It has been validated already by Amir/John on their side. After a successful run at the soak scale, we can create a tmp patch to revert Di's fix and have only John's patch present for a final verification.

Comment by Andreas Dilger [ 06/Oct/17 ]

I don't think we need to revert Di's patch, even if John's patch works.

Comment by Cliff White (Inactive) [ 06/Oct/17 ]

I'll do this on Monday.

Comment by Joseph Gmitter (Inactive) [ 07/Oct/17 ]

Thanks Cliff.

Di - do you agree with Andreas that your fix is useful independent of John's proposed fix?

Comment by Di Wang [ 07/Oct/17 ]

Yes, Joe. I agree with Andreas.

Comment by Cliff White (Inactive) [ 09/Oct/17 ]

With this patch, multiple errors when attempting client mounts. Servers mount.

[ 4377.206482] Lustre: 10866:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507572416/real 1507572416]  req@ffff880825ec0300 x1580800666501216/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507572460 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[ 4377.238308] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
[ 4377.254213] LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
[ 4377.254668] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
Comment by Cliff White (Inactive) [ 09/Oct/17 ]

Clients do not mount with this patch - set -1 debug, dumped lustre-log from MDS and client after failure. Attached.

Comment by Gerrit Updater [ 11/Oct/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29238/
Subject: LU-9983 llog: skip non-exist log cancellation
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 858f0ee15eb0ba2eca525297d510213ba6c5ec5e

Comment by Gerrit Updater [ 16/Oct/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29290/
Subject: LU-9983 ko2iblnd: allow for discontiguous fragments
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7954a520428002de8629916d4ae7b6660818dc6c

Comment by Chris Horn [ 07/Nov/17 ]

Cray is seeing this issue in our 2.10 testing. I read through the comments in this ticket but I did not see a clear description of the root cause of this issue. Could someone provide a description? I see a ko2iblnd change was made in association with this patch, and I'm concerned that gnilnd may be vulnerable as well.

Comment by John Hammond [ 10/Nov/17 ]

Chris, the root cause is the bug addressed by https://review.whamcloud.com/29290

See the commit message for an explanation. This in turn triggers a second bug which is addressed by https://review.whamcloud.com/29238/.

You can also see the failed assertion in some OOM cases on an MDS.

Comment by Chris Horn [ 10/Nov/17 ]

Got it. Thanks, John.

Comment by Gerrit Updater [ 22/Nov/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29218/
Subject: LU-9983 llog: skip non-exist log cancellation
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 16d7737894c0a8a111f64f04d7c6bdf0ddc86167

Comment by Gerrit Updater [ 17/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29270/
Subject: LU-9983 osp: align the OSP request size by 4k
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1f50b1e494ff1b4988508c6d6398ee6769467931

Comment by Peter Jones [ 17/Dec/17 ]

It looks like all the parts of this have finally landed to master

Comment by Gerrit Updater [ 18/Dec/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30580
Subject: LU-9983 ko2iblnd: allow for discontiguous fragments
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 0d10af5408ef98801920c04daf86d1d21c8d6dda

Generated at Sat Feb 10 02:31:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.