[LU-14402] LBUG: osd_write_commit() ASSERTION( !PageDirty(lnb[i].lnb_page) ) failed Created: 04/Feb/21  Updated: 23/Oct/21  Resolved: 23/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Shaun Tancheff Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14640 ASSERTION( !PageDirty(lnb[i].lnb_page... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
[ 6733.016726] perf: interrupt took too long (3163 > 3126), lowering kernel.perf_event_max_sample_rate to 63000
[ 6953.568157] LustreError: 29459:0:(osd_io.c:1558:osd_write_commit()) ASSERTION( !PageDirty(lnb[i].lnb_page) ) failed: 
[ 6953.579769] LustreError: 29459:0:(osd_io.c:1558:osd_write_commit()) LBUG
[ 6953.587203] Pid: 29459, comm: ll_ost_io00_523 3.10.0-957.1.3957.1.3.x3.4.37.x86_64 #1 SMP Mon Jan 13 18:26:28 PST 2020
[ 6953.598541] Call Trace:
[ 6953.601529]  [<ffffffffc11e862c>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 6953.608765]  [<ffffffffc11e894c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 6953.615539]  [<ffffffffc1dffcac>] osd_write_commit+0x52c/0x870 [osd_ldiskfs]
[ 6953.623529]  [<ffffffffc1c24769>] ofd_commitrw_write+0xde9/0x1480 [ofd]
[ 6953.630806]  [<ffffffffc1c2861d>] ofd_commitrw+0x2ad/0x9a0 [ofd]
[ 6953.637512]  [<ffffffffc171dac9>] tgt_brw_write+0xfd9/0x1cc0 [ptlrpc]
[ 6953.644747]  [<ffffffffc1719c4a>] tgt_request_handle+0x7ea/0x1750 [ptlrpc]
[ 6953.652474]  [<ffffffffc16bd136>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[ 6953.661074]  [<ffffffffc16c1c8c>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
[ 6953.668522]  [<ffffffffba6c1c31>] kthread+0xd1/0xe0
[ 6953.674378]  [<ffffffffbad76c1d>] ret_from_fork_nospec_begin+0x7/0x21
[ 6953.681517]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 6953.687422] Kernel panic - not syncing: LBUG
[ 6953.692298] CPU: 11 PID: 29459 Comm: ll_ost_io00_523 Kdump: loaded Tainted: P           OE  ------------   3.10.0-957.1.3957.1.3.x3.4.37.x86_64 #1
[ 6953.706675] Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018
[ 6953.715123] Call Trace:
[ 6953.718301]  [<ffffffffbad63e41>] dump_stack+0x19/0x1b
[ 6953.724122]  [<ffffffffbad5d550>] panic+0xe8/0x21f
[ 6953.729700]  [<ffffffffc11e899b>] lbug_with_loc+0x9b/0xa0 [libcfs]
[ 6953.736628]  [<ffffffffc1dffcac>] osd_write_commit+0x52c/0x870 [osd_ldiskfs]
[ 6953.744395]  [<ffffffffc1c24769>] ofd_commitrw_write+0xde9/0x1480 [ofd]
[ 6953.751678]  [<ffffffffc1c2861d>] ofd_commitrw+0x2ad/0x9a0 [ofd]
[ 6953.758374]  [<ffffffffc171dac9>] tgt_brw_write+0xfd9/0x1cc0 [ptlrpc]
[ 6953.765463]  [<ffffffffba6db748>] ? __enqueue_entity+0x78/0x80
[ 6953.771941]  [<ffffffffba6e236f>] ? enqueue_entity+0x2ef/0xbe0
[ 6953.778575]  [<ffffffffc16d6d97>] ? __req_capsule_get+0x427/0x6b0 [ptlrpc]
[ 6953.786210]  [<ffffffffc1719c4a>] tgt_request_handle+0x7ea/0x1750 [ptlrpc]
[ 6953.793817]  [<ffffffffc16f3bc1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[ 6953.802150]  [<ffffffffc11e502e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[ 6953.809942]  [<ffffffffc16bd136>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[ 6953.818313]  [<ffffffffba6cec64>] ? __wake_up+0x44/0x50
[ 6953.824396]  [<ffffffffc16c1c8c>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
[ 6953.831440]  [<ffffffffc16c1150>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc]
[ 6953.839649]  [<ffffffffba6c1c31>] kthread+0xd1/0xe0
[ 6953.845394]  [<ffffffffba6c1b60>] ? insert_kthread_work+0x40/0x40
[ 6953.852303]  [<ffffffffbad76c1d>] ret_from_fork_nospec_begin+0x7/0x21
[ 6953.859587]  [<ffffffffba6c1b60>] ? insert_kthread_work+0x40/0x40
(END)



 Comments   
Comment by Shaun Tancheff [ 19/Sep/21 ]

As of: v2_14_54-52-g1887169365

I am still hitting this LBUG.

Ex:

Sep 19 09:21:52 snx11922n005 kernel: LustreError: 24267:0:(osd_io.c:1608:osd_write_commit()) ASSERTION( !PageDirty(lnb[i].lnb_page) ) failed: 
Sep 19 09:21:52 snx11922n005 kernel: LustreError: 24267:0:(osd_io.c:1608:osd_write_commit()) LBUG
Sep 19 09:21:52 snx11922n005 kernel: Pid: 24267, comm: ll_ost_io00_713 3.10.0-957.1.3957.1.3.x3.4.37.x86_64 #1 SMP Mon Jan 13 18:26:28 PST 2020
Sep 19 09:21:52 snx11922n005 kernel: IEC: 026000003: LASSERT: { "pid": "24267", "ext_pid": "0", "filename": "osd_io.c", "line": "1608", "func_name": "osd_write_commit", "assert_info": "( !PageDirty(lnb[i].lnb_page) ) failed: " }
Sep 19 09:21:52 snx11922n005 kernel: IEC: 026000004: LBUG: { "pid": "24267", "ext_pid": "0", "filename": "osd_io.c", "line": "1608", "func_name": "osd_write_commit" }
Sep 19 09:21:52 snx11922n005 kernel: Call Trace:
Sep 19 09:21:52 snx11922n005 kernel: [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] osd_write_commit+0x52c/0x880 [osd_ldiskfs]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] ofd_commitrw_write+0xef9/0x15d0 [ofd]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] ofd_commitrw+0x335/0x9f0 [ofd]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] tgt_brw_write+0x176a/0x2310 [ptlrpc]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] tgt_request_handle+0x823/0x1850 [ptlrpc]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] ptlrpc_server_handle_request+0x253/0xb10 [ptlrpc]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] kthread+0xd1/0xe0
Sep 19 09:21:52 snx11922n005 kernel: [<0>] ret_from_fork_nospec_begin+0x7/0x21
Sep 19 09:21:52 snx11922n005 kernel: [<0>] 0xfffffffffffffffe
Sep 19 09:21:52 snx11922n005 kernel: Kernel panic - not syncing: LBUG

Also kindly note that this includes the LU-14640 fix https://review.whamcloud.com/#/c/43462/

Comment by Gerrit Updater [ 29/Sep/21 ]

"Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45086
Subject: LU-14402 osd-ldiskfs: Page cache pages dirtied in writeback
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3d35690a18d558be68e1701c71f9f02901125cf8

Comment by Gerrit Updater [ 29/Sep/21 ]

"Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45087
Subject: LU-14402 osd-ldiskfs: disable pagecache bypass feature
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 060e753054494e44eb15bef9695d9038ba834a4e

Comment by Andreas Dilger [ 29/Sep/21 ]

Shaun, this same ASSERTION() was reported fixed via patch https://review.whamcloud.com/43462 "LU-14640 osd: ASSERTION(!PageDirty(lnb[i].lnb_page)".

The problem wasn't so much the uncached IO patches that you are reverting as that fallocate() wasn't cleaning up partially-zeroed pages in the cache.

Ah, I see your addendum that this patch is included in your recent failure. Do you have any details of your workload that is triggering this error? We've been running with the uncached IO patches for many months without any similar reports in the field, so there must be something specific in your workload that is triggering it.

Comment by Shaun Tancheff [ 30/Sep/21 ]

The crash hits quite early (30 minutes or so) in our io-stress suite. The suite is large with bits of unaligned I/O mixed in doing to usual aio/dio, mmap, ior, and ltp tests.

Comment by Shaun Tancheff [ 22/Oct/21 ]

As of 14d07b6237 this did not reproduce.

Comment by Andreas Dilger [ 23/Oct/21 ]

The range v2_14_55-52..14d07b6237 contains 87 patches, but if this is restricted to changes to lustre/osd-ldiskfs and lustre/ofd the list is more manageable:

[root@centos7 lustre-copy]# git log --oneline v2_14_55~$((98-52))..14d07b6237 lustre/osd-ldiskfs  lustre/ofd
5daf86607877 LU-12268 osd: BUG_ON for IAM corruption
882a9f784de2 LU-14927 scrub: create shared scrub_needs_check() function.
0daeebcbdc4e LU-14797 nodemap: map project id
bbfdc7c1670c LU-14739 quota: fix quota with root squash enabled
bb5d81ea9550 LU-14543 target: prevent overflowing of tgd->tgd_tot_granted
da1d93513fdf LU-14475 log: Rewrite some log messages
2a24b6ec67da LU-14734 ldiskfs: improve message for large_dir
7fdd664b3518 LU-14895 osd-ldiskfs: combine checksum functions
c18d5d892b62 LU-14889 lproc: Add server checksum_type

The patch https://review.whamcloud.com/45072 "LU-12268 osd: BUG_ON for IAM corruption" which might "prevent" memory corruption in some cases, but it would trigger BUG_ON() instead, because the patch doesn't actually fix the core problem.

Other than that, I don't really have any concrete suggestions other than to try bisect, if you want to figure out which patch fixed the problem.

Generated at Sat Feb 10 03:09:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.