[LU-14402] LBUG: osd_write_commit() ASSERTION( !PageDirty(lnb[i].lnb_page) ) failed Created: 04/Feb/21 Updated: 23/Oct/21 Resolved: 23/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Shaun Tancheff | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
[ 6733.016726] perf: interrupt took too long (3163 > 3126), lowering kernel.perf_event_max_sample_rate to 63000 [ 6953.568157] LustreError: 29459:0:(osd_io.c:1558:osd_write_commit()) ASSERTION( !PageDirty(lnb[i].lnb_page) ) failed: [ 6953.579769] LustreError: 29459:0:(osd_io.c:1558:osd_write_commit()) LBUG [ 6953.587203] Pid: 29459, comm: ll_ost_io00_523 3.10.0-957.1.3957.1.3.x3.4.37.x86_64 #1 SMP Mon Jan 13 18:26:28 PST 2020 [ 6953.598541] Call Trace: [ 6953.601529] [<ffffffffc11e862c>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 6953.608765] [<ffffffffc11e894c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 6953.615539] [<ffffffffc1dffcac>] osd_write_commit+0x52c/0x870 [osd_ldiskfs] [ 6953.623529] [<ffffffffc1c24769>] ofd_commitrw_write+0xde9/0x1480 [ofd] [ 6953.630806] [<ffffffffc1c2861d>] ofd_commitrw+0x2ad/0x9a0 [ofd] [ 6953.637512] [<ffffffffc171dac9>] tgt_brw_write+0xfd9/0x1cc0 [ptlrpc] [ 6953.644747] [<ffffffffc1719c4a>] tgt_request_handle+0x7ea/0x1750 [ptlrpc] [ 6953.652474] [<ffffffffc16bd136>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 6953.661074] [<ffffffffc16c1c8c>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc] [ 6953.668522] [<ffffffffba6c1c31>] kthread+0xd1/0xe0 [ 6953.674378] [<ffffffffbad76c1d>] ret_from_fork_nospec_begin+0x7/0x21 [ 6953.681517] [<ffffffffffffffff>] 0xffffffffffffffff [ 6953.687422] Kernel panic - not syncing: LBUG [ 6953.692298] CPU: 11 PID: 29459 Comm: ll_ost_io00_523 Kdump: loaded Tainted: P OE ------------ 3.10.0-957.1.3957.1.3.x3.4.37.x86_64 #1 [ 6953.706675] Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018 [ 6953.715123] Call Trace: [ 6953.718301] [<ffffffffbad63e41>] dump_stack+0x19/0x1b [ 6953.724122] [<ffffffffbad5d550>] panic+0xe8/0x21f [ 6953.729700] [<ffffffffc11e899b>] lbug_with_loc+0x9b/0xa0 [libcfs] [ 6953.736628] [<ffffffffc1dffcac>] osd_write_commit+0x52c/0x870 [osd_ldiskfs] [ 6953.744395] [<ffffffffc1c24769>] ofd_commitrw_write+0xde9/0x1480 [ofd] [ 6953.751678] [<ffffffffc1c2861d>] ofd_commitrw+0x2ad/0x9a0 [ofd] [ 6953.758374] [<ffffffffc171dac9>] tgt_brw_write+0xfd9/0x1cc0 [ptlrpc] [ 6953.765463] [<ffffffffba6db748>] ? __enqueue_entity+0x78/0x80 [ 6953.771941] [<ffffffffba6e236f>] ? enqueue_entity+0x2ef/0xbe0 [ 6953.778575] [<ffffffffc16d6d97>] ? __req_capsule_get+0x427/0x6b0 [ptlrpc] [ 6953.786210] [<ffffffffc1719c4a>] tgt_request_handle+0x7ea/0x1750 [ptlrpc] [ 6953.793817] [<ffffffffc16f3bc1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [ 6953.802150] [<ffffffffc11e502e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 6953.809942] [<ffffffffc16bd136>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 6953.818313] [<ffffffffba6cec64>] ? __wake_up+0x44/0x50 [ 6953.824396] [<ffffffffc16c1c8c>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc] [ 6953.831440] [<ffffffffc16c1150>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc] [ 6953.839649] [<ffffffffba6c1c31>] kthread+0xd1/0xe0 [ 6953.845394] [<ffffffffba6c1b60>] ? insert_kthread_work+0x40/0x40 [ 6953.852303] [<ffffffffbad76c1d>] ret_from_fork_nospec_begin+0x7/0x21 [ 6953.859587] [<ffffffffba6c1b60>] ? insert_kthread_work+0x40/0x40 (END) |
| Comments |
| Comment by Shaun Tancheff [ 19/Sep/21 ] |
|
As of: v2_14_54-52-g1887169365 I am still hitting this LBUG. Ex: Sep 19 09:21:52 snx11922n005 kernel: LustreError: 24267:0:(osd_io.c:1608:osd_write_commit()) ASSERTION( !PageDirty(lnb[i].lnb_page) ) failed:
Sep 19 09:21:52 snx11922n005 kernel: LustreError: 24267:0:(osd_io.c:1608:osd_write_commit()) LBUG
Sep 19 09:21:52 snx11922n005 kernel: Pid: 24267, comm: ll_ost_io00_713 3.10.0-957.1.3957.1.3.x3.4.37.x86_64 #1 SMP Mon Jan 13 18:26:28 PST 2020
Sep 19 09:21:52 snx11922n005 kernel: IEC: 026000003: LASSERT: { "pid": "24267", "ext_pid": "0", "filename": "osd_io.c", "line": "1608", "func_name": "osd_write_commit", "assert_info": "( !PageDirty(lnb[i].lnb_page) ) failed: " }
Sep 19 09:21:52 snx11922n005 kernel: IEC: 026000004: LBUG: { "pid": "24267", "ext_pid": "0", "filename": "osd_io.c", "line": "1608", "func_name": "osd_write_commit" }
Sep 19 09:21:52 snx11922n005 kernel: Call Trace:
Sep 19 09:21:52 snx11922n005 kernel: [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] osd_write_commit+0x52c/0x880 [osd_ldiskfs]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] ofd_commitrw_write+0xef9/0x15d0 [ofd]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] ofd_commitrw+0x335/0x9f0 [ofd]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] tgt_brw_write+0x176a/0x2310 [ptlrpc]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] tgt_request_handle+0x823/0x1850 [ptlrpc]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] ptlrpc_server_handle_request+0x253/0xb10 [ptlrpc]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc]
Sep 19 09:21:52 snx11922n005 kernel: [<0>] kthread+0xd1/0xe0
Sep 19 09:21:52 snx11922n005 kernel: [<0>] ret_from_fork_nospec_begin+0x7/0x21
Sep 19 09:21:52 snx11922n005 kernel: [<0>] 0xfffffffffffffffe
Sep 19 09:21:52 snx11922n005 kernel: Kernel panic - not syncing: LBUG
Also kindly note that this includes the |
| Comment by Gerrit Updater [ 29/Sep/21 ] |
|
"Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45086 |
| Comment by Gerrit Updater [ 29/Sep/21 ] |
|
"Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45087 |
| Comment by Andreas Dilger [ 29/Sep/21 ] |
|
Ah, I see your addendum that this patch is included in your recent failure. Do you have any details of your workload that is triggering this error? We've been running with the uncached IO patches for many months without any similar reports in the field, so there must be something specific in your workload that is triggering it. |
| Comment by Shaun Tancheff [ 30/Sep/21 ] |
|
The crash hits quite early (30 minutes or so) in our io-stress suite. The suite is large with bits of unaligned I/O mixed in doing to usual aio/dio, mmap, ior, and ltp tests. |
| Comment by Shaun Tancheff [ 22/Oct/21 ] |
|
As of 14d07b6237 this did not reproduce. |
| Comment by Andreas Dilger [ 23/Oct/21 ] |
|
The range v2_14_55-52..14d07b6237 contains 87 patches, but if this is restricted to changes to lustre/osd-ldiskfs and lustre/ofd the list is more manageable: [root@centos7 lustre-copy]# git log --oneline v2_14_55~$((98-52))..14d07b6237 lustre/osd-ldiskfs lustre/ofd 5daf86607877 LU-12268 osd: BUG_ON for IAM corruption 882a9f784de2 LU-14927 scrub: create shared scrub_needs_check() function. 0daeebcbdc4e LU-14797 nodemap: map project id bbfdc7c1670c LU-14739 quota: fix quota with root squash enabled bb5d81ea9550 LU-14543 target: prevent overflowing of tgd->tgd_tot_granted da1d93513fdf LU-14475 log: Rewrite some log messages 2a24b6ec67da LU-14734 ldiskfs: improve message for large_dir 7fdd664b3518 LU-14895 osd-ldiskfs: combine checksum functions c18d5d892b62 LU-14889 lproc: Add server checksum_type The patch https://review.whamcloud.com/45072 " Other than that, I don't really have any concrete suggestions other than to try bisect, if you want to figure out which patch fixed the problem. |