[LU-12593] update_log corruption Created: 26/Jul/19 Updated: 17/Feb/21 Resolved: 23/Oct/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Boyko | Assignee: | Alexander Boyko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
During upgrade process from a old version, we saw errors Jun 13 18:16:48 282 kernel: LustreError: 24685:0:(lod_dev.c:131:lod_fld_lookup()) lustre02-MDT0005-mdtlov: invalid FID [0x0:0x1cf60001:0x0] Jun 13 18:17:01 280 kernel: LustreError: 26388:0:(lod_dev.c:131:lod_fld_lookup()) lustre02-MDT0003-mdtlov: invalid FID [0x0:0x2050100:0x0] Jun 13 18:20:30 279 kernel: LustreError: 26741:0:(lod_dev.c:131:lod_fld_lookup()) lustre02-MDT0002-mdtlov: invalid FID [0x0:0x1cf60001:0x0] Jun 13 18:26:48 282 kernel: LustreError: 24685:0:(lod_dev.c:131:lod_fld_lookup()) lustre02-MDT0005-mdtlov: invalid FID [0x0:0x1cf60001:0x0] Jun 13 18:27:01 280 kernel: LustreError: 26388:0:(lod_dev.c:131:lod_fld_lookup()) lustre02-MDT0003-mdtlov: invalid FID [0x0:0x2050100:0x0] Jun 13 18:30:30 279 kernel: LustreError: 26741:0:(lod_dev.c:131:lod_fld_lookup()) lustre02-MDT0002-mdtlov: invalid FID [0x0:0x1cf60001:0x0] Jun 13 19:03:39 281 kernel: LustreError: 17588:0:(lod_dev.c:131:lod_fld_lookup()) lustre02-MDT0004-mdtlov: invalid FID [0x0:0x1d:0x0] Jun 13 19:03:41 281 kernel: LustreError: 17588:0:(lod_dev.c:131:lod_fld_lookup()) lustre02-MDT0004-mdtlov: invalid FID [0x0:0x1d:0x0] Jun 13 19:03:43 281 kernel: LustreError: 17588:0:(lod_dev.c:131:lod_fld_lookup()) lustre02-MDT0004-mdtlov: invalid FID [0x0:0x1d:0x0] Lustre logs shows the next info 00000004:00080000:12.0:1560446017.274394:0:17588:0:(lod_dev.c:423:lod_sub_recovery_thread()) lustre02-MDT0005-osp-MDT0004 get update log failed -5, retry 00000040:01000000:12.0:1560446017.274395:0:17588:0:(llog_osd.c:2027:llog_osd_get_cat_list()) cat list: disk size=192, read=32 00000004:00020000:12.0:1560446017.274467:0:17588:0:(lod_dev.c:131:lod_fld_lookup()) lustre02-MDT0004-mdtlov: invalid FID [0x0:0x1d:0x0] 00000004:00080000:12.0:1560446017.274468:0:17588:0:(lod_dev.c:423:lod_sub_recovery_thread()) lustre02-MDT0005-osp-MDT0004 get update log failed -5, retry 00000040:01000000:12.0:1560446017.274469:0:17588:0:(llog_osd.c:2027:llog_osd_get_cat_list()) cat list: disk size=192, read=32 00000004:00020000:12.0:1560446017.274548:0:17588:0:(lod_dev.c:131:lod_fld_lookup()) lustre02-MDT0004-mdtlov: invalid FID [0x0:0x1d:0x0] 00000004:00080000:12.0:1560446017.274549:0:17588:0:(lod_dev.c:423:lod_sub_recovery_thread()) lustre02-MDT0005-osp-MDT0004 get update log failed -5, retry 00000040:01000000:12.0:1560446017.274549:0:17588:0:(llog_osd.c:2027:llog_osd_get_cat_list()) cat list: disk size=192, read=32 00000040:00000001:1.0:1560447502.778180:0:17588:0:(llog_osd.c:1968:llog_osd_get_cat_list()) Process entered 00000004:00000001:1.0:1560447502.778181:0:17588:0:(osp_object.c:546:osp_attr_get()) Process entered 00000004:00000001:1.0:1560447502.778181:0:17588:0:(osp_object.c:556:osp_attr_get()) Process leaving (rc=0 : 0 : 0) 00000040:01000000:1.0:1560447502.778182:0:17588:0:(llog_osd.c:2027:llog_osd_get_cat_list()) cat list: disk size=192, read=32 .... 00000004:00000040:1.0:1560447502.778187:0:17588:0:(osp_md_object.c:1148:osp_md_read()) lustre02-MDT0005-osp-MDT0004 [0x200000009:0x5:0x0] read offset 128 size 32 ... 00000040:00000001:1.0:1560447502.778489:0:17588:0:(llog_osd.c:2056:llog_osd_get_cat_list()) Process leaving (rc=0 : 0 : 0) ... 00000004:00000040:1.0:1560447502.778495:0:17588:0:(mdt_handler.c:5441:mdt_object_init()) object init, fid = [0x0:0x1d:0x0] 00000004:00000010:1.0:1560447502.778495:0:17588:0:(mdd_object.c:282:mdd_object_alloc()) slab-alloced 'mdd_obj': 96 at ffff880f5ce8cc60. 00000004:00000001:1.0:1560447502.778496:0:17588:0:(mdt_handler.c:5450:mdt_object_init()) Process leaving (rc=0 : 0 : 0) the above was a remote read of update_log from MDT0005 on 282, 32 bytes on offset 128, The part of update_log from MDT0005 fetched by MDT0004 : $ od -X -j 128 -N 32 -A d update_log 0000128 00000000 00000003 0000001d 00000004 0000144 0000001d 00000000 7473756c 32306572 0000160 The issue is reproducible, reproducer do the next steps
The root cause for update_log corruption. if (map.m_flags & LDISKFS_MAP_NEW) {
J_ASSERT(create != 0);
J_ASSERT(handle != NULL);
/*
* Now that we do not always journal data, we should
* keep in mind whether this should always journal the
* new buffer as metadata. For now, regular file
* writes use ldiskfs_get_block instead, so it's not a
* problem.
*/
lock_buffer(bh);
BUFFER_TRACE(bh, "call get_create_access");
fatal = ldiskfs_journal_get_create_access(handle, bh);
if (!fatal && !buffer_uptodate(bh)) {
memset(bh->b_data, 0, inode->i_sb->s_blocksize);
set_buffer_uptodate(bh);
}
unlock_buffer(bh);
BUFFER_TRACE(bh, "call ldiskfs_handle_dirty_metadata");
err = ldiskfs_handle_dirty_metadata(handle, inode, bh);
if (!fatal)
fatal = err;
} else {
Some times NEW bh is uptodate, in this case data is not zeroed, and we get corruption for update_log. |
| Comments |
| Comment by Gerrit Updater [ 26/Jul/19 ] |
|
Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/35629 |
| Comment by Gerrit Updater [ 22/Oct/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35629/ |
| Comment by Peter Jones [ 23/Oct/19 ] |
|
Landed for 2.13 |
| Comment by Alex Zhuravlev [ 06/Nov/19 ] |
|
Alexandr, have you seen LDISKFS_GET_BLOCKS_ZERO ? looks like it does what we needed? |
| Comment by Alexander Boyko [ 07/Nov/19 ] |
|
Yes, I saw it. It does what we need, but for extents. osd_ldiskfs_write_record is based on another codepath. |
| Comment by Alex Zhuravlev [ 07/Nov/19 ] |
|
osd_ldiskfs_write_record() -> __ldiskfs_bread() -> ldiskfs_bread() -> ldiskfs_getblk() -> ldiskfs_map_blocks() - so far it's common, no? |
| Comment by Gerrit Updater [ 07/Nov/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36709 |
| Comment by Gerrit Updater [ 05/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36709/ |
| Comment by Andreas Dilger [ 01/Feb/20 ] |
|
It looks like there is a bug in this patch pointed out by Misc Code Checks Robot that it is leaking i_alloc_sem in some error paths that may result in an OST deadlock in some cases, see my comments in https://review.whamcloud.com/#/c/36976/2/lustre/osd-ldiskfs/osd_io.c,unified |
| Comment by Gerrit Updater [ 03/Feb/20 ] |
|
Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/37406 |
| Comment by Gerrit Updater [ 05/Feb/20 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37445 |
| Comment by Gerrit Updater [ 08/Feb/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37406/ |
| Comment by Gerrit Updater [ 08/Feb/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37445/ |