[LU-10913] conf-sanity test_106: OST0000-osc-MDT0000: can't cancel record Created: 13/Apr/18  Updated: 08/Apr/19  Resolved: 11/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0, Lustre 2.12.1

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/d0b6dfde-3eaf-11e8-960d-52540065bddc

test_106 failed with the following error:

Timeout occurred after 276 mins, last suite running was conf-sanity, restarting cluster to continue tests

MDS error:

Apr 12 21:13:22 trevis-72vm6 kernel: Lustre: 24256:0:(llog_cat.c:926:llog_cat_process_or_fork()) lustre-OST0000-osc-MDT0000: catlog [0x2:0x1:0x0] crosses index zero
Apr 12 21:13:29 trevis-72vm6 kernel: LustreError: 24256:0:(llog_cat.c:269:llog_cat_id2handle()) lustre-OST0000-osc-MDT0000: error opening log id [0xa:0x1:0x0]:0: rc = -2
Apr 12 21:13:29 trevis-72vm6 kernel: LustreError: 24256:0:(llog_cat.c:795:llog_cat_cancel_records()) lustre-OST0000-osc-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2
Apr 12 21:13:29 trevis-72vm6 kernel: LustreError: 24256:0:(osp_sync.c:1062:osp_sync_process_committed()) lustre-OST0000-osc-MDT0000: can't cancel record: -2

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
conf-sanity test_106 - Timeout occurred after 276 mins, last suite running was conf-sanity, restarting cluster to continue tests



 Comments   
Comment by Alexander Boyko [ 29/Nov/18 ]

I've seen something similar for 106

 00000004:00080000:2.0:1542884025.062076:0:11650:0:(osp_sync.c:510:osp_sync_request_commit_cb()) commit req ffff8800123e3300, transno 0
00000004:00080000:3.0:1542884025.062076:0:11649:0:(osp_sync.c:510:osp_sync_request_commit_cb()) commit req ffff8800123e3600, transno 0
00000004:00080000:2.0:1542884025.062077:0:11650:0:(osp_sync.c:564:osp_sync_interpret()) reply req ffff8800123e3300/1, rc -2, transno 0
00000004:00080000:3.0:1542884025.062077:0:11649:0:(osp_sync.c:564:osp_sync_interpret()) reply req ffff8800123e3600/1, rc -2, transno 0
00000004:00001000:2.0:1542884025.062079:0:11650:0:(osp_sync.c:620:osp_sync_interpret()) lustre-OST0000-osc-MDT0000: 7 in flight, 3985 in progress
00000004:00001000:3.0:1542884025.062079:0:11649:0:(osp_sync.c:620:osp_sync_interpret()) lustre-OST0000-osc-MDT0000: 6 in flight, 3985 in progress
00000100:00001000:2.0:1542884025.062081:0:11650:0:(import.c:1695:at_measured()) add 1 to ffff88002a111c00 time=131 v=1 (1 0 0 0)
00000100:00001000:3.0:1542884025.062081:0:11649:0:(import.c:1695:at_measured()) add 1 to ffff88002a111c00 time=131 v=1 (1 0 0 0)
00000100:00001000:2.0:1542884025.062083:0:11650:0:(import.c:1695:at_measured()) add 1 to ffff88002a111bc8 time=131 v=1 (1 0 0 0)
00000100:00001000:3.0:1542884025.062083:0:11649:0:(import.c:1695:at_measured()) add 1 to ffff88002a111bc8 time=131 v=1 (1 0 0 0)
00000004:00080000:2.0:1542884025.062084:0:11650:0:(osp_sync.c:510:osp_sync_request_commit_cb()) commit req ffff8800123e3900, transno 0
00000004:00080000:2.0:1542884025.062085:0:11650:0:(osp_sync.c:564:osp_sync_interpret()) reply req ffff8800123e3900/1, rc -2, transno 0
00000004:00080000:3.0:1542884025.062085:0:11649:0:(osp_sync.c:510:osp_sync_request_commit_cb()) commit req ffff8800123e3000, transno 0
00000004:00001000:2.0:1542884025.062086:0:11650:0:(osp_sync.c:620:osp_sync_interpret()) lustre-OST0000-osc-MDT0000: 5 in flight, 3985 in progress
00000004:00080000:3.0:1542884025.062086:0:11649:0:(osp_sync.c:564:osp_sync_interpret()) reply req ffff8800123e3000/1, rc -2, transno 0

It means the request with unlink/etc. was sent to OST and OST reported no object exist(ENOENT).
After capturing couple vmcores with different logging options I've found that last catalog record was processed twice.
First time for a normal processing and second time when wrapping.
osp_sync_thread() using opd_sync_catalog_idx to indicate the start point for processing.
rc = llog_cat_process(&env, llh, osp_sync_process_queues, d,
d->opd_sync_last_catalog_idx, 0);
But it is ignored when catalog is wrapped at the llog_cat_process_or_fork(). It starts from the the end of wrapping. For example catalog records is
1 1 0 0 1.
The processing would start from record 5.
But normal processing was stopped at the end of catalog, so we should skip the last record and start from a first one.

Comment by Gerrit Updater [ 29/Nov/18 ]

Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/33749
Subject: LU-10913 llog: add startcat for wrapped catalog
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e6b3eae5b5532e48405fc485f7a025a91089dd2d

Comment by Gerrit Updater [ 11/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33749/
Subject: LU-10913 llog: add startcat for wrapped catalog
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8109c9e1718d7f425a7186c31399ac035157acd6

Comment by Peter Jones [ 11/Feb/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 26/Mar/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34506
Subject: LU-10913 llog: add startcat for wrapped catalog
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: aed0d8a71803ec19d80e4adaa33bab2b09273a9c

Comment by Gerrit Updater [ 08/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34506/
Subject: LU-10913 llog: add startcat for wrapped catalog
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: fe0f046d85d1c8f415e7744f4ff7de611efdbff0

Generated at Sat Feb 10 02:39:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.