[LU-13102] llog_cat_id2handle()) error opening log id rc = -2 osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING ) Created: 26/Dec/19  Updated: 20/May/20  Resolved: 05/Mar/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: Lustre 2.14.0, Lustre 2.12.5

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During massive IO at cluster we could reproduce the next issue.

Nov 24 12:25:15 snx11281n003 kernel: LustreError: 12929:0:(llog_cat.c:404:llog_cat_id2handle()) snx11281-OST0000-osc-MDT0001: error opening log id [0x22903:0x1:0x0]:0: rc = -2
Nov 24 12:25:15 snx11281n003 kernel: LustreError: 12929:0:(osp_sync.c:1293:osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING ) failed: 3 changes, 0 in progress, 0 in flight
Nov 24 12:25:15 snx11281n003 kernel: LustreError: 12929:0:(osp_sync.c:1293:osp_sync_thread()) LBUG
Nov 24 12:25:15 snx11281n003 kernel: Pid: 12929, comm: osp-syn-0-1 3.10.0-957.1.3957.1.3.x3.3.36.x86_64 #1 SMP Tue Nov 19 09:47:32 PST 2019
Nov 24 12:25:15 snx11281n003 kernel: Call Trace:
Nov 24 12:25:15 snx11281n003 kernel:  [<ffffffffc0a427cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
Nov 24 12:25:15 snx11281n003 kernel:  [<ffffffffc0a4287c>] lbug_with_loc+0x4c/0xa0 [libcfs]
Nov 24 12:25:15 snx11281n003 kernel:  [<ffffffffc18b5a1b>] osp_sync_thread+0xa1b/0xa70 [osp]
Nov 24 12:25:15 snx11281n003 kernel:  [<ffffffffa72c1c31>] kthread+0xd1/0xe0
Nov 24 12:25:15 snx11281n003 kernel:  [<ffffffffa7976c1d>] ret_from_fork_nospec_begin+0x7/0x21
Nov 24 12:25:15 snx11281n003 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff

And with additional debug.

 llog_cat_set_first_idx())   catlog [0x6:0x1:0x0] first idx 34730, last_idx 34731
 osp_sync_process_queues()) 1 changes, 0 in progress, 0 in flight
 llog_process_thread()) stop processing plain 0x76941:1:0 index 64767 count 1
 llog_process_thread()) index: 34731, lh_last_idx: 34730 synced_idx: 34730 lgh_last_idx: 34731
 llog_cat_process_common()) processing log [0x2780f:0x1:0x0]:0   at index 34731 of catalog [0x6:0x1:0x0]
llog_cat_id2handle()) snx11281-OST0001-osc-MDT0001:  error opening log id [0x2780f:0x1:0x0]:0:rc = -2

RC
The logic for rereading a llog buffer had an exception for a full catalog, when lgh_last_idx = llh_cat_idx and a first processing index is a llh_cat_idx+1. This check is based on a value lh_last_idx, which stays unchanged between buffer read. But llh_cat_idx could go forward, and this lead to a wrong check where reread doesn't happen. As a result Lustre got ENOENT for a record and stoped osp processing.



 Comments   
Comment by Gerrit Updater [ 26/Dec/19 ]

Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/37102
Subject: LU-13102 llog: fix processing of a wrapped catalog
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b55bbbca0c425926d5fcc04a598a6754297fda8f

Comment by Jian Yu [ 21/Jan/20 ]

+1 on master branch: https://testing.whamcloud.com/test_sets/1160632e-3bb7-11ea-adca-52540065bddc

Comment by Jian Yu [ 28/Jan/20 ]

More failures on master branch:
https://testing.whamcloud.com/test_sets/aae2a638-412f-11ea-9630-52540065bddc
https://testing.whamcloud.com/test_sets/87deb48c-4130-11ea-9847-52540065bddc
https://testing.whamcloud.com/test_sets/7a5d326e-415a-11ea-9847-52540065bddc
https://testing.whamcloud.com/test_sets/894f039c-4164-11ea-9b1e-52540065bddc

Comment by Gerrit Updater [ 05/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37102/
Subject: LU-13102 llog: fix processing of a wrapped catalog
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a4f049b96562fd502b1948fb082767351e040a1c

Comment by Peter Jones [ 05/Mar/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 08/May/20 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38541
Subject: LU-13102 llog: fix processing of a wrapped catalog
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 4be4f2d6589a740788a3fd1ca545422a4e6e6500

Comment by Gerrit Updater [ 20/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38541/
Subject: LU-13102 llog: fix processing of a wrapped catalog
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 21b34dd8a226fbaf2fee52b294f2482aa8670ad2

Generated at Sat Feb 10 02:58:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.