Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.13.0
-
3
-
9223372036854775807
Description
During massive IO at cluster we could reproduce the next issue.
Nov 24 12:25:15 snx11281n003 kernel: LustreError: 12929:0:(llog_cat.c:404:llog_cat_id2handle()) snx11281-OST0000-osc-MDT0001: error opening log id [0x22903:0x1:0x0]:0: rc = -2 Nov 24 12:25:15 snx11281n003 kernel: LustreError: 12929:0:(osp_sync.c:1293:osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING ) failed: 3 changes, 0 in progress, 0 in flight Nov 24 12:25:15 snx11281n003 kernel: LustreError: 12929:0:(osp_sync.c:1293:osp_sync_thread()) LBUG Nov 24 12:25:15 snx11281n003 kernel: Pid: 12929, comm: osp-syn-0-1 3.10.0-957.1.3957.1.3.x3.3.36.x86_64 #1 SMP Tue Nov 19 09:47:32 PST 2019 Nov 24 12:25:15 snx11281n003 kernel: Call Trace: Nov 24 12:25:15 snx11281n003 kernel: [<ffffffffc0a427cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] Nov 24 12:25:15 snx11281n003 kernel: [<ffffffffc0a4287c>] lbug_with_loc+0x4c/0xa0 [libcfs] Nov 24 12:25:15 snx11281n003 kernel: [<ffffffffc18b5a1b>] osp_sync_thread+0xa1b/0xa70 [osp] Nov 24 12:25:15 snx11281n003 kernel: [<ffffffffa72c1c31>] kthread+0xd1/0xe0 Nov 24 12:25:15 snx11281n003 kernel: [<ffffffffa7976c1d>] ret_from_fork_nospec_begin+0x7/0x21 Nov 24 12:25:15 snx11281n003 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
And with additional debug.
llog_cat_set_first_idx()) catlog [0x6:0x1:0x0] first idx 34730, last_idx 34731 osp_sync_process_queues()) 1 changes, 0 in progress, 0 in flight llog_process_thread()) stop processing plain 0x76941:1:0 index 64767 count 1 llog_process_thread()) index: 34731, lh_last_idx: 34730 synced_idx: 34730 lgh_last_idx: 34731 llog_cat_process_common()) processing log [0x2780f:0x1:0x0]:0 at index 34731 of catalog [0x6:0x1:0x0] llog_cat_id2handle()) snx11281-OST0001-osc-MDT0001: error opening log id [0x2780f:0x1:0x0]:0:rc = -2
RC
The logic for rereading a llog buffer had an exception for a full catalog, when lgh_last_idx = llh_cat_idx and a first processing index is a llh_cat_idx+1. This check is based on a value lh_last_idx, which stays unchanged between buffer read. But llh_cat_idx could go forward, and this lead to a wrong check where reread doesn't happen. As a result Lustre got ENOENT for a record and stoped osp processing.