[LU-15644] failed llog cancel should not generate an error Created: 12/Mar/22  Updated: 13/Mar/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12985 sanity test_60g: Timeout occurred af... Open
is related to LU-13469 MDS hung during mount Resolved
is related to LU-15645 gap in recovery llog should not be a ... Resolved
is related to LU-15646 fix DOSTID printing of llog_id FIDs Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

If llog cancel is cancelling a record that does not exist (either because the record is already cancelled or the log has been removed), this is generating a lot of console logs and (potentially) errors on the other servers:

lfs02-n05:
Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:753:llog_cat_cancel_arr_rec()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 llog-records: rc = -116
Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:790:llog_cat_cancel_records()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 of 1 llog-records: rc = -116

lfs02-n06:
Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:753:llog_cat_cancel_arr_rec()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 llog-records: rc = -116
Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:790:llog_cat_cancel_records()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 of 1 llog-records: rc = -116

lfs02-n07:
Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:753:llog_cat_cancel_arr_rec()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 llog-records: rc = -116
Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:790:llog_cat_cancel_records()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 of 1 llog-records: rc = -116
[repeats for all MDS servers]

The -116=-ESTALE error is because the OUT recovery llog on the MDT was deleted, but the FID->inode record is still in the OI file and it finds the inode, but the inode has i_nlink=0 on disk.

Regardless of that, failure to cancel an llog record that doesn't exist (e.g. -ENOENT or -ESTALE) should not be a cause for an error that is retried. The local record should be cancelled in this case and not retried.



 Comments   
Comment by Andreas Dilger [ 12/Mar/22 ]

It looks like this same problem was also hit in LU-12985 and LU-13469, with -ENOENT, -EIO, and -ESTALE.

It would be useful if the error messages also included the FID of the llog file itself, so that the problematic llog file can be tracked more easily.

Generated at Sat Feb 10 03:20:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.