Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.14.0
-
None
-
3
-
9223372036854775807
Description
If llog cancel is cancelling a record that does not exist (either because the record is already cancelled or the log has been removed), this is generating a lot of console logs and (potentially) errors on the other servers:
lfs02-n05: Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:753:llog_cat_cancel_arr_rec()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 llog-records: rc = -116 Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:790:llog_cat_cancel_records()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 of 1 llog-records: rc = -116 lfs02-n06: Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:753:llog_cat_cancel_arr_rec()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 llog-records: rc = -116 Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:790:llog_cat_cancel_records()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 of 1 llog-records: rc = -116 lfs02-n07: Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:753:llog_cat_cancel_arr_rec()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 llog-records: rc = -116 Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:790:llog_cat_cancel_records()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 of 1 llog-records: rc = -116 [repeats for all MDS servers]
The -116=-ESTALE error is because the OUT recovery llog on the MDT was deleted, but the FID->inode record is still in the OI file and it finds the inode, but the inode has i_nlink=0 on disk.
Regardless of that, failure to cancel an llog record that doesn't exist (e.g. -ENOENT or -ESTALE) should not be a cause for an error that is retried. The local record should be cancelled in this case and not retried.