Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.14.0
-
None
-
3
-
9223372036854775807
Description
If llog cancel is cancelling a record that does not exist (either because the record is already cancelled or the log has been removed), this is generating a lot of console logs and (potentially) errors on the other servers:
lfs02-n05: Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:753:llog_cat_cancel_arr_rec()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 llog-records: rc = -116 Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:790:llog_cat_cancel_records()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 of 1 llog-records: rc = -116 lfs02-n06: Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:753:llog_cat_cancel_arr_rec()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 llog-records: rc = -116 Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:790:llog_cat_cancel_records()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 of 1 llog-records: rc = -116 lfs02-n07: Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:753:llog_cat_cancel_arr_rec()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 llog-records: rc = -116 Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:790:llog_cat_cancel_records()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 of 1 llog-records: rc = -116 [repeats for all MDS servers]
The -116=-ESTALE error is because the OUT recovery llog on the MDT was deleted, but the FID->inode record is still in the OI file and it finds the inode, but the inode has i_nlink=0 on disk.
Regardless of that, failure to cancel an llog record that doesn't exist (e.g. -ENOENT or -ESTALE) should not be a cause for an error that is retried. The local record should be cancelled in this case and not retried.
Attachments
Issue Links
Activity
Resolution | New: Fixed [ 1 ] | |
Status | Original: Reopened [ 4 ] | New: Resolved [ 5 ] |
Resolution | Original: Fixed [ 1 ] | |
Status | Original: Resolved [ 5 ] | New: Reopened [ 4 ] |
Fix Version/s | New: Lustre 2.16.0 [ 15190 ] |
Assignee | Original: WC Triage [ wc-triage ] | New: Mikhail Pershin [ tappro ] |
Resolution | New: Fixed [ 1 ] | |
Status | Original: Open [ 1 ] | New: Resolved [ 5 ] |
Link | New: This issue is related to EX-9183 [ EX-9183 ] |
Description |
Original:
If llog cancel is cancelling a record that does not exist (either because the record is already cancelled or the log has been removed), this is generating a lot of console logs and (potentially) errors on the other servers:
{noformat} lfs02-n05: Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:753:llog_cat_cancel_arr_rec()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 llog-records: rc = -116 Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:790:llog_cat_cancel_records()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 of 1 llog-records: rc = -116 lfs02-n06: Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:753:llog_cat_cancel_arr_rec()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 llog-records: rc = -116 Mar 12 14:06:15 slfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:790:llog_cat_cancel_records()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 of 1 llog-records: rc = -116 lfs02-n07: Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:753:llog_cat_cancel_arr_rec()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 llog-records: rc = -116 Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:790:llog_cat_cancel_records()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 of 1 llog-records: rc = -116 [repeats for all MDS servers] {noformat} The {{-116=-ESTALE}} error is because the OUT recovery llog on the MDT was deleted, but the FID->inode record is still in the OI file and it finds the inode, but the inode has i_nlink=0 on disk. Regardless of that, failure to cancel an llog record that doesn't exist (e.g. {{-ENOENT}} or {{-ESTALE}}) should not be a cause for an error that is retried. The local record should be cancelled in this case and not retried. |
New:
If llog cancel is cancelling a record that does not exist (either because the record is already cancelled or the log has been removed), this is generating a lot of console logs and (potentially) errors on the other servers:
{noformat} lfs02-n05: Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:753:llog_cat_cancel_arr_rec()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 llog-records: rc = -116 Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:790:llog_cat_cancel_records()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 of 1 llog-records: rc = -116 lfs02-n06: Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:753:llog_cat_cancel_arr_rec()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 llog-records: rc = -116 Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:790:llog_cat_cancel_records()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 of 1 llog-records: rc = -116 lfs02-n07: Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:753:llog_cat_cancel_arr_rec()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 llog-records: rc = -116 Mar 12 14:06:15 lfs02-n30 kernel: LustreError: 28071:0:(llog_cat.c:790:llog_cat_cancel_records()) lfs02-MDT0004-osp-MDT001d: fail to cancel 1 of 1 llog-records: rc = -116 [repeats for all MDS servers] {noformat} The {{\-116=\-ESTALE}} error is because the OUT recovery llog on the MDT was deleted, but the FID->inode record is still in the OI file and it finds the inode, but the inode has i_nlink=0 on disk. Regardless of that, failure to cancel an llog record that doesn't exist (e.g. {{\-ENOENT}} or {{\-ESTALE}}) should not be a cause for an error that is retried. The local record should be cancelled in this case and not retried. |
Affects Version/s | New: Lustre 2.14.0 [ 14490 ] |