Details
-
Bug
-
Resolution: Won't Fix
-
Major
-
None
-
Lustre 2.12.3
-
None
-
CentOS 7.6 - Lustre 2.12.3 clients and servers
-
3
-
9223372036854775807
Description
Hello,
We have four MDTs on Fir, and Robinhood kept being blocked on one of them. So we decided to clear/unregister all four changelogs readers and start with a fresh filesystem scan + new readers. While it was ok on three MDTs, with the last one, {fir-MDT0003}, I am not able to successfully runĀ changelog_deregister due to the following errors:
Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2 Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11438 of catalog [0x5:0xa:0x0]: rc = -2 Nov 14 11:39:24 fir-md1-s4 kernel: Lustre: 14109:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero
A single changelog_deregister will just hang. If I launch a second one, the first will eventually returns with the following error:
[root@fir-md1-s4 ~]# lctl --device fir-MDT0003 changelog_deregister cl1 error: changelog_deregister: No such file or directory
If I launch two changelog_deregister at the same time, each in an endless loop, I'm able to make progress in the idx number, but not very fast:
Nov 14 09:34:18 fir-md1-s4 kernel: LustreError: 11745:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11428 of catalog [0x5:0xa:0x0]: rc = -2 Nov 14 09:34:19 fir-md1-s4 kernel: Lustre: 12314:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero Nov 14 10:02:41 fir-md1-s4 kernel: LustreError: 12314:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2 Nov 14 10:02:41 fir-md1-s4 kernel: LustreError: 12314:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11431 of catalog [0x5:0xa:0x0]: rc = -2 Nov 14 10:02:42 fir-md1-s4 kernel: Lustre: 12741:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero Nov 14 10:36:51 fir-md1-s4 kernel: LustreError: 11916:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2 Nov 14 10:36:51 fir-md1-s4 kernel: LustreError: 11916:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11433 of catalog [0x5:0xa:0x0]: rc = -2 Nov 14 10:36:52 fir-md1-s4 kernel: Lustre: 13213:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero Nov 14 10:53:58 fir-md1-s4 kernel: LustreError: 12741:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2 Nov 14 10:53:58 fir-md1-s4 kernel: LustreError: 12741:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11435 of catalog [0x5:0xa:0x0]: rc = -2 Nov 14 10:53:59 fir-md1-s4 kernel: Lustre: 13452:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero Nov 14 11:16:38 fir-md1-s4 kernel: LustreError: 13452:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2 Nov 14 11:16:38 fir-md1-s4 kernel: LustreError: 13452:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11436 of catalog [0x5:0xa:0x0]: rc = -2 Nov 14 11:16:39 fir-md1-s4 kernel: Lustre: 13794:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2 Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11438 of catalog [0x5:0xa:0x0]: rc = -2
We would like to clear all changelogs and start fresh on fir-MDT0003. Because we have been using 2.12 since the beginning and we might have corrupt changelogs there. How to do that properly? Thanks!