[LU-12971] changelog_deregister: fail to cancel 0 of 1 llog-records Created: 14/Nov/19 Updated: 23/Nov/19 Resolved: 23/Nov/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Mikhail Pershin |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6 - Lustre 2.12.3 clients and servers |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hello, We have four MDTs on Fir, and Robinhood kept being blocked on one of them. So we decided to clear/unregister all four changelogs readers and start with a fresh filesystem scan + new readers. While it was ok on three MDTs, with the last one, {fir-MDT0003}, I am not able to successfully run changelog_deregister due to the following errors: Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2 Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11438 of catalog [0x5:0xa:0x0]: rc = -2 Nov 14 11:39:24 fir-md1-s4 kernel: Lustre: 14109:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero A single changelog_deregister will just hang. If I launch a second one, the first will eventually returns with the following error: [root@fir-md1-s4 ~]# lctl --device fir-MDT0003 changelog_deregister cl1 error: changelog_deregister: No such file or directory If I launch two changelog_deregister at the same time, each in an endless loop, I'm able to make progress in the idx number, but not very fast: Nov 14 09:34:18 fir-md1-s4 kernel: LustreError: 11745:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11428 of catalog [0x5:0xa:0x0]: rc = -2 Nov 14 09:34:19 fir-md1-s4 kernel: Lustre: 12314:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero Nov 14 10:02:41 fir-md1-s4 kernel: LustreError: 12314:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2 Nov 14 10:02:41 fir-md1-s4 kernel: LustreError: 12314:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11431 of catalog [0x5:0xa:0x0]: rc = -2 Nov 14 10:02:42 fir-md1-s4 kernel: Lustre: 12741:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero Nov 14 10:36:51 fir-md1-s4 kernel: LustreError: 11916:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2 Nov 14 10:36:51 fir-md1-s4 kernel: LustreError: 11916:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11433 of catalog [0x5:0xa:0x0]: rc = -2 Nov 14 10:36:52 fir-md1-s4 kernel: Lustre: 13213:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero Nov 14 10:53:58 fir-md1-s4 kernel: LustreError: 12741:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2 Nov 14 10:53:58 fir-md1-s4 kernel: LustreError: 12741:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11435 of catalog [0x5:0xa:0x0]: rc = -2 Nov 14 10:53:59 fir-md1-s4 kernel: Lustre: 13452:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero Nov 14 11:16:38 fir-md1-s4 kernel: LustreError: 13452:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2 Nov 14 11:16:38 fir-md1-s4 kernel: LustreError: 13452:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11436 of catalog [0x5:0xa:0x0]: rc = -2 Nov 14 11:16:39 fir-md1-s4 kernel: Lustre: 13794:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2 Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11438 of catalog [0x5:0xa:0x0]: rc = -2 We would like to clear all changelogs and start fresh on fir-MDT0003. Because we have been using 2.12 since the beginning and we might have corrupt changelogs there. How to do that properly? Thanks! |
| Comments |
| Comment by Stephane Thiell [ 14/Nov/19 ] |
|
TL;DR Can I just quickly remove changelog_catalog and changelog_users at the root level using ldiskfs on the MDT? Or these files have special inodes? Alternatively, can I empty them? |
| Comment by Stephane Thiell [ 14/Nov/19 ] |
|
I found some instructions on how to manually regenerate the changelog files in https://jira.whamcloud.com/browse/EDU-122?focusedCommentId=255248&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-255248 I will wait until tomorrow in case you want to investigate the issue with me before I do that. We would like to resume Robinhood operation while at SC19 as it is used to automatically purge the filesystem. |
| Comment by Peter Jones [ 15/Nov/19 ] |
|
Mike Anything further to add here? Peter |
| Comment by Stephane Thiell [ 23/Nov/19 ] |
|
Issue resolved! Something was wrong as, likely due to this missing file: [root@fir-md1-s4 ldiskfs]# llog_reader O/1/d4/4 llog_reader: Could not open the file O/1/d4/4.: No such file or directory (2) 2019/11/22 14:44:01 [38287/1] STATS | ChangeLog reader #3: 2019/11/22 14:44:01 [38287/1] STATS | fs_name = fir 2019/11/22 14:44:01 [38287/1] STATS | mdt_name = MDT0003 2019/11/22 14:44:01 [38287/1] STATS | reader_id = cl1 2019/11/22 14:44:01 [38287/1] STATS | records read = 217385998 2019/11/22 14:44:01 [38287/1] STATS | interesting records = 185702 2019/11/22 14:44:01 [38287/1] STATS | suppressed records = 217195055 2019/11/22 14:44:01 [38287/1] STATS | records pending = 5560 2019/11/22 14:44:01 [38287/1] STATS | last received: rec_id=217385998, rec_time=2019/11/22 14:43:55.908001, received at 2019/11/22 14:43:57.236896 2019/11/22 14:44:01 [38287/1] STATS | receive speed: 8554.39 rec/sec, log/real time ratio: 0.98 2019/11/22 14:44:01 [38287/1] STATS | last pushed: rec_id=214812359, rec_time=2019/11/22 14:38:58.510981, pushed at 2019/11/22 14:43:58.237611 2019/11/22 14:44:01 [38287/1] STATS | push speed: 3430.45 rec/sec, log/real time ratio: 0.99 2019/11/22 14:44:01 [38287/1] STATS | last committed: rec_id=214812359, rec_time=2019/11/22 14:38:58.510981, committed at 2019/11/22 14:43:58.271717 2019/11/22 14:44:01 [38287/1] STATS | commit speed: 3430.45 rec/sec, log/real time ratio: 0.99 2019/11/22 14:44:01 [38287/1] STATS | last cleared: rec_id=214812359, rec_time=2019/11/22 14:38:58.510981, cleared at 2019/11/22 14:43:58.272185 2019/11/22 14:44:01 [38287/1] STATS | clear speed: 3430.45 rec/sec, log/real time ratio: 0.99 2019/11/22 14:44:01 [38287/1] STATS | ChangeLog stats: 2019/11/22 14:44:01 [38287/1] STATS | MARK: 0, CREAT: 75559, MKDIR: 10260, HLINK: 0, SLINK: 19095, MKNOD: 0, UNLNK: 46381 2019/11/22 14:44:01 [38287/1] STATS | RMDIR: 2439, RENME: 8947, RNMTO: 0, OPEN: 0, CLOSE: 211845911, LYOUT: 10786, TRUNC: 5339182 2019/11/22 14:44:01 [38287/1] STATS | SATTR: 22598, XATTR: 0, HSM: 0, MTIME: 4817, CTIME: 23, ATIME: 0, MIGRT: 0, FLRW: 0 2019/11/22 14:44:01 [38287/1] STATS | RESYNC: 0, GXATR: 0, NOPEN: 0 |
| Comment by Peter Jones [ 23/Nov/19 ] |
|
ok - thanks |