[LU-12971] changelog_deregister: fail to cancel 0 of 1 llog-records Created: 14/Nov/19  Updated: 23/Nov/19  Resolved: 23/Nov/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Mikhail Pershin
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

CentOS 7.6 - Lustre 2.12.3 clients and servers


Attachments: File fir-md1-s4_chglog.tar.gz     Text File fir-md1-s4_chglog_clear_notes_20191122.txt    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hello,

We have four MDTs on Fir, and Robinhood kept being blocked on one of them. So we decided to clear/unregister all four changelogs readers and start with a fresh filesystem scan + new readers. While it was ok on three MDTs, with the last one, {fir-MDT0003}, I am not able to successfully run changelog_deregister due to the following errors:

Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2
Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11438 of catalog [0x5:0xa:0x0]: rc = -2
Nov 14 11:39:24 fir-md1-s4 kernel: Lustre: 14109:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero 

A single changelog_deregister will just hang. If I launch a second one, the first will eventually returns with the following error:

[root@fir-md1-s4 ~]#  lctl --device fir-MDT0003 changelog_deregister cl1
error: changelog_deregister: No such file or directory

If I launch two changelog_deregister at the same time, each in an endless loop, I'm able to make progress in the idx number, but not very fast:

Nov 14 09:34:18 fir-md1-s4 kernel: LustreError: 11745:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11428 of catalog [0x5:0xa:0x0]: rc = -2
Nov 14 09:34:19 fir-md1-s4 kernel: Lustre: 12314:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero
Nov 14 10:02:41 fir-md1-s4 kernel: LustreError: 12314:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2
Nov 14 10:02:41 fir-md1-s4 kernel: LustreError: 12314:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11431 of catalog [0x5:0xa:0x0]: rc = -2
Nov 14 10:02:42 fir-md1-s4 kernel: Lustre: 12741:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero
Nov 14 10:36:51 fir-md1-s4 kernel: LustreError: 11916:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2
Nov 14 10:36:51 fir-md1-s4 kernel: LustreError: 11916:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11433 of catalog [0x5:0xa:0x0]: rc = -2
Nov 14 10:36:52 fir-md1-s4 kernel: Lustre: 13213:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero
Nov 14 10:53:58 fir-md1-s4 kernel: LustreError: 12741:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2
Nov 14 10:53:58 fir-md1-s4 kernel: LustreError: 12741:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11435 of catalog [0x5:0xa:0x0]: rc = -2
Nov 14 10:53:59 fir-md1-s4 kernel: Lustre: 13452:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero
Nov 14 11:16:38 fir-md1-s4 kernel: LustreError: 13452:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2
Nov 14 11:16:38 fir-md1-s4 kernel: LustreError: 13452:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11436 of catalog [0x5:0xa:0x0]: rc = -2
Nov 14 11:16:39 fir-md1-s4 kernel: Lustre: 13794:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero
Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2
Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11438 of catalog [0x5:0xa:0x0]: rc = -2

We would like to clear all changelogs and start fresh on fir-MDT0003. Because we have been using 2.12 since the beginning and we might have corrupt changelogs there. How to do that properly? Thanks!



 Comments   
Comment by Stephane Thiell [ 14/Nov/19 ]

TL;DR Can I just quickly remove changelog_catalog and changelog_users at the root level using ldiskfs on the MDT? Or these files have special inodes? Alternatively, can I empty them?

Comment by Stephane Thiell [ 14/Nov/19 ]

I found some instructions on how to manually regenerate the changelog files in https://jira.whamcloud.com/browse/EDU-122?focusedCommentId=255248&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-255248

I will wait until tomorrow in case you want to investigate the issue with me before I do that. We would like to resume Robinhood operation while at SC19 as it is used to automatically purge the filesystem.

Comment by Peter Jones [ 15/Nov/19 ]

Mike

Anything further to add here?

Peter

Comment by Stephane Thiell [ 23/Nov/19 ]

Issue resolved! Something was wrong as, likely due to this missing file:

[root@fir-md1-s4 ldiskfs]# llog_reader O/1/d4/4
llog_reader: Could not open the file O/1/d4/4.: No such file or directory (2)

 
Notes available as fir-md1-s4_chglog_clear_notes_20191122.txt
Some files put in fir-md1-s4_chglog.tar.gz for reference.

 
Robinhood has been happily reading changelogs from fir-MDT0003 all day:

2019/11/22 14:44:01 [38287/1] STATS | ChangeLog reader #3:
2019/11/22 14:44:01 [38287/1] STATS |    fs_name    =   fir
2019/11/22 14:44:01 [38287/1] STATS |    mdt_name   =   MDT0003
2019/11/22 14:44:01 [38287/1] STATS |    reader_id  =   cl1
2019/11/22 14:44:01 [38287/1] STATS |    records read        = 217385998
2019/11/22 14:44:01 [38287/1] STATS |    interesting records = 185702
2019/11/22 14:44:01 [38287/1] STATS |    suppressed records  = 217195055
2019/11/22 14:44:01 [38287/1] STATS |    records pending     = 5560
2019/11/22 14:44:01 [38287/1] STATS |    last received: rec_id=217385998, rec_time=2019/11/22 14:43:55.908001, received at 2019/11/22 14:43:57.236896
2019/11/22 14:44:01 [38287/1] STATS |        receive speed: 8554.39 rec/sec, log/real time ratio: 0.98
2019/11/22 14:44:01 [38287/1] STATS |    last pushed: rec_id=214812359, rec_time=2019/11/22 14:38:58.510981, pushed at 2019/11/22 14:43:58.237611
2019/11/22 14:44:01 [38287/1] STATS |        push speed: 3430.45 rec/sec, log/real time ratio: 0.99
2019/11/22 14:44:01 [38287/1] STATS |    last committed: rec_id=214812359, rec_time=2019/11/22 14:38:58.510981, committed at 2019/11/22 14:43:58.271717
2019/11/22 14:44:01 [38287/1] STATS |        commit speed: 3430.45 rec/sec, log/real time ratio: 0.99
2019/11/22 14:44:01 [38287/1] STATS |    last cleared: rec_id=214812359, rec_time=2019/11/22 14:38:58.510981, cleared at 2019/11/22 14:43:58.272185
2019/11/22 14:44:01 [38287/1] STATS |        clear speed: 3430.45 rec/sec, log/real time ratio: 0.99
2019/11/22 14:44:01 [38287/1] STATS |    ChangeLog stats:
2019/11/22 14:44:01 [38287/1] STATS |    MARK: 0, CREAT: 75559, MKDIR: 10260, HLINK: 0, SLINK: 19095, MKNOD: 0, UNLNK: 46381
2019/11/22 14:44:01 [38287/1] STATS |    RMDIR: 2439, RENME: 8947, RNMTO: 0, OPEN: 0, CLOSE: 211845911, LYOUT: 10786, TRUNC: 5339182
2019/11/22 14:44:01 [38287/1] STATS |    SATTR: 22598, XATTR: 0, HSM: 0, MTIME: 4817, CTIME: 23, ATIME: 0, MIGRT: 0, FLRW: 0
2019/11/22 14:44:01 [38287/1] STATS |    RESYNC: 0, GXATR: 0, NOPEN: 0
 
Comment by Peter Jones [ 23/Nov/19 ]

ok - thanks

Generated at Sat Feb 10 02:57:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.