[LU-4481] Impossible to start changelogs after corruption Created: 14/Jan/14  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Sebastien Buisson (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File changelog_users_xxd     File debugfs_stat    
Severity: 3
Rank (Obsolete): 12272

 Description   

Hi,

On a customer cluster, changelogs refuse to start, probably because of an internal data corruption.
Here are the messages we can see when mounting the MDT:

1373184833 2013 Jul 7 10:13:53 bcluster111 kern err kernel LustreError: 14143:0:(llog_lvfs.c:199:llog_lvfs_read_header()) bad log header magic: 0x10670000 (expected 0x10645539)
1373184833 2013 Jul 7 10:13:53 bcluster111 kern err kernel LustreError: 14143:0:(llog_obd.c:320:cat_cancel_cb()) Cannot find handle for log 0x1490186b: -5
1373184833 2013 Jul 7 10:13:53 bcluster111 kern err kernel LustreError: 14133:0:(llog_obd.c:393:llog_obd_origin_setup()) llog_process() with cat_cancel_cb failed: -5
1373184833 2013 Jul 7 10:13:53 bcluster111 kern err kernel LustreError: 14133:0:(llog_obd.c:220:llog_setup_named()) obd mdd_obd-scratch3-MDT0000 ctxt 14 lop_setup=ffffffffa0501cc0 failed -5
1373184833 2013 Jul 7 10:13:53 bcluster111 kern err kernel LustreError: 14133:0:(mds_log.c:218:mds_changelog_llog_init()) changelog users llog setup failed -5
1373184835 2013 Jul 7 10:13:55 bcluster111 kern err kernel LustreError: 14133:0:(mdd_device.c:216:mdd_changelog_llog_init()) no changelog user context
1373184835 2013 Jul 7 10:13:55 bcluster111 kern err kernel LustreError: 14133:0:(mdd_device.c:254:mdd_changelog_init()) Changelog setup during init failed -22
1373184835 2013 Jul 7 10:13:55 bcluster111 kern warning kernel Lustre: scratch3-MDT0000: used disk, loading

So the MDt is started, but without changelogs.

And if we try to look at changelog_users with lctl:

# lctl get_param mdd.scratch3-MDT0000.changelog_users
error: get_param: read('/proc/fs/lustre/mdd/scratch3-MDT0000/changelog_users') failed: No such device or address

The problem is the customer needs Lustre changelogs because they are consumed by Robinhood to monitor activity on the file system.

So the first thing we need is a way to restart changelogs as soon as possible. We already tried any administrative lustre command (lfs or lctl) to cleanup things, but it did not work due to the fact that the feature did not start. Manually cleaning OBJETS files is not a thing we tried, for fear of making the situation even worse.

After the changelogs will be restarted on site, we will need a fix so that changelogs can deal with corrupted data and start afresh in that case.

But again, the really first thing we need is a helping hand to clean things on site and restart changelogs ASAP.

TIA,
Sebastien.



 Comments   
Comment by Bruno Faccini (Inactive) [ 14/Jan/14 ]

Hello Seb,
Can you run debugfs on the concerned MDT and do a "stat changelog_users" and "stat changelog_catalog" to provide their infos?
Then can you mount the MDT as ldiskfs and do a "xxd changelog_users" to provide its corrupted content?
BTW, the "0x10670000" wrong value, vs expected LLOG_HDR_MAGIC, looks like CHANGELOG_USER_REC …
In the same time, I am trying to setup a platform to test a [manual?] way for you to recover from this.

Comment by Sebastien Buisson (Inactive) [ 16/Jan/14 ]

Hi Bruno,

Please find attached two files:

  • debugfs_stat: stat of the 2 files changelog_users and changelog_catalog in debugfs;
  • changelog_users_xxd: xxd of the file changelog_users with the MDT mounted in ldiskfs.

HTH,
Sebastien.

Comment by Sebastien Buisson (Inactive) [ 16/Jan/14 ]

Bruno,

As there is this message "Cannot find handle for log 0x1490186b: -5" in the logs, people on site have also stated this file and taken an od dump:

[root@bcluster111 OBJECTS] # pwd
/mnt/scratch3/mdt/0_ldiskfs/OBJECTS
[root@bcluster111 OBJECTS] # od -tx4 1490186b:dbc122a8 | more
0000000 00000028 00000001 10670000 00000000
0000020 00000001 00000000 b3ef6c29 00000000
0000040 00000028 00000001 00000002 00000058
0000060 00000000 00000005 00000001 00000000
0000100 00000000 00000000 00000000 00000000
*
0000140 00000000 00000000 00000003 00000000
0000160 00000000 00000000 00000000 00000000
*
0020000 00000000 00000000 00002000 00000001
0020020 00000001 00000000 b3ef4519 00000000
0020040 00000028 00000001
0020050

[root@bcluster111 OBJECTS] # od -tx4 149025cb:672131de | head -5
0000000 00002000 00000000 10645539 00000000
0000020 51001a9c 00000000 00000012 00000058
0000040 00000040 00000002 00000001 00000000
0000060 00000000 00000000 00000000 00000000
*

[root@bcluster111 OBJECTS] # stat 1490186b:dbc122a8
  File: `1490186b:dbc122a8'
  Size: 8232 Blocks: 24 IO Block: 4096 regular file
Device: fd01h/64769d Inode: 344987755 Links: 1
Access: (0666/-rw-rw-rw-) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2013-07-12 11:59:30.727996524 +0200
Modify: 2012-08-21 14:42:09.320548695 +0200
Change: 2012-08-21 14:42:09.320548695 +0200

Maybe you could find this helpful.

Sebastien.

Comment by Bruno Faccini (Inactive) [ 16/Jan/14 ]

Yes thanks, it was the next step/need I would have requested since changelog_user is ok and pointing to OBJECTS/149025cb:672131de !!
And this is this last file which is corrupted, as the "Cannot find handle for log 0x1490186b: -5" msg you point indicated.

Will try to get back soon with a bypass/reconstruct procedure.

Comment by Bruno Faccini (Inactive) [ 20/Jan/14 ]

Hello Seb,
I made some more work/tests in order for you to be able to restart Change-Logs.
A "normal" OBJECTS file pointed by changelog_users with a single/cl1 id registered looks like :

0000000 00002000 00000000 10645539 00000000
0000020 52dd1c70 00000000 00000002 00000058
0000040 00000000 00000005 00000001 00000000
0000060 00000000 00000000 00000000 00000000
*
0000120 00000000 00000000 00000003 00000000
0000140 00000000 00000000 00000000 00000000
*
0017760 00000000 00000000 00002000 00000001
0020000 00000028 00000001 10670000 00000000
0020020 00000001 00000000 00000dd7 00000000
0020040 00000028 00000001
0020050

So your is badly corrupted and lacks its full header record !!… But BTW, there is no need to try reconstruct it, since as I understood you restarted and used the MDT without Change-Logs enabled, so a full RobinHood scan will be required to re-populate its database from scratch.

Then, to be able to restart Change-Logs, you will need to umount/stop the MDT, mount it as LDISKFS, move/mv both CONFIGS/changelog_[catalog,users] to new-names, re-start/mount the MDT, and re-register the change-log user/id configured in RobinHood.

Moving both CONFIGS/changelog_[catalog,users] to new-names is strongly required since it will allow for later OBJECTS/* related files clean-up.

Additionally, when you say "we will need a fix so that changelogs can deal with corrupted data and start afresh in that case", do you mean that the procedure I described before should be automatic during MDT mount/start ?

Comment by Sebastien Buisson (Inactive) [ 20/Jan/14 ]

Hi Bruno,

Thanks for the procedure, I have forwarded it to onsite support team.

Concerning the fix, what I meant was that MDT should be able to cope with a corrupted OBJECTS file and start changelog feature even in that case, for instance by ignoring it. But after reading your last comment, it seems that this "restart in degraded mode" way would lead to an implicit reset of changelogs config (users and catalog). So this may be a little bit too strong...
On the other hand, I think resetting changelogs config (via lctl or lfs command) should be possible even if the feature is not started on the MDT. That would avoid mounting the MDT as ldiskfs and manually moving files.
What do you think?

Thanks!
Sebastien.

Comment by Sebastien Buisson (Inactive) [ 29/Jan/14 ]

Hi,

Concerning the procedure, I have confirmation from the Support team that it worked fine. Changelogs are now functional on site, thanks!

Cheers,
Sebastien.

Generated at Sat Feb 10 01:43:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.