[LU-4481] Impossible to start changelogs after corruption Created: 14/Jan/14 Updated: 13/Oct/21 Resolved: 13/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Sebastien Buisson (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 12272 |
| Description |
|
Hi, On a customer cluster, changelogs refuse to start, probably because of an internal data corruption. 1373184833 2013 Jul 7 10:13:53 bcluster111 kern err kernel LustreError: 14143:0:(llog_lvfs.c:199:llog_lvfs_read_header()) bad log header magic: 0x10670000 (expected 0x10645539) 1373184833 2013 Jul 7 10:13:53 bcluster111 kern err kernel LustreError: 14143:0:(llog_obd.c:320:cat_cancel_cb()) Cannot find handle for log 0x1490186b: -5 1373184833 2013 Jul 7 10:13:53 bcluster111 kern err kernel LustreError: 14133:0:(llog_obd.c:393:llog_obd_origin_setup()) llog_process() with cat_cancel_cb failed: -5 1373184833 2013 Jul 7 10:13:53 bcluster111 kern err kernel LustreError: 14133:0:(llog_obd.c:220:llog_setup_named()) obd mdd_obd-scratch3-MDT0000 ctxt 14 lop_setup=ffffffffa0501cc0 failed -5 1373184833 2013 Jul 7 10:13:53 bcluster111 kern err kernel LustreError: 14133:0:(mds_log.c:218:mds_changelog_llog_init()) changelog users llog setup failed -5 1373184835 2013 Jul 7 10:13:55 bcluster111 kern err kernel LustreError: 14133:0:(mdd_device.c:216:mdd_changelog_llog_init()) no changelog user context 1373184835 2013 Jul 7 10:13:55 bcluster111 kern err kernel LustreError: 14133:0:(mdd_device.c:254:mdd_changelog_init()) Changelog setup during init failed -22 1373184835 2013 Jul 7 10:13:55 bcluster111 kern warning kernel Lustre: scratch3-MDT0000: used disk, loading So the MDt is started, but without changelogs. And if we try to look at changelog_users with lctl: # lctl get_param mdd.scratch3-MDT0000.changelog_users
error: get_param: read('/proc/fs/lustre/mdd/scratch3-MDT0000/changelog_users') failed: No such device or address
The problem is the customer needs Lustre changelogs because they are consumed by Robinhood to monitor activity on the file system. So the first thing we need is a way to restart changelogs as soon as possible. We already tried any administrative lustre command (lfs or lctl) to cleanup things, but it did not work due to the fact that the feature did not start. Manually cleaning OBJETS files is not a thing we tried, for fear of making the situation even worse. After the changelogs will be restarted on site, we will need a fix so that changelogs can deal with corrupted data and start afresh in that case. But again, the really first thing we need is a helping hand to clean things on site and restart changelogs ASAP. TIA, |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 14/Jan/14 ] |
|
Hello Seb, |
| Comment by Sebastien Buisson (Inactive) [ 16/Jan/14 ] |
|
Hi Bruno, Please find attached two files:
HTH, |
| Comment by Sebastien Buisson (Inactive) [ 16/Jan/14 ] |
|
Bruno, As there is this message "Cannot find handle for log 0x1490186b: -5" in the logs, people on site have also stated this file and taken an od dump: [root@bcluster111 OBJECTS] # pwd /mnt/scratch3/mdt/0_ldiskfs/OBJECTS [root@bcluster111 OBJECTS] # od -tx4 1490186b:dbc122a8 | more 0000000 00000028 00000001 10670000 00000000 0000020 00000001 00000000 b3ef6c29 00000000 0000040 00000028 00000001 00000002 00000058 0000060 00000000 00000005 00000001 00000000 0000100 00000000 00000000 00000000 00000000 * 0000140 00000000 00000000 00000003 00000000 0000160 00000000 00000000 00000000 00000000 * 0020000 00000000 00000000 00002000 00000001 0020020 00000001 00000000 b3ef4519 00000000 0020040 00000028 00000001 0020050 [root@bcluster111 OBJECTS] # od -tx4 149025cb:672131de | head -5 0000000 00002000 00000000 10645539 00000000 0000020 51001a9c 00000000 00000012 00000058 0000040 00000040 00000002 00000001 00000000 0000060 00000000 00000000 00000000 00000000 * [root@bcluster111 OBJECTS] # stat 1490186b:dbc122a8 File: `1490186b:dbc122a8' Size: 8232 Blocks: 24 IO Block: 4096 regular file Device: fd01h/64769d Inode: 344987755 Links: 1 Access: (0666/-rw-rw-rw-) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2013-07-12 11:59:30.727996524 +0200 Modify: 2012-08-21 14:42:09.320548695 +0200 Change: 2012-08-21 14:42:09.320548695 +0200 Maybe you could find this helpful. Sebastien. |
| Comment by Bruno Faccini (Inactive) [ 16/Jan/14 ] |
|
Yes thanks, it was the next step/need I would have requested since changelog_user is ok and pointing to OBJECTS/149025cb:672131de !! Will try to get back soon with a bypass/reconstruct procedure. |
| Comment by Bruno Faccini (Inactive) [ 20/Jan/14 ] |
|
Hello Seb, 0000000 00002000 00000000 10645539 00000000 0000020 52dd1c70 00000000 00000002 00000058 0000040 00000000 00000005 00000001 00000000 0000060 00000000 00000000 00000000 00000000 * 0000120 00000000 00000000 00000003 00000000 0000140 00000000 00000000 00000000 00000000 * 0017760 00000000 00000000 00002000 00000001 0020000 00000028 00000001 10670000 00000000 0020020 00000001 00000000 00000dd7 00000000 0020040 00000028 00000001 0020050 So your is badly corrupted and lacks its full header record !!… But BTW, there is no need to try reconstruct it, since as I understood you restarted and used the MDT without Change-Logs enabled, so a full RobinHood scan will be required to re-populate its database from scratch. Then, to be able to restart Change-Logs, you will need to umount/stop the MDT, mount it as LDISKFS, move/mv both CONFIGS/changelog_[catalog,users] to new-names, re-start/mount the MDT, and re-register the change-log user/id configured in RobinHood. Moving both CONFIGS/changelog_[catalog,users] to new-names is strongly required since it will allow for later OBJECTS/* related files clean-up. Additionally, when you say "we will need a fix so that changelogs can deal with corrupted data and start afresh in that case", do you mean that the procedure I described before should be automatic during MDT mount/start ? |
| Comment by Sebastien Buisson (Inactive) [ 20/Jan/14 ] |
|
Hi Bruno, Thanks for the procedure, I have forwarded it to onsite support team. Concerning the fix, what I meant was that MDT should be able to cope with a corrupted OBJECTS file and start changelog feature even in that case, for instance by ignoring it. But after reading your last comment, it seems that this "restart in degraded mode" way would lead to an implicit reset of changelogs config (users and catalog). So this may be a little bit too strong... Thanks! |
| Comment by Sebastien Buisson (Inactive) [ 29/Jan/14 ] |
|
Hi, Concerning the procedure, I have confirmation from the Support team that it worked fine. Changelogs are now functional on site, thanks! Cheers, |