[LU-6954] LustreError: 12934:0:(mdd_device.c:305:mdd_changelog_llog_init()) fsrzb-MDD0000: changelog init failed: rc = -5 Created: 04/Aug/15 Updated: 19/Feb/16 Resolved: 19/Feb/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Olaf Faaland | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64 The mds service on both porter and stout fails to start. Able to import zfs pool on both systems with no problem. The mgs device mounts with no problem but the mdt on both systems fails to mount. Doing a "writeconf" on the stout mds did not help. The following console messages were reported on stout-mds1 console: 2015-08-02 16:38:26 Lustre: Lustre: Build Version: 2.5.4-4chaos-4chaos--PRISTINE-2.6.32-504.16.2.1chaos.ch5.3.x86_64 2015-08-02 16:38:27 Lustre: MGC172.21.1.99@o2ib200: Connection restored to MGS (at 0@lo) 2015-08-02 16:38:28 Lustre: MGS: Logs for fs fsrzb were removed by user request. All servers must be restarted in order to regenerate the logs. 2015-08-02 16:38:30 LustreError: 11-0: fsrzb-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. 2015-08-02 16:38:31 Lustre: 12934:0:(llog_cat.c:718:llog_cat_reverse_process()) catalog 0x2:10 crosses index zero 2015-08-02 16:38:31 LustreError: 12934:0:(mdd_device.c:305:mdd_changelog_llog_init()) fsrzb-MDD0000: changelog init failed: rc = -5 2015-08-02 16:38:31 LustreError: 12934:0:(mdd_device.c:380:mdd_changelog_init()) fsrzb-MDD0000: changelog setup during init failed: rc = -5 2015-08-02 16:38:31 LustreError: 12934:0:(mdd_device.c:963:mdd_prepare()) fsrzb-MDD0000: failed to initialize changelog: rc = -5 2015-08-02 16:38:31 Lustre: fsrzb-MDT0000: Unable to start target: -5 2015-08-02 16:38:31 Lustre: Failing over fsrzb-MDT0000 2015-08-02 16:38:32 Lustre: server umount fsrzb-MDT0000 complete 2015-08-02 16:38:32 LustreError: 12934:0:(obd_mount.c:1331:lustre_fill_super()) Unable to mount (-5) A workaround was found to allow the MDT to mount: |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 04/Aug/15 ] |
|
Hello Olaf, |
| Comment by Olaf Faaland [ 04/Aug/15 ] |
|
Bruno, |
| Comment by Bruno Faccini (Inactive) [ 04/Aug/15 ] |
|
Thanks Olaf, and here is what I can tell after analyzing the changelog_catalog file you have provided. |
| Comment by Christopher Morrone [ 04/Aug/15 ] |
|
I find it exceptionally unlikely that two different filesystems had both independently "just reached" the end of the changelog_catalog and were just about to loop back at exactly the same time. More explanation is needed. |
| Comment by Bruno Faccini (Inactive) [ 04/Aug/15 ] |
|
Hello Chris, thanks to warn me about the fact that 2 FSs are affected, I should have better read the description text for this ticket, sorry about that. But, if the same symptoms/msgs have occurred for both Filesystems failures, I can already confirm that the "crosses index zero" msg is an indication of a Catalog loop-back, and I will also need the 2nd changelog_catalog file for the 2nd filesystem to analyze it. And BTW, I have double-checked the 1st Catalog you have provided and I can also confirm you that it shows the same corruption (Catalog records written past normal end, leading to a Catalog size > header+bitmap+records) than what has been found for Concerning the fact I used the "just reached" comment, this may come from the fact that, for an unexplained reason at the moment, bits at the beginning of the bitmap have been cleared (or may be never set). |
| Comment by Olaf Faaland [ 05/Aug/15 ] |
|
Bruno, I confirmed that both filesystems produced the same sequence of error messages when attempting to start the MDT, including the "crosses index zero" and "changelog init failed" messages, same rc's. We do have the I'll attach the second changelog_catalog. The one you've already seen is from porter. |
| Comment by Olaf Faaland [ 14/Aug/15 ] |
|
Bruno, Do you have any updates on this? I see that the stout catalog file contains 67272 records, and it looks like the bitmap has only 64767 bits for tracking the status of the non- llog_log_hdr records. So it does seem to me that the changelog_catalog file is corrupt. The records that appear after that have indices in the range 12197 - 14701, which seems odd. The code in llog_osd_prev_block() appears to me to assume that the records within a block have monotonically increasing indices, since only lrt_index is generally checked before deciding whether to read another block from disk or not. Am I correct that requirement for increasing indices? thanks, |
| Comment by Bruno Faccini (Inactive) [ 15/Aug/15 ] |
|
Olaf, So the scenario which is likely to have happen for both filesystems is that prior to start running with patch for Am I right when I suspect that problem has occured during the 2nd restart for each FS since upgrading with |
| Comment by Olaf Faaland [ 19/Aug/15 ] |
|
Bruno, Yes, it is likely we encountered this error on the 2nd restart after upgrading with I am looking at the code in llog_osd_write_rec() and your patch for thanks, |
| Comment by Olaf Faaland [ 19/Aug/15 ] |
|
Bruno, I can see how we could end up with the changelog_catalog file corruption if, before we upgraded to However, in the thanks, |
| Comment by Olaf Faaland [ 19/Aug/15 ] |
|
I guess what I'm really asking is, did changelog_catalogs wrap around prior to the thanks, |
| Comment by Olaf Faaland [ 19/Aug/15 ] |
|
Bruno, Looks to me like this code in llog_cat_new_log() implemented wrap-around before 76 bitmap_size = LLOG_BITMAP_SIZE(llh); 77 78 index = (cathandle->lgh_last_idx + 1) % bitmap_size; ... 118 cathandle->lgh_last_idx = index; 119 llh->llh_tail.lrt_index = index; thanks, |
| Comment by Bruno Faccini (Inactive) [ 19/Aug/15 ] |
|
Olaf, |
| Comment by Bruno Faccini (Inactive) [ 08/Dec/15 ] |
|
Hello Olaf, |
| Comment by Olaf Faaland [ 09/Dec/15 ] |
|
Hi Bruno, |
| Comment by John Fuchs-Chesney (Inactive) [ 11/Dec/15 ] |
|
Thanks Bruno and Olaf. |