Details
-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
Lustre 2.5.4
-
lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64
-
3
-
9223372036854775807
Description
lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64
The mds service on both porter and stout fails to start. Able to import zfs pool on both systems with no problem. The mgs device mounts with no problem but the mdt on both systems fails to mount. Doing a "writeconf" on the stout mds did not help. The following console messages were reported on stout-mds1 console:
2015-08-02 16:38:26 Lustre: Lustre: Build Version: 2.5.4-4chaos-4chaos--PRISTINE-2.6.32-504.16.2.1chaos.ch5.3.x86_64 2015-08-02 16:38:27 Lustre: MGC172.21.1.99@o2ib200: Connection restored to MGS (at 0@lo) 2015-08-02 16:38:28 Lustre: MGS: Logs for fs fsrzb were removed by user request. All servers must be restarted in order to regenerate the logs. 2015-08-02 16:38:30 LustreError: 11-0: fsrzb-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. 2015-08-02 16:38:31 Lustre: 12934:0:(llog_cat.c:718:llog_cat_reverse_process()) catalog 0x2:10 crosses index zero 2015-08-02 16:38:31 LustreError: 12934:0:(mdd_device.c:305:mdd_changelog_llog_init()) fsrzb-MDD0000: changelog init failed: rc = -5 2015-08-02 16:38:31 LustreError: 12934:0:(mdd_device.c:380:mdd_changelog_init()) fsrzb-MDD0000: changelog setup during init failed: rc = -5 2015-08-02 16:38:31 LustreError: 12934:0:(mdd_device.c:963:mdd_prepare()) fsrzb-MDD0000: failed to initialize changelog: rc = -5 2015-08-02 16:38:31 Lustre: fsrzb-MDT0000: Unable to start target: -5 2015-08-02 16:38:31 Lustre: Failing over fsrzb-MDT0000 2015-08-02 16:38:32 Lustre: server umount fsrzb-MDT0000 complete 2015-08-02 16:38:32 LustreError: 12934:0:(obd_mount.c:1331:lustre_fill_super()) Unable to mount (-5)
A workaround was found to allow the MDT to mount:
Mounting the MDT via ZPL
Delete the changelog_catalog and changelog_users files
Unmount
Mount the MDT via lustre in the normal manner
Bruno,
Do you have any updates on this? I see that the stout catalog file contains 67272 records, and it looks like the bitmap has only 64767 bits for tracking the status of the non- llog_log_hdr records. So it does seem to me that the changelog_catalog file is corrupt.
The records that appear after that have indices in the range 12197 - 14701, which seems odd. The code in llog_osd_prev_block() appears to me to assume that the records within a block have monotonically increasing indices, since only lrt_index is generally checked before deciding whether to read another block from disk or not. Am I correct that requirement for increasing indices?
thanks,
Olaf