Details

    • 3
    • 4003

    Description

      It seems the MDT catalog file may be damaged on our test filesystem. We were doing recovery testing with the patch for LU-1352. Sometime after power-cycling the MDS and letting it go through recovery, clients started getting EFAULT writing to lustre. These failures are accompanied by the following console errors on the MDS.

      Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_cat.c:81:llog_cat_new_log()) no free catalog slots for log...
      Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_cat.c:81:llog_cat_new_log()) Skipped 3 previous similar messages
      Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_obd.c:454:llog_obd_origin_add()) write one catalog record failed: -28
      Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_obd.c:454:llog_obd_origin_add()) Skipped 3 previous similar messages
      Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(mdd_object.c:1330:mdd_changelog_data_store()) changelog failed: rc=-28 op17 t[0x200de60af:0x17913:0x0]
      

      I mentioned this in LU-1570, but I figured a new ticket was needed.

      Attachments

        Issue Links

          Activity

            [LU-1586] no free catalog slots for log

            It only took about 1.3 million changelog entries to fill the catalog. My test case was something like

            MDSDEV1=/dev/sda llmount.sh
            lctl --device lustre-MDT0000 changelog_register
            while createmany -m /mnt/lustre/%d 1000 ; do
                unlinkmany /mnt/lustre/%d 1000
            done
            

            and it made it through about 670 iterations before failing.

            nedbass Ned Bass (Inactive) added a comment - It only took about 1.3 million changelog entries to fill the catalog. My test case was something like MDSDEV1=/dev/sda llmount.sh lctl --device lustre-MDT0000 changelog_register while createmany -m /mnt/lustre/%d 1000 ; do unlinkmany /mnt/lustre/%d 1000 done and it made it through about 670 iterations before failing.

            Ned, I agree this should be handled more gracefully. I think it is preferable to unregister the oldest consumer as the catalog approaches full, which should cause old records to be released (need to check this). That is IMHO better than setting the mask to zero and no longer recording new events.

            In both cases the consumer will have to do some scanning to find new changes. However, in the first case, it is more likely that the old consumer is no longer in use and no harm is done, while in the second case even a well-behaved consumer is punished.

            On a related note, do you know how many files were created before the catalog was full? In theory about 4B Changelog entries should be possible (approx 64000^2), but this might be reduced by some small factor if there are multiple records per file (e.g. create + setattr).

            adilger Andreas Dilger added a comment - Ned, I agree this should be handled more gracefully. I think it is preferable to unregister the oldest consumer as the catalog approaches full, which should cause old records to be released (need to check this). That is IMHO better than setting the mask to zero and no longer recording new events. In both cases the consumer will have to do some scanning to find new changes. However, in the first case, it is more likely that the old consumer is no longer in use and no harm is done, while in the second case even a well-behaved consumer is punished. On a related note, do you know how many files were created before the catalog was full? In theory about 4B Changelog entries should be possible (approx 64000^2), but this might be reduced by some small factor if there are multiple records per file (e.g. create + setattr).

            It seems like lots of bad things can happen if the changelog catalog is allowed to become full: LU-2843 LU-2844 LU-2845. Besides these crashes the MDS service fails to start due to EINVAL errors from mdd_changelog_llog_init(), and the only way I've found to recover is manually deleting the changelog_catalog file.

            I'm interested in adding safety mechanisms to prevent this situation. Perhaps the MDS could automatically unregister changelog users or set the changelog mask to zero based on a tunable threshold of unprocessed records. Does anyone have other ideas for how to handle this more gracefully?

            nedbass Ned Bass (Inactive) added a comment - It seems like lots of bad things can happen if the changelog catalog is allowed to become full: LU-2843 LU-2844 LU-2845 . Besides these crashes the MDS service fails to start due to EINVAL errors from mdd_changelog_llog_init(), and the only way I've found to recover is manually deleting the changelog_catalog file. I'm interested in adding safety mechanisms to prevent this situation. Perhaps the MDS could automatically unregister changelog users or set the changelog mask to zero based on a tunable threshold of unprocessed records. Does anyone have other ideas for how to handle this more gracefully?
            pjones Peter Jones added a comment -

            Adding those involved with HSM for comment

            pjones Peter Jones added a comment - Adding those involved with HSM for comment

            Yes, the changelogs could definitely be a factor. Once there is a registered changelog user, the changelogs are kept on disk until they are consumed. That ensures that if e.g. Robinhood crashes, or has some other problem for a day or four, that it won't have to do a full scan just to recover the state again.

            However, if the ChangeLog user is not unregistered, the changelogs will be kept until they run out of space. I suspect that is the root cause here, and should be investigated further. This bug should be CC'd to Jinshan and Aurelien Degremont, who are working on HSM these days.

            Cheers, Andreas

            adilger Andreas Dilger added a comment - Yes, the changelogs could definitely be a factor. Once there is a registered changelog user, the changelogs are kept on disk until they are consumed. That ensures that if e.g. Robinhood crashes, or has some other problem for a day or four, that it won't have to do a full scan just to recover the state again. However, if the ChangeLog user is not unregistered, the changelogs will be kept until they run out of space. I suspect that is the root cause here, and should be investigated further. This bug should be CC'd to Jinshan and Aurelien Degremont, who are working on HSM these days. Cheers, Andreas

            -17 = EEXIST, so I would suspect it is complaining about a file in CONFIGS, but you reported that was cleared out as well.

            Even more strangely, the EEXIST errors persisted after reformatting the MDT. To test whether something on disk was to blame, we formatted a loopback device with the same filesystem name, and still got EEXIST mounting the MDT. Formatting and the loopback device with a different filesystem name worked as expected. Rebooting the MDS node cleared the problem (I suspect reloading the module stack would have as well). There must have been some in-memory state left over from the original filesytsem (we double-checked in /proc/mounts that the old MDT was not mounted).

            In any case, this seems like something to track in a separate (but possibly related) issue.

            it seems there is something at your site that is consuming more llogs than normal.

            By any chance, do you suppose running Robinhood could be a factor? I just learned one of our admins was evaluating Robinhood on this filesystem, but my understanding it that it just reads the Changelogs so I don't suspect a connection.

            Thanks for the test suggestions. We'll give them a try when the filesystem is free again.

            nedbass Ned Bass (Inactive) added a comment - -17 = EEXIST, so I would suspect it is complaining about a file in CONFIGS, but you reported that was cleared out as well. Even more strangely, the EEXIST errors persisted after reformatting the MDT. To test whether something on disk was to blame, we formatted a loopback device with the same filesystem name, and still got EEXIST mounting the MDT. Formatting and the loopback device with a different filesystem name worked as expected. Rebooting the MDS node cleared the problem (I suspect reloading the module stack would have as well). There must have been some in-memory state left over from the original filesytsem (we double-checked in /proc/mounts that the old MDT was not mounted). In any case, this seems like something to track in a separate (but possibly related) issue. it seems there is something at your site that is consuming more llogs than normal. By any chance, do you suppose running Robinhood could be a factor? I just learned one of our admins was evaluating Robinhood on this filesystem, but my understanding it that it just reads the Changelogs so I don't suspect a connection. Thanks for the test suggestions. We'll give them a try when the filesystem is free again.

            -17 = EEXIST, so I would suspect it is complaining about a file in CONFIGS, but you reported that was cleared out as well.

            You are correct that bringing the OSTs back online should cause the OST recovery logs to be cleaned up. Along with the message in LU-1570, it seems there is something at your site that is consuming more llogs than normal. Each llog should allow up to 64k unlinks to be stored for recovery, and up to 64k llogs in a catalog PER OST, though new llog files are started for each boot. That means 4B unlinks or 64k reboots, or combinations thereof, per OST before the catalog wraps back to zero (per LU-1570).

            The logs should be deleted sequentially after the MDT->OST orphan recovery is completed when the OST reconnects, freeing up their slot in the catalog file. It is possible that something was broken in this code in 2.x and it hasn't been noticed until now, since it would take a long time to see the symptoms.

            A simple test to reproduce this would be to create & delete files in a loop (~1M) on a specific OST (using "mkdir $LUSTRE/test_ostN; lfs setstripe -i N $LUSTRE/test_ostN; createmany -o $LUSTRE/test_ostN/f 1000000", where "N" is some OST number) and see if the number of llog files in the MDT OBJECTS/ directory is increasing steadily over time (beyond 2 or 3 files per OST). I don't recall specifically, but it may need an unmount and remount of the OST for the llog files to be cleaned up.

            Failing that test, try creating a large number of files (~1M) in $LUSTRE/test_ostN, and then unmount OST N and delete all the files. This should succeed without error, but there will be many llog entries stored in the llog file. The llog files should be cleaned when this OST is mounted again.

            adilger Andreas Dilger added a comment - -17 = EEXIST, so I would suspect it is complaining about a file in CONFIGS, but you reported that was cleared out as well. You are correct that bringing the OSTs back online should cause the OST recovery logs to be cleaned up. Along with the message in LU-1570 , it seems there is something at your site that is consuming more llogs than normal. Each llog should allow up to 64k unlinks to be stored for recovery, and up to 64k llogs in a catalog PER OST, though new llog files are started for each boot. That means 4B unlinks or 64k reboots, or combinations thereof, per OST before the catalog wraps back to zero (per LU-1570 ). The logs should be deleted sequentially after the MDT->OST orphan recovery is completed when the OST reconnects, freeing up their slot in the catalog file. It is possible that something was broken in this code in 2.x and it hasn't been noticed until now, since it would take a long time to see the symptoms. A simple test to reproduce this would be to create & delete files in a loop (~1M) on a specific OST (using "mkdir $LUSTRE/test_ostN; lfs setstripe -i N $LUSTRE/test_ostN; createmany -o $LUSTRE/test_ostN/f 1000000", where "N" is some OST number) and see if the number of llog files in the MDT OBJECTS/ directory is increasing steadily over time (beyond 2 or 3 files per OST). I don't recall specifically, but it may need an unmount and remount of the OST for the llog files to be cleaned up. Failing that test, try creating a large number of files (~1M) in $LUSTRE/test_ostN, and then unmount OST N and delete all the files. This should succeed without error, but there will be many llog entries stored in the llog file. The llog files should be cleaned when this OST is mounted again.

            Creating and deleting a lot of files, but having OSTs offline could do this.

            I still don't really understand this failure mode. Shouldn't the llogs be replayed and removed when the OSTs reconnect to the MDS?

            Also, we found that removing the CATALOGS file didn't resolve this. We event went so far as to clear the OBJECTS and CONFIGS directories and issue a writeconf, but were unable to resuscitate the MDS.

            Lustre: MGS: Logs for fs lc2 were removed by user request.  All servers must be restarted in order to regenerate the logs.
            Lustre: Setting parameter lc2-MDT0000-mdtlov.lov.stripecount in log lc2-MDT0000
            Lustre: Setting parameter lc2-clilov.lov.stripecount in log lc2-client
            LustreError: 5569:0:(genops.c:304:class_newdev()) Device lc2-MDT0000-mdtlov already exists, won't add
            LustreError: 5569:0:(obd_config.c:327:class_attach()) Cannot create device lc2-MDT0000-mdtlov of type lov : -17
            LustreError: 5569:0:(obd_config.c:1363:class_config_llog_handler()) Err -17 on cfg command:
            Lustre:    cmd=cf001 0:lc2-MDT0000-mdtlov  1:lov  2:lc2-MDT0000-mdtlov_UUID  
            LustreError: 15c-8: MGC10.1.1.212@o2ib9: The configuration from log 'lc2-MDT0000' failed (-17). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
            LustreError: 5513:0:(obd_mount.c:1192:server_start_targets()) failed to start server lc2-MDT0000: -17
            LustreError: 5513:0:(obd_mount.c:1723:server_fill_super()) Unable to start targets: -17
            LustreError: 5513:0:(obd_mount.c:1512:server_put_super()) no obd lc2-MDT0000
            

            We needed the filesystem for other testing, so we gave up and reformatted it. However, we're concerned that this failure may occur in production and we don't have a recovery process. So we'd really like to understand better what happened and how to fix it.

            nedbass Ned Bass (Inactive) added a comment - Creating and deleting a lot of files, but having OSTs offline could do this. I still don't really understand this failure mode. Shouldn't the llogs be replayed and removed when the OSTs reconnect to the MDS? Also, we found that removing the CATALOGS file didn't resolve this. We event went so far as to clear the OBJECTS and CONFIGS directories and issue a writeconf, but were unable to resuscitate the MDS. Lustre: MGS: Logs for fs lc2 were removed by user request. All servers must be restarted in order to regenerate the logs. Lustre: Setting parameter lc2-MDT0000-mdtlov.lov.stripecount in log lc2-MDT0000 Lustre: Setting parameter lc2-clilov.lov.stripecount in log lc2-client LustreError: 5569:0:(genops.c:304:class_newdev()) Device lc2-MDT0000-mdtlov already exists, won't add LustreError: 5569:0:(obd_config.c:327:class_attach()) Cannot create device lc2-MDT0000-mdtlov of type lov : -17 LustreError: 5569:0:(obd_config.c:1363:class_config_llog_handler()) Err -17 on cfg command: Lustre: cmd=cf001 0:lc2-MDT0000-mdtlov 1:lov 2:lc2-MDT0000-mdtlov_UUID LustreError: 15c-8: MGC10.1.1.212@o2ib9: The configuration from log 'lc2-MDT0000' failed (-17). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. LustreError: 5513:0:(obd_mount.c:1192:server_start_targets()) failed to start server lc2-MDT0000: -17 LustreError: 5513:0:(obd_mount.c:1723:server_fill_super()) Unable to start targets: -17 LustreError: 5513:0:(obd_mount.c:1512:server_put_super()) no obd lc2-MDT0000 We needed the filesystem for other testing, so we gave up and reformatted it. However, we're concerned that this failure may occur in production and we don't have a recovery process. So we'd really like to understand better what happened and how to fix it.

            is there any reason you are aware of that would cause a lot of llog records to be produced but not deleted? Creating and deleting a lot of files, but having OSTs offline could do this.

            It's very possible that's what happened. The OSTs were down overnight while our I/O SWL tests were running on the clients. That may have included mdtest which does create lots of files. I thought the MDS was down too, but I'd have to go back over the logs to reconstruct what really happened.

            It would be useful to get a listing of the OBJECTS directory on the MDS, just to see how many llog files there actually are. The catalog should be able to reference ~64000 log files.

            # zwicky-mds2 > ls OBJECTS/ | wc -l
            65428

            nedbass Ned Bass (Inactive) added a comment - is there any reason you are aware of that would cause a lot of llog records to be produced but not deleted? Creating and deleting a lot of files, but having OSTs offline could do this. It's very possible that's what happened. The OSTs were down overnight while our I/O SWL tests were running on the clients. That may have included mdtest which does create lots of files. I thought the MDS was down too, but I'd have to go back over the logs to reconstruct what really happened. It would be useful to get a listing of the OBJECTS directory on the MDS, just to see how many llog files there actually are. The catalog should be able to reference ~64000 log files. # zwicky-mds2 > ls OBJECTS/ | wc -l 65428

            If this is a production system, it would be possible to delete the CATALOGS file on the MDS.

            That said, is there any reason you are aware of that would cause a lot of llog records to be produced but not deleted? Creating and deleting a lot of files, but having OSTs offline could do this.

            It would be useful to get a listing of the OBJECTS directory on the MDS, just to see how many llog files there actually are. The catalog should be able to reference ~64000 log files.

            adilger Andreas Dilger added a comment - If this is a production system, it would be possible to delete the CATALOGS file on the MDS. That said, is there any reason you are aware of that would cause a lot of llog records to be produced but not deleted? Creating and deleting a lot of files, but having OSTs offline could do this. It would be useful to get a listing of the OBJECTS directory on the MDS, just to see how many llog files there actually are. The catalog should be able to reference ~64000 log files.

            Indeed, these are the types of issues we're hoping to shake out before we even consider putting that patch in production. I should mention that the "catlog 0x27500007 crosses index zero" error message from LU-1570 began appearing in the MDS logs before we installed the LU-1352 patch.

            nedbass Ned Bass (Inactive) added a comment - Indeed, these are the types of issues we're hoping to shake out before we even consider putting that patch in production. I should mention that the "catlog 0x27500007 crosses index zero" error message from LU-1570 began appearing in the MDS logs before we installed the LU-1352 patch.

            People

              bogl Bob Glossman (Inactive)
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: