Details

    • 3
    • 4003

    Description

      It seems the MDT catalog file may be damaged on our test filesystem. We were doing recovery testing with the patch for LU-1352. Sometime after power-cycling the MDS and letting it go through recovery, clients started getting EFAULT writing to lustre. These failures are accompanied by the following console errors on the MDS.

      Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_cat.c:81:llog_cat_new_log()) no free catalog slots for log...
      Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_cat.c:81:llog_cat_new_log()) Skipped 3 previous similar messages
      Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_obd.c:454:llog_obd_origin_add()) write one catalog record failed: -28
      Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_obd.c:454:llog_obd_origin_add()) Skipped 3 previous similar messages
      Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(mdd_object.c:1330:mdd_changelog_data_store()) changelog failed: rc=-28 op17 t[0x200de60af:0x17913:0x0]
      

      I mentioned this in LU-1570, but I figured a new ticket was needed.

      Attachments

        Issue Links

          Activity

            [LU-1586] no free catalog slots for log

            Yes, the changelogs could definitely be a factor. Once there is a registered changelog user, the changelogs are kept on disk until they are consumed. That ensures that if e.g. Robinhood crashes, or has some other problem for a day or four, that it won't have to do a full scan just to recover the state again.

            However, if the ChangeLog user is not unregistered, the changelogs will be kept until they run out of space. I suspect that is the root cause here, and should be investigated further. This bug should be CC'd to Jinshan and Aurelien Degremont, who are working on HSM these days.

            Cheers, Andreas

            adilger Andreas Dilger added a comment - Yes, the changelogs could definitely be a factor. Once there is a registered changelog user, the changelogs are kept on disk until they are consumed. That ensures that if e.g. Robinhood crashes, or has some other problem for a day or four, that it won't have to do a full scan just to recover the state again. However, if the ChangeLog user is not unregistered, the changelogs will be kept until they run out of space. I suspect that is the root cause here, and should be investigated further. This bug should be CC'd to Jinshan and Aurelien Degremont, who are working on HSM these days. Cheers, Andreas

            -17 = EEXIST, so I would suspect it is complaining about a file in CONFIGS, but you reported that was cleared out as well.

            Even more strangely, the EEXIST errors persisted after reformatting the MDT. To test whether something on disk was to blame, we formatted a loopback device with the same filesystem name, and still got EEXIST mounting the MDT. Formatting and the loopback device with a different filesystem name worked as expected. Rebooting the MDS node cleared the problem (I suspect reloading the module stack would have as well). There must have been some in-memory state left over from the original filesytsem (we double-checked in /proc/mounts that the old MDT was not mounted).

            In any case, this seems like something to track in a separate (but possibly related) issue.

            it seems there is something at your site that is consuming more llogs than normal.

            By any chance, do you suppose running Robinhood could be a factor? I just learned one of our admins was evaluating Robinhood on this filesystem, but my understanding it that it just reads the Changelogs so I don't suspect a connection.

            Thanks for the test suggestions. We'll give them a try when the filesystem is free again.

            nedbass Ned Bass (Inactive) added a comment - -17 = EEXIST, so I would suspect it is complaining about a file in CONFIGS, but you reported that was cleared out as well. Even more strangely, the EEXIST errors persisted after reformatting the MDT. To test whether something on disk was to blame, we formatted a loopback device with the same filesystem name, and still got EEXIST mounting the MDT. Formatting and the loopback device with a different filesystem name worked as expected. Rebooting the MDS node cleared the problem (I suspect reloading the module stack would have as well). There must have been some in-memory state left over from the original filesytsem (we double-checked in /proc/mounts that the old MDT was not mounted). In any case, this seems like something to track in a separate (but possibly related) issue. it seems there is something at your site that is consuming more llogs than normal. By any chance, do you suppose running Robinhood could be a factor? I just learned one of our admins was evaluating Robinhood on this filesystem, but my understanding it that it just reads the Changelogs so I don't suspect a connection. Thanks for the test suggestions. We'll give them a try when the filesystem is free again.

            -17 = EEXIST, so I would suspect it is complaining about a file in CONFIGS, but you reported that was cleared out as well.

            You are correct that bringing the OSTs back online should cause the OST recovery logs to be cleaned up. Along with the message in LU-1570, it seems there is something at your site that is consuming more llogs than normal. Each llog should allow up to 64k unlinks to be stored for recovery, and up to 64k llogs in a catalog PER OST, though new llog files are started for each boot. That means 4B unlinks or 64k reboots, or combinations thereof, per OST before the catalog wraps back to zero (per LU-1570).

            The logs should be deleted sequentially after the MDT->OST orphan recovery is completed when the OST reconnects, freeing up their slot in the catalog file. It is possible that something was broken in this code in 2.x and it hasn't been noticed until now, since it would take a long time to see the symptoms.

            A simple test to reproduce this would be to create & delete files in a loop (~1M) on a specific OST (using "mkdir $LUSTRE/test_ostN; lfs setstripe -i N $LUSTRE/test_ostN; createmany -o $LUSTRE/test_ostN/f 1000000", where "N" is some OST number) and see if the number of llog files in the MDT OBJECTS/ directory is increasing steadily over time (beyond 2 or 3 files per OST). I don't recall specifically, but it may need an unmount and remount of the OST for the llog files to be cleaned up.

            Failing that test, try creating a large number of files (~1M) in $LUSTRE/test_ostN, and then unmount OST N and delete all the files. This should succeed without error, but there will be many llog entries stored in the llog file. The llog files should be cleaned when this OST is mounted again.

            adilger Andreas Dilger added a comment - -17 = EEXIST, so I would suspect it is complaining about a file in CONFIGS, but you reported that was cleared out as well. You are correct that bringing the OSTs back online should cause the OST recovery logs to be cleaned up. Along with the message in LU-1570 , it seems there is something at your site that is consuming more llogs than normal. Each llog should allow up to 64k unlinks to be stored for recovery, and up to 64k llogs in a catalog PER OST, though new llog files are started for each boot. That means 4B unlinks or 64k reboots, or combinations thereof, per OST before the catalog wraps back to zero (per LU-1570 ). The logs should be deleted sequentially after the MDT->OST orphan recovery is completed when the OST reconnects, freeing up their slot in the catalog file. It is possible that something was broken in this code in 2.x and it hasn't been noticed until now, since it would take a long time to see the symptoms. A simple test to reproduce this would be to create & delete files in a loop (~1M) on a specific OST (using "mkdir $LUSTRE/test_ostN; lfs setstripe -i N $LUSTRE/test_ostN; createmany -o $LUSTRE/test_ostN/f 1000000", where "N" is some OST number) and see if the number of llog files in the MDT OBJECTS/ directory is increasing steadily over time (beyond 2 or 3 files per OST). I don't recall specifically, but it may need an unmount and remount of the OST for the llog files to be cleaned up. Failing that test, try creating a large number of files (~1M) in $LUSTRE/test_ostN, and then unmount OST N and delete all the files. This should succeed without error, but there will be many llog entries stored in the llog file. The llog files should be cleaned when this OST is mounted again.

            Creating and deleting a lot of files, but having OSTs offline could do this.

            I still don't really understand this failure mode. Shouldn't the llogs be replayed and removed when the OSTs reconnect to the MDS?

            Also, we found that removing the CATALOGS file didn't resolve this. We event went so far as to clear the OBJECTS and CONFIGS directories and issue a writeconf, but were unable to resuscitate the MDS.

            Lustre: MGS: Logs for fs lc2 were removed by user request.  All servers must be restarted in order to regenerate the logs.
            Lustre: Setting parameter lc2-MDT0000-mdtlov.lov.stripecount in log lc2-MDT0000
            Lustre: Setting parameter lc2-clilov.lov.stripecount in log lc2-client
            LustreError: 5569:0:(genops.c:304:class_newdev()) Device lc2-MDT0000-mdtlov already exists, won't add
            LustreError: 5569:0:(obd_config.c:327:class_attach()) Cannot create device lc2-MDT0000-mdtlov of type lov : -17
            LustreError: 5569:0:(obd_config.c:1363:class_config_llog_handler()) Err -17 on cfg command:
            Lustre:    cmd=cf001 0:lc2-MDT0000-mdtlov  1:lov  2:lc2-MDT0000-mdtlov_UUID  
            LustreError: 15c-8: MGC10.1.1.212@o2ib9: The configuration from log 'lc2-MDT0000' failed (-17). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
            LustreError: 5513:0:(obd_mount.c:1192:server_start_targets()) failed to start server lc2-MDT0000: -17
            LustreError: 5513:0:(obd_mount.c:1723:server_fill_super()) Unable to start targets: -17
            LustreError: 5513:0:(obd_mount.c:1512:server_put_super()) no obd lc2-MDT0000
            

            We needed the filesystem for other testing, so we gave up and reformatted it. However, we're concerned that this failure may occur in production and we don't have a recovery process. So we'd really like to understand better what happened and how to fix it.

            nedbass Ned Bass (Inactive) added a comment - Creating and deleting a lot of files, but having OSTs offline could do this. I still don't really understand this failure mode. Shouldn't the llogs be replayed and removed when the OSTs reconnect to the MDS? Also, we found that removing the CATALOGS file didn't resolve this. We event went so far as to clear the OBJECTS and CONFIGS directories and issue a writeconf, but were unable to resuscitate the MDS. Lustre: MGS: Logs for fs lc2 were removed by user request. All servers must be restarted in order to regenerate the logs. Lustre: Setting parameter lc2-MDT0000-mdtlov.lov.stripecount in log lc2-MDT0000 Lustre: Setting parameter lc2-clilov.lov.stripecount in log lc2-client LustreError: 5569:0:(genops.c:304:class_newdev()) Device lc2-MDT0000-mdtlov already exists, won't add LustreError: 5569:0:(obd_config.c:327:class_attach()) Cannot create device lc2-MDT0000-mdtlov of type lov : -17 LustreError: 5569:0:(obd_config.c:1363:class_config_llog_handler()) Err -17 on cfg command: Lustre: cmd=cf001 0:lc2-MDT0000-mdtlov 1:lov 2:lc2-MDT0000-mdtlov_UUID LustreError: 15c-8: MGC10.1.1.212@o2ib9: The configuration from log 'lc2-MDT0000' failed (-17). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. LustreError: 5513:0:(obd_mount.c:1192:server_start_targets()) failed to start server lc2-MDT0000: -17 LustreError: 5513:0:(obd_mount.c:1723:server_fill_super()) Unable to start targets: -17 LustreError: 5513:0:(obd_mount.c:1512:server_put_super()) no obd lc2-MDT0000 We needed the filesystem for other testing, so we gave up and reformatted it. However, we're concerned that this failure may occur in production and we don't have a recovery process. So we'd really like to understand better what happened and how to fix it.

            is there any reason you are aware of that would cause a lot of llog records to be produced but not deleted? Creating and deleting a lot of files, but having OSTs offline could do this.

            It's very possible that's what happened. The OSTs were down overnight while our I/O SWL tests were running on the clients. That may have included mdtest which does create lots of files. I thought the MDS was down too, but I'd have to go back over the logs to reconstruct what really happened.

            It would be useful to get a listing of the OBJECTS directory on the MDS, just to see how many llog files there actually are. The catalog should be able to reference ~64000 log files.

            # zwicky-mds2 > ls OBJECTS/ | wc -l
            65428

            nedbass Ned Bass (Inactive) added a comment - is there any reason you are aware of that would cause a lot of llog records to be produced but not deleted? Creating and deleting a lot of files, but having OSTs offline could do this. It's very possible that's what happened. The OSTs were down overnight while our I/O SWL tests were running on the clients. That may have included mdtest which does create lots of files. I thought the MDS was down too, but I'd have to go back over the logs to reconstruct what really happened. It would be useful to get a listing of the OBJECTS directory on the MDS, just to see how many llog files there actually are. The catalog should be able to reference ~64000 log files. # zwicky-mds2 > ls OBJECTS/ | wc -l 65428

            If this is a production system, it would be possible to delete the CATALOGS file on the MDS.

            That said, is there any reason you are aware of that would cause a lot of llog records to be produced but not deleted? Creating and deleting a lot of files, but having OSTs offline could do this.

            It would be useful to get a listing of the OBJECTS directory on the MDS, just to see how many llog files there actually are. The catalog should be able to reference ~64000 log files.

            adilger Andreas Dilger added a comment - If this is a production system, it would be possible to delete the CATALOGS file on the MDS. That said, is there any reason you are aware of that would cause a lot of llog records to be produced but not deleted? Creating and deleting a lot of files, but having OSTs offline could do this. It would be useful to get a listing of the OBJECTS directory on the MDS, just to see how many llog files there actually are. The catalog should be able to reference ~64000 log files.

            Indeed, these are the types of issues we're hoping to shake out before we even consider putting that patch in production. I should mention that the "catlog 0x27500007 crosses index zero" error message from LU-1570 began appearing in the MDS logs before we installed the LU-1352 patch.

            nedbass Ned Bass (Inactive) added a comment - Indeed, these are the types of issues we're hoping to shake out before we even consider putting that patch in production. I should mention that the "catlog 0x27500007 crosses index zero" error message from LU-1570 began appearing in the MDS logs before we installed the LU-1352 patch.

            This problem nearly certainly has a common underlying cause with LU-1570. Both come down to damage in the catalog. Don't plan to close this as a dup until we complete our analysis.

            Ned,
            You say this happened while you were testing a fix from LU-1352. As far as I know that back port to b2_1 has not yet been fully tested and accepted. Kind of suspicious.

            bogl Bob Glossman (Inactive) added a comment - This problem nearly certainly has a common underlying cause with LU-1570 . Both come down to damage in the catalog. Don't plan to close this as a dup until we complete our analysis. Ned, You say this happened while you were testing a fix from LU-1352 . As far as I know that back port to b2_1 has not yet been fully tested and accepted. Kind of suspicious.

            People

              bogl Bob Glossman (Inactive)
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: