Details

    • 3
    • 4003

    Description

      It seems the MDT catalog file may be damaged on our test filesystem. We were doing recovery testing with the patch for LU-1352. Sometime after power-cycling the MDS and letting it go through recovery, clients started getting EFAULT writing to lustre. These failures are accompanied by the following console errors on the MDS.

      Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_cat.c:81:llog_cat_new_log()) no free catalog slots for log...
      Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_cat.c:81:llog_cat_new_log()) Skipped 3 previous similar messages
      Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_obd.c:454:llog_obd_origin_add()) write one catalog record failed: -28
      Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_obd.c:454:llog_obd_origin_add()) Skipped 3 previous similar messages
      Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(mdd_object.c:1330:mdd_changelog_data_store()) changelog failed: rc=-28 op17 t[0x200de60af:0x17913:0x0]
      

      I mentioned this in LU-1570, but I figured a new ticket was needed.

      Attachments

        Issue Links

          Activity

            [LU-1586] no free catalog slots for log

            Close as a duplicate of LU-7340.

            adilger Andreas Dilger added a comment - Close as a duplicate of LU-7340 .
            kilian Kilian Cavalotti added a comment - - edited

            As a matter of fact, it happened to us on a production filesystem. I wouldn't say the workload is non-pathological, though.

            Anyway, we noticed at some point that a MD operation such as "chown" could lead too ENOSPC:

            # chown userA /scratch/users/userA
            chown: changing ownership of `/scratch/users/userA/': No space left on device
            

            The related MDS messages are:

            LustreError: 8130:0:(llog_cat.c:82:llog_cat_new_log()) no free catalog slots for log...
            LustreError: 8130:0:(mdd_dir.c:783:mdd_changelog_ns_store()) changelog failed: rc=-28, op1 test c[0x20000b197:0x108d0:0x0] p[0x200002efb:0x155d5:0x0]
            

            Any tip on how to solve this? Would consuming (or clearing) the changelogs be sufficient?

            kilian Kilian Cavalotti added a comment - - edited As a matter of fact, it happened to us on a production filesystem. I wouldn't say the workload is non-pathological, though. Anyway, we noticed at some point that a MD operation such as "chown" could lead too ENOSPC: # chown userA /scratch/users/userA chown: changing ownership of `/scratch/users/userA/': No space left on device The related MDS messages are: LustreError: 8130:0:(llog_cat.c:82:llog_cat_new_log()) no free catalog slots for log... LustreError: 8130:0:(mdd_dir.c:783:mdd_changelog_ns_store()) changelog failed: rc=-28, op1 test c[0x20000b197:0x108d0:0x0] p[0x200002efb:0x155d5:0x0] Any tip on how to solve this? Would consuming (or clearing) the changelogs be sufficient?

            Aurelien, we're concerned about filling the changelog catalog, not the device. We actually had that happen on our our test system when Robinhood was down and I was testing metadata peformance (hence this Jira issue). It's far less likely on a production system with non-pathological workloads, but not outside the realm of possibility.

            nedbass Ned Bass (Inactive) added a comment - Aurelien, we're concerned about filling the changelog catalog, not the device. We actually had that happen on our our test system when Robinhood was down and I was testing metadata peformance (hence this Jira issue). It's far less likely on a production system with non-pathological workloads, but not outside the realm of possibility.

            FYI we had Robinhood setup on a filesystem with 100 millions of inodes, and MDS RPC rate between 1k/s and 30k/s peak. We had Robinhood stopped for days and we had millions of record changelog to be consumed. It has required also days to close the gap but the MDS was very, very far from being filled. (MDS size was 2 TB). I think we did not consume even 1% of this device.
            Do not worry

            adegremont Aurelien Degremont (Inactive) added a comment - FYI we had Robinhood setup on a filesystem with 100 millions of inodes, and MDS RPC rate between 1k/s and 30k/s peak. We had Robinhood stopped for days and we had millions of record changelog to be consumed. It has required also days to close the gap but the MDS was very , very far from being filled. (MDS size was 2 TB). I think we did not consume even 1% of this device. Do not worry

            Sorry, I was filling the device not the changelog catalog. I specified MDSDEV1=/dev/sda thinking it would use the whole device, but I also need to set MDSSIZE. So it will take days not minutes to hit this limit, making it less worrisome but still something that should be addressed.

            The reason I'm now picking this thread up again is that we have plans to enable changelogs on our production systems for use by Robinhood. We're concerned about being exposed to the problems under discussion here if Robinhood goes down for an extended period.

            nedbass Ned Bass (Inactive) added a comment - Sorry, I was filling the device not the changelog catalog. I specified MDSDEV1=/dev/sda thinking it would use the whole device, but I also need to set MDSSIZE. So it will take days not minutes to hit this limit, making it less worrisome but still something that should be addressed. The reason I'm now picking this thread up again is that we have plans to enable changelogs on our production systems for use by Robinhood. We're concerned about being exposed to the problems under discussion here if Robinhood goes down for an extended period.

            It only took about 1.3 million changelog entries to fill the catalog. My test case was something like

            MDSDEV1=/dev/sda llmount.sh
            lctl --device lustre-MDT0000 changelog_register
            while createmany -m /mnt/lustre/%d 1000 ; do
                unlinkmany /mnt/lustre/%d 1000
            done
            

            and it made it through about 670 iterations before failing.

            nedbass Ned Bass (Inactive) added a comment - It only took about 1.3 million changelog entries to fill the catalog. My test case was something like MDSDEV1=/dev/sda llmount.sh lctl --device lustre-MDT0000 changelog_register while createmany -m /mnt/lustre/%d 1000 ; do unlinkmany /mnt/lustre/%d 1000 done and it made it through about 670 iterations before failing.

            Ned, I agree this should be handled more gracefully. I think it is preferable to unregister the oldest consumer as the catalog approaches full, which should cause old records to be released (need to check this). That is IMHO better than setting the mask to zero and no longer recording new events.

            In both cases the consumer will have to do some scanning to find new changes. However, in the first case, it is more likely that the old consumer is no longer in use and no harm is done, while in the second case even a well-behaved consumer is punished.

            On a related note, do you know how many files were created before the catalog was full? In theory about 4B Changelog entries should be possible (approx 64000^2), but this might be reduced by some small factor if there are multiple records per file (e.g. create + setattr).

            adilger Andreas Dilger added a comment - Ned, I agree this should be handled more gracefully. I think it is preferable to unregister the oldest consumer as the catalog approaches full, which should cause old records to be released (need to check this). That is IMHO better than setting the mask to zero and no longer recording new events. In both cases the consumer will have to do some scanning to find new changes. However, in the first case, it is more likely that the old consumer is no longer in use and no harm is done, while in the second case even a well-behaved consumer is punished. On a related note, do you know how many files were created before the catalog was full? In theory about 4B Changelog entries should be possible (approx 64000^2), but this might be reduced by some small factor if there are multiple records per file (e.g. create + setattr).

            It seems like lots of bad things can happen if the changelog catalog is allowed to become full: LU-2843 LU-2844 LU-2845. Besides these crashes the MDS service fails to start due to EINVAL errors from mdd_changelog_llog_init(), and the only way I've found to recover is manually deleting the changelog_catalog file.

            I'm interested in adding safety mechanisms to prevent this situation. Perhaps the MDS could automatically unregister changelog users or set the changelog mask to zero based on a tunable threshold of unprocessed records. Does anyone have other ideas for how to handle this more gracefully?

            nedbass Ned Bass (Inactive) added a comment - It seems like lots of bad things can happen if the changelog catalog is allowed to become full: LU-2843 LU-2844 LU-2845 . Besides these crashes the MDS service fails to start due to EINVAL errors from mdd_changelog_llog_init(), and the only way I've found to recover is manually deleting the changelog_catalog file. I'm interested in adding safety mechanisms to prevent this situation. Perhaps the MDS could automatically unregister changelog users or set the changelog mask to zero based on a tunable threshold of unprocessed records. Does anyone have other ideas for how to handle this more gracefully?
            pjones Peter Jones added a comment -

            Adding those involved with HSM for comment

            pjones Peter Jones added a comment - Adding those involved with HSM for comment

            Yes, the changelogs could definitely be a factor. Once there is a registered changelog user, the changelogs are kept on disk until they are consumed. That ensures that if e.g. Robinhood crashes, or has some other problem for a day or four, that it won't have to do a full scan just to recover the state again.

            However, if the ChangeLog user is not unregistered, the changelogs will be kept until they run out of space. I suspect that is the root cause here, and should be investigated further. This bug should be CC'd to Jinshan and Aurelien Degremont, who are working on HSM these days.

            Cheers, Andreas

            adilger Andreas Dilger added a comment - Yes, the changelogs could definitely be a factor. Once there is a registered changelog user, the changelogs are kept on disk until they are consumed. That ensures that if e.g. Robinhood crashes, or has some other problem for a day or four, that it won't have to do a full scan just to recover the state again. However, if the ChangeLog user is not unregistered, the changelogs will be kept until they run out of space. I suspect that is the root cause here, and should be investigated further. This bug should be CC'd to Jinshan and Aurelien Degremont, who are working on HSM these days. Cheers, Andreas

            People

              bogl Bob Glossman (Inactive)
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: