Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5778

MDS not creating files on OSTs properly

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.5.2
    • CentOS 6.5, kernel 2.6.32-431.17.1.el6_lustre.x86_64
    • 3
    • 16216

    Description

      One of our Stampede filesystems running Lustre 2.5.2 has an OST offline due to a different problem described in another ticket. Since the OST has been offline, the MDS server crashed with an LBUG and was restarted last Friday. After the restart, the MDS server no longer automatically creates files on any OSTs after the offline OSTs. In our case, OST0010 is offline so now the MDS will only create files on the first 16 OSTs unless we manually specify the stripeoffset in lfs setstripe. This is overloading the the servers with these OSTs while the others are doing nothing. If we deactivate the first 16 OSTs on the MDS, then all files are created with the first stripe on the lowest numbered active OST.

      Can you suggest any way to force the MDS to use all the other OSTs through any lctl set_param options? Getting the offline OST back online is not currently an option due to corruption and ongoing e2fsck, it can't be mounted. Manually setting the stripe is also not an option, we need it to work automatically like it should. Could we set some qos options to try and have it balance the OST file creation?

      Attachments

        1. lctl_state.out
          44 kB
        2. lctl_target_obd.out
          11 kB
        3. LU-5778_file_create_getstripe.out.gz
          12 kB
        4. LU-5778.debug_filtered.bz2
          30 kB
        5. mds5_prealloc.out
          128 kB

        Issue Links

          Activity

            [LU-5778] MDS not creating files on OSTs properly

            Attached is the debug log with the data filtered for lov and ost subsystems. There is no osp subsystem on our current /scratch mds. I cleared the logs prior to running the test, set a mark and then 500 files were created on the filesystem, hopefully you can find it in the attached log.

            minyard Tommy Minyard (Inactive) added a comment - Attached is the debug log with the data filtered for lov and ost subsystems. There is no osp subsystem on our current /scratch mds. I cleared the logs prior to running the test, set a mark and then 500 files were created on the filesystem, hopefully you can find it in the attached log.
            green Oleg Drokin added a comment -

            Hm, thanks for this extra bit of info.
            lwp really should only be used for quota and some fld stuff that should not really impact the allocations, certainly not on some OSTs, but lwp stuff does live in OSP codebase so should be caught with the osp mask, except I just checked and it's somehow registered itself on ost mask instead, weird.
            You might wish to revise that echo debug subsystem line to echo "osp ost lod" > /..../debug_subsystem

            Old servers did not really have LWP config record, but I think we tried to connect anyway (there was even compat issue in the past about that that we since fixed, but we'll need to go back and check how this was implemented).

            I think catching that bit of debug should be useful just in case.

            green Oleg Drokin added a comment - Hm, thanks for this extra bit of info. lwp really should only be used for quota and some fld stuff that should not really impact the allocations, certainly not on some OSTs, but lwp stuff does live in OSP codebase so should be caught with the osp mask, except I just checked and it's somehow registered itself on ost mask instead, weird. You might wish to revise that echo debug subsystem line to echo "osp ost lod" > /..../debug_subsystem Old servers did not really have LWP config record, but I think we tried to connect anyway (there was even compat issue in the past about that that we since fixed, but we'll need to go back and check how this was implemented). I think catching that bit of debug should be useful just in case.

            Thanks Oleg, we're going to try and quiet down the system a bit (the system is a little drained waiting to schedule a 32K core job) and collect the mds trace all with the OSTs active again to see if that can provide some more debug information.

            We compared our test filesystem with the current /scratch filesystem and did find one significant difference, the test filesystem has a lwp service running on it that /scratch does not (there are additional lwp entries in lctl dl on MDS and OSS's. The test filesystem went through the same 2.1.5 -> 2.5.2 upgrade process, however, we also ran tunefs.lustre --writeconf on the test filesystem and we did not run tunefs.lustre --writeconf on /scratch in case we encountered a major issue and needed to rollback to previous version. That appears to be the only difference in process of upgrade between our test filesystem and /scratch. I didn't find much information about what lwp does from a quick search and some googling. Not sure if having this lwp service running would impact file layout and creation on the OSTs.

            In answer to your question, we use a default stripe count of 2, a size of 1MB and offset -1. We have not enabled ost pools.

            minyard Tommy Minyard (Inactive) added a comment - Thanks Oleg, we're going to try and quiet down the system a bit (the system is a little drained waiting to schedule a 32K core job) and collect the mds trace all with the OSTs active again to see if that can provide some more debug information. We compared our test filesystem with the current /scratch filesystem and did find one significant difference, the test filesystem has a lwp service running on it that /scratch does not (there are additional lwp entries in lctl dl on MDS and OSS's. The test filesystem went through the same 2.1.5 -> 2.5.2 upgrade process, however, we also ran tunefs.lustre --writeconf on the test filesystem and we did not run tunefs.lustre --writeconf on /scratch in case we encountered a major issue and needed to rollback to previous version. That appears to be the only difference in process of upgrade between our test filesystem and /scratch. I didn't find much information about what lwp does from a quick search and some googling. Not sure if having this lwp service running would impact file layout and creation on the OSTs. In answer to your question, we use a default stripe count of 2, a size of 1MB and offset -1. We have not enabled ost pools.
            green Oleg Drokin added a comment -

            Just a quick note.
            Even if there's a lot of noise from other users, you can grab some tracesand since users are likely to allocate files in that time,
            we still can gather some useful info.
            sadly, t looks like QOS_DEBUG stuff is compiled out by default, but perhaps you can still limit the scope of traces to jsut lod with
            echo "lod osp" >/proc/sys/lnet/debug_subsystem
            echo -1 >/proc/sys/lnet/debug
            lctl dk >/dev/null #to clear the log

            on your MDS (please cat those files first and remember the values and restore the content after the gathering of info).

            Let it run for some brief time in normal offset -1 mode and definitely do a couple of creations manually too that expose the problem, then gather the log with lctl dk >/tmp/lustre.log. (just run it long enough for your creates to run course)

            Also what's your default striping like wrt default stripe count? How about ost pools, do you use that?

            green Oleg Drokin added a comment - Just a quick note. Even if there's a lot of noise from other users, you can grab some tracesand since users are likely to allocate files in that time, we still can gather some useful info. sadly, t looks like QOS_DEBUG stuff is compiled out by default, but perhaps you can still limit the scope of traces to jsut lod with echo "lod osp" >/proc/sys/lnet/debug_subsystem echo -1 >/proc/sys/lnet/debug lctl dk >/dev/null #to clear the log on your MDS (please cat those files first and remember the values and restore the content after the gathering of info). Let it run for some brief time in normal offset -1 mode and definitely do a couple of creations manually too that expose the problem, then gather the log with lctl dk >/tmp/lustre.log. (just run it long enough for your creates to run course) Also what's your default striping like wrt default stripe count? How about ost pools, do you use that?

            Andreas,
            I also tried to reproduce the problem on some test hardware by creating a filesystem with the same exact 2.5.2 version of Lustre installed on our /scratch filesystem and was unsuccessful to reproduce as well. There must be something else going on with our /scratch filesystem, either due to large scale with 348 OSTs or the upgrade from the 2.1.5 version we were running, so i'm going to compare the setup of the filesystem and see if I can find any differences.

            In regard to the debug output, we could not wait to put the system back into production so we developed a manual process to distribute files across by setting the stripe offset to a random OST for active user directories. We are cycling the first two active OSTs so that files created in directories where the stripe_offset is still set to -1 get distributed as well. Not efficient or as good performance, but at least it lets us run jobs for users and distribute files across all the OSTs. I can certainly generate the debug output, but afraid it would be polluted with the activity from all the users. That and we had to deactivate the first 16 OSTs since they reached > 93% capacity. We have a maintenance scheduled for next Tuesday and can collect data on a quiet system then. I've included the output from the prealloc information in case it might be useful. I noticed two had -5 as the prealloc_status, those OSTs are in the list of inactive OSTs, which is in the attached file as well. Note that in looking through the prealloc output, found these three sets of messages corresponding to those OSTs:

            Oct 22 00:43:29 mds5 kernel: Lustre: setting import scratch-OST001d_UUID INACTIVE by administrator request
            Oct 22 00:43:29 mds5 kernel: Lustre: Skipped 8 previous similar messages
            Oct 22 00:43:29 mds5 kernel: LustreError: 22062:0:(osp_precreate.c:464:osp_precreate_send()) scratch-OST001d-osc-MDT0000: can't precreate: rc = -5
            Oct 22 00:43:29 mds5 kernel: LustreError: 22062:0:(osp_precreate.c:968:osp_precreate_thread()) scratch-OST001d-osc-MDT0000: cannot precreate objects: rc = -5

            Oct 22 01:04:06 mds5 kernel: Lustre: setting import scratch-OST0021_UUID INACTIVE by administrator request
            Oct 22 01:04:06 mds5 kernel: LustreError: 22070:0:(osp_precreate.c:464:osp_precreate_send()) scratch-OST0021-osc-MDT0000: can't precreate: rc = -5
            Oct 22 01:04:06 mds5 kernel: LustreError: 22070:0:(osp_precreate.c:968:osp_precreate_thread()) scratch-OST0021-osc-MDT0000: cannot precreate objects: rc = -5

            Oct 22 15:07:21 mds5 kernel: Lustre: setting import scratch-OST0024_UUID INACTIVE by administrator request
            Oct 22 15:07:21 mds5 kernel: Lustre: Skipped 5 previous similar messages
            Oct 22 15:07:21 mds5 kernel: LustreError: 22084:0:(osp_precreate.c:464:osp_precreate_send()) scratch-OST0026-osc-MDT0000: can't precreate: rc = -5
            Oct 22 15:07:21 mds5 kernel: LustreError: 22084:0:(osp_precreate.c:968:osp_precreate_thread()) scratch-OST0026-osc-MDT0000: cannot precreate objects: rc = -5

            minyard Tommy Minyard (Inactive) added a comment - Andreas, I also tried to reproduce the problem on some test hardware by creating a filesystem with the same exact 2.5.2 version of Lustre installed on our /scratch filesystem and was unsuccessful to reproduce as well. There must be something else going on with our /scratch filesystem, either due to large scale with 348 OSTs or the upgrade from the 2.1.5 version we were running, so i'm going to compare the setup of the filesystem and see if I can find any differences. In regard to the debug output, we could not wait to put the system back into production so we developed a manual process to distribute files across by setting the stripe offset to a random OST for active user directories. We are cycling the first two active OSTs so that files created in directories where the stripe_offset is still set to -1 get distributed as well. Not efficient or as good performance, but at least it lets us run jobs for users and distribute files across all the OSTs. I can certainly generate the debug output, but afraid it would be polluted with the activity from all the users. That and we had to deactivate the first 16 OSTs since they reached > 93% capacity. We have a maintenance scheduled for next Tuesday and can collect data on a quiet system then. I've included the output from the prealloc information in case it might be useful. I noticed two had -5 as the prealloc_status, those OSTs are in the list of inactive OSTs, which is in the attached file as well. Note that in looking through the prealloc output, found these three sets of messages corresponding to those OSTs: Oct 22 00:43:29 mds5 kernel: Lustre: setting import scratch-OST001d_UUID INACTIVE by administrator request Oct 22 00:43:29 mds5 kernel: Lustre: Skipped 8 previous similar messages Oct 22 00:43:29 mds5 kernel: LustreError: 22062:0:(osp_precreate.c:464:osp_precreate_send()) scratch-OST001d-osc-MDT0000: can't precreate: rc = -5 Oct 22 00:43:29 mds5 kernel: LustreError: 22062:0:(osp_precreate.c:968:osp_precreate_thread()) scratch-OST001d-osc-MDT0000: cannot precreate objects: rc = -5 Oct 22 01:04:06 mds5 kernel: Lustre: setting import scratch-OST0021_UUID INACTIVE by administrator request Oct 22 01:04:06 mds5 kernel: LustreError: 22070:0:(osp_precreate.c:464:osp_precreate_send()) scratch-OST0021-osc-MDT0000: can't precreate: rc = -5 Oct 22 01:04:06 mds5 kernel: LustreError: 22070:0:(osp_precreate.c:968:osp_precreate_thread()) scratch-OST0021-osc-MDT0000: cannot precreate objects: rc = -5 Oct 22 15:07:21 mds5 kernel: Lustre: setting import scratch-OST0024_UUID INACTIVE by administrator request Oct 22 15:07:21 mds5 kernel: Lustre: Skipped 5 previous similar messages Oct 22 15:07:21 mds5 kernel: LustreError: 22084:0:(osp_precreate.c:464:osp_precreate_send()) scratch-OST0026-osc-MDT0000: can't precreate: rc = -5 Oct 22 15:07:21 mds5 kernel: LustreError: 22084:0:(osp_precreate.c:968:osp_precreate_thread()) scratch-OST0026-osc-MDT0000: cannot precreate objects: rc = -5

            I'm unable to reproduce the problem with 2.5.2 using "-c 2 -i -1". It does imbalance the object allocations somewhat - with OST0002 disabled on the MDS, OST0000 and OST0003 seem to get chosen as the starting OST index much less often than others (of 1000 files, 2000 objects), but it still chooses OSTs beyond the deactivated OST, and the total number of objects allocated on each OST isn't as imbalanced:

            OST_idx     #start   #total
               0           36       251
               1          283       319
               3           39       322
               4          175       214
               5          147       322
               6          105       252
               7          215       320
            

            Tommy, can you please enable full debugging via lctl set_param debug=-1 debug_mb=512 on the MDS, and then create maybe 50 files (enough that some of them should be beyond OST0010), then dump the debug log lctl dk /tmp/LU-5779.debug; bzip2 -9 /tmp/LU-5779.debug and attach that log to the ticket here. Getting the lctl get_param osp.*.prealloc* info would also be useful (sorry, the Jira markup turned my * into bold in my first comment).

            adilger Andreas Dilger added a comment - I'm unable to reproduce the problem with 2.5.2 using "-c 2 -i -1". It does imbalance the object allocations somewhat - with OST0002 disabled on the MDS, OST0000 and OST0003 seem to get chosen as the starting OST index much less often than others (of 1000 files, 2000 objects), but it still chooses OSTs beyond the deactivated OST, and the total number of objects allocated on each OST isn't as imbalanced: OST_idx #start #total 0 36 251 1 283 319 3 39 322 4 175 214 5 147 322 6 105 252 7 215 320 Tommy, can you please enable full debugging via lctl set_param debug=-1 debug_mb=512 on the MDS, and then create maybe 50 files (enough that some of them should be beyond OST0010), then dump the debug log lctl dk /tmp/ LU-5779 .debug; bzip2 -9 /tmp/ LU-5779 .debug and attach that log to the ticket here. Getting the lctl get_param osp.*.prealloc* info would also be useful (sorry, the Jira markup turned my * into bold in my first comment).
            green Oleg Drokin added a comment -

            It should be noted that deactivate state as indicated by Andreas is different from disconnected - the state that the system would enter if it could not connect on it's own to an OST (and it'll retry all the time too).

            It's interesting what Andreas testing of 2.5.2 will show, I guess.

            green Oleg Drokin added a comment - It should be noted that deactivate state as indicated by Andreas is different from disconnected - the state that the system would enter if it could not connect on it's own to an OST (and it'll retry all the time too). It's interesting what Andreas testing of 2.5.2 will show, I guess.

            Andreas,
            The device is definitely deactivated on the MDS, since the OST is offline and the MDS has been restarted, it could never activate anyway but I have deactivated it again for good measure. The MDS is still choosing to create files on only the first 16 OSTs.

            Also, it is not the stripe_count that is the problem, I have been saying it is the stripe_offset set to -1, where it should choose to create a file on the OSTs in a semi-random fashion. Can you try your test again with -c set to 2 (that is our default stripe), -s set to 1MB and -i set to -1?

            minyard Tommy Minyard (Inactive) added a comment - Andreas, The device is definitely deactivated on the MDS, since the OST is offline and the MDS has been restarted, it could never activate anyway but I have deactivated it again for good measure. The MDS is still choosing to create files on only the first 16 OSTs. Also, it is not the stripe_count that is the problem, I have been saying it is the stripe_offset set to -1, where it should choose to create a file on the OSTs in a semi-random fashion. Can you try your test again with -c set to 2 (that is our default stripe), -s set to 1MB and -i set to -1?
            adilger Andreas Dilger added a comment - - edited

            I don't want to restate the obvious, but just to be sure that we don't have a simple workaround here, have you actually deactivated the failed OST on the MDS? Something like the following:

            mds# lctl --device  %scratch-OST0010-osc-MDT0000 deactivate
            

            This should produce a message in the MDS console logs like:

            Lustre: setting import scratch-OST0010_UUID INACTIVE by administrator request
            

            I've done local testing with master and am unable to reproduce this (lfs setstripe -c -1 creates objects on all available stripes when one in the middle is deactivated). I'm going to build and run with 2.5.2 + patch to see if that shows a similar problem (maybe it has already been fixed in later releases).

            adilger Andreas Dilger added a comment - - edited I don't want to restate the obvious, but just to be sure that we don't have a simple workaround here, have you actually deactivated the failed OST on the MDS? Something like the following: mds# lctl --device %scratch-OST0010-osc-MDT0000 deactivate This should produce a message in the MDS console logs like: Lustre: setting import scratch-OST0010_UUID INACTIVE by administrator request I've done local testing with master and am unable to reproduce this ( lfs setstripe -c -1 creates objects on all available stripes when one in the middle is deactivated). I'm going to build and run with 2.5.2 + patch to see if that shows a similar problem (maybe it has already been fixed in later releases).

            Oleg,
            We have one cherry-picked patch applied to resolve crashes we were experiencing shortly after our upgrade to 2.5.2. We used patch 0020cc44.diff from LU-5040 to resolve those crashes.

            In the current situation, the MDS will create files on any of the first 16 OSTs as expected with stripe_offset of -1 when the OSTs are active in the MDS. If we deactivate those OSTs in the MDS, then the first active OST index is used for all file creates, if we then deactivate that one, it then moves to the next one in the index.

            minyard Tommy Minyard (Inactive) added a comment - Oleg, We have one cherry-picked patch applied to resolve crashes we were experiencing shortly after our upgrade to 2.5.2. We used patch 0020cc44.diff from LU-5040 to resolve those crashes. In the current situation, the MDS will create files on any of the first 16 OSTs as expected with stripe_offset of -1 when the OSTs are active in the MDS. If we deactivate those OSTs in the MDS, then the first active OST index is used for all file creates, if we then deactivate that one, it then moves to the next one in the index.

            Niu,
            The sequence of events to reproduce the problem is as follows. Deactivate the OST in the MDS, then unmount the deactivated OST on the oss (ours can't be mounted due to corruption). then restart the MDS. We are currently in that state, the OST is unmounted and offline, so can't check in with the MDS when the MDS is restarted. It was only after MDS restart that we started to see the file creates only on the first 16 OSTs.

            minyard Tommy Minyard (Inactive) added a comment - Niu, The sequence of events to reproduce the problem is as follows. Deactivate the OST in the MDS, then unmount the deactivated OST on the oss (ours can't be mounted due to corruption). then restart the MDS. We are currently in that state, the OST is unmounted and offline, so can't check in with the MDS when the MDS is restarted. It was only after MDS restart that we started to see the file creates only on the first 16 OSTs.

            People

              niu Niu Yawei (Inactive)
              minyard Tommy Minyard (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: