[LU-5778] MDS not creating files on OSTs properly - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.7.0
Affects Version/s: Lustre 2.5.2
Labels:
- patch
Environment:
CentOS 6.5, kernel 2.6.32-431.17.1.el6_lustre.x86_64

Severity:
3
Rank (Obsolete):
16216

Description

One of our Stampede filesystems running Lustre 2.5.2 has an OST offline due to a different problem described in another ticket. Since the OST has been offline, the MDS server crashed with an LBUG and was restarted last Friday. After the restart, the MDS server no longer automatically creates files on any OSTs after the offline OSTs. In our case, OST0010 is offline so now the MDS will only create files on the first 16 OSTs unless we manually specify the stripeoffset in lfs setstripe. This is overloading the the servers with these OSTs while the others are doing nothing. If we deactivate the first 16 OSTs on the MDS, then all files are created with the first stripe on the lowest numbered active OST.

Can you suggest any way to force the MDS to use all the other OSTs through any lctl set_param options? Getting the offline OST back online is not currently an option due to corruption and ongoing e2fsck, it can't be mounted. Manually setting the stripe is also not an option, we need it to work automatically like it should. Could we set some qos options to try and have it balance the OST file creation?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lctl_state.out
44 kB
21/Oct/14 7:27 PM
lctl_target_obd.out
11 kB
21/Oct/14 7:27 PM
LU-5778_file_create_getstripe.out.gz
12 kB
27/Oct/14 3:44 PM
LU-5778.debug_filtered.bz2
30 kB
24/Oct/14 4:06 PM
mds5_prealloc.out
128 kB
23/Oct/14 12:09 AM

Issue Links

is duplicated by

LU-10414 An unbalancing Lustre fs write the first ACTIVE OST always.

Resolved

is related to

LU-5807 enable QOS_DEBUG

Resolved

is related to

LU-5780 Corrupted OST and very long running e2fsck

Resolved

Activity

[LU-5778] MDS not creating files on OSTs properly

Niu Yawei (Inactive) added a comment - 07/Nov/14 1:35 AM

Niu, can you take a look and confirm?

I agree with that 'break' here is inappropriate, but I don't see why this can leads to the result of always allocating objects on the OSTs before the deactivated one.

In the lod_alloc_qos(), statfs will be performed on all OSPs again:

        /* Find all the OSTs that are valid stripe candidates */
        for (i = 0; i < osts->op_count; i++) {
                if (!cfs_bitmap_check(m->lod_ost_bitmap, osts->op_array[i]))
                        continue;

                rc = lod_statfs_and_check(env, m, osts->op_array[i], sfs);
                if (rc) {
                        /* this OSP doesn't feel well */
                        continue;
                }

Niu Yawei (Inactive) added a comment - 07/Nov/14 1:35 AM Niu, can you take a look and confirm? I agree with that 'break' here is inappropriate, but I don't see why this can leads to the result of always allocating objects on the OSTs before the deactivated one. In the lod_alloc_qos(), statfs will be performed on all OSPs again: /* Find all the OSTs that are valid stripe candidates */ for (i = 0; i < osts->op_count; i++) { if (!cfs_bitmap_check(m->lod_ost_bitmap, osts->op_array[i])) continue ; rc = lod_statfs_and_check(env, m, osts->op_array[i], sfs); if (rc) { /* this OSP doesn't feel well */ continue ; }

Aurelien Degremont (Inactive) added a comment - 06/Nov/14 5:40 PM

As far as I can see in current master branch, the problem is still there.

Aurelien Degremont (Inactive) added a comment - 06/Nov/14 5:40 PM As far as I can see in current master branch, the problem is still there.

Alex Zhuravlev added a comment - 06/Nov/14 5:38 PM

iirc, originally osp_statfs() wasn't returning an error, instead it claimed "empty" OST if the connection is down.

Alex Zhuravlev added a comment - 06/Nov/14 5:38 PM iirc, originally osp_statfs() wasn't returning an error, instead it claimed "empty" OST if the connection is down.

Tommy Minyard added a comment - 06/Nov/14 5:31 PM

Thanks for the workaround Aurelien, will keep this in mind if we encounter this situation again. Do you know if this problem is still in the main branch of the source tree or is it just for the 2.5.x versions like we are running? If so, should be fixed a simple fix as you noted.

Niu, can you take a look and confirm?

Tommy Minyard added a comment - 06/Nov/14 5:31 PM Thanks for the workaround Aurelien, will keep this in mind if we encounter this situation again. Do you know if this problem is still in the main branch of the source tree or is it just for the 2.5.x versions like we are running? If so, should be fixed a simple fix as you noted. Niu, can you take a look and confirm?

Aurelien Degremont (Inactive) added a comment - 06/Nov/14 3:46 PM

We faced the same issue at CEA. We analyze the problem and tracked it down to lod_qos_statfs_update(). Indead, in this function, each OST is checked successively. If a OST has active=0, lod_statfs_and_check() will return ENOTCONN and the for-loop will break. Following OSTs won't be checked. Their metadata will not be updated (zero) and they will not be used when allocating file objects.

        for (i = 0; i < osts->op_count; i++) {
                idx = osts->op_array[i];
                avail = OST_TGT(lod,idx)->ltd_statfs.os_bavail;
                rc = lod_statfs_and_check(env, lod, idx,
                                          &OST_TGT(lod,idx)->ltd_statfs);
                if (rc)
                        break;
                if (OST_TGT(lod,idx)->ltd_statfs.os_bavail != avail)
                        /* recalculate weigths */
                        lod->lod_qos.lq_dirty = 1;
        }

A simple workaround for this bug is to deactivate QOS allocation setting qos_threshold_rr to 100 and so going back to a simple round-robin.

A fix seems to simply replace the break with continue

Aurelien Degremont (Inactive) added a comment - 06/Nov/14 3:46 PM We faced the same issue at CEA. We analyze the problem and tracked it down to lod_qos_statfs_update() . Indead, in this function, each OST is checked successively. If a OST has active=0, lod_statfs_and_check() will return ENOTCONN and the for-loop will break. Following OSTs won't be checked. Their metadata will not be updated (zero) and they will not be used when allocating file objects. for (i = 0; i < osts->op_count; i++) { idx = osts->op_array[i]; avail = OST_TGT(lod,idx)->ltd_statfs.os_bavail; rc = lod_statfs_and_check(env, lod, idx, &OST_TGT(lod,idx)->ltd_statfs); if (rc) break ; if (OST_TGT(lod,idx)->ltd_statfs.os_bavail != avail) /* recalculate weigths */ lod->lod_qos.lq_dirty = 1; } A simple workaround for this bug is to deactivate QOS allocation setting qos_threshold_rr to 100 and so going back to a simple round-robin. A fix seems to simply replace the break with continue

Tommy Minyard added a comment - 27/Oct/14 3:44 PM

Oleg,
In regard to the file creation, we did have some directories set with a manual stripe offset and not -1, so the stripe creation was being forced onto specific OSTs. I ran an lfs getstripe on the 500 created and they all landed on objidx less than 16. I've attached the output from that file create. As you noted, a large number of OSTs were inactive due us trying to distribute files across the other OSTs, but in this case, all files still had their stripes on the first 16 OSTs.

This ticket is no longer critical for us since we were able to recreate OST0010 over the weekend and once it was mounted and checked in with the mds, then file creation started across all the active OSTs once again like it should. This specific issue must have been due to the OST being offline and not able to check in with the MDS when the MDS had to be restarted after an LBUG crash. We certainly do not get into this state often as usually all OSTs are available and not in a state that they can't be mounted.

Tommy Minyard added a comment - 27/Oct/14 3:44 PM Oleg, In regard to the file creation, we did have some directories set with a manual stripe offset and not -1, so the stripe creation was being forced onto specific OSTs. I ran an lfs getstripe on the 500 created and they all landed on objidx less than 16. I've attached the output from that file create. As you noted, a large number of OSTs were inactive due us trying to distribute files across the other OSTs, but in this case, all files still had their stripes on the first 16 OSTs. This ticket is no longer critical for us since we were able to recreate OST0010 over the weekend and once it was mounted and checked in with the mds, then file creation started across all the active OSTs once again like it should. This specific issue must have been due to the OST being offline and not able to check in with the MDS when the MDS had to be restarted after an LBUG crash. We certainly do not get into this state often as usually all OSTs are available and not in a state that they can't be mounted.

Oleg Drokin added a comment - 27/Oct/14 4:28 AM

QOS_DEBUG is compiled out by default which is really unfortunate:

#if 0
#define QOS_DEBUG(fmt, ...)     CDEBUG(D_OTHER, fmt, ## __VA_ARGS__)
#define QOS_CONSOLE(fmt, ...)   LCONSOLE(D_OTHER, fmt, ## __VA_ARGS__)
#else
#define QOS_DEBUG(fmt, ...)
#define QOS_CONSOLE(fmt, ...)
#endif

Whoever thought that was great idea was not right.

Niu, can you please open a separate ticket for this and perhaps add a patch?
Also d_other might not be great mask for it either, so please see if something else could be better here.

Oleg Drokin added a comment - 27/Oct/14 4:28 AM QOS_DEBUG is compiled out by default which is really unfortunate: #if 0 #define QOS_DEBUG(fmt, ...) CDEBUG(D_OTHER, fmt, ## __VA_ARGS__) #define QOS_CONSOLE(fmt, ...) LCONSOLE(D_OTHER, fmt, ## __VA_ARGS__) #else #define QOS_DEBUG(fmt, ...) #define QOS_CONSOLE(fmt, ...) #endif Whoever thought that was great idea was not right. Niu, can you please open a separate ticket for this and perhaps add a patch? Also d_other might not be great mask for it either, so please see if something else could be better here.

Niu Yawei (Inactive) added a comment - 27/Oct/14 2:25 AM

I'm wondering why the QOS_DEBUG() messages in lod_alloc_qos() wasn't printed?

Niu Yawei (Inactive) added a comment - 27/Oct/14 2:25 AM I'm wondering why the QOS_DEBUG() messages in lod_alloc_qos() wasn't printed?

Oleg Drokin added a comment - 26/Oct/14 3:14 AM - edited

Looking at the log:
It contains 157 create requests in it (I would have thought it has reached internal log sizing limit, but with the default at 1M per cpu, and the total log size you have at 1.1M it's certainly not the case).

Other observations, in the log it says it has 348 OST of which 248 are not really connected (check status returns -107).
First 47 OSTs in the list are connected (with only a one in the middle of these 47 being down, namely OST 16 ).
OSTs 47 to 292 are down, then OSTs 293, 341, 342 are also down.

Of the 157 creates only 16 went into the alloc_qos path and had any chance of random allocation. The other 141 took the alloc_specific path that is chosen when starting offset is below number of osts in the system (in other words it was set not to -1).
Did you have other processes in the system creating files with forced striping while this was run?
Sadly our level of debug does not allow us to see what osts were chosen in the end in the alloc_qos case, but if you have retained those 500 files, it might be interesting to see lfs getstripe output on them.

Oleg Drokin added a comment - 26/Oct/14 3:14 AM - edited Looking at the log: It contains 157 create requests in it (I would have thought it has reached internal log sizing limit, but with the default at 1M per cpu, and the total log size you have at 1.1M it's certainly not the case). Other observations, in the log it says it has 348 OST of which 248 are not really connected (check status returns -107). First 47 OSTs in the list are connected (with only a one in the middle of these 47 being down, namely OST 16 ). OSTs 47 to 292 are down, then OSTs 293, 341, 342 are also down. Of the 157 creates only 16 went into the alloc_qos path and had any chance of random allocation. The other 141 took the alloc_specific path that is chosen when starting offset is below number of osts in the system (in other words it was set not to -1). Did you have other processes in the system creating files with forced striping while this was run? Sadly our level of debug does not allow us to see what osts were chosen in the end in the alloc_qos case, but if you have retained those 500 files, it might be interesting to see lfs getstripe output on them.

Tommy Minyard added a comment - 24/Oct/14 4:06 PM

Attached is the debug log with the data filtered for lov and ost subsystems. There is no osp subsystem on our current /scratch mds. I cleared the logs prior to running the test, set a mark and then 500 files were created on the filesystem, hopefully you can find it in the attached log.

Tommy Minyard added a comment - 24/Oct/14 4:06 PM Attached is the debug log with the data filtered for lov and ost subsystems. There is no osp subsystem on our current /scratch mds. I cleared the logs prior to running the test, set a mark and then 500 files were created on the filesystem, hopefully you can find it in the attached log.

Oleg Drokin added a comment - 24/Oct/14 3:18 PM

Hm, thanks for this extra bit of info.
lwp really should only be used for quota and some fld stuff that should not really impact the allocations, certainly not on some OSTs, but lwp stuff does live in OSP codebase so should be caught with the osp mask, except I just checked and it's somehow registered itself on ost mask instead, weird.
You might wish to revise that echo debug subsystem line to echo "osp ost lod" > /..../debug_subsystem

Old servers did not really have LWP config record, but I think we tried to connect anyway (there was even compat issue in the past about that that we since fixed, but we'll need to go back and check how this was implemented).

I think catching that bit of debug should be useful just in case.

Oleg Drokin added a comment - 24/Oct/14 3:18 PM Hm, thanks for this extra bit of info. lwp really should only be used for quota and some fld stuff that should not really impact the allocations, certainly not on some OSTs, but lwp stuff does live in OSP codebase so should be caught with the osp mask, except I just checked and it's somehow registered itself on ost mask instead, weird. You might wish to revise that echo debug subsystem line to echo "osp ost lod" > /..../debug_subsystem Old servers did not really have LWP config record, but I think we tried to connect anyway (there was even compat issue in the past about that that we since fixed, but we'll need to go back and check how this was implemented). I think catching that bit of debug should be useful just in case.

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Tommy Minyard

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 21/Oct/14 2:41 PM

Updated:: 25/Dec/17 7:52 AM

Resolved:: 04/Dec/14 10:34 PM