[LU-11227] client process hangs when lod_sync accesses deactivated OSTs - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.12.0, Lustre 2.10.6
Affects Version/s: Lustre 2.11.0, Lustre 2.10.4
Labels:
None
Environment:
x64_64, ldiskfs, not DNE

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Hi,

we have a ldiskfs Lustre install where one OST is permanently deactivated with

lctl conf_param lustre-OST005a.osc.active=0

we were in IEEL and saw no issues there, but are now in 2.10.4 to try and get better compatibility for file transfers to the new cluster.

the problem is that in 2.10.4 a chgrp on the client hangs forever as it re-tries infinitely. MDS load is significant too.

message from the MDS when the chgrp hangs is 1000's of these per second

Aug  7 19:45:41 metadata01 kernel: LustreError: 4502:0:(lod_dev.c:1400:lod_sync()) lustre-MDT0000-mdtlov: can't sync ost 90: -107

looking at that lod_sync() code, it doesn't check for OSTs being deactivated. the below seems to work as a quick fix so that we don't have to reboot back into IEEL.

diff --git a/lustre/lod/lod_dev.c b/lustre/lod/lod_dev.c
index d61ad2d..2194110 100644
--- a/lustre/lod/lod_dev.c
+++ b/lustre/lod/lod_dev.c
@@ -1409,6 +1409,8 @@ static int lod_sync(const struct lu_env *env, struct dt_device *dev)
        lod_foreach_ost(lod, i) {
                ost = OST_TGT(lod, i);
                LASSERT(ost && ost->ltd_ost);
+               if (!ost->ltd_active)
+                       continue;
                rc = dt_sync(env, ost->ltd_ost);
                if (rc) {
                        CERROR("%s: can't sync ost %u: %d\n",

the same fix would also seem to be appropriate for a deactivated MDT a few lines lower down.

please let me know if this is totally the wrong thing to do, or if alternatively if it's useful and you'd like me to upload a patch to Gerrit.

also dry-run of lfsck in 2.10.4 wants to correct layout for 15% of mdt files, and namespace (linkea_inconsistent) for about 15k files out of 225M. presumably because of this fs being old and/or a bit damaged from bugs and hardware failures and/or created under IEEL. OSTs all look ok in lfsck. we have not run a lfsck for real because this all looks too intrusive.

cheers,
robin

Attachments

Issue Links

is related to

LU-11236 client MDT OST ENOTCONN loops

Open

LU-11119 A 'mv' of a file from a local file system to a lustre file system hangs

Resolved

is related to

LU-5152 Can't enforce block quota when unprivileged user change group

Resolved

LU-11303 slow chgrp as user when quotas are enabled

Resolved

Activity

People

Assignee:: Robin Humble (Inactive)

Reporter:: Robin Humble (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 08/Aug/18 8:01 AM

Updated:: 11/Sep/18 8:44 PM

Resolved:: 23/Aug/18 1:01 PM