[LU-14277] any create blocked due any OST fail Created: 23/Dec/20  Updated: 12/Jun/23  Resolved: 26/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Blocker
Reporter: Alexey Lyashkov Assignee: Alexey Lyashkov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-13073 Multiple MDS deadlocks (in lod_qos_pr... Resolved
Related
is related to LU-7853 Fixes bitfield in lod qos code Resolved
is related to LU-15393 object allocation when OST is lost Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

multiple MDT threads blocked by one declare_create call due OST fail.
rwsem_down_write_failed
call_rwsem_down_write_failed
lod_qos_prep_create
lod_declare_striped_object
lod_declare_object_create
mdd_declare_object_create_internal
mdd_declare_create
mdd_create
mdo_create
mdt_reint_open
mdt_reint_rec
mdt_reint_internal
mdt_intent_reint
mdt_intent_policy
ldlm_lock_enqueue
ldlm_handle_enqueue0
tgt_enqueue
tgt_request_handle
ptlrpc_main
kthread
vs
osp_precreate_reserve
osp_declare_object_create
lod_qos_declare_object_on
lod_alloc_rr.clone.2
lod_qos_prep_create
lod_declare_striped_object
lod_declare_object_create
mdd_declare_object_create_internal
mdd_declare_create
mdd_create
mdo_create
mdt_reint_open
mdt_reint_rec
mdt_reint_internal
mdt_intent_reint
mdt_intent_policy
ldlm_lock_enqueue
ldlm_handle_enqueue0
tgt_enqueue
tgt_request_handle
ptlrpc_main
kthread
kernel_thread



 Comments   
Comment by Alex Zhuravlev [ 23/Dec/20 ]

I think this is a dup of LU-13073

Comment by Alexey Lyashkov [ 23/Dec/20 ]

It probably. But my fix much much simple - just revert some code landed with LOV -> LOD transmission.
Let's looks to it.

Comment by Alexey Lyashkov [ 23/Dec/20 ]

one more bug introduced by last changes in replace LQ_DIRTY with bitfield.

c1d0a355a6a (Lai Siyao          2019-08-05 02:08:02 +0800  327)         up_write(&ltd->ltd_qos.lq_rw_sem);
c7f2e70a27e (Alex Zhuravlev     2012-09-20 15:56:31 +0400  328)
c7f2e70a27e (Alex Zhuravlev     2012-09-20 15:56:31 +0400  329)         if (placed != real_count) {
c7f2e70a27e (Alex Zhuravlev     2012-09-20 15:56:31 +0400  330)                 /* This should never happen */
c1d0a355a6a (Lai Siyao          2019-08-05 02:08:02 +0800  331)                 LCONSOLE_ERROR_MSG(0x14e, "Failed to place all tgts in the "
c7f2e70a27e (Alex Zhuravlev     2012-09-20 15:56:31 +0400  332)                                    "round-robin list (%d of %d).\n",
c7f2e70a27e (Alex Zhuravlev     2012-09-20 15:56:31 +0400  333)                                    placed, real_count);
c7f2e70a27e (Alex Zhuravlev     2012-09-20 15:56:31 +0400  334)                 for (i = 0; i < lqr->lqr_pool.op_count; i++) {
c1d0a355a6a (Lai Siyao          2019-08-05 02:08:02 +0800  335)                         LCONSOLE(D_WARNING, "rr #%d tgt idx=%d\n", i,
c7f2e70a27e (Alex Zhuravlev     2012-09-20 15:56:31 +0400  336)                                  lqr->lqr_pool.op_array[i]);
c7f2e70a27e (Alex Zhuravlev     2012-09-20 15:56:31 +0400  337)                 }
c7f2e70a27e (Alex Zhuravlev     2012-09-20 15:56:31 +0400  338)                 lqr->lqr_dirty = 1; 
c7f2e70a27e (Alex Zhuravlev     2012-09-20 15:56:31 +0400  339)                 RETURN(-EAGAIN);

lqr_dirty was without lock held.

Comment by Andreas Dilger [ 23/Dec/20 ]

But my fix much much simple - just revert some code landed with LOV -> LOD transmission.

Why not push that patch under LU-13073?

lqr_dirty was without lock held.

That lqr_dirty = 1 is in a "should never happen" code path with a very clear error message that is not being seen here, so it doesn't seem to be related to this problem.

Comment by Andreas Dilger [ 23/Dec/20 ]

one more bug introduced by last changes in replace LQ_DIRTY with bitfield.

I don't see any patch like that... There no reference to "LQ_DIRTY" in any of the patches in Git. The very first patch introducing lqr_dirty (replacing lq_dirty_rr) is using a bitfield, and it has been a bitfield since that time, even though the code was moved around to different files a few times:

commit 665e36b780faa2144cecccd29a0d8a8196a76903
Author:     Nathan Rutman <nathan@clusterfs.com>
AuthorDate: Mon Sep 15 23:25:34 2008 +0000

    b=14836
    i=nathan
    i=adilger
    OST pools on HEAD, comprehensive patch including 17054:19007;
    16935:18918,19012,19089,19128; 16978:18872

:
:
+/* Round-robin allocator data */
+struct lov_qos_rr {
+        __u32               lqr_start_idx;   /* start index of new inode */
+        __u32               lqr_offset_idx;  /* aliasing for start_idx  */
+        int                 lqr_start_count; /* reseed counter */
+        struct ost_pool     lqr_pool;        /* round-robin optimized list */
+        unsigned long       lqr_dirty:1;     /* recalc round-robin list */
+};
+
+/* Stripe placement optimization */
 struct lov_qos {
         struct list_head    lq_oss_list;    /* list of OSSs that targets use */
         struct rw_semaphore lq_rw_sem;
         __u32               lq_active_oss_count;
-        __u32              *lq_rr_array;    /* round-robin optimized list */
-        unsigned int        lq_rr_size;     /* rr array size */
         unsigned int        lq_prio_free;   /* priority for free space */
+        struct lov_qos_rr   lq_rr;          /* round robin qos data */
         unsigned long       lq_dirty:1,     /* recalc qos data */
-                            lq_dirty_rr:1,  /* recalc round-robin list */
                             lq_same_space:1,/* the ost's all have approx.
                                                the same space avail */
                             lq_reset:1;     /* zero current penalties */

The patch that introduced lq_dirty_rr was also using a bitfield:

commit 113303973ec9f8484eb2355a1a6ef3c4c7fd6a56
Author:     Nathan Rutman <nathan@clusterfs.com>
AuthorDate: Sat Feb 10 06:33:41 2007 +0000

    land b1_5 onto HEAD

:
:
+struct lov_qos {
+        struct list_head    lq_oss_list;    /* list of OSSs that targets use */
+        struct rw_semaphore lq_rw_sem;
+        __u32               lq_active_oss_count;
+        __u32              *lq_rr_array;    /* round-robin optimized list */
+        unsigned int        lq_rr_size;     /* rr array size */
+        unsigned int        lq_prio_free;   /* priority for free space */
+        unsigned int        lq_dirty:1,     /* recalc qos data */
+                            lq_dirty_rr:1,  /* recalc round-robin list */
+                            lq_same_space:1,/* the ost's all have approx.
+                                               the same space avail */
+                            lq_reset:1;     /* zero current penalties */
+};

I'm happy to see a simpler fix and look forward to seeing your patch.

Comment by Alexey Lyashkov [ 24/Dec/20 ]

Andreas, lets open a lustre 2.0..2.4 code. Before an LOD was introduced.

[root@lustre lov]# grep _DIRTY *
lov_obd.c:	cfs_set_bit(LQ_DIRTY, &lov->lov_qos.lq_flags);
lov_qos.c:	cfs_set_bit(LQ_DIRTY, &lov->lov_qos.lq_flags);
lov_qos.c:	cfs_set_bit(LQ_DIRTY, &lov->lov_qos.lq_flags);
lov_qos.c:        if (!cfs_test_bit(LQ_DIRTY, &lov->lov_qos.lq_flags))
lov_qos.c:        cfs_clear_bit(LQ_DIRTY, &lov->lov_qos.lq_flags);
lov_qos.c:        if (!cfs_test_bit(LQ_DIRTY, &lov->lov_qos.lq_flags) &&
lov_qos.c:        if (cfs_test_bit(LQ_DIRTY, &lov->lov_qos.lq_flags))
lov_qos.c:        cfs_set_bit(LQ_DIRTY, &lov->lov_qos.lq_flags);
lproc_lov.c:	cfs_set_bit(LQ_DIRTY, &lov->lov_qos.lq_flags);
lproc_lov.c:	cfs_set_bit(LQ_DIRTY, &lov->lov_qos.lq_flags);

But i talk about LQ_SF_PROGRESS bit.

void qos_statfs_update(struct obd_device *obd, __u64 max_age, int wait)
{
        struct lov_obd         *lov = &obd->u.lov;
        struct obd_info        *oinfo;
        int                     rc = 0;
        struct ptlrpc_request_set *set = NULL;
        ENTRY;

        if (cfs_time_beforeq_64(max_age, obd->obd_osfs_age))
                /* statfs data are quite recent, don't need to refresh it */
                RETURN_EXIT;

        /* Check if statfs already in progress */
        if (cfs_test_and_set_bit(LQ_SF_PROGRESS, &lov->lov_qos.lq_flags))
                GOTO(out, rc = 0);

LQ_SF_PROGRESS was gone in LOV -> LOD transmission, but LQ_DIRTY still exist in 2.7..2.12.
And gone with QOS move from lod to the lu_qos.

for 2.12 code this code is safe.

--- a/lustre/lod/lod_internal.h
+++ b/lustre/lod/lod_internal.h
@@ -83,7 +83,7 @@ enum lq_flag {
        LQ_SAME_SPACE   = 1, /* the ost's all have approx.
                                 the same space avail */
        LQ_RESET        = 2, /* zero current penalties */
-
+       LQ_SF_PROGRESS  = 3  /* statfs op in progress */
 };

 struct lod_qos {
diff --git a/lustre/lod/lod_qos.c b/lustre/lod/lod_qos.c
index 1c377422b1..be0c10f0e8 100644
--- a/lustre/lod/lod_qos.c
+++ b/lustre/lod/lod_qos.c
@@ -291,11 +291,11 @@ void lod_qos_statfs_update(const struct lu_env *env, struct lod_device *lod)
                /* statfs data are quite recent, don't need to refresh it */
                RETURN_EXIT;

-       down_write(&lod->lod_qos.lq_rw_sem);
-
-       if (obd->obd_osfs_age > max_age)
-               goto out;
+       /* Check if statfs already in progress */
+       if (test_and_set_bit(LQ_SF_PROGRESS, &lod->lod_qos.lq_flags))
+               RETURN_EXIT;

+       down_read(&osts->op_rw_sem);
        for (i = 0; i < osts->op_count; i++) {
                idx = osts->op_array[i];
                avail = OST_TGT(lod,idx)->ltd_statfs.os_bavail;
@@ -306,10 +306,12 @@ void lod_qos_statfs_update(const struct lu_env *env, struct lod_device *lod)
                        /* recalculate weigths */
                        set_bit(LQ_DIRTY, &lod->lod_qos.lq_flags);
        }
+       up_read(&osts->op_rw_sem);
+
        obd->obd_osfs_age = ktime_get_seconds();

-out:
-       up_write(&lod->lod_qos.lq_rw_sem);
+       clear_bit(LQ_SF_PROGRESS, &lod->lod_qos.lq_flags);
+
        EXIT;
 }

read semaphore is good protection from the global pool modification and set dirty bit is good atomic operation.
But for current master i stick with porting. Because i want a revert so much code to have working it fine.

Comment by Alexey Lyashkov [ 24/Dec/20 ]

Hm.. it looks this patch just don't accepted by WC in past :-/
https://jira.whamcloud.com/browse/LU-7853

Comment by Alexander Zarochentsev [ 24/Dec/20 ]

there was an attempt to convert those bitfields to atomic bits : LU-7853 , needs to be updated after lod_qos code reorganization.

Comment by Andreas Dilger [ 24/Dec/20 ]

Hm.. it looks this patch just don't accepted by WC in past :-/

The original version of LU-7853 didn't give any reason why the patch was needed, so I asked for an explanation and eventually got one. Then the patch was failing testing repeatedly until it was fixed. I've reviewed that patch many times, and approved it. The last version finished testing the day before another batch of patches were landed by Oleg, but it didn't get refreshed since that time.

I don't think that equals "don't accepted by WC". Everyone has to follow the same process for landing patches. I have lots of patches that I would like landed, but I know the work is mine to update them and to get them past testing before they can land.

Comment by Alex Zhuravlev [ 24/Dec/20 ]

I think the rule is very simple: patch accepted - you're lucky, grab a cake; not yet accepted - press rebase.

Comment by Alexey Lyashkov [ 24/Dec/20 ]

Andreas, this patch is needs because of large number points where unprotected set used.
If any parallel set to same 'long' will used this will be lost one of modifications due read-modify-set operation.
set_bit is prevent such bugs.
Anyway, Thanks Zam to the explanation why i seen this patch on any own tree for any 2.x based and it looks time to rebase and fix last issues with him. Like lqr_dirty and lq_dirty is same semantic but it one of them don't set sometimes (like with add target to pool).

Alex, sorry about reverting notice.

Comment by Alex Zhuravlev [ 24/Dec/20 ]

shadowI'm fine, I'm just trying to say that the list of the patches (ready to land!) can be very long and sometimes it takes months to get through - because of priorities, luck, etc. I do have patches hanging around with +2 for months, but basically master can't accept too many in short time, it's just a way to keep it more-less stable. the good thing is that rebase button simplifies efforts to maintain patch uptodate (not guaranteed, of course). and a good test demonstrating the issue helps, obviously.

Comment by Alexey Lyashkov [ 24/Dec/20 ]

Ok. i will pickup an LU-7853 rebase it, and fix an problem with qos_rr lqr dirty. It looks it need to be covered to bit flag also.

Comment by Cory Spitz [ 21/Dec/21 ]

https://review.whamcloud.com/#/c/41497/

Comment by Gerrit Updater [ 26/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/41497/
Subject: LU-14277 lod: statfs should not block a create
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6cdb08a0c598b20a4b0ce27f7fe4763b0ada3118

Comment by Peter Jones [ 26/Jan/22 ]

Landed for 2.15

Generated at Sat Feb 10 03:08:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.