[LU-1013] recovery-mds lu_object.c:116:lu_object_put()) ASSERTION(cfs_list_empty(&top->loh_lru)) failed Created: 19/Jan/12  Updated: 27/Mar/12  Resolved: 13/Feb/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Hyperion/LLNL chaos5


Attachments: Text File mds.syslog.txt    
Issue Links:
Duplicate
is duplicated by LU-1086 several crash triggered in key_fini r... Resolved
Severity: 3
Rank (Obsolete): 4743

 Description   

Running recovery-mds-scale fails after 10-15 failovers. Log attached, also uploaded to Maloo. There are several failure in maloo to choose from



 Comments   
Comment by Peter Jones [ 20/Jan/12 ]

Oleg

Can you look into this one please?

Thanks

Peter

Comment by Oleg Drokin [ 27/Jan/12 ]

So this log attached is not very useful and in the many maloo reports with this failure there are no logs at all?

Comment by Cliff White (Inactive) [ 30/Jan/12 ]

Yes, panic_on_lbug was set. I will attempt to replicate without this on 2.1.55.

Comment by Leon Kos [ 30/Jan/12 ]

I am getting this crashes on 2.1.55 when I try to remove some user directories with glob

[root@mds home]# rm -rf mdular/ #works
[root@mds home]# rm -rf lsfadmin/ #works
[root@mds home]# rm -rf * # LBUG imediately

Message from syslogd@ at Mon Jan 30 20:16:21 2012 ...
mds kernel: LustreError: 28264:0:(lu_object.c:116:lu_object_put()) ASSERTION(cfs_list_empty(&top->loh_lru)) failed

Message from syslogd@ at Mon Jan 30 20:16:21 2012 ...
mds kernel: LustreError: 28264:0:(lu_object.c:116:lu_object_put()) LBUG

Comment by nasf (Inactive) [ 07/Feb/12 ]

I have met similar issues in my branch for OI Scrub and blocked my OI Scrub test. I found there is a race condition in lu_object_find_try() and lu_object_put(). For example:

static struct lu_object *lu_object_find_try(const struct lu_env *env,
                                            struct lu_device *dev,
                                            const struct lu_fid *f,
                                            const struct lu_object_conf *conf,
                                            cfs_waitlink_t *waiter)
{
...
(step1)        o = lu_object_alloc(env, dev, f, conf);
        if (unlikely(IS_ERR(o)))
                return o;

        LASSERT(lu_fid_eq(lu_object_fid(o), f));

        cfs_hash_bd_lock(hs, &bd, 1);

        shadow = htable_lookup(s, &bd, f, waiter, &version);
        if (likely(shadow == NULL)) {
                struct lu_site_bkt_data *bkt;

                bkt = cfs_hash_bd_extra_get(hs, &bd);
                cfs_hash_bd_add_locked(hs, &bd, &o->lo_header->loh_hash);
                bkt->lsb_busy++;
                cfs_hash_bd_unlock(hs, &bd, 1);
(step2)                return o;
        }

        lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_RACE);
        cfs_hash_bd_unlock(hs, &bd, 1);
        lu_object_free(env, o);
(step3)        return shadow;
}

void lu_object_put(const struct lu_env *env, struct lu_object *o)
{
...
        if (!lu_object_is_dying(top)) {
(step4)                LASSERT(cfs_list_empty(&top->loh_lru));
(step5)                cfs_list_add_tail(&top->loh_lru, &bkt->lsb_lru);
                cfs_hash_bd_unlock(site->ls_obj_hash, &bd, 1);
                return;
        }
...
}

Thread1 and Thread2 try to find some object with the same FID concurrently, and the object is not allocated in memory yet. Consider the following sequence:

1) Thread1 step1
2) Thread2 step1
3) Thread1 step2
4) Thread1 step4
5) Thread1 step5
6) Thread2 step3
7) Thread2 step4

So Thread2 will failed at step4:

mds kernel: LustreError: 28264:0:(lu_object.c:116:lu_object_put()) ASSERTION(cfs_list_empty(&top->loh_lru)) failed

I have made following patch to fix the race, if you do not mind, I will push the patch to gerrit for review.

===========================
diff --git a/lustre/obdclass/lu_object.c b/lustre/obdclass/lu_object.c
index 2ad22f0..f26c534 100644
--- a/lustre/obdclass/lu_object.c
+++ b/lustre/obdclass/lu_object.c
@@ -619,7 +624,7 @@ static struct lu_object *lu_object_find_try(const struct lu_env *env,
         cfs_hash_bd_lock(hs, &bd, 1);

         shadow = htable_lookup(s, &bd, f, waiter, &version);
-        if (likely(shadow == NULL)) {
+        if (shadow == NULL) {
                 struct lu_site_bkt_data *bkt;

                 bkt = cfs_hash_bd_extra_get(hs, &bd);
@@ -627,12 +632,14 @@ static struct lu_object *lu_object_find_try(const struct lu_env *env,
                 bkt->lsb_busy++;
                 cfs_hash_bd_unlock(hs, &bd, 1);
                 return o;
+        } else {
+                if (!cfs_list_empty(&shadow->lo_header->loh_lru))
+                        cfs_list_del_init(&shadow->lo_header->loh_lru);
+                lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_RACE);
+                cfs_hash_bd_unlock(hs, &bd, 1);
+                lu_object_free(env, o);
+                return shadow;
         }
-
-        lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_RACE);
-        cfs_hash_bd_unlock(hs, &bd, 1);
-        lu_object_free(env, o);
-        return shadow;
 }

 /**
===========================
Comment by Peter Jones [ 10/Feb/12 ]

FanYong

Yes. Please push your patch to gerrit - and you do not need to ask permission in future

Thanks

Peter

Comment by Oleg Drokin [ 10/Feb/12 ]

Nice find.

Kind of strange how come the shadow object we got without a refcount, so that the thread1 was able to release the last reference and put it into lru?

Comment by nasf (Inactive) [ 10/Feb/12 ]

In fact, before the thread2 got "shadow" object, the thread1 has already release the last reference of "shadow", and because the "shadow" object is not dying, it is put into LRU. Then thread2 found "shadow" object with "loh_lru" non-empty.

Comment by Peter Jones [ 10/Feb/12 ]

http://review.whamcloud.com/#change,2134

Comment by Build Master (Inactive) [ 11/Feb/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #464
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)

Result = SUCCESS
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 11/Feb/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #464
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)

Result = SUCCESS
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 11/Feb/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #464
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)

Result = SUCCESS
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 11/Feb/12 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #464
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)

Result = SUCCESS
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 11/Feb/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #464
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)

Result = SUCCESS
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 11/Feb/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #464
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)

Result = SUCCESS
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 11/Feb/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #464
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)

Result = SUCCESS
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 11/Feb/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #464
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)

Result = SUCCESS
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 11/Feb/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #464
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)

Result = SUCCESS
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 11/Feb/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #464
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)

Result = SUCCESS
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 11/Feb/12 ]

Integrated in lustre-master » i686,server,el5,ofa #464
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)

Result = SUCCESS
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 11/Feb/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #464
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)

Result = SUCCESS
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 11/Feb/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #464
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)

Result = SUCCESS
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 11/Feb/12 ]

Integrated in lustre-master » i686,client,el5,ofa #464
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)

Result = SUCCESS
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c
Comment by Peter Jones [ 13/Feb/12 ]

Patch landed for 2.2. Please reopen this ticket if the issue still manifests itself with the patch applied.

Comment by Build Master (Inactive) [ 16/Feb/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #475
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = SUCCESS
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 16/Feb/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #475
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = SUCCESS
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 16/Feb/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #475
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = SUCCESS
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 16/Feb/12 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #475
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = SUCCESS
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 16/Feb/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #475
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = SUCCESS
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 16/Feb/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #475
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = SUCCESS
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 16/Feb/12 ]

Integrated in lustre-master » i686,client,el5,ofa #475
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = SUCCESS
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 16/Feb/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #475
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = SUCCESS
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 16/Feb/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #475
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = SUCCESS
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 16/Feb/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #475
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = SUCCESS
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 16/Feb/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #475
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = SUCCESS
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 16/Feb/12 ]

Integrated in lustre-master » i686,server,el5,ofa #475
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = SUCCESS
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 16/Feb/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #475
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = SUCCESS
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 16/Feb/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #475
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = SUCCESS
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,server,el6,ofa #480
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = FAILURE
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c

Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,client,el6,ofa #480
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = FAILURE
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c

Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » i686,client,el6,ofa #480
LU-1013 obdclass: lu_object_find miss to unlink object from LRU (Revision b9ccecd1453c5c76fe135048c39f149c241650c6)
Revert "LU-1013 obdclass: lu_object_find miss to unlink object from LRU" (Revision 7eef7d96bd0c4463ab4e90657d9e2bf706995c05)

Result = ABORTED
Oleg Drokin : b9ccecd1453c5c76fe135048c39f149c241650c6
Files :

  • lustre/obdclass/lu_object.c

Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
Files :

  • lustre/obdclass/lu_object.c
Generated at Sat Feb 10 01:12:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.