[LU-1013] recovery-mds lu_object.c:116:lu_object_put()) ASSERTION(cfs_list_empty(&top->loh_lru)) failed Created: 19/Jan/12 Updated: 27/Mar/12 Resolved: 13/Feb/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Hyperion/LLNL chaos5 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4743 | ||||||||
| Description |
|
Running recovery-mds-scale fails after 10-15 failovers. Log attached, also uploaded to Maloo. There are several failure in maloo to choose from |
| Comments |
| Comment by Peter Jones [ 20/Jan/12 ] |
|
Oleg Can you look into this one please? Thanks Peter |
| Comment by Oleg Drokin [ 27/Jan/12 ] |
|
So this log attached is not very useful and in the many maloo reports with this failure there are no logs at all? |
| Comment by Cliff White (Inactive) [ 30/Jan/12 ] |
|
Yes, panic_on_lbug was set. I will attempt to replicate without this on 2.1.55. |
| Comment by Leon Kos [ 30/Jan/12 ] |
|
I am getting this crashes on 2.1.55 when I try to remove some user directories with glob [root@mds home]# rm -rf mdular/ #works Message from syslogd@ at Mon Jan 30 20:16:21 2012 ... Message from syslogd@ at Mon Jan 30 20:16:21 2012 ... |
| Comment by nasf (Inactive) [ 07/Feb/12 ] |
|
I have met similar issues in my branch for OI Scrub and blocked my OI Scrub test. I found there is a race condition in lu_object_find_try() and lu_object_put(). For example: static struct lu_object *lu_object_find_try(const struct lu_env *env, struct lu_device *dev, const struct lu_fid *f, const struct lu_object_conf *conf, cfs_waitlink_t *waiter) { ... (step1) o = lu_object_alloc(env, dev, f, conf); if (unlikely(IS_ERR(o))) return o; LASSERT(lu_fid_eq(lu_object_fid(o), f)); cfs_hash_bd_lock(hs, &bd, 1); shadow = htable_lookup(s, &bd, f, waiter, &version); if (likely(shadow == NULL)) { struct lu_site_bkt_data *bkt; bkt = cfs_hash_bd_extra_get(hs, &bd); cfs_hash_bd_add_locked(hs, &bd, &o->lo_header->loh_hash); bkt->lsb_busy++; cfs_hash_bd_unlock(hs, &bd, 1); (step2) return o; } lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_RACE); cfs_hash_bd_unlock(hs, &bd, 1); lu_object_free(env, o); (step3) return shadow; } void lu_object_put(const struct lu_env *env, struct lu_object *o) { ... if (!lu_object_is_dying(top)) { (step4) LASSERT(cfs_list_empty(&top->loh_lru)); (step5) cfs_list_add_tail(&top->loh_lru, &bkt->lsb_lru); cfs_hash_bd_unlock(site->ls_obj_hash, &bd, 1); return; } ... } Thread1 and Thread2 try to find some object with the same FID concurrently, and the object is not allocated in memory yet. Consider the following sequence: 1) Thread1 step1 So Thread2 will failed at step4: mds kernel: LustreError: 28264:0:(lu_object.c:116:lu_object_put()) ASSERTION(cfs_list_empty(&top->loh_lru)) failed I have made following patch to fix the race, if you do not mind, I will push the patch to gerrit for review. =========================== diff --git a/lustre/obdclass/lu_object.c b/lustre/obdclass/lu_object.c index 2ad22f0..f26c534 100644 --- a/lustre/obdclass/lu_object.c +++ b/lustre/obdclass/lu_object.c @@ -619,7 +624,7 @@ static struct lu_object *lu_object_find_try(const struct lu_env *env, cfs_hash_bd_lock(hs, &bd, 1); shadow = htable_lookup(s, &bd, f, waiter, &version); - if (likely(shadow == NULL)) { + if (shadow == NULL) { struct lu_site_bkt_data *bkt; bkt = cfs_hash_bd_extra_get(hs, &bd); @@ -627,12 +632,14 @@ static struct lu_object *lu_object_find_try(const struct lu_env *env, bkt->lsb_busy++; cfs_hash_bd_unlock(hs, &bd, 1); return o; + } else { + if (!cfs_list_empty(&shadow->lo_header->loh_lru)) + cfs_list_del_init(&shadow->lo_header->loh_lru); + lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_RACE); + cfs_hash_bd_unlock(hs, &bd, 1); + lu_object_free(env, o); + return shadow; } - - lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_RACE); - cfs_hash_bd_unlock(hs, &bd, 1); - lu_object_free(env, o); - return shadow; } /** =========================== |
| Comment by Peter Jones [ 10/Feb/12 ] |
|
FanYong Yes. Please push your patch to gerrit - and you do not need to ask permission in future Thanks Peter |
| Comment by Oleg Drokin [ 10/Feb/12 ] |
|
Nice find. Kind of strange how come the shadow object we got without a refcount, so that the thread1 was able to release the last reference and put it into lru? |
| Comment by nasf (Inactive) [ 10/Feb/12 ] |
|
In fact, before the thread2 got "shadow" object, the thread1 has already release the last reference of "shadow", and because the "shadow" object is not dying, it is put into LRU. Then thread2 found "shadow" object with "loh_lru" non-empty. |
| Comment by Peter Jones [ 10/Feb/12 ] |
| Comment by Build Master (Inactive) [ 11/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Peter Jones [ 13/Feb/12 ] |
|
Patch landed for 2.2. Please reopen this ticket if the issue still manifests itself with the patch applied. |
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = FAILURE
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = FAILURE
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = ABORTED
Oleg Drokin : 7eef7d96bd0c4463ab4e90657d9e2bf706995c05
|