[LU-6337] threads stuck at ldlm_completion_ast Created: 05/Mar/15  Updated: 16/Oct/15  Resolved: 16/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Oleg Drokin
Resolution: Duplicate Votes: 0
Labels: None

Attachments: HTML File service100    
Severity: 3
Rank (Obsolete): 17749

 Description   

Looks like we have this this issue at 3 times in the past 48 hours. Lots of threads stuck at ldlm_completion_ast. We are running with 2.4.3.

see attached console logs



 Comments   
Comment by Jay Lan (Inactive) [ 05/Mar/15 ]

This ticket looks like LU-5497. It seems like moving to 2.5.3 is a possible solution, but we had problem upgrading yesterday. While we investigate problems of upgrading (and honestly need more testing before putting it in production), we need working patches for 2.4.3 from Intel.

Comment by Oleg Drokin [ 06/Mar/15 ]

The symptoms you are seeing are too broad. The most likely cause is due to a lock that is not being released by some party.
In the past the major contributor to it was LU2827 also seen as LU-5497 at LLNL - but in order for it to manifest you need to have over 45 OSTs in your system OR your nerwork must regularly drop RPCs to/from MDS.
If any of those two are true - then applying http://review.whamcloud.com/#/c/6511/ and http://review.whamcloud.com/#/c/9488/ should help you.

Also please note that 2.5.3 does not contain fixes for this problem (but the tip of b2_5 does).

Comment by Jay Lan (Inactive) [ 06/Mar/15 ]

Our nas-2.5.3 branch as of today is very close to the tip of b2_5. We are at
LU-5912 libcfs: use vfs api for fsync calls

However, I do not see LU-5497 patch in b2_5. Which commit should it be if the tip of b2_5 contains the fix?

Now, if I want to cherry pick #6511 and #9488 to nas-2.4.3, do I also need
http://review.whamcloud.com/#/c/10601/ in addition to those two?

Comment by Oleg Drokin [ 06/Mar/15 ]

in b2_5 there is a proper fix to this issue under the banner of LU-2827 and then a number of follow-on patches, from LU-2827 to LU-5579 and everythign in-between.

patch 10601 is purely of informational nature and does not really improve actual hanging situation, so it's ok to skip it for 2.4.3, but it alsmo might be a good idea to add it should this matter require more investigations.

Comment by Jay Lan (Inactive) [ 06/Mar/15 ]

I had a conflict in applying #9488/9:

<<<<<<< HEAD
/* If the client does not require open lock, it does not need to

  • search lock in exp_lock_hash, since the server thread will
  • make sure the lock will be released, and the resend request
  • can always re-enqueue the lock */
    if ((opcode != MDT_IT_OPEN) || (opcode == MDT_IT_OPEN &&
    info->mti_spec.sp_cr_flags & MDS_OPEN_LOCK)) {
    /* In the function below, .hs_keycmp resolves to
  • ldlm_export_lock_keycmp() */
    /* coverity[overrun-buffer-val] */
    lock = cfs_hash_lookup(exp->exp_lock_hash, &remote_hdl);
    if (lock)
    Unknown macro: { lock_res_and_lock(lock); if (lock != new_lock) { lh->mlh_reg_lh.cookie = lock->l_handle.h_cookie; lh->mlh_reg_mode = lock->l_granted_mode; LDLM_DEBUG(lock, "Restoring lock cookie"); DEBUG_REQ(D_DLMTRACE, req, "restoring lock cookie "LPX64, lh->mlh_reg_lh.cookie); if (old_lock) *old_lock = LDLM_LOCK_GET(lock); cfs_hash_put(exp->exp_lock_hash, &lock->l_exp_hash); unlock_res_and_lock(lock); return; } cfs_hash_put(exp->exp_lock_hash, &lock->l_exp_hash); unlock_res_and_lock(lock); }

    }
    =======
    /* In the function below, .hs_keycmp resolves to

  • ldlm_export_lock_keycmp() */
    /* coverity[overrun-buffer-val] */

/* Look for first lock found in hash for key that is not new_lock.
There should only be 2 upon resend, new_lock and the first/original
one.
*/
data.skip_lock = new_lock;
cfs_hash_for_each_key(exp->exp_lock_hash, &remote_hdl,
not_skip_lock, &data);
lock = data.found_lock;
if (lock != NULL)

{ lh->mlh_reg_lh.cookie = lock->l_handle.h_cookie; lh->mlh_reg_mode = lock->l_granted_mode; LDLM_DEBUG(lock, "Restoring lock cookie"); DEBUG_REQ(D_DLMTRACE, req, "restoring lock cookie "LPX64, lh->mlh_reg_lh.cookie); if (old_lock) *old_lock = LDLM_LOCK_GET(lock); cfs_hash_put(exp->exp_lock_hash, &lock->l_exp_hash); return; }

>>>>>>> c695980... LU-4584 mdt: ensure orig lock is found in hash upon resend

Does this ring any bell to you? I guess the code in HEAD came from another patch we cherry-picked before or I missed another patch.

Comment by Jay Lan (Inactive) [ 06/Mar/15 ]

Geez, the formatting really screwed up in displaying "*" and indentation.

The display will be correct when you enter edit mode.

Comment by Oleg Drokin [ 06/Mar/15 ]

Patch 9488 is against b2_4, so there should not really be a conflict. (btw you can use (code)....(code) tags (but use curvy brackets) to disable formatting on a piece of code when making a comment with it).

I see that in your case it's LU-4403 patch you are carryign that messed things up.

Sadly it has a ton of whitespace change, but if you do git show -b 08b397f5bf2561f2294315a9039b1930ce0695d5 on it, then you can see the real change.
Also I think to have a passing memory how this patch was not really needed and only arouse due to a bug fixed by patch 6511 (it was backed out in b2_5 as part of lu2827 series of patches)

Comment by Mahmoud Hanafi [ 16/Oct/15 ]

Please close this issue

Comment by Peter Jones [ 16/Oct/15 ]

ok - thanks Mahmoud

Generated at Sat Feb 10 01:59:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.