[LU-3684] LBUG/"ldlm_lock_decref_internal_nolock()) ASSERTION(lock->l_readers > 0) failed" running Bull's NFS locktests Created: 01/Aug/13  Updated: 16/Apr/14  Resolved: 03/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.7
Fix Version/s: None

Type: Task Priority: Major
Reporter: Supporto Lustre Jnet2000 (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Client Lustre 1.8.7 jenkins-wc1--PRISTINE-2.6.18-274.3.1.el5 RHEL 5.7
Server Lustre 1.8.7 jenkins-wc1--PRISTINE-2.6.18-274.3.1.el5_lustre.g9500ebf RHEL 5.7


Attachments: File locktests.tar.gz     File lustre_log.txt.gz    
Rank (Obsolete): 9511

 Description   

We need to fix the bug reported here: https://jira.hpdd.intel.com/browse/LU-1126 before installing Sas Grid Manager.

The Lustre filesystem is mounted on client using the -o flock option



 Comments   
Comment by Peter Jones [ 01/Aug/13 ]

Thanks for the report. We are looking into the best option

Comment by Bruno Faccini (Inactive) [ 02/Sep/13 ]

Hello,
I wonder if you can better detail the issue you experience by running SAS Grid Manager on top of Lustre 1.8 ?? This could help me to better qualify this ticket and also to change its title accordingly!

Since your feeling is that your problem is still the one originally addressed by LU-1126 and that you may already be aware of my last work/update in it, I would like you to help me determine if it is still the original one described in LU-1126 (wrong lock mode used versus readers/writers counters) and likely to be reproduced by flock.c provided program, or the one I reported to be still present in master (race during lock destroy when overlap detection) and easily reproducible with "BULL's NFS Locktests".

Do you think you can again provide the Lustre debug-log (The way Oleg described in LU-1126 would be the best since it is for a full-trace, but at least with "dlmtrace" enabled) of a new occurence ??

Thanks in advance for your help.

Comment by Supporto Lustre Jnet2000 (Inactive) [ 02/Sep/13 ]

Dear Bruno, currently we don't have SAS Grid Manager installed on our system. Before installing it, sas support team require that the bug reported here https://jira.hpdd.intel.com/browse/LU-1126 must to be fixed.
Instead, we have surely the bug that can be reproduced with "BULL's NFS Locktests".
If you wish, we can provide the lustre log of the BULL test run.
Regards

Comment by Bruno Faccini (Inactive) [ 02/Sep/13 ]

That would be nice if I can get the Lustre debug-log taken during "BULL's NFS Locktests" run at your site! Thanks in advance.

Comment by Supporto Lustre Jnet2000 (Inactive) [ 06/Sep/13 ]

Dear Bruno, we have attached the log of lustre client crashed during the execution of the BULL test.
Regards

Comment by Bruno Faccini (Inactive) [ 09/Sep/13 ]

I checked the lustre-log you provided and it is definitelly the same problem triggered by "BULL's NFS Locktests" (and not the original one in LU-1126 with custom reproducer) that I fixed with change http://review.whamcloud.com/7134 in master, like already described in LU-1126.

Comment by Bruno Faccini (Inactive) [ 10/Sep/13 ]

Since this ticket definitely addresses a different scenario (even if LBUG/"ldlm_lock_decref_internal_nolock()) ASSERTION(lock->l_readers > 0) failed" is the same!!) than the original reported as part of LU-1126 but where it was also/already described, I would like to set this ticket as the main/tracking one for this particular case.

Just to be complete about the differences between the 2 problems for LU-1126 and this ticket :

_ LU-1126, has been open for a particular race/scenario where a transient lock, result/return for a F_GETLK request, has to be destroyed due to overlap, but this occurs during the very short window where its changed mode (PR/PW) has become incompatible with its counter (l_readers/l_writers). Thus the LBUG occurs because the wrong counter will be decremented. This particular problem only show up with the custom reproducer (flock.c) provided for LU-1126.

_ this is different to the scenario for this ticket's problem, where a race can occur between 2 threads who want to destroy (one to finish corresponding request processing, the other due to overlap rules) the same lock, mainly during F_UNLCK multiple/concurent requests handling. Thus the LBUG occurs because 2nd thread found counter already set to 0. This particular problem show up very easily when running, as you experienced, "Bull's NFS Locktests". This test is available at http://nfsv4.bullopensource.org/tools/tests/locktest.php, and provided as "locktests.tar.gz" distro I attached here. Easy way to reproduce is to run in pthread mode like "locktests -n 10 -T -f <Lustre-File>" on a single+full Lustre node (ie, like after intalling Lustre and running "llmount.sh").

As I said, problem has been fixed in master with http://review.whamcloud.com/7134, b1_8 patch is at http://review.whamcloud.com/7420 now.

Also, I would like to change this ticket's title as it is definitely not the same problem/race than the one addressed in LU-1126, something like "[ldlm_lock_decref_internal_nolock()) ASSERTION(lock->l_readers > 0) failed] running Bull's NFS locktests".

Will also add reference to this ticket in LU-1126 to complete split-up between the 2 different problems.

Comment by Gabriele Paciucci (Inactive) [ 28/Feb/14 ]

The customer is currently in the process to upgrade to 1.8.9+patch. So please close this ticket.

Comment by Bruno Faccini (Inactive) [ 16/Apr/14 ]

b2_4 patch version is at http://review.whamcloud.com/9968.

Generated at Sat Feb 10 01:36:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.