[LU-1126] Client file locking issue. Assertion triggered when decrementing a read lock on an item that has no existing read locks. Created: 21/Feb/12  Updated: 21/Oct/13  Resolved: 21/Oct/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.7
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Blocker
Reporter: Peter Piela (Inactive) Assignee: Oleg Drokin
Resolution: Fixed Votes: 2
Labels: None
Environment:

Client: Lustre 1.8.7-wc1, inkernel IB, RHEL 5.6
Server Lustre 1.8.4, CentOS 5.5, Terascala appliance


Attachments: File debug_logs.tar.bz2     File flock.c     File log_lustre.3     File messages    
Severity: 1
Rank (Obsolete): 3992

 Description   

The Lustre filesystem is mounted on the client using the -o flock option. Without this option the customer's application will not run. The application uses fcntl for file locking. The application does not explicitly release locks, it relies the file close operation to do that.

The client hardware configuration is a single HP DL980 G7 server with 8x8 core Nahalem-EX CPUs and 512 GB RAM.

The application workload consists of a number of processes writing to a small number of shared datasets.

Below is an example of the traceback.

Feb 15 21:08:43 pt980a kernel: LustreError: 23926:0:(ldlm_lock.c:599:ldlm_lock_decref_internal_nolock()) ASSERTION(lock->l_readers > 0) failed
Feb 15 21:08:43 pt980a kernel: LustreError: 23926:0:(ldlm_lock.c:599:ldlm_lock_decref_internal_nolock()) LBUG
Feb 15 21:08:43 pt980a kernel: Pid: 23926, comm: sas
Feb 15 21:08:43 pt980a kernel:
Feb 15 21:08:43 pt980a kernel: Call Trace:
Feb 15 21:08:43 pt980a kernel: [<ffffffff889fa6a1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
Feb 15 21:08:43 pt980a kernel: [<ffffffff889fabda>] lbug_with_loc+0x7a/0xd0 [libcfs]
Feb 15 21:08:43 pt980a kernel: [<ffffffff88a02ff0>] tracefile_init+0x0/0x110 [libcfs]
Feb 15 21:08:43 pt980a kernel: [<ffffffff88b2161f>] ldlm_lock_decref_internal_nolock+0x7f/0x100 [ptlrpc]
Feb 15 21:08:43 pt980a kernel: [<ffffffff88b492d9>] ldlm_process_flock_lock+0x1089/0x18a0 [ptlrpc]
Feb 15 21:08:43 pt980a kernel: [<ffffffff88a4c33d>] LNetMDUnlink+0xcd/0xf0 [lnet]
Feb 15 21:08:43 pt980a kernel: [<ffffffff88b1fd59>] ldlm_grant_lock+0x4e9/0x550 [ptlrpc]
Feb 15 21:08:43 pt980a kernel: [<ffffffff88b4a5fb>] ldlm_flock_completion_ast+0xa0b/0xaf0 [ptlrpc]
Feb 15 21:08:43 pt980a kernel: [<ffffffff88b23729>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
Feb 15 21:08:43 pt980a kernel: [<ffffffff88b3bc8b>] ldlm_cli_enqueue_fini+0xa5b/0xbc0 [ptlrpc]
Feb 15 21:08:43 pt980a kernel: [<ffffffff88ab567d>] class_handle_hash+0x16d/0x250 [obdclass]
Feb 15 21:08:43 pt980a kernel: [<ffffffff8008e7f7>] default_wake_function+0x0/0xe
Feb 15 21:08:43 pt980a kernel: [<ffffffff88b3d7cf>] ldlm_cli_enqueue+0x63f/0x700 [ptlrpc]
Feb 15 21:08:43 pt980a kernel: [<ffffffff88b3e0a0>] ldlm_completion_ast+0x0/0x880 [ptlrpc]
Feb 15 21:08:43 pt980a kernel: [<ffffffff88d20acf>] ll_file_flock+0x57f/0x680 [lustre]
Feb 15 21:08:43 pt980a kernel: [<ffffffff88b49bf0>] ldlm_flock_completion_ast+0x0/0xaf0 [ptlrpc]
Feb 15 21:08:43 pt980a kernel: [<ffffffff8003063e>] locks_remove_posix+0x84/0xa8
Feb 15 21:08:43 pt980a kernel: [<ffffffff8003007e>] __up_write+0x27/0xf2
Feb 15 21:08:43 pt980a kernel: [<ffffffff80023da7>] filp_close+0x54/0x64
Feb 15 21:08:43 pt980a kernel: [<ffffffff8001e211>] sys_close+0x88/0xbd
Feb 15 21:08:43 pt980a kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0



 Comments   
Comment by Oleg Drokin [ 21/Feb/12 ]

Since dlmtrace is no longer in default debug amsk, can you please reproduce and then attach here the debug log dumped during lbug?

It appears that the problem is due to some sort of a race where once the file is closed, lustre attempts to wipe all flock locks, but one of the locks we found is already free somehow.

Additionally if the application needs flock functionality, but it's not important for consistency (e.g. because the app writes to non-intersecting regions of the file and hence does not need the real flock semantics on lustre to work), you can significantly speed up its operations by mounting with -o localflock option, this would provide you with node-local flock consistency, but not cluster-wide consistency (the savings come from not needing to send lock RPCs to MDS for every flock operation)

Comment by Peter Piela (Inactive) [ 28/Feb/12 ]

Problem reproduced with lock tracing enabled

Comment by Peter Piela (Inactive) [ 28/Feb/12 ]

Customer was able to reproduce the problem with tracing enabled. The log file is attached.

Comment by Peter Piela (Inactive) [ 29/Feb/12 ]

The generated log does not contain the trace information that I expected. I will review the steps used to create it. The logs was generated using the steps below. Please advise if they are not correct.

lctl> filter all_types
lctl> show dlmtrace
lctl> debug_kernel [filename]

Comment by Oleg Drokin [ 29/Feb/12 ]

Can you please tell me how is the flock used by the app to guarantee the consistency?

I think the best way to get a very good log would be:
on the node where you run the app nd expect it to crash, please do:
echo -1 >/proc/sys/lnet/debug
echo 200 >/proc/sys/lnet/debug_mb
echo 0 >/proc/sys/lnet/panic_on_lbug # This is needed to prevent kernel panic on lbug to gather log after the crash
lctl dk >/dev/null # This is to clear current debug buffer
now run the app
After LBUG is hit, you will see a message in kernel log about "dumping log to /tmp/lustre...." get that log (and make a copy first!) and run
lctl df /tmp/lustre... >/tmp/lustre...txt

Please attach the resulting LBUG message and the decoded log file (the one with txt at the end you got in the last step) here.

Comment by Ben Evans (Inactive) [ 02/Mar/12 ]

Debug logs from test run.

Comment by Oleg Drokin [ 05/Mar/12 ]

Did the backtrace from the LBUG (in dmesg/kernel log) looked the same as the one at the start of this bug? i.e. it all started at sys_close and then passed through locks_remove_posix into lustre flock code?
You forgot to put the crash backtrace into the archive with the log.

Comment by Oleg Drokin [ 05/Mar/12 ]

From the log we can see that the kernel made a GETLK request for a write lock covering entire file.
The server found a blocking lock present and returned the blocking lock which happened to be a read lock. The code attempted to release the ldlm lock, but used the wrong mode somehow (even though we are remembering the correct one in a file_lock structure).

Now, even ignoring the wrong unlock mode for a moment, if we look into locks_remove_posix, we can see that it just does not do any GETLK kind of requests. It does a blanket mode UNLCK with SETLK command.
So how come this happened?

Do you have any patches on top of your rhel5 kernel? Please attach fs/locks.c from the kernel you use.
Do you have any patches on top of the 1.8.7-wc1 that you use on the clients?

Comment by Peter Piela (Inactive) [ 05/Mar/12 ]

Do you have any patches on top of your rhel5 kernel? No

[root@pt980a ~]# uname -a
Linux pt980a.ptest.sas.com 2.6.18-274.3.1.el5 #1 SMP Fri Aug 26 18:49:02 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
[root@pt980a ~]#

Do you have any patches on top of the 1.8.7-wc1 that you use on the clients? No

Please attach fs/locks.c from the kernel you use. The customer does not have access to the source file, can you locate it from the version information above?

Comment by Oleg Drokin [ 05/Mar/12 ]

Well, I have looked into the file for that kernel version and its content is as I expected. I.e. it does not do getlk there.

So, if the backtrace from the crash to which you provided the logs is the same as in the bug description, I don't have a coherent explanation of what has happened.
In that case I wonder if you can capture a kernel crash dump from the event so that we can take a look inside of the kernel structures.
So far the information in the code that I see does not match information in the debug logs.

I wonder if you can share the reproducing app with us and what sort of scale does it need to be run at to actually hit the issue?

Comment by Peng Tao [ 17/Mar/12 ]

reproducer programe

Comment by Peng Tao [ 17/Mar/12 ]

I've spent sometime looking into this and found a reproducer for it. As discussed with Oleg, we identified the root cause to be:

1. For GETLK/LCK_PW operation, client creates struct ldlm_lock with PW mode and reader(0) writer(1). If server returns LCK_PR to indicate the lock is obtained by other process (line 466845 and line 466846 both show server returned PR lock owned by PID 22497), client changes the lock mode to PR but doesn't touch reader/writer count. Hence there is a window where the lock is in resource's lr_granted list, saved in PR mode but had reader(0) and writer(1). Normally it is OK because the lock will be destroyed in ldlm_flock_completion_ast() with PW mode saved in getlk->fl_type. But it is vulnerable because other processes can find it from resource's lr_granted list.

2. In the log, one process entered ldlm_process_flock_lock() looking for PR lock owned by PID 22497 in lr_granted (line 468326). Then it found the locks mentioned in 1 and decided to destroy them because of lock range overlapping. It destroyed the lock in req->l_req_mode which is PR mode but the found locks are in PR mode with reader(0) writer(1). Therefore it caused the LBUG as reported.

Comment by Peng Tao [ 18/Mar/12 ]

A patch to fix it by exchanging lock refcount when lock mode changes is submitted to gerrit:
http://review.whamcloud.com/2343

Comment by HP Slovakia team (Inactive) [ 03/Aug/12 ]

Our customer also faced this issue on Lustre 1.8.8. When we can expect a patch to fix it?

Comment by Ben Allen [ 24/Sep/12 ]

This bug also appears to affect version 2.1.3 installed from Whamcloud's official RPMs and the pre-built kernel 2.6.32-279.2.1.el6_lustre.gc46c389.x86_64.

An easy way to reproduce this issue is with BULL's NFS Locktests: http://nfsv4.bullopensource.org/tools/tests/locktest.php. When run with the thread option, it will cause the LBUG nearly immediately on a client.

Comment by Supporto Lustre Jnet2000 (Inactive) [ 24/Jul/13 ]

Dear sirs, we are planning to install SAS with LSF in our production server with
lustre: 1.8.7
kernel: patchless_client
build: jenkins-wc1--PRISTINE-2.6.18-274.3.1.el5_lustre.g9500ebf

before install , we need to fix this bug. The patch reported here http://review.whamcloud.com/#/c/2343/
is ready for production environment ?
If not, when you plan to release it ?
Thanks a lot.

Comment by Bruno Faccini (Inactive) [ 01/Aug/13 ]

As part of LU-2839, likely to be a dup of this ticket, I tried to rerun both provided reproducers over latest master and here are the results :

_ unable to reproduce LBUG with attached "flock.c".
_ able to reproduce (quite easily too!) with "BULL's NFS Locktests". But the scenario found in Lustre debug-log and crash-dump is not the one described (lock mode versus readers/writers counters) but a simplest race during lock destroy when overlap detection.

Thus, I have been able to successfully run "BULL's NFS Locktests" during hours over latest master after applying the simpler patch http://review.whamcloud.com/7134.

Will check if I can confirm this running with latest 1.8.

May be I should open a specific ticket against master too ?

Comment by Bruno Faccini (Inactive) [ 10/Sep/13 ]

Specific case/scenario of LBUG when running "BULL's NFS Locktests" is now addressed by LU-3684 ticket.

Comment by Jodi Levi (Inactive) [ 21/Oct/13 ]

Patch landed to Master.

Generated at Sat Feb 10 01:13:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.