[LU-1061] LBUG on cl_locks_prune() Created: 01/Feb/12  Updated: 17/Feb/12  Resolved: 13/Feb/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: Lustre 2.2.0

Type: Bug Priority: Blocker
Reporter: Niu Yawei (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File oops    
Severity: 3
Rank (Obsolete): 4738

 Description   

an 31 14:13:43 spoon02 kernel: Lustre: DEBUG MARKER: == racer racer.sh test complete, duration 659 sec ==================================================== 14:1
3:43 (1328037223)
Jan 31 14:29:31 spoon02 kernel: Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
Jan 31 14:29:31 spoon02 kernel: LustreError: 25339:0:(file.c:157:ll_close_inode_openhandle()) inode 144115205255725063 mdc close failed: rc = -4
Jan 31 14:29:31 spoon02 kernel: Lustre: setting import lustre-OST0000_UUID INACTIVE by administrator request
Jan 31 14:29:31 spoon02 kernel: LustreError: 25321:0:(file.c:157:ll_close_inode_openhandle()) inode 144115205255725059 mdc close failed: rc = -108
Jan 31 14:29:31 spoon02 kernel: LustreError: 25321:0:(file.c:157:ll_close_inode_openhandle()) Skipped 1 previous similar message
Jan 31 14:29:31 spoon02 kernel: LustreError: 25321:0:(cl_lock.c:2082:cl_locks_prune()) ASSERTION(lock->cll_users == 0) failed
Jan 31 14:29:31 spoon02 kernel: LustreError: 25321:0:(cl_lock.c:2082:cl_locks_prune()) LBUG
Jan 31 14:29:31 spoon02 kernel: Pid: 25321, comm: dd
Jan 31 14:29:31 spoon02 kernel:
Jan 31 14:29:31 spoon02 kernel: Call Trace:
Jan 31 14:29:31 spoon02 kernel: [<ffffffff8882372f>] libcfs_debug_dumpstack+0x5f/0x80 [libcfs]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88823c5f>] lbug_with_loc+0x7f/0xd0 [libcfs]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff8882f191>] libcfs_assertion_failed+0x61/0x70 [libcfs]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88954e7c>] cl_locks_prune+0x14c/0x210 [obdclass]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88831aec>] cfs_hash_bd_from_key+0x3c/0xc0 [libcfs]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff8894a992>] cl_object_kill+0x82/0x90 [obdclass]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88bba781>] lov_delete_raid0+0x141/0x300 [lov]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88943fa1>] lu_obj_hop_hash+0x131/0x240 [obdclass]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88bbb7df>] lov_object_delete+0xcf/0x150 [lov]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88942709>] lu_object_free+0x89/0x190 [obdclass]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88833c8d>] cfs_hash_hd_hnode_del+0xd/0x50 [libcfs]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88943613>] lu_object_put+0x1c3/0x1e0 [obdclass]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88949629>] cl_object_put+0x9/0x10 [obdclass]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88c8b133>] cl_inode_fini+0x1d3/0x240 [lustre]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88cf30cf>] lmv_change_cbdata+0x55f/0x580 [lmv]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88c53a60>] null_if_equal+0x0/0x50 [lustre]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88c63248>] ll_clear_inode+0x838/0xc70 [lustre]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88c56c10>] ll_delete_inode+0x0/0x600 [lustre]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff80023be7>] clear_inode+0xda/0x12d
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88c5718a>] ll_delete_inode+0x57a/0x600 [lustre]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88c56c10>] ll_delete_inode+0x0/0x600 [lustre]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff800309d8>] generic_delete_inode+0xc9/0x147
Jan 31 14:29:31 spoon02 kernel: [<ffffffff8003b777>] generic_drop_inode+0x15/0x15f
Jan 31 14:29:31 spoon02 kernel: [<ffffffff8002c25c>] iput+0x85/0x8a
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88c289ab>] ll_d_iput+0x4b/0x60 [lustre]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff80036aa8>] dentry_iput+0x8c/0xae
Jan 31 14:29:31 spoon02 kernel: [<ffffffff8000d890>] dput+0xf7/0x115
Jan 31 14:29:31 spoon02 kernel: [<ffffffff800131c5>] __fput+0x19c/0x1bd
Jan 31 14:29:31 spoon02 kernel: [<ffffffff8002e001>] fput+0x14/0x16
Jan 31 14:29:31 spoon02 kernel: [<ffffffff800249d3>] filp_close+0x65/0x70
Jan 31 14:29:31 spoon02 kernel: [<ffffffff8003aa87>] put_files_struct+0x6b/0xb3
Jan 31 14:29:31 spoon02 kernel: [<ffffffff80015fb2>] do_exit+0x38c/0x9c0
Jan 31 14:29:31 spoon02 kernel: [<ffffffff8004b974>] cpuset_exit+0x0/0x8f
Jan 31 14:29:31 spoon02 kernel: [<ffffffff8002c6d6>] get_signal_to_deliver+0x475/0x4a7
Jan 31 14:29:31 spoon02 kernel: [<ffffffff8006119f>] sysret_signal+0x1c/0x27
Jan 31 14:29:31 spoon02 kernel: [<ffffffff8005e0d0>] do_notify_resume+0xa4/0x84e
Jan 31 14:29:31 spoon02 kernel: [<ffffffff8016685c>] list_add+0xc/0xe
Jan 31 14:29:31 spoon02 kernel: [<ffffffff88948d68>] cl_env_put+0x288/0x2c0 [obdclass]
Jan 31 14:29:31 spoon02 kernel: [<ffffffff80017267>] vfs_write+0xcf/0x175
Jan 31 14:29:31 spoon02 kernel: [<ffffffff8006119f>] sysret_signal+0x1c/0x27
Jan 31 14:29:31 spoon02 kernel: [<ffffffff80061427>] ptregscall_common+0x67/0xac
Jan 31 14:29:31 spoon02 kernel:

> Jan 31 14:29:31 spoon02 kernel: LustreError: 25321:0:(file.c:157:ll_close_inode_openhandle()) Skipped 1 previous similar message
> Jan 31 14:29:31 spoon02 kernel: LustreError: 25321:0:(cl_lock.c:2082:cl_locks_prune()) ASSERTION(lock->cll_users == 0) failed
> Jan 31 14:29:31 spoon02 kernel: LustreError: 25321:0:(cl_lock.c:2082:cl_locks_prune()) LBUG

It maybe caused by AGL (async glimpse lock), which holds an user count for AGL RPC reply processing later.



 Comments   
Comment by nasf (Inactive) [ 02/Feb/12 ]

The patch for this bug:

http://review.whamcloud.com/#change,2079

The AGL sponsor holds user reference count on the cl_lock before triggering AGL RPC. Such user reference count on the cl_lock will be released by AGL RPC reply upcall (osc_lock_upcall()). Such AGL mechanism conflicts with cl_locks_prune(), which requires no lock is in active using when the last iput() called.

So introduce another cl_lock ops: cl_lock_operations::clo_abort(), to resolve above conflict as following:

1) If cl_locks_prune() finds some lock is in using, it calls clo_abort() to abort other cl_lock operations on the lock. Currently, it mainly processes the stub for AGL RPC reply upcall.

2) For AGL RPC reply upcall, it will detect whether the lock is aborted or not firstly. If yes, do nothing; otherwise, set some flag to prevent to be aborted by cl_locks_prune(), and then process the upcall as normal.

Comment by James A Simmons [ 03/Feb/12 ]

I'm in the process of testing your patch.

Comment by James A Simmons [ 03/Feb/12 ]

The first problem I saw was a regression in sanity test 132 which now fails. The second problem is racer now fails in another way. This time I'm seeing client nodes just rebooting. I attached a file called oops with the details of the run. The server side seems normal, just the clients had problems.

Comment by nasf (Inactive) [ 05/Feb/12 ]

Can you paste test_132 failure log? I cannot reproduce the failure by myself.

As for client reboot, from the log "oops", it is difficult to say what cause the rebooting. I suspect there were some memory crash on the client, but it was not included in the "oops" file. Would you please to collect more crash log through crash or serial port configuration?

Comment by James A Simmons [ 07/Feb/12 ]

With your latest patch I still see sanity test 132 failure most of the time. Once in a while it does passes. As for racer I stopped seeing the oops and the test does complete, but now the various dd process on all the nodes are still running which prevents the test suite from going on. I'm going to let it running for several hours tonight to see if the dd process ever quite. If they don't then we have a locking issue.

Comment by nasf (Inactive) [ 08/Feb/12 ]

Can you test Lustre without my patch under the same test environment to verify whether sanity_132 will fail or not? If still fail, please open a new jira ticket for test_132 failure.

Comment by James A Simmons [ 08/Feb/12 ]

Yes the bug for test 132 still exist without the patch. I will reopen LU-582.

Comment by Peter Jones [ 13/Feb/12 ]

Landed for 2.2

Comment by Build Master (Inactive) [ 13/Feb/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #468
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = SUCCESS
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/include/lustre_dlm.h
  • lustre/osc/osc_lock.c
  • lustre/obdclass/cl_lock.c
  • lustre/include/cl_object.h
  • lustre/osc/osc_request.c
  • lustre/ldlm/ldlm_lock.c
Comment by Build Master (Inactive) [ 13/Feb/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #468
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = SUCCESS
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/include/cl_object.h
  • lustre/osc/osc_request.c
  • lustre/include/lustre_dlm.h
  • lustre/osc/osc_lock.c
  • lustre/ldlm/ldlm_lock.c
Comment by Build Master (Inactive) [ 13/Feb/12 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #468
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = SUCCESS
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/include/cl_object.h
  • lustre/ldlm/ldlm_lock.c
  • lustre/include/lustre_dlm.h
  • lustre/osc/osc_request.c
  • lustre/osc/osc_lock.c
  • lustre/obdclass/cl_lock.c
Comment by Build Master (Inactive) [ 13/Feb/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #468
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = SUCCESS
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/include/cl_object.h
  • lustre/osc/osc_lock.c
  • lustre/obdclass/cl_lock.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/osc/osc_request.c
  • lustre/include/lustre_dlm.h
Comment by Build Master (Inactive) [ 13/Feb/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #468
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = SUCCESS
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/osc/osc_request.c
  • lustre/osc/osc_lock.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/include/cl_object.h
  • lustre/include/lustre_dlm.h
Comment by Build Master (Inactive) [ 13/Feb/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #468
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = SUCCESS
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/osc/osc_lock.c
  • lustre/obdclass/cl_lock.c
  • lustre/osc/osc_request.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/include/lustre_dlm.h
  • lustre/include/cl_object.h
Comment by Build Master (Inactive) [ 13/Feb/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #468
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = SUCCESS
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/include/cl_object.h
  • lustre/include/lustre_dlm.h
  • lustre/osc/osc_lock.c
  • lustre/obdclass/cl_lock.c
  • lustre/osc/osc_request.c
  • lustre/ldlm/ldlm_lock.c
Comment by Build Master (Inactive) [ 13/Feb/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #468
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = SUCCESS
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/include/lustre_dlm.h
  • lustre/osc/osc_lock.c
  • lustre/include/cl_object.h
  • lustre/osc/osc_request.c
  • lustre/ldlm/ldlm_lock.c
Comment by Build Master (Inactive) [ 13/Feb/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #468
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = SUCCESS
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/include/lustre_dlm.h
  • lustre/include/cl_object.h
  • lustre/osc/osc_request.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/osc/osc_lock.c
  • lustre/obdclass/cl_lock.c
Comment by Build Master (Inactive) [ 13/Feb/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #468
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = SUCCESS
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/osc/osc_request.c
  • lustre/include/cl_object.h
  • lustre/obdclass/cl_lock.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/osc/osc_lock.c
  • lustre/include/lustre_dlm.h
Comment by Build Master (Inactive) [ 13/Feb/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #468
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = SUCCESS
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/osc/osc_request.c
  • lustre/include/cl_object.h
  • lustre/ldlm/ldlm_lock.c
  • lustre/obdclass/cl_lock.c
  • lustre/include/lustre_dlm.h
  • lustre/osc/osc_lock.c
Comment by Build Master (Inactive) [ 13/Feb/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #468
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = SUCCESS
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/include/lustre_dlm.h
  • lustre/osc/osc_lock.c
  • lustre/obdclass/cl_lock.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/osc/osc_request.c
  • lustre/include/cl_object.h
Comment by Build Master (Inactive) [ 13/Feb/12 ]

Integrated in lustre-master » i686,server,el5,ofa #468
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = SUCCESS
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/osc/osc_lock.c
  • lustre/include/lustre_dlm.h
  • lustre/ldlm/ldlm_lock.c
  • lustre/include/cl_object.h
  • lustre/obdclass/cl_lock.c
  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 13/Feb/12 ]

Integrated in lustre-master » i686,client,el5,ofa #468
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = SUCCESS
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/include/cl_object.h
  • lustre/include/lustre_dlm.h
  • lustre/osc/osc_lock.c
  • lustre/osc/osc_request.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/obdclass/cl_lock.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,server,el6,ofa #480
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = FAILURE
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/include/cl_object.h
  • lustre/osc/osc_request.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/obdclass/cl_lock.c
  • lustre/include/lustre_dlm.h
  • lustre/osc/osc_lock.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,client,el6,ofa #480
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = FAILURE
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/osc/osc_request.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/obdclass/cl_lock.c
  • lustre/include/lustre_dlm.h
  • lustre/osc/osc_lock.c
  • lustre/include/cl_object.h
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » i686,client,el6,ofa #480
LU-1061 agl: cl_locks_prune() waits for the last user (Revision f0d608786a27dfb8dddf06d6b086b491749557f1)

Result = ABORTED
Oleg Drokin : f0d608786a27dfb8dddf06d6b086b491749557f1
Files :

  • lustre/osc/osc_request.c
  • lustre/obdclass/cl_lock.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/include/lustre_dlm.h
  • lustre/osc/osc_lock.c
  • lustre/include/cl_object.h
Generated at Sat Feb 10 01:13:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.