[LU-736] LBUG and kernel panic on client unmount Created: 04/Oct/11  Updated: 22/Jan/16  Resolved: 22/Jan/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Christopher Morrone Assignee: Christopher Morrone
Resolution: Won't Fix Votes: 0
Labels: llnl
Environment:

1.8.5.0-5chaos. https://github.com/chaos/lustre/tree/1.8.5.0-5chaos


Attachments: Text File sierra32_console.txt    
Issue Links:
Related
Severity: 3
Bugzilla ID: 23,861
Rank (Obsolete): 9743

 Description   

We had a few hundred clients all LBUG and then kernel panic on unmount of a lustre filesystem recently. All the ones that I checked have the same backtrace. See the attached sierra32_console.txt.

It looks like others have hit this in earlier 1.8 versions. See bugzilla.lustre.org bug 23861.



 Comments   
Comment by Christopher Morrone [ 04/Oct/11 ]

To make this issue more searchable, the LBUG is here:

2011-09-29 07:51:30 LustreError: 19065:0:(ldlm_lock.c:1568:ldlm_lock_cancel()) ### lock still has references ns: lsa-MDT0000-mdc-ffff810332040400 lock: ffff810263a92e00/0xb23761f5d085be87 lrc: 4/0,1 mode: PW/PW res: 578792285/4020328757 rrc: 2 type: FLK pid: 21451 [0->9223372036854775807] flags: 0x22002890 remote: 0x1f055096a089059 expref: -99 pid: 21451 timeout: 0
2011-09-29 07:51:30 LustreError: 19065:0:(ldlm_lock.c:1569:ldlm_lock_cancel()) LBUG

Comment by Peter Jones [ 04/Oct/11 ]

HongChao

Could you please look into this one?

Thanks

Peter

Comment by Peter Jones [ 13/Oct/11 ]

Hongchao

Could you please provide a status update?

Thanks

Peter

Comment by Hongchao Zhang [ 18/Oct/11 ]

the readers/writers of the flock's LDLM lock isn't zero until it is canceled by unlock request, the reference will only be
dropped by "ldlm_flock_completion_ast" in "cleanup_resource" by setting "LDLM_FL_LOCAL_ONLY|LDLM_FL_FAILED" flag in the
LDLM lock.

in this case, the flag of the lock is "0x22002890", only contains LDLM_FL_FAILED, no LDLM_FL_LOCAL_ONLY. and during umount
this flag will be set only if obd->obd_force is set,

if there are flock's LDLM lock during umount and obd->obd_force isn't set, then this issue wil be triggered.

Hi Chris,
Do you add "-f" flag when you umount the Lustre client? thanks.

Comment by Hongchao Zhang [ 18/Oct/11 ]

Hi Chris,
could you please also check whether your application running on the Lustre client leave some flocks unlocked?
I have tested locally with leaving some flock unlocked deliberately and umount Lustre, which triggers this LBUG.

Thanks

Comment by Christopher Morrone [ 20/Oct/11 ]

I will find out what the admins did to umount lustre.

It is going to be rather difficult to track down whether an of the various applications are using flock, and how. Most of our users won't know the answer to that that, even if there application IS using flock.

Perhaps it is relevant that we are mounting with the "flock" option enabled.

Comment by Christopher Morrone [ 20/Oct/11 ]

As far as they can recall, they did not use the umount -f option.

Comment by Hongchao Zhang [ 13/Apr/12 ]

the initial patch is tracked at http://review.whamcloud.com/#change,2535

Comment by Christopher Morrone [ 13/Apr/12 ]

Thanks. FYI unless this is also a problem for 2.1, this ticket is very low priority compared to our many 2.1 bugs. We do not plan to fix any 1.8 bugs in production.

Comment by Hongchao Zhang [ 21/Jan/16 ]

Hi Chris,
Do you need any more work on this ticket? Or are we OK to close it? Thanks

Comment by Christopher Morrone [ 21/Jan/16 ]

This is so old that I think you can close it with resolution "Won't Fix".

Comment by Hongchao Zhang [ 22/Jan/16 ]

Chris, Thanks!

Generated at Sat Feb 10 01:09:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.