[LU-9428] ASSERTION( de->d_op == &ll_d_ops) Created: 02/May/17  Updated: 29/Jun/17  Resolved: 29/Jun/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Frederik Ferner (Inactive) Assignee: Lai Siyao
Resolution: Duplicate Votes: 0
Labels: None
Environment:

RHEL6 server


Issue Links:
Duplicate
duplicates LU-9421 minor improvement on the implementati... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We have recently seen frequent occurrences of the LBUG below.

The affected machines are all exporting our Lustre file system via NFS to other Linux machines.

May  2 06:59:03 i05-storage1 kernel: LustreError: 3023:0:(dcache.c:236:ll_d_init()) ASSERTION( de->d_op == &ll_d_ops ) failed: 
May  2 06:59:03 i05-storage1 kernel: LustreError: 3023:0:(dcache.c:236:ll_d_init()) LBUG
May  2 06:59:03 i05-storage1 kernel: Pid: 3023, comm: nfsd
May  2 06:59:03 i05-storage1 kernel: 
May  2 06:59:03 i05-storage1 kernel: Call Trace:
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa0383895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa0383e97>] lbug_with_loc+0x47/0xb0 [libcfs]
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa097e69f>] ll_d_init+0x2ff/0x540 [lustre]
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa09c1b5b>] ll_iget_for_nfs+0x20b/0x300 [lustre]
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa09c1d89>] ll_fh_to_dentry+0x99/0xa0 [lustre]
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa0b3871c>] exportfs_decode_fh+0x5c/0x2bc [exportfs]
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa0bcc8e0>] ? nfsd_acceptable+0x0/0x120 [nfsd]
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa0b56da0>] ? cache_check+0x60/0x370 [sunrpc]
May  2 06:59:03 i05-storage1 kernel: [<ffffffff8117f76b>] ? cache_alloc_refill+0x15b/0x240
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa0bccdda>] fh_verify+0x32a/0x640 [nfsd]
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa0bcfda1>] nfsd_open+0x31/0x240 [nfsd]
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa0bd022b>] nfsd_commit+0x3b/0xa0 [nfsd]
May  2 06:59:03 i05-storage1 kernel: [<ffffffff810aff24>] ? groups_free+0x54/0x60
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa0bd769d>] nfsd3_proc_commit+0x9d/0x100 [nfsd]
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa0bc9405>] nfsd_dispatch+0xe5/0x230 [nfsd]
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa0b4ccf4>] svc_process_common+0x344/0x640 [sunrpc]
May  2 06:59:03 i05-storage1 kernel: [<ffffffff8106c500>] ? default_wake_function+0x0/0x20
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa0b4d390>] svc_process+0x110/0x160 [sunrpc]
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa0bc9c82>] nfsd+0xc2/0x160 [nfsd]
May  2 06:59:03 i05-storage1 kernel: [<ffffffffa0bc9bc0>] ? nfsd+0x0/0x160 [nfsd]
May  2 06:59:03 i05-storage1 kernel: [<ffffffff810a640e>] kthread+0x9e/0xc0
May  2 06:59:03 i05-storage1 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
May  2 06:59:03 i05-storage1 kernel: [<ffffffff810a6370>] ? kthread+0x0/0xc0
May  2 06:59:03 i05-storage1 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
May  2 06:59:03 i05-storage1 kernel: 

This looks similar to LU-9241 but the stack trace is not quite the same and also the patch is against master while we are running b2_7_fe, so would need a fix for that.

We are still investigating the events leading to the crash, hoping for a reproducer....



 Comments   
Comment by Peter Jones [ 02/May/17 ]

Lai

Can you please assist with this issue?

Thanks

Peter

Comment by Frederik Ferner (Inactive) [ 11/May/17 ]

Any updates to this? We are still seeing this frequently, though unfortunately haven't been able to detect a pattern or develop a reproducer yet, however it is definitely affecting our users.

Thanks,
Frederik

Comment by Frederik Ferner (Inactive) [ 18/May/17 ]

I noticed that the patch in LU-9241 has been merged and appears to apply cleanly to our b2.7 tree.

Can we get feedback if it should be safe to cherry-pick this commit and test on our clients?

thanks,
Frederik

Comment by Lai Siyao [ 18/May/17 ]

Yes, it's safe to cherry-pick to 2.7. It's a trivial fix to client code.

Comment by Frederik Ferner (Inactive) [ 19/May/17 ]

Thanks for confirming. We have rebuild our client with this patch applied and have started testing.

As we don't have a know reproducer and it seems quite unpredictable when it happens, it will take a while until we can be confident that this fixed our problem. We'll report back.

Frederik

Comment by Peter Jones [ 21/Jun/17 ]

Frederik

Has this been long enough to ascertain whether the fix has helped?

Peter

Comment by Frederik Ferner (Inactive) [ 29/Jun/17 ]

Peter, All,

apologies for the delay, I've been away.

Without a clear reproducer it is always going to be hard to be absolutely sure and the problem seems to come and go in waves. However, we have so far not seen this problem on a NFS server running the patched version. So I feel confident to say it's looking good so far, it certainly seems to have helped.

Thanks,
Frederik

Comment by Peter Jones [ 29/Jun/17 ]

Thanks Frederik. Let's close out this ticket for now then and open a new one if you do ever get a reoccurence.

Generated at Sat Feb 10 02:26:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.