[LU-14822] Panic at dnode.c leading to LNet service thread inactive Created: 06/Jul/21  Updated: 12/Jul/21  Resolved: 12/Jul/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Dneg (Inactive) Assignee: Peter Jones
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

EL7


Attachments: File lustre-logs.tar.gz     HTML File stack-trace    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On our MDS running ZFS backing, we're seeing a frequent issue which'll hang the clients and show the attached stack trace.  I'll also attach the lustre-logs for the relevant time period of this morning's instance.

Once this has happened, it seems the only option to reconnect the client is a reboot of the MDS.

This may be related to the MDT filling up - we changed the zpool topology to increase the size of the MDT and all seemed well for a few days after before these issues started to occur.

I'm running an lfsck which has so far repaired a large number of namespaces but the problem as occurred again while that was running.

Any help as always much appreciated. 



 Comments   
Comment by Dneg (Inactive) [ 06/Jul/21 ]

and so you're not just talking to a generic company name, Stephen here

Comment by Peter Jones [ 06/Jul/21 ]

Stephen

What ZFS version are you running? Could this be LU-13536? If so, moving to a newer ZFS version could address this issue for you.

Peter

Comment by Dneg (Inactive) [ 06/Jul/21 ]

Browsing around it seems this might be related to the issue seen in LU-13536 ?

Is there a recommended package set to upgrade to ZFS 0.8.3 or newer/different?  I may of course be barking up the wrong tree here.

Comment by Dneg (Inactive) [ 06/Jul/21 ]

Hi Peter.  Your comment showed up as I clicked 'Add'   As mentioned, happy to upgrade ZFS.  Do you have a recommended method for that or just grab them from the OpenZFS project?

Comment by Dneg (Inactive) [ 06/Jul/21 ]

For reference:

emds1 /tmp # rpm -qa | grep zfs
lustre-osd-zfs-mount-2.12.6-1.el7.x86_64
zfs-0.7.13-1.el7.x86_64
kmod-zfs-3.10.0-1160.2.1.el7_lustre.x86_64-0.7.13-1.el7.x86_64
libzfs2-0.7.13-1.el7.x86_64
kmod-lustre-osd-zfs-2.12.6-1.el7.x86_64 
Comment by Peter Jones [ 06/Jul/21 ]

Yes - just grab the updated ZFS version from the ZoL site and rebuild. Several sites have done this successfully.

Comment by Dneg (Inactive) [ 06/Jul/21 ]

Would I be correct in assuming that there's nothing particularly version specific about the two lustre packages in that list (lustre-osd-zfs-mount and kmod-lustre-osd-zfs)?

I'll just replace zfs, libzfs2 with the newer ones and swap out the Lustre provided kmod-zfs with zfs-dkms?

Comment by Dneg (Inactive) [ 06/Jul/21 ]

Had a first pass at this and I'm afraid I think I'm gonna have to ask for some step-by-step here.

A quick rpmbuild -ba of the ZFS spec with a simple swap out of the spec file version number and ZFS tar.gz source is taking me down a road of missing files and so on.  Do you have appropriate spec files for the newer versions?  I notice also that the SPL source is no longer separate.  Haven't tried rebuilding that yet.

Comment by Andreas Dilger [ 06/Jul/21 ]

Per the comments in LU-13536, there are two approaches to solving the crashes in that ticket - updating to ZFS 0.8.x, or patching ZFS 0.7.13 with the two referenced patches:

    78e213946 Fix dnode_hold() freeing dnode behavior
    58769a4eb Don’t allow dnode allocation if dn_holds != 0

Due to changes in ZFS between 0.7 and 0.8, if you do a ZFS upgrade you would need to rebuild all of the Lustre RPMs to get a new kmod-lustre-osd-zfs since it links directly with the ZFS module. If you apply the patches to the ZFS 0.7.13 you could very likely keep the existing Lustre RPMs since that change is internal only.

Comment by Dneg (Inactive) [ 06/Jul/21 ]

Right.  Thanks.  Patching for now then.

Comment by Dneg (Inactive) [ 06/Jul/21 ]

Rebuilt zfs and zfs-dkms with those two patches.  I'll schedule a reboot soon to pick up the patched versions.  Happy for you to close this and I can re-open if necessary or leave it as is for a few days while it's soak tested - whichever is your preference.

Comment by Peter Jones [ 06/Jul/21 ]

Great. Why not just let us know after the weekend if there is a noticeable improvement (or sooner of course if there are still problems)?

Comment by Dneg (Inactive) [ 12/Jul/21 ]

Looking good.  Thanks for your help.

Comment by Peter Jones [ 12/Jul/21 ]

Excellent - thanks for the update

Generated at Sat Feb 10 03:13:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.