[LU-14822] Panic at dnode.c leading to LNet service thread inactive Created: 06/Jul/21 Updated: 12/Jul/21 Resolved: 12/Jul/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Dneg (Inactive) | Assignee: | Peter Jones |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
EL7 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
On our MDS running ZFS backing, we're seeing a frequent issue which'll hang the clients and show the attached stack trace. I'll also attach the lustre-logs for the relevant time period of this morning's instance. Once this has happened, it seems the only option to reconnect the client is a reboot of the MDS. This may be related to the MDT filling up - we changed the zpool topology to increase the size of the MDT and all seemed well for a few days after before these issues started to occur. I'm running an lfsck which has so far repaired a large number of namespaces but the problem as occurred again while that was running. Any help as always much appreciated. |
| Comments |
| Comment by Dneg (Inactive) [ 06/Jul/21 ] |
|
and so you're not just talking to a generic company name, Stephen here |
| Comment by Peter Jones [ 06/Jul/21 ] |
|
Stephen What ZFS version are you running? Could this be Peter |
| Comment by Dneg (Inactive) [ 06/Jul/21 ] |
|
Browsing around it seems this might be related to the issue seen in Is there a recommended package set to upgrade to ZFS 0.8.3 or newer/different? I may of course be barking up the wrong tree here. |
| Comment by Dneg (Inactive) [ 06/Jul/21 ] |
|
Hi Peter. Your comment showed up as I clicked 'Add' |
| Comment by Dneg (Inactive) [ 06/Jul/21 ] |
|
For reference: emds1 /tmp # rpm -qa | grep zfs lustre-osd-zfs-mount-2.12.6-1.el7.x86_64 zfs-0.7.13-1.el7.x86_64 kmod-zfs-3.10.0-1160.2.1.el7_lustre.x86_64-0.7.13-1.el7.x86_64 libzfs2-0.7.13-1.el7.x86_64 kmod-lustre-osd-zfs-2.12.6-1.el7.x86_64 |
| Comment by Peter Jones [ 06/Jul/21 ] |
|
Yes - just grab the updated ZFS version from the ZoL site and rebuild. Several sites have done this successfully. |
| Comment by Dneg (Inactive) [ 06/Jul/21 ] |
|
Would I be correct in assuming that there's nothing particularly version specific about the two lustre packages in that list (lustre-osd-zfs-mount and kmod-lustre-osd-zfs)? I'll just replace zfs, libzfs2 with the newer ones and swap out the Lustre provided kmod-zfs with zfs-dkms? |
| Comment by Dneg (Inactive) [ 06/Jul/21 ] |
|
Had a first pass at this and I'm afraid I think I'm gonna have to ask for some step-by-step here. A quick rpmbuild -ba of the ZFS spec with a simple swap out of the spec file version number and ZFS tar.gz source is taking me down a road of missing files and so on. Do you have appropriate spec files for the newer versions? I notice also that the SPL source is no longer separate. Haven't tried rebuilding that yet. |
| Comment by Andreas Dilger [ 06/Jul/21 ] |
|
Per the comments in 78e213946 Fix dnode_hold() freeing dnode behavior
58769a4eb Don’t allow dnode allocation if dn_holds != 0
Due to changes in ZFS between 0.7 and 0.8, if you do a ZFS upgrade you would need to rebuild all of the Lustre RPMs to get a new kmod-lustre-osd-zfs since it links directly with the ZFS module. If you apply the patches to the ZFS 0.7.13 you could very likely keep the existing Lustre RPMs since that change is internal only. |
| Comment by Dneg (Inactive) [ 06/Jul/21 ] |
|
Right. Thanks. Patching for now then. |
| Comment by Dneg (Inactive) [ 06/Jul/21 ] |
|
Rebuilt zfs and zfs-dkms with those two patches. I'll schedule a reboot soon to pick up the patched versions. Happy for you to close this and I can re-open if necessary or leave it as is for a few days while it's soak tested - whichever is your preference. |
| Comment by Peter Jones [ 06/Jul/21 ] |
|
Great. Why not just let us know after the weekend if there is a noticeable improvement (or sooner of course if there are still problems)? |
| Comment by Dneg (Inactive) [ 12/Jul/21 ] |
|
Looking good. Thanks for your help. |
| Comment by Peter Jones [ 12/Jul/21 ] |
|
Excellent - thanks for the update |