[LU-13189] ASSERTION( obj->oo_with_projid ) failed with 2.12.3 Created: 02/Feb/20 Updated: 05/Apr/23 Resolved: 11/Jul/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.3, Lustre 2.14.0, Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Shane Nehring | Assignee: | Dongyang Li |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
rhel 7.7 zfs-0.8.2 kernel 3.10.0-1062.9.1.el7.x86_64 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Seeing a crash fairly frequently on one of our oss [Feb 1 19:45] Lustre: work2-OST0002: Recovery over after 0:56, of 20 clients 20 recovered and 0 were evicted. Message from syslogd@rit-ost1.las.iastate.edu at Feb 1 19:45:04 ...
Not sure what exactly is causing it. Stack trace is from after the server reboots, as soon as recovery finishes and io starts again it happens. Originally I thought it was related to the recovery process and that aborting recovery would work around it, but it still occurs. I'm not really sure if it's a particular file or an io pattern that's leading to it, and I've not been able to narrow it down to a specific job in our environment.
|
| Comments |
| Comment by Shane Nehring [ 12/Feb/20 ] |
|
I ended up undefining ZFS_PROJINHERIT and recompiling so I could get the oss to stay up. It doesn't look like this code was touched in 2.12.4 (I had tried RC1 when I was running into this issue and it still occurred). |
| Comment by Shane Nehring [ 20/Feb/20 ] |
|
Please let me know if you need any more information |
| Comment by Shane Nehring [ 12/Aug/20 ] |
|
Had this start showing up on another oss/ost. Implemented the same work around there. |
| Comment by Darby Vicker [ 30/Sep/21 ] |
|
Shane, we are running into this same issue with our lustre file system - we are running 2.14. At what level did you do the #undef ZFS_PROJINHERIT? Just in that source file or for the whole Lustre build? |
| Comment by Shane Nehring [ 30/Sep/21 ] |
|
What I've done is add #undef ZFS_PROJINHERIT to lustre/osd-zfs/osd_internal.h
diff --git a/lustre/osd-zfs/osd_internal.h b/lustre/osd-zfs/osd_internal.h index ae21447..58ef131 100644 --- a/lustre/osd-zfs/osd_internal.h +++ b/lustre/osd-zfs/osd_internal.h @@ -55,6 +55,7 @@ #include <sys/dbuf.h> #include <sys/dmu_objset.h> #include <lustre_scrub.h> +#undef ZFS_PROJINHERIT /** * By design including kmem.h overrides the Linux slab interfaces to provide
Which keeps things up. It will at the very least make project quotas non functional, I believe. |
| Comment by Robert Redl [ 10/Mar/22 ] |
|
We also ran into the same issue with 2.14: kernel:LustreError: 24469:0:(osd_object.c:1353:osd_attr_set()) ASSERTION( obj->oo_with_projid ) failed: kernel:LustreError: 24469:0:(osd_object.c:1353:osd_attr_set()) LBUG Is there any known workaround without recompiling? |
| Comment by Darby Vicker [ 10/Mar/22 ] |
|
We never found another workaround besides recompiling. I tried reaching out on the mailing list but didn't get any responses.
http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2021-September/017791.html http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2021-October/017794.html
|
| Comment by Shane Nehring [ 10/Mar/22 ] |
|
Had either of you ever enabled project quotas? |
| Comment by Robert Redl [ 10/Mar/22 ] |
|
We migrated recently from 2.12.8 to 2.14.0. On this occasion we activated project quotas but have not actually used them yet. My currently solution is:
That broke the cycle of reboot, recovery, and kernel panic. During the last few hours everything worked fine. But we actually would like to use project quotas in future. |
| Comment by Shane Nehring [ 10/Mar/22 ] |
|
Do you have the kernel patches for lustre applied on the oss? I believe, at least in my case, that this is the result of enabling project quotas with kernel version < 4.5 without the lustre kernel patches. |
| Comment by Robert Redl [ 11/Mar/22 ] |
|
We are using the ZFS backend for MDT and OST. All servers are installed with the packages from the official repository. Versions:
On a second identical system project quotas are already in use since a few weeks without any problems. |
| Comment by Darby Vicker [ 11/Mar/22 ] |
|
No, we are not patching the kernel. I realized after the fact that project quota won't work without the patched kernel. I'm still a little concerned as to why this would panic the OSS. I would like to know how to clear the project ID's from our OSS so we could go back to the unmodified lustre source. We are also using ZFS for our MDT and OST's. Our servers are CentOS 7.9 with kernel 3.10.0-1160.31.1.el7.x86_64, lustre-2.14.0_1.el7, zfs 2.0.5. |
| Comment by Shane Nehring [ 11/Mar/22 ] |
|
Hmm, that you're seeing this on a rhel 8 kernel kinda shoots that idea down. Unless you're hitting it by some other means.
I've recently explicitly disabled project quotas, I'll be doing an upgrade to 2.12.8 on Monday where I plan to not include my workaround to see if that resolves it. It's difficult for us to tell though, as it can take a day or two for someone to start hitting a file that has this problem. |
| Comment by Shane Nehring [ 14/Mar/22 ] |
|
I actually just thought to look at the configuration logs for the osts, it doesn't look like I ever actually enabled project quotas. So this may be something waiting in the wings that an FS can hit regardless of whether those were ever enabled or not. |
| Comment by Robert Redl [ 16/Mar/22 ] |
|
Two new observations:
|
| Comment by Robert Redl [ 05/Jun/22 ] |
|
The problem unfortunately persists with Lustre 2.15.0-RC5 and ZFS 2.0.7: Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: 809221:0:(osd_object.c:1300:osd_attr_set()) ASSERTION( obj->oo_with_projid ) failed: Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: 809016:0:(osd_object.c:1300:osd_attr_set()) ASSERTION( obj->oo_with_projid ) failed: Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: 808404:0:(osd_object.c:1300:osd_attr_set()) ASSERTION( obj->oo_with_projid ) failed: Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: 808404:0:(osd_object.c:1300:osd_attr_set()) LBUG Jun 05 17:03:04 z-ha-oss02b kernel: Pid: 808404, comm: ll_ost02_001 4.18.0-372.9.1.el8.x86_64 #1 SMP Tue May 10 08:57:35 EDT 2022 Jun 05 17:03:04 z-ha-oss02b kernel: Call Trace TBD: Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] libcfs_call_trace+0x6f/0x90 [libcfs] Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] lbug_with_loc+0x3f/0x70 [libcfs] Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] osd_attr_set+0xe3f/0xed0 [osd_zfs] Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] ofd_attr_set+0x638/0x1080 [ofd] Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] ofd_setattr_hdl+0x454/0x8d0 [ofd] Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] tgt_request_handle+0xc93/0x1a40 [ptlrpc] Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc] Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] ptlrpc_main+0xc06/0x1560 [ptlrpc] Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] kthread+0x10a/0x120 Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] ret_from_fork+0x35/0x40 Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: dumping log to /tmp/lustre-log.1654441384.808404 Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: 809221:0:(osd_object.c:1300:osd_attr_set()) LBUG Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: 822248:0:(osd_object.c:1300:osd_attr_set()) ASSERTION( obj->oo_with_projid ) failed: Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: 822248:0:(osd_object.c:1300:osd_attr_set()) LBUG |
| Comment by Peter Bortas [ 22/Jun/22 ] |
|
NSC hit this bug Yesterday. We have been running Rocky 8, Lustre 2.14 with ZFS 2.1.x on a few filesystems since March because we wanted dRAID support on the OSSs. We are not using project quota. When upgrading the remaining filesystems to 2.14 with OpenZFS 2.1.4 Yesterday, it ran for 6h before five of the newly upgraded OSSs PANICed, and then re-PANICed pretty quickly after reboot. After applying Shane's fix things remained stable over night. Today one OSS for another filesystem PANICed, so we decided to apply the fix to all remaining servers, including the MDSs. This is just to reaffirm that more people are seeing this and to offer thanks to Shane for sharing the workaround! |
| Comment by Gerrit Updater [ 23/Jun/22 ] |
|
"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47709 |
| Comment by Robert Redl [ 23/Jun/22 ] |
|
Thanks a lot for the patch! Would this now only result in not hitting the LASSERT with an old object anymore, or would this now also result in the old object being updated on disk with an included project ID? |
| Comment by Fredrik Nyström [ 23/Jun/22 ] |
|
Thanks for the patch. Running tests on a non production filesystem at NSC (Rocky 8.6 + ZFS 2.1.5 + Lustre 2.14.0). |
| Comment by Dongyang Li [ 24/Jun/22 ] |
|
I've updated patch https://review.whamcloud.com/47709 to add project id for old objects as well. |
| Comment by Robert Redl [ 24/Jun/22 ] |
|
Is the update of old objects also done for objects on a ZFS based MDT? Would https://jira.whamcloud.com/browse/LU-15640 also be solved by this patch? |
| Comment by Dongyang Li [ 24/Jun/22 ] |
|
Yes it's for both MDT and OST. I think it should let you set project id on old dirs now. if you would test and give some feedback it would be great. BTW was zpool upgrade used during the zfs upgrade? what does zpool status -v show? |
| Comment by Robert Redl [ 24/Jun/22 ] |
|
Thank you very much! I will report back after a test. About the zpool: it was created on new hardware on ZFS 2.0.0, so it did have project quotas enabled by default. But the datasets have been copied over from old hardware with zfs send/resv and did not have project quotas before. I tried to upgrade the datasets with zfs upgrade, but that did not have any effect. |
| Comment by Robert Redl [ 27/Jun/22 ] |
|
After applying the patch to MDTs and OSTs project quotas work es expected on a system that has been migrated with zfs send/recv. Setting the project ID on old directories and files that have been there before the migration is not failing anymore. Thanks a lot, @dongyang! |
| Comment by Dongyang Li [ 27/Jun/22 ] |
|
Thanks for the feedback Robert. Good to know setting project ID is not failing. If you get the project id after setting on old files and dirs, it's showing the expected one right? Could you also verify after setting the project id on the old files/dirs, the project quota accounting is showing the expected numbers - they should reflect the old dirs/files? Cheers Dongyang |
| Comment by Robert Redl [ 27/Jun/22 ] |
|
Yes, I can confirm that after setting the project id it is also correctly shown by lfs project and also by lsattr -p. New files created in an old directory with with inherit flag set are also correctly inheriting the project id. The project quota also shows expected values. |
| Comment by Dongyang Li [ 28/Jun/22 ] |
|
Great, thanks for the update Robert. |
| Comment by Gerrit Updater [ 30/Jun/22 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47846 |
| Comment by Gerrit Updater [ 11/Jul/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47709/ |
| Comment by Peter Jones [ 11/Jul/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 11/Jul/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47846/ |
| Comment by Kaizaad Bilimorya [ 11/Jul/22 ] |
|
We just hit this today and Shane's #undef ZFS_PROJINHERIT patch seemed to fix it (thanks so much Shane!). Note we don't have project quotas enabled CentOS Linux release 7.9.2009 (Core) We have only been running with these versions for ~ 3 weeks. The OSTs were upgraded from zfs 0.7.13 and we did run "zpool upgrade ostpool"
-k |