[LU-13189] ASSERTION( obj->oo_with_projid ) failed with 2.12.3 Created: 02/Feb/20  Updated: 05/Apr/23  Resolved: 11/Jul/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3, Lustre 2.14.0, Lustre 2.15.0
Fix Version/s: Lustre 2.16.0, Lustre 2.15.1

Type: Bug Priority: Major
Reporter: Shane Nehring Assignee: Dongyang Li
Resolution: Fixed Votes: 1
Labels: None
Environment:

rhel 7.7 zfs-0.8.2 kernel 3.10.0-1062.9.1.el7.x86_64


Issue Links:
Related
is related to LU-15640 Setting project ID for existing direc... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Seeing a crash fairly frequently on one of our oss

[Feb 1 19:45] Lustre: work2-OST0002: Recovery over after 0:56, of 20 clients 20 recovered and 0 were evicted.
[ +0.000279] Lustre: work2-OST0002: deleting orphan objects from 0x0:268076412 to 0x0:268081537
[ +0.198454] LustreError: 14123:0:(osd_object.c:1345:osd_attr_set()) ASSERTION( obj->oo_with_projid ) failed:
[ +0.000046] LustreError: 14123:0:(osd_object.c:1345:osd_attr_set()) LBUG
[ +0.000064] Pid: 14123, comm: ll_ost_io01_013 3.10.0-1062.9.1.el7.x86_64 #1 SMP Mon Dec 2 08:31:54 EST 2019
[ +0.000035] Call Trace:
[ +0.000018] [<ffffffffc10e87cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ +0.001388] [<ffffffffc10e887c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ +0.001275] [<ffffffffc179b458>] osd_attr_set+0xdd8/0xe50 [osd_zfs]

Message from syslogd@rit-ost1.las.iastate.edu at Feb 1 19:45:04 ...
kernel:LustreError: 14123:0:(osd_object.c:1345:osd_attr_set()) ASSERTION( obj->oo_with_projid ) failed:
[ +0.001272] [<ffffffffc190e622>] ofd_commitrw_write+0x13c2/0x1d40 [ofd]
[ +0.001274] [<ffffffffc191212c>] ofd_commitrw+0x48c/0x9e0 [ofd]
[ +0.001255] [<ffffffffc15ad0fa>] tgt_brw_write+0x10ba/0x1ce0 [ptlrpc]
[ +0.001586] [<ffffffffc15ab2ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ +0.001574] [<ffffffffc155029b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ +0.001555] [<ffffffffc1553bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
[ +0.001560] [<ffffffffb28c61f1>] kthread+0xd1/0xe0
[ +0.001500] [<ffffffffb2f8dd1d>] ret_from_fork_nospec_begin+0x7/0x21
[ +0.001487] [<ffffffffffffffff>] 0xffffffffffffffff
[ +0.001498] Kernel panic - not syncing: LBUG

 

Not sure what exactly is causing it. Stack trace is from after the server reboots, as soon as recovery finishes and io starts again it happens. Originally I thought it was related to the recovery process and that aborting recovery would work around it, but it still occurs. I'm not really sure if it's a particular file or an io pattern that's leading to it, and I've not been able to narrow it down to a specific job in our environment.

 

 



 Comments   
Comment by Shane Nehring [ 12/Feb/20 ]

I ended up undefining ZFS_PROJINHERIT and recompiling so I could get the oss to stay up. It doesn't look like this code was touched in 2.12.4 (I had tried RC1 when I was running into this issue and it still occurred).

Comment by Shane Nehring [ 20/Feb/20 ]

Please let me know if you need any more information

Comment by Shane Nehring [ 12/Aug/20 ]

Had this start showing up on another oss/ost. Implemented the same work around there.

Comment by Darby Vicker [ 30/Sep/21 ]

Shane, we are running into this same issue with our lustre file system - we are running 2.14.  At what level did you do the #undef ZFS_PROJINHERIT?  Just in that source file or for the whole Lustre build?

Comment by Shane Nehring [ 30/Sep/21 ]

What I've done is add #undef ZFS_PROJINHERIT to lustre/osd-zfs/osd_internal.h

 

diff --git a/lustre/osd-zfs/osd_internal.h b/lustre/osd-zfs/osd_internal.h
index ae21447..58ef131 100644
--- a/lustre/osd-zfs/osd_internal.h
+++ b/lustre/osd-zfs/osd_internal.h
@@ -55,6 +55,7 @@
 #include <sys/dbuf.h>
 #include <sys/dmu_objset.h>
 #include <lustre_scrub.h>
+#undef ZFS_PROJINHERIT
 
 /**
  * By design including kmem.h overrides the Linux slab interfaces to provide 

 

Which keeps things up. It will at the very least make project quotas non functional, I believe.

Comment by Robert Redl [ 10/Mar/22 ]

We also ran into the same issue with 2.14:

kernel:LustreError: 24469:0:(osd_object.c:1353:osd_attr_set()) ASSERTION( obj->oo_with_projid ) failed: 
kernel:LustreError: 24469:0:(osd_object.c:1353:osd_attr_set()) LBUG

Is there any known workaround without recompiling?

Comment by Darby Vicker [ 10/Mar/22 ]

We never found another workaround besides recompiling.  I tried reaching out on the mailing list but didn't get any responses.  

 

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2021-September/017791.html

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2021-October/017794.html

 

 

Comment by Shane Nehring [ 10/Mar/22 ]

Had either of you ever enabled project quotas?

Comment by Robert Redl [ 10/Mar/22 ]

We migrated recently from 2.12.8 to 2.14.0. On this occasion we activated project quotas but have not actually used them yet. My currently solution is:

  • project quotas disabled
  • all clients evicted

That broke the cycle of reboot, recovery, and kernel panic. During the last few hours everything worked fine. But we actually would like to use project quotas in future.

Comment by Shane Nehring [ 10/Mar/22 ]

Do you have the kernel patches for lustre applied on the oss?

I believe, at least in my case, that this is the result of enabling project quotas with kernel version < 4.5 without the lustre kernel patches.

Comment by Robert Redl [ 11/Mar/22 ]

We are using the ZFS backend for MDT and OST. All servers are installed with the packages from the official repository. Versions:

  • Kernel: 4.18.0-240.1.1.el8_lustre.x86_64
  • Lustre: 2.14.0-1.el8 (kmod)
  • ZFS: 2.0.0-1.el8 (kmod)

On a second identical system project quotas are already in use since a few weeks without any problems.

Comment by Darby Vicker [ 11/Mar/22 ]

No, we are not patching the kernel.  I realized after the fact that project quota won't work without the patched kernel.  I'm still a little concerned as to why this would panic the OSS.  I would like to know how to clear the project ID's from our OSS so we could go back to the unmodified lustre source.  

We are also using ZFS for our MDT and OST's.  Our servers are CentOS 7.9 with kernel 3.10.0-1160.31.1.el7.x86_64, lustre-2.14.0_1.el7, zfs 2.0.5.  

Comment by Shane Nehring [ 11/Mar/22 ]

Hmm, that you're seeing this on a rhel 8 kernel kinda shoots that idea down. Unless you're hitting it by some other means.

 

I've recently explicitly disabled project quotas, I'll be doing an upgrade to 2.12.8 on Monday where I plan to not include my workaround to see if that resolves it. It's difficult for us to tell though, as it can take a day or two for someone to start hitting a file that has this problem.

Comment by Shane Nehring [ 14/Mar/22 ]

I actually just thought to look at the configuration logs for the osts, it doesn't look like I ever actually enabled project quotas. So this may be something waiting in the wings that an FS can hit regardless of whether those were ever enabled or not.

Comment by Robert Redl [ 16/Mar/22 ]

Two new observations:

  • project quotas work fine on the same server for a different OST.
  • disabling project quotas does not help. It worked fine for the system with disabled project quotas for the last five days. But today again the same issue.
Comment by Robert Redl [ 05/Jun/22 ]

The problem unfortunately persists with Lustre 2.15.0-RC5 and ZFS 2.0.7:

Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: 809221:0:(osd_object.c:1300:osd_attr_set()) ASSERTION( obj->oo_with_projid ) failed: 
Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: 809016:0:(osd_object.c:1300:osd_attr_set()) ASSERTION( obj->oo_with_projid ) failed: 
Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: 808404:0:(osd_object.c:1300:osd_attr_set()) ASSERTION( obj->oo_with_projid ) failed: 
Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: 808404:0:(osd_object.c:1300:osd_attr_set()) LBUG
Jun 05 17:03:04 z-ha-oss02b kernel: Pid: 808404, comm: ll_ost02_001 4.18.0-372.9.1.el8.x86_64 #1 SMP Tue May 10 08:57:35 EDT 2022
Jun 05 17:03:04 z-ha-oss02b kernel: Call Trace TBD:
Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] libcfs_call_trace+0x6f/0x90 [libcfs]
Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] osd_attr_set+0xe3f/0xed0 [osd_zfs]
Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] ofd_attr_set+0x638/0x1080 [ofd]
Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] ofd_setattr_hdl+0x454/0x8d0 [ofd]
Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] tgt_request_handle+0xc93/0x1a40 [ptlrpc]
Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] ptlrpc_main+0xc06/0x1560 [ptlrpc]
Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] kthread+0x10a/0x120
Jun 05 17:03:04 z-ha-oss02b kernel: [<0>] ret_from_fork+0x35/0x40
Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: dumping log to /tmp/lustre-log.1654441384.808404
Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: 809221:0:(osd_object.c:1300:osd_attr_set()) LBUG
Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: 822248:0:(osd_object.c:1300:osd_attr_set()) ASSERTION( obj->oo_with_projid ) failed: 
Jun 05 17:03:04 z-ha-oss02b kernel: LustreError: 822248:0:(osd_object.c:1300:osd_attr_set()) LBUG
Comment by Peter Bortas [ 22/Jun/22 ]

NSC hit this bug Yesterday. We have been running Rocky 8, Lustre 2.14 with ZFS 2.1.x on a few filesystems since March because we wanted dRAID support on the OSSs. We are not using project quota.

When upgrading the remaining filesystems to 2.14 with OpenZFS 2.1.4 Yesterday, it ran for 6h before five of the newly upgraded OSSs PANICed, and then re-PANICed pretty quickly after reboot. After applying Shane's fix things remained stable over night.

Today one OSS for another filesystem PANICed, so we decided to apply the fix to all remaining servers, including the MDSs.

This is just to reaffirm that more people are seeing this and to offer thanks to Shane for sharing the workaround!

Comment by Gerrit Updater [ 23/Jun/22 ]

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47709
Subject: LU-13189 osd-zfs: fix assert on oo_with_projid
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7d631c2f7f45caa36fbdc73b9d83bd98b43edd42

Comment by Robert Redl [ 23/Jun/22 ]

Thanks a lot for the patch!

Would this now only result in not hitting the LASSERT with an old object anymore, or would this now also result in the old object being updated on disk with an included project ID?

Comment by Fredrik Nyström [ 23/Jun/22 ]

Thanks for the patch.

Running tests on a non production filesystem at NSC (Rocky 8.6 + ZFS 2.1.5 + Lustre 2.14.0).

Comment by Dongyang Li [ 24/Jun/22 ]

I've updated patch https://review.whamcloud.com/47709

to add project id for old objects as well.

Comment by Robert Redl [ 24/Jun/22 ]

Is the update of old objects also done for objects on a ZFS based MDT? Would https://jira.whamcloud.com/browse/LU-15640 also be solved by this patch?

Comment by Dongyang Li [ 24/Jun/22 ]

Yes it's for both MDT and OST.

I think it should let you set project id on old dirs now. if you would test and give some feedback it would be great.

BTW was zpool upgrade used during the zfs upgrade? what does zpool status -v show?

Comment by Robert Redl [ 24/Jun/22 ]

Thank you very much! I will report back after a test.

About the zpool: it was created on new hardware on ZFS 2.0.0, so it did have project quotas enabled by default. But the datasets have been copied over from old hardware with zfs send/resv and did not have project quotas before. I tried to upgrade the datasets with zfs upgrade, but that did not have any effect.

Comment by Robert Redl [ 27/Jun/22 ]

After applying the patch to MDTs and OSTs project quotas work es expected on a system that has been migrated with zfs send/recv. Setting the project ID on old directories and files that have been there before the migration is not failing anymore.

Thanks a lot, @dongyang!

Comment by Dongyang Li [ 27/Jun/22 ]

Thanks for the feedback Robert.

Good to know setting project ID is not failing. If you get the project id after setting on old files and dirs, it's showing the expected one right?

Could you also verify after setting the project id on the old files/dirs, the project quota accounting is showing the expected numbers - they should reflect the old dirs/files?

Cheers

Dongyang

Comment by Robert Redl [ 27/Jun/22 ]

Yes, I can confirm that after setting the project id it is also correctly shown by lfs project and also by lsattr -p. New files created in an old directory with with inherit flag set are also correctly inheriting the project id.

The project quota also shows expected values.

Comment by Dongyang Li [ 28/Jun/22 ]

Great, thanks for the update Robert.

Comment by Gerrit Updater [ 30/Jun/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47846
Subject: LU-13189 osd-zfs: add project id for old objects without ZFS_PROJID
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 26539cc3744155c6b6ad89fc0b5ef1413a8beb14

Comment by Gerrit Updater [ 11/Jul/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47709/
Subject: LU-13189 osd-zfs: add project id for old objects without ZFS_PROJID
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ec79791a7cda5b66649200b16a70167d86059e65

Comment by Peter Jones [ 11/Jul/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 11/Jul/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47846/
Subject: LU-13189 osd-zfs: add project id for old objects without ZFS_PROJID
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 5a5dad1bc0147b63f377168dde3fe799156a5abd

Comment by Kaizaad Bilimorya [ 11/Jul/22 ]

We just hit this today and Shane's

#undef ZFS_PROJINHERIT

patch seemed to fix it (thanks so much Shane!). Note we don't have project quotas enabled

CentOS Linux release 7.9.2009 (Core)
Kernel 3.10.0-1160.49.1.el7_lustre.x86_64
Lustre 2.12.9
MDT - ldiskfs
OSTs - zfs-0.8.6

We have only been running with these versions for ~ 3 weeks. The OSTs were upgraded from zfs 0.7.13 and we did run "zpool upgrade ostpool" 

 

-k

Generated at Sat Feb 10 02:59:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.