[LU-15635] ext4_(inc|dec)_count removed handle_t arg breaking 5.10 server Created: 09/Mar/22 Updated: 05/Jul/22 Resolved: 11/Jun/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.1 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shaun Tancheff | Assignee: | Shaun Tancheff |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Linux v5.9-rc7-8-g15ed2851b0f4 This breaks when the 'handle' is treated as inode with random crashes like the following: PID: 1901 TASK: ffff8a4d4151c740 CPU: 0 COMMAND: "mount.lustre"
#0 [ffffafe480aef6b0] panic at ffffffffb0f54b17
/home/shaun/rpmbuild/BUILD/kernel-5.10.9/linux-5.10.9-1.ldiskfs.el8.x86_64/kernel/panic.c: 360
#1 [ffffafe480aef750] no_context at ffffffffb066a2f9
/home/shaun/rpmbuild/BUILD/kernel-5.10.9/linux-5.10.9-1.ldiskfs.el8.x86_64/arch/x86/mm/fault.c: 747
#2 [ffffafe480aef7b8] exc_page_fault at ffffffffb0f953c3
/home/shaun/rpmbuild/BUILD/kernel-5.10.9/linux-5.10.9-1.ldiskfs.el8.x86_64/arch/x86/mm/fault.c: 1320
#3 [ffffafe480aef810] asm_exc_page_fault at ffffffffb1000ade
/home/shaun/rpmbuild/BUILD/kernel-5.10.9/linux-5.10.9-1.ldiskfs.el8.x86_64/./arch/x86/include/asm/idtentry.h: 583
[exception RIP: inc_nlink+32]
RIP: ffffffffb0934f80 RSP: ffffafe480aef8c0 RFLAGS: 00010202
RAX: 0000000100037655 RBX: ffff8a4d4467ddc8 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff8a4d44792bd0 RDI: ffff8a4d4467ddc8
RBP: ffff8a4d44792bd0 R8: 000000000000004c R9: 0000000000000003
R10: 0000000000000000 R11: ffff8a4d41ba8700 R12: ffff8a4d41ba8700
R13: ffffafe480aefb40 R14: ffff8a4d43961800 R15: ffff8a4d41ba8b00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
/home/shaun/rpmbuild/BUILD/kernel-5.10.9/linux-5.10.9-1.ldiskfs.el8.x86_64/./arch/x86/include/asm/atomic64_64.h: 102
#4 [ffffafe480aef8c0] ldiskfs_inc_count at ffffffffc0a4ca9e [ldiskfs]
#5 [ffffafe480aef8d0] osd_ref_add at ffffffffc15fcd65 [osd_ldiskfs]
#6 [ffffafe480aef8f8] __local_file_create at ffffffffc0cff324 [obdclass]
#7 [ffffafe480aef950] local_file_find_or_create at ffffffffc0cffd37 [obdclass]
#8 [ffffafe480aef9a0] mgs_fs_setup at ffffffffc1686712 [mgs]
#9 [ffffafe480aefa00] mgs_init0 at ffffffffc16821ad [mgs]
#10 [ffffafe480aefae0] mgs_device_alloc at ffffffffc1682c7a [mgs]
|
| Comments |
| Comment by James A Simmons [ 09/Mar/22 ] |
|
Yep I saw this with my UbuntuLTS 5.11 kernel testing. Do you have a fix? I haven't had time to work out a fix. |
| Comment by Andreas Dilger [ 09/Mar/22 ] |
|
It looks like a fundamental source of this bug is that "ldiskfs_inc_count()" is declared in lustre/osd-ldiskfs/osd_internal.h, while the function itself is exported from the ldiskfs module in ext4-misc.patch without a declaration. It would be better to remove the osd_internal.h declaration and put it into ext4.h in the patch, so that it is sure to remain consistent. It should never happen that functions are declared in a file that is not itself included where the function is implemented, exactly to catch issues like this at compile time rather than run time. I had a quick look through the rest of osd_internal.h and didn't see any other ldiskfs_* function declarations (though there are some inline functions and macros, but those are OK). |
| Comment by Gerrit Updater [ 10/Mar/22 ] |
|
"Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46775 |
| Comment by Shaun Tancheff [ 10/Mar/22 ] |
|
I will rework the patch to fix ext4-misc.patch, probably only for 'newer' kernels first and reworking the older kernels later (there are 12 versions of this patch and some of the targets are a bit old and I do not have active images for some of them to test). |
| Comment by Andreas Dilger [ 10/Mar/22 ] |
|
IMHO the high value targets are the newer kernels and the recent distro kernels - RHEL8.5/7.9, Ubuntu 20, SLES15. It is unlikely that older kernels would break from such a simple change, but at the same time moving the function declaration is mostly to avoid problems in the future, so there isn't a strict need to do it for all kernels |
| Comment by Gerrit Updater [ 11/Jun/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46775/ |
| Comment by Peter Jones [ 11/Jun/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 24/Jun/22 ] |
|
"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47731 |
| Comment by Gerrit Updater [ 05/Jul/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47731/ |