[LU-15635] ext4_(inc|dec)_count removed handle_t arg breaking 5.10 server Created: 09/Mar/22  Updated: 05/Jul/22  Resolved: 11/Jun/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0, Lustre 2.15.1

Type: Bug Priority: Minor
Reporter: Shaun Tancheff Assignee: Shaun Tancheff
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14195 Support for linux kernel version 5.10 Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Linux v5.9-rc7-8-g15ed2851b0f4
ext4: remove unused argument from ext4_(inc|dec)_count

This breaks when the 'handle' is treated as inode with random crashes like the following:

PID: 1901   TASK: ffff8a4d4151c740  CPU: 0   COMMAND: "mount.lustre"
 #0 [ffffafe480aef6b0] panic at ffffffffb0f54b17
    /home/shaun/rpmbuild/BUILD/kernel-5.10.9/linux-5.10.9-1.ldiskfs.el8.x86_64/kernel/panic.c: 360
 #1 [ffffafe480aef750] no_context at ffffffffb066a2f9
    /home/shaun/rpmbuild/BUILD/kernel-5.10.9/linux-5.10.9-1.ldiskfs.el8.x86_64/arch/x86/mm/fault.c: 747
 #2 [ffffafe480aef7b8] exc_page_fault at ffffffffb0f953c3
    /home/shaun/rpmbuild/BUILD/kernel-5.10.9/linux-5.10.9-1.ldiskfs.el8.x86_64/arch/x86/mm/fault.c: 1320
 #3 [ffffafe480aef810] asm_exc_page_fault at ffffffffb1000ade
    /home/shaun/rpmbuild/BUILD/kernel-5.10.9/linux-5.10.9-1.ldiskfs.el8.x86_64/./arch/x86/include/asm/idtentry.h: 583
    [exception RIP: inc_nlink+32]
    RIP: ffffffffb0934f80  RSP: ffffafe480aef8c0  RFLAGS: 00010202
    RAX: 0000000100037655  RBX: ffff8a4d4467ddc8  RCX: 0000000000000000
    RDX: 0000000000000001  RSI: ffff8a4d44792bd0  RDI: ffff8a4d4467ddc8
    RBP: ffff8a4d44792bd0   R8: 000000000000004c   R9: 0000000000000003
    R10: 0000000000000000  R11: ffff8a4d41ba8700  R12: ffff8a4d41ba8700
    R13: ffffafe480aefb40  R14: ffff8a4d43961800  R15: ffff8a4d41ba8b00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    /home/shaun/rpmbuild/BUILD/kernel-5.10.9/linux-5.10.9-1.ldiskfs.el8.x86_64/./arch/x86/include/asm/atomic64_64.h: 102
 #4 [ffffafe480aef8c0] ldiskfs_inc_count at ffffffffc0a4ca9e [ldiskfs]
 #5 [ffffafe480aef8d0] osd_ref_add at ffffffffc15fcd65 [osd_ldiskfs]
 #6 [ffffafe480aef8f8] __local_file_create at ffffffffc0cff324 [obdclass]
 #7 [ffffafe480aef950] local_file_find_or_create at ffffffffc0cffd37 [obdclass]
 #8 [ffffafe480aef9a0] mgs_fs_setup at ffffffffc1686712 [mgs]
 #9 [ffffafe480aefa00] mgs_init0 at ffffffffc16821ad [mgs]
#10 [ffffafe480aefae0] mgs_device_alloc at ffffffffc1682c7a [mgs]


 Comments   
Comment by James A Simmons [ 09/Mar/22 ]

Yep I saw this with my UbuntuLTS 5.11 kernel testing. Do you have a fix? I haven't had time to work out a fix. 

Comment by Andreas Dilger [ 09/Mar/22 ]

It looks like a fundamental source of this bug is that "ldiskfs_inc_count()" is declared in lustre/osd-ldiskfs/osd_internal.h, while the function itself is exported from the ldiskfs module in ext4-misc.patch without a declaration. It would be better to remove the osd_internal.h declaration and put it into ext4.h in the patch, so that it is sure to remain consistent.

It should never happen that functions are declared in a file that is not itself included where the function is implemented, exactly to catch issues like this at compile time rather than run time.

I had a quick look through the rest of osd_internal.h and didn't see any other ldiskfs_* function declarations (though there are some inline functions and macros, but those are OK).

Comment by Gerrit Updater [ 10/Mar/22 ]

"Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46775
Subject: LU-15635 ldiskfs: Interface change fix server v5.10
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5a6c0b92df59ae411134c89fcf405e12ff3fac8c

Comment by Shaun Tancheff [ 10/Mar/22 ]

I will rework the patch to fix ext4-misc.patch, probably only for 'newer' kernels first and reworking the older kernels later (there are 12 versions of this patch and some of the targets are a bit old and I do not have active images for some of them to test).

Comment by Andreas Dilger [ 10/Mar/22 ]

IMHO the high value targets are the newer kernels and the recent distro kernels - RHEL8.5/7.9, Ubuntu 20, SLES15. It is unlikely that older kernels would break from such a simple change, but at the same time moving the function declaration is mostly to avoid problems in the future, so there isn't a strict need to do it for all kernels

Comment by Gerrit Updater [ 11/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46775/
Subject: LU-15635 ldiskfs: Interface change fix server v5.10
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 68d96d2f650a6d9ae04e48eac9c66b2cd4be0a23

Comment by Peter Jones [ 11/Jun/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 24/Jun/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47731
Subject: LU-15635 ldiskfs: Interface change fix server v5.10
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: b5ea75bd1660fe8f6d4cba611b72af7c3568b6c2

Comment by Gerrit Updater [ 05/Jul/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47731/
Subject: LU-15635 ldiskfs: Interface change fix server v5.10
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 401b5002230d8a2fcc8a4cbe77fa81eac7605c38

Generated at Sat Feb 10 03:20:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.