Loading...

Details

Type: Bug
Resolution: Unresolved
Priority: Medium
Fix Version/s: None
Affects Version/s: Lustre 2.15.6
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

lod_sub_recovery_thread can be called from two different threads. One is the lod_prepare which will be called with mount.lustre. The other thread is class_config_llog_handler , which will call lod_sub_recovery_thread when adding remote MDT.

lut_tdtd is referred within lod_sub_recovery_thread, but it is only initialized by lod_prepare thread. So if another thread called the lod_sub_recovery_thread early, it will have the null pointer reference issue.

For this issue, within the lod_prepare thread, one function lod_update_log_dir_gc been added before the function where lut_tdtd is initialized. From the call trace we saw, this function can be slow for ZFS I/O, this caused the lod_sub_recovery_thread called by the other thread before the lut_tdtd can be initialized.

[ 1168.052917] osd_zfs: module uses symbols from proprietary module zfs, inheriting taint.
[ 1181.311009] LustreError: 137-5: 4kn4rbev-MDT0005_UUID: not available for connect from 198.19.28.43@tcp1 (no target). If you are running an HA pair check that the target is mounted on the other server.
[ 1182.450843] Lustre: 4kn4rbev-MDT0005: Not available for connect from 198.19.1.78@tcp1 (not set up)
[ 1187.081798] Lustre: 7469:0:(mdt_lproc.c:310:identity_upcall_store()) 4kn4rbev-MDT0005: disable "identity_upcall" with ACL enabled maybe cause unexpected "EACCESS"
[ 1187.523598] Lustre: 4kn4rbev-MDT0005: Imperative Recovery not enabled, recovery window 300-500
[ 1188.380497] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000b8
[ 1188.381896] Mem abort info:
[ 1188.382348]   ESR = 0x96000005
[ 1188.382841]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 1188.383692]   SET = 0, FnV = 0
[ 1188.384185]   EA = 0, S1PTW = 0
[ 1188.384694] Data abort info:
[ 1188.385160]   ISV = 0, ISS = 0x00000005
[ 1188.385770]   CM = 0, WnR = 0
[ 1188.386251] user pgtable: 4k pages, 48-bit VAs, pgdp=000000041ee4d000
[ 1188.387260] [00000000000000b8] pgd=000000040d532003, p4d=000000040d532003, pud=0000000000000000
[ 1188.388620] Internal error: Oops: 0000000096000005 [#1] SMP
[ 1188.389498] Modules linked in: osp(OE) lod(OE) mdt(OE) mdd(OE) lfsck(OE) mgc(OE) osd_zfs(POE) lquota(OE) af_packet_diag udp_diag tcp_diag inet_diag lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) ip6table_filter nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_filter xt_mark iptable_mangle bpfilter lnet(OE) crc32_generic libcfs(OE) sunrpc vfat fat dm_mirror dm_region_hash dm_log dm_mod ghash_ce sha2_ce sha256_arm64 sha1_ce ptp_vmclock zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) binfmt_misc ena ptp pps_core
[ 1188.397749] CPU: 1 PID: 7587 Comm: lod0005_rec000f Kdump: loaded Tainted: P           OE     5.10.245-241.976.amzn2.aarch64 #1
[ 1188.399526] Hardware name: Amazon EC2 c6gn.4xlarge/, BIOS 1.0 11/1/2018
[ 1188.400573] pstate: 80c00005 (Nzcv daif +PAN +UAO -TCO BTYPE=--)
[ 1188.401536] pc : lod_sub_recovery_thread+0x98/0x1000 [lod]
[ 1188.402406] lr : kthread+0x118/0x120
[ 1188.402980] sp : ffff8000256c3d70
[ 1188.403510] x29: ffff8000256c3df0 x28: ffff0004e65f3000 
[ 1188.404348] x27: ffff0004e61ff4c8 x26: ffff8000012eb000 
[ 1188.405190] x25: ffff0004a21a0800 x24: ffff800001f926c8 
[ 1188.406029] x23: 0000000000003df0 x22: ffff0004a21a0820 
[ 1188.406867] x21: ffff80002557b788 x20: ffff0004a21a0280 
[ 1188.407704] x19: ffff0004a21a0800 x18: ffff0004a60d8635 
[ 1188.408543] x17: 00000000ffffffff x16: fffffe001253f2c0 
[ 1188.409381] x15: 0000000000000000 x14: ffff80000907c5b0 
[ 1188.410220] x13: 0000000000000001 x12: ffffffffffffffff 
[ 1188.411057] x11: ffffffff00000010 x10: 0000000000000d30 
[ 1188.411899] x9 : ffff8000080ce3b8 x8 : ffff0004ea9d8d90 
[ 1188.412744] x7 : 00000000000003c0 x6 : 00000021dcb4c6ea 
[ 1188.413588] x5 : 00000000410fd0c0 x4 : 0000000000000000 
[ 1188.414430] x3 : ffff0004ea9d89fc x2 : 00000000000000b8 
[ 1188.415271] x1 : 0000000000000001 x0 : 0000000000000000 
[ 1188.416110] Call trace:
[ 1188.416527]  lod_sub_recovery_thread+0x98/0x1000 [lod]
[ 1188.417344]  kthread+0x118/0x120
[ 1188.417867] Code: 9102e002 d503201f d503201f 52800021 (b821005f) 
[ 1188.418827] SMP: stopping secondary CPUs
[ 1188.422315] Starting crashdump kernel...
[ 1188.422944] Bye!

Crash on lod_sub_recovery_thread due to NULL pointer dereference

Details

Description

Attachments

Activity

People

Dates