Details
-
Bug
-
Resolution: Unresolved
-
Medium
-
None
-
Lustre 2.15.6
-
None
-
3
-
9223372036854775807
Description
lod_sub_recovery_thread can be called from two different threads. One is the lod_prepare which will be called with mount.lustre. The other thread is class_config_llog_handler , which will call lod_sub_recovery_thread when adding remote MDT.
lut_tdtd is referred within lod_sub_recovery_thread, but it is only initialized by lod_prepare thread. So if another thread called the lod_sub_recovery_thread early, it will have the null pointer reference issue.
For this issue, within the lod_prepare thread, one function lod_update_log_dir_gc been added before the function where lut_tdtd is initialized. From the call trace we saw, this function can be slow for ZFS I/O, this caused the lod_sub_recovery_thread called by the other thread before the lut_tdtd can be initialized.
[ 1168.052917] osd_zfs: module uses symbols from proprietary module zfs, inheriting taint. [ 1181.311009] LustreError: 137-5: 4kn4rbev-MDT0005_UUID: not available for connect from 198.19.28.43@tcp1 (no target). If you are running an HA pair check that the target is mounted on the other server. [ 1182.450843] Lustre: 4kn4rbev-MDT0005: Not available for connect from 198.19.1.78@tcp1 (not set up) [ 1187.081798] Lustre: 7469:0:(mdt_lproc.c:310:identity_upcall_store()) 4kn4rbev-MDT0005: disable "identity_upcall" with ACL enabled maybe cause unexpected "EACCESS" [ 1187.523598] Lustre: 4kn4rbev-MDT0005: Imperative Recovery not enabled, recovery window 300-500 [ 1188.380497] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000b8 [ 1188.381896] Mem abort info: [ 1188.382348] ESR = 0x96000005 [ 1188.382841] EC = 0x25: DABT (current EL), IL = 32 bits [ 1188.383692] SET = 0, FnV = 0 [ 1188.384185] EA = 0, S1PTW = 0 [ 1188.384694] Data abort info: [ 1188.385160] ISV = 0, ISS = 0x00000005 [ 1188.385770] CM = 0, WnR = 0 [ 1188.386251] user pgtable: 4k pages, 48-bit VAs, pgdp=000000041ee4d000 [ 1188.387260] [00000000000000b8] pgd=000000040d532003, p4d=000000040d532003, pud=0000000000000000 [ 1188.388620] Internal error: Oops: 0000000096000005 [#1] SMP [ 1188.389498] Modules linked in: osp(OE) lod(OE) mdt(OE) mdd(OE) lfsck(OE) mgc(OE) osd_zfs(POE) lquota(OE) af_packet_diag udp_diag tcp_diag inet_diag lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) ip6table_filter nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_filter xt_mark iptable_mangle bpfilter lnet(OE) crc32_generic libcfs(OE) sunrpc vfat fat dm_mirror dm_region_hash dm_log dm_mod ghash_ce sha2_ce sha256_arm64 sha1_ce ptp_vmclock zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) binfmt_misc ena ptp pps_core [ 1188.397749] CPU: 1 PID: 7587 Comm: lod0005_rec000f Kdump: loaded Tainted: P OE 5.10.245-241.976.amzn2.aarch64 #1 [ 1188.399526] Hardware name: Amazon EC2 c6gn.4xlarge/, BIOS 1.0 11/1/2018 [ 1188.400573] pstate: 80c00005 (Nzcv daif +PAN +UAO -TCO BTYPE=--) [ 1188.401536] pc : lod_sub_recovery_thread+0x98/0x1000 [lod] [ 1188.402406] lr : kthread+0x118/0x120 [ 1188.402980] sp : ffff8000256c3d70 [ 1188.403510] x29: ffff8000256c3df0 x28: ffff0004e65f3000 [ 1188.404348] x27: ffff0004e61ff4c8 x26: ffff8000012eb000 [ 1188.405190] x25: ffff0004a21a0800 x24: ffff800001f926c8 [ 1188.406029] x23: 0000000000003df0 x22: ffff0004a21a0820 [ 1188.406867] x21: ffff80002557b788 x20: ffff0004a21a0280 [ 1188.407704] x19: ffff0004a21a0800 x18: ffff0004a60d8635 [ 1188.408543] x17: 00000000ffffffff x16: fffffe001253f2c0 [ 1188.409381] x15: 0000000000000000 x14: ffff80000907c5b0 [ 1188.410220] x13: 0000000000000001 x12: ffffffffffffffff [ 1188.411057] x11: ffffffff00000010 x10: 0000000000000d30 [ 1188.411899] x9 : ffff8000080ce3b8 x8 : ffff0004ea9d8d90 [ 1188.412744] x7 : 00000000000003c0 x6 : 00000021dcb4c6ea [ 1188.413588] x5 : 00000000410fd0c0 x4 : 0000000000000000 [ 1188.414430] x3 : ffff0004ea9d89fc x2 : 00000000000000b8 [ 1188.415271] x1 : 0000000000000001 x0 : 0000000000000000 [ 1188.416110] Call trace: [ 1188.416527] lod_sub_recovery_thread+0x98/0x1000 [lod] [ 1188.417344] kthread+0x118/0x120 [ 1188.417867] Code: 9102e002 d503201f d503201f 52800021 (b821005f) [ 1188.418827] SMP: stopping secondary CPUs [ 1188.422315] Starting crashdump kernel... [ 1188.422944] Bye!