[LU-15513] crash in lod_fill_mirrors() with sparse OSTs + PFL Created: 02/Feb/22  Updated: 27/Jun/22  Resolved: 27/Jun/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Andreas Dilger Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-14996 select preferred mirror using non-rot... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

MDS can crash in lod_fill_mirrors() if a default PFL layout is set on a directory, and the filesystem has sparse OST indices:

BUG: unable to handle kernel NULL pointer dereference at 00000000000000e8
IP: [<ffffffffc17a0e6e>] lod_fill_mirrors+0x17e/0x490 [lod]
Oops: 0000 [#1] SMP 
CPU: 8 PID: 16061 Comm: mdt02_001 Kdump: loaded 3.10.0-1160.49.1.el7_lustre.x86_64 #1
Call Trace:
 lod_striped_create+0x3d7/0x690 [lod]
 lod_layout_change+0x3f/0x50 [lod]
 mdd_layout_change+0xaea/0xef0 [mdd]
 mdt_layout_change+0x2df/0x480 [mdt]
 mdt_intent_layout+0x8a0/0xe00 [mdt]
 mdt_intent_policy+0x435/0xd80 [mdt]
 ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
 ldlm_handle_enqueue0+0xaa6/0x1630 [ptlrpc]
 tgt_enqueue+0x62/0x210 [ptlrpc]
 tgt_request_handle+0xaee/0x15f0 [ptlrpc]
 ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
 ptlrpc_main+0xb34/0x1470 [ptlrpc]

This results in the lod_tgt_osts[] array being sparse and have NULL pointers that are accessed when checking for non-rotational OSTs.

(gdb) list *(lod_fill_mirrors+0x17e)
0xfe9e is in lod_fill_mirrors (/usr/src/lustre-exa-52/lustre/lod/lod_lov.c:768).
757             for (i = 0; i < lo->ldo_comp_cnt; i++, lod_comp++) {
758                     int stale = !!(lod_comp->llc_flags & LCME_FL_STALE);
759                     int preferred = !!(lod_comp->llc_flags & LCME_FL_PREF_WR);
760                     int j;
761
762                     pref = 0;
763                     /* calculate component preference over all used OSTs */
764                     for (j = 0; j < lod_comp->llc_stripes_allocated; j++) {
765                             int idx = lod_comp->llc_ost_indices[j];
766                             struct obd_statfs *osfs = &OST_TGT(lod,idx)->ltd_statfs;
767
768                             if (osfs->os_state & OS_STATE_NONROT)
769                                     pref++;
770                     }
771
772                     if (mirror_id_of(lod_comp->llc_id) == mirror_id) {
(gdb) p &((struct lod_tgt_desc *)0)->ltd_statfs + &((struct obd_statfs *)0)->os_state
$1 = 0xe8


 Comments   
Comment by Gerrit Updater [ 02/Feb/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46435
Subject: LU-15513 lod: skip uninit component in lod_fill_mirror
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 649af2fdc2ce23ecec4bf9379397267a37afe1b2

Comment by Gerrit Updater [ 05/Mar/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46435/
Subject: LU-15513 lod: skip uninit component in lod_fill_mirrors
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 591a990c617f9b953d2e838427d45fa1de061a83

Comment by Peter Jones [ 05/Mar/22 ]

Landed for 2.15

Comment by Oleg Drokin [ 30/Mar/22 ]

Looks like this is not really fixed.

 

With this patch included I hit it twice in the boilpot still in conf-sanity test 56a:

 

http://testing.linuxhacker.ru/lustre-reports/external/crashes/boilpot-bigmem-24-2022-03-29-05:55:42/

 

http://testing.linuxhacker.ru/lustre-reports/external/crashes/boilpot-bigmem-99-2022-03-29-13:45:45/

Comment by Gerrit Updater [ 11/Apr/22 ]

"Bobi Jam <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/47028
Subject: LU-15513 lod: iterate initialized stripe
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 82ea8ec26e72f4edd7efbbb53a66716a5ac3845f

Comment by Zhenyu Xu [ 21/Apr/22 ]

Oleg,

Can you re-hit consistently? I tried to reproduce the issue for a test case, but haven't made it (cannot rehit it in conf-sanity test_56a as well).

Comment by Oleg Drokin [ 11/Jun/22 ]

yes, this hits reliably in boilpot on b2_15 and master. you can see ongoing master hits here (with crashdumps and all) : https://knox.linuxhacker.ru/crashdb_ui_external.py.cgi?newid=67460

but b2_15 are matching againstexisting bugreport so are not collected.

Comment by Gerrit Updater [ 27/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47028/
Subject: LU-15513 lod: iterate initialized stripe
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 20318e34907d90d76759aee6f0cd609640bbb5aa

Comment by Peter Jones [ 27/Jun/22 ]

Landed for 2.16

Generated at Sat Feb 10 03:18:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.