[LU-15513] crash in lod_fill_mirrors() with sparse OSTs + PFL Created: 02/Feb/22 Updated: 27/Jun/22 Resolved: 27/Jun/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Andreas Dilger | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
MDS can crash in lod_fill_mirrors() if a default PFL layout is set on a directory, and the filesystem has sparse OST indices: BUG: unable to handle kernel NULL pointer dereference at 00000000000000e8 IP: [<ffffffffc17a0e6e>] lod_fill_mirrors+0x17e/0x490 [lod] Oops: 0000 [#1] SMP CPU: 8 PID: 16061 Comm: mdt02_001 Kdump: loaded 3.10.0-1160.49.1.el7_lustre.x86_64 #1 Call Trace: lod_striped_create+0x3d7/0x690 [lod] lod_layout_change+0x3f/0x50 [lod] mdd_layout_change+0xaea/0xef0 [mdd] mdt_layout_change+0x2df/0x480 [mdt] mdt_intent_layout+0x8a0/0xe00 [mdt] mdt_intent_policy+0x435/0xd80 [mdt] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc] ldlm_handle_enqueue0+0xaa6/0x1630 [ptlrpc] tgt_enqueue+0x62/0x210 [ptlrpc] tgt_request_handle+0xaee/0x15f0 [ptlrpc] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] ptlrpc_main+0xb34/0x1470 [ptlrpc] This results in the lod_tgt_osts[] array being sparse and have NULL pointers that are accessed when checking for non-rotational OSTs. (gdb) list *(lod_fill_mirrors+0x17e) 0xfe9e is in lod_fill_mirrors (/usr/src/lustre-exa-52/lustre/lod/lod_lov.c:768). 757 for (i = 0; i < lo->ldo_comp_cnt; i++, lod_comp++) { 758 int stale = !!(lod_comp->llc_flags & LCME_FL_STALE); 759 int preferred = !!(lod_comp->llc_flags & LCME_FL_PREF_WR); 760 int j; 761 762 pref = 0; 763 /* calculate component preference over all used OSTs */ 764 for (j = 0; j < lod_comp->llc_stripes_allocated; j++) { 765 int idx = lod_comp->llc_ost_indices[j]; 766 struct obd_statfs *osfs = &OST_TGT(lod,idx)->ltd_statfs; 767 768 if (osfs->os_state & OS_STATE_NONROT) 769 pref++; 770 } 771 772 if (mirror_id_of(lod_comp->llc_id) == mirror_id) { (gdb) p &((struct lod_tgt_desc *)0)->ltd_statfs + &((struct obd_statfs *)0)->os_state $1 = 0xe8 |
| Comments |
| Comment by Gerrit Updater [ 02/Feb/22 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46435 |
| Comment by Gerrit Updater [ 05/Mar/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46435/ |
| Comment by Peter Jones [ 05/Mar/22 ] |
|
Landed for 2.15 |
| Comment by Oleg Drokin [ 30/Mar/22 ] |
|
Looks like this is not really fixed.
With this patch included I hit it twice in the boilpot still in conf-sanity test 56a:
http://testing.linuxhacker.ru/lustre-reports/external/crashes/boilpot-bigmem-24-2022-03-29-05:55:42/
http://testing.linuxhacker.ru/lustre-reports/external/crashes/boilpot-bigmem-99-2022-03-29-13:45:45/ |
| Comment by Gerrit Updater [ 11/Apr/22 ] |
|
"Bobi Jam <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/47028 |
| Comment by Zhenyu Xu [ 21/Apr/22 ] |
|
Oleg, Can you re-hit consistently? I tried to reproduce the issue for a test case, but haven't made it (cannot rehit it in conf-sanity test_56a as well). |
| Comment by Oleg Drokin [ 11/Jun/22 ] |
|
yes, this hits reliably in boilpot on b2_15 and master. you can see ongoing master hits here (with crashdumps and all) : https://knox.linuxhacker.ru/crashdb_ui_external.py.cgi?newid=67460 but b2_15 are matching againstexisting bugreport so are not collected. |
| Comment by Gerrit Updater [ 27/Jun/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47028/ |
| Comment by Peter Jones [ 27/Jun/22 ] |
|
Landed for 2.16 |