[LU-14389] crash in lov_delete_composite() with racer+migrate Created: 31/Jan/21  Updated: 25/Oct/23  Resolved: 08/Feb/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: racer

Issue Links:
Related
is related to LU-13602 Skip unknown FLR component types Resolved
is related to LU-7073 racer with OST object migration hangs... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running racer with patch https://review.whamcloud.com/13669 "LU-7073 tests: Add file migration to racer" causes repeated client crashes:
https://testing-archive.whamcloud.com/gerrit-janitor/13887/testresults/racer-special4-ldiskfs-centos7_x86_64-centos7_x86_64/
https://testing-archive.whamcloud.com/gerrit-janitor/13887/testresults/racer-special6-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/
https://testing-archive.whamcloud.com/gerrit-janitor/13887/testresults/racer-special10-ldiskfs-centos7_x86_64-centos7_x86_64/
https://testing-archive.whamcloud.com/gerrit-janitor/13887/testresults/racer-special8-ldiskfs-centos7_x86_64-centos7_x86_64/

It looks like there is an error in parsing the layout in lov_init_composite(), then the caller lov_layout_change() tries to clean up after the error and calls lov_delete_composite() and crashes:

[  385.587338] LustreError: 20227:0:(lov_object.c:680:lov_init_composite()) lustre-clilov-ffff8800d65a3000: DOM entries with different sizes
[  385.590193] LustreError: 20227:0:(lov_ea.c:617:dump_lsm()) lsm ffff8800c2482280, objid 0x0:0, maxbytes 0x400000fe000, magic 0x0BD60BD0, refc: 2, entry: 4, layout_gen 4
[  385.592938] LustreError: 20227:0:(lov_ea.c:639:dump_lsm()) [0x0, 0x80000): id: 65537, flags: 10, magic 0x0BD10BD0, layout_gen 0, stripe count 0, sstripe size 524288, pool: []
[  385.596090] LustreError: 20227:0:(lov_ea.c:639:dump_lsm()) [0x80000, 0xffffffffffffffff): id: 65538, flags: 10, magic 0x0BD10BD0, layout_gen 0, stripe count 1, sstripe size 1048576, pool: []
[  385.602686] LustreError: 20227:0:(lov_ea.c:649:dump_lsm())    oinfo:ffff8800c2482180: ostid: 0x0:1893 ost idx: 1 gen: 0
[  385.607168] LustreError: 20227:0:(lov_ea.c:639:dump_lsm()) [0x0, 0x100000): id: 131073, flags: 11, magic 0x0BD10BD0, layout_gen 0, stripe count 0, sstripe size 1048576, pool: []
[  385.610768] LustreError: 20227:0:(lov_ea.c:639:dump_lsm()) [0x100000, 0xffffffffffffffff): id: 131074, flags: 0, magic 0x0BD10BD0, layout_gen 65535, stripe count 1, sstripe size 1048576, pool: []
[  385.615671] LustreError: 20227:0:(lov_object.c:1298:lov_layout_change()) lustre-clilov-ffff8800d65a3000: cannot apply new layout on [0x200000402:0x3e6a:0x0] : rc = -22
[  385.615671] LustreError: 20227:0:(lov_object.c:1298:lov_layout_change()) lustre-clilov-ffff8800d65a3000: cannot apply new layout on [0x200000402:0x3e6a:0x0] : rc = -22
[  385.620214] BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
[  385.622675] IP: [<ffffffffa08baef4>] lov_delete_composite+0x104/0x540 [lov]
[  385.627878] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[  385.642379] CPU: 1 PID: 20227 Comm: ln Kdump: loaded 
[  [  385.649327] RIP: 0010:[<ffffffffa08baef4>] lov_delete_composite+0x104/0x540 [lov]
[  385.669870] Call Trace:
[  385.670394]  [<ffffffffa08bf96a>] lov_conf_set+0x8ca/0xaa0 [lov]
[  385.672880]  [<ffffffffa0333950>] cl_conf_set+0x60/0x120 [obdclass]
[  385.675008]  [<ffffffffa0de6a9b>] cl_file_inode_init+0x12b/0x390 [lustre]
[  385.677377]  [<ffffffffa0dbaae5>] ll_update_inode+0x365/0x670 [lustre]
[  385.688379]  [<ffffffffa0dcdef3>] ll_iget+0x253/0x350 [lustre]
[  385.689648]  [<ffffffffa0dbf90d>] ll_prep_inode+0x20d/0x9b0 [lustre]
[  385.697886]  [<ffffffffa0dce90c>] ll_lookup_it_finish.isra.24+0xbc/0xe60 [lustre]
[  385.702800]  [<ffffffffa0dd001b>] ll_lookup_it.constprop.26+0x96b/0x1400 [lustre]
[  385.705598]  [<ffffffffa0dd0b97>] ll_lookup_nd+0xe7/0x1c0 [lustre]
[  385.706979]  [<ffffffff8124f2dd>] lookup_real+0x1d/0x50
[  385.708098]  [<ffffffff8124fdc2>] __lookup_hash+0x42/0x60
[  385.709437]  [<ffffffff817d5ff3>] lookup_slow+0x42/0xa7
[  385.710981]  [<ffffffff8125565e>] path_lookupat+0x89e/0x8d0
[  385.714547]  [<ffffffff812556bb>] filename_lookup+0x2b/0xc0
[  385.715670]  [<ffffffff812575b7>] user_path_at_empty+0x67/0xc0
[  385.717213]  [<ffffffff81257621>] user_path_at+0x11/0x20
[  385.718617]  [<ffffffff81249df3>] vfs_fstatat+0x63/0xc0


 Comments   
Comment by Gerrit Updater [ 01/Feb/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41377
Subject: LU-14389 lov: avoid NULL dereference in cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8bbfc5971c3628207db78c88304fc38752fa53e2

Comment by Gerrit Updater [ 02/Feb/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41398
Subject: LU-14389 lov: avoid NULL dereference in cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4043c45ffdea580c17d4333ccb3b153e33577883

Comment by Gerrit Updater [ 08/Feb/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41398/
Subject: LU-14389 lov: avoid NULL dereference in cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5da049d9ef1d26e606a333812630f87c29aa1a35

Comment by Peter Jones [ 08/Feb/21 ]

Landed for 2.14

Generated at Sat Feb 10 03:09:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.