[LU-17448] LBUG in racer with layout change Created: 19/Jan/24  Updated: 28/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-15913 rename stress test leads to REMOTE_PA... Reopened
is related to LU-17459 lod_statfs_and_check()) ASSERTION( de... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hit a crash with a new racer rename test in v4 of patch https://review.whamcloud.com/47643 "LU-15913 tests: add rename stress test via racer" that was still using complex file layouts (I've removed most complex layouts in v5 to focus on actual rename testing for LU-17426). It looks like an ldiskfs credits issue with the large file layout (note xattr_set: 2/103/0), as I don't see any crashes with ZFS.
https://testing.whamcloud.com/gerrit-janitor/38859/results.html

Lustre: lod_parse_striping()) lustre-MDT0000-mdtlov: EXTENSION flags=40 set on component[2]=1 of non-SEL file [0x200000402:0x2402:0x0] with magic=0xbd60bd0
[few seconds pass, but included to show SEL is definitely in use]
Lustre: osd_trans_exec_check()) lustre-MDT0000: opcode 7: used 14, used now 2, reserved 12
Lustre: osd_trans_dump_creds())   create: 0/0/0, destroy: 0/0/0
Lustre: osd_trans_dump_creds())   attr_set: 1/1/0, xattr_set: 2/103/0
Lustre: osd_trans_dump_creds())   write: 3/12/14, punch: 0/0/0, quota 1/3/2
Lustre: osd_trans_dump_creds())   insert: 0/0/0, delete: 0/0/0
Lustre: osd_trans_dump_creds())   ref_add: 0/0/0, ref_del: 0/0/0
LustreError: osd_trans_exec_check()) LBUG
Pid: 14124, comm: mdt00_019 3.10.0-7.9-debug #1 SMP Sat Mar 26 23:28:42 EDT 2022
Call Trace:
 libcfs_call_trace+0x90/0xf0 [libcfs]
 lbug_with_loc+0x4c/0xa0 [libcfs]
 osd_write+0x929/0xcc0 [osd_ldiskfs]
 dt_record_write+0x33/0x130 [obdclass]
 llog_osd_write_rec+0xfdb/0x1c80 [obdclass]
 llog_write_rec+0x290/0x590 [obdclass]
 llog_cat_add_rec+0x1d9/0xa50 [obdclass]
 llog_add+0x17f/0x1f0 [obdclass]
 osp_sync_add+0x1f9/0x760 [osp]
 osp_attr_set+0x3cd/0x680 [osp]
 lod_sub_attr_set+0x1d7/0x500 [lod]
 lod_obj_stripe_attr_set_cb+0x40/0x100 [lod]
 lod_obj_for_each_stripe+0x12d/0x310 [lod]
 lod_attr_set+0x51a/0xb60 [lod]
 lod_layout_change+0x367/0x3f0 [lod]
 dt_layout_change+0x20/0xc0 [mdd]
 mdd_layout_change+0x1712/0x1db0 [mdd]
 mdt_layout_change+0x2bf/0x450 [mdt]
 mdt_intent_layout+0x910/0xeb0 [mdt]
 mdt_intent_opc+0x1c8/0xc50 [mdt]
 mdt_intent_policy+0xfa/0x460 [mdt]
 ldlm_lock_enqueue+0x3b1/0xbb0 [ptlrpc]
 ldlm_handle_enqueue+0x359/0x17c0 [ptlrpc]
 tgt_enqueue+0x68/0x240 [ptlrpc]
 tgt_request_handle+0x74e/0x19d0 [ptlrpc]
 ptlrpc_server_handle_request+0x26e/0xcf0 [ptlrpc]
 ptlrpc_main+0xc37/0x16d0 [ptlrpc]


 Comments   
Comment by Andreas Dilger [ 22/Jan/24 ]

Hit this again on racer test_1, so it isn't just related to the new rename test_2 I added. It also has to do with a layout change:
https://testing.whamcloud.com/gerrit-janitor/38926/testresults/racer-special3-ldiskfs-centos7_x86_64-centos7_x86_64/

Comment by Alex Zhuravlev [ 28/Jan/24 ]

also, sometimes I see very strange "deadlock":

schedule,rwsem_down_write_slowpath,do_rmdir
	PIDs(2): "rm":41084 "rm":41766 

schedule,schedule_preempt_disabled,__mutex_lock,lock_rename,do_renameat2,__x64_sys_rename
	PIDs(7): "mrename":41114 "mrename":41184 "mrename":41248 "mrename":41291 "mrename":41406 "mrename":41519 "mrename":41556 

schedule,schedule_preempt_disabled,rwsem_down_read_slowpath,down_read,lookup_slow,walk_component,path_lookupat,filename_lookup,do_readlinkat,__x64_sys_readlink
	PIDs(1): "lfs":41134 

schedule,rwsem_down_write_slowpath,vfs_rename,do_renameat2,__x64_sys_rename
	PIDs(1): "mrename":41171 

schedule,rwsem_down_write_slowpath,lock_rename,do_renameat2,__x64_sys_rename
	PIDs(4): "mrename":41177 "mrename":41192 "mrename":41198 "mrename":41202 

schedule,schedule_preempt_disabled,rwsem_down_read_slowpath,down_read,path_openat,do_filp_open,do_sys_openat2,do_sys_open
	PIDs(2): "lfs":41186 "lfs":41213 

schedule,schedule_preempt_disabled,rwsem_down_read_slowpath,down_read,lookup_slow,walk_component,path_lookupat,filename_lookup,vfs_statx,__se_sys_newlstat
	PIDs(2): "lfs":41895 "lfs":47609 

schedule,schedule_preempt_disabled,rwsem_down_read_slowpath,down_read,lookup_slow,walk_component,path_lookupat,filename_lookup,vfs_statx,__se_sys_newfstatat
	PIDs(1): "rm":42331 

schedule,schedule_preempt_disabled,rwsem_down_read_slowpath,down_read,lookup_slow,walk_component,path_lookupat,filename_lookup,vfs_statx,__se_sys_statx
	PIDs(1): "lfs":43870 

schedule,rwsem_down_write_slowpath,filename_create,do_mkdirat
	PIDs(3): "mkdir":44262 "mkdir":44273 "mkdir":47557 
Comment by Andreas Dilger [ 28/Jan/24 ]

Yes, I've seen this several times in my racer rename test:
https://review.whamcloud.com/47643
It looks like a client-side deadlock, because none of the server threads have a stack trace of interest, AFAICS.

Generated at Sat Feb 10 03:35:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.