[LU-14326] sanity-dom test_fsx: crash in osc_extent_make_ready() Created: 12/Jan/21  Updated: 23/Jan/21  Resolved: 22/Jan/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11463 DOM: osc_extent_make_ready()) ASSERTI... Reopened
is related to LU-14286 fallocate() does not zero on-disk data Resolved
is related to LU-14333 ASSERTION( oinfo->loi_lvb.lvb_size >=... Resolved
is related to LU-10407 osc_cache.c:1141:osc_extent_make_read... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/393c6362-298e-472f-be95-f23a1892e9df

test_fsx failed with the following error:

trevis-209vm6 crashed during sanity-dom test_fsx

[14491.172043] LustreError: 813:0:(osc_cache.c:1113:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed: 
[14491.173797] LustreError: 813:0:(osc_cache.c:1113:osc_extent_make_ready()) LBUG
[14491.174924] Pid: 813, comm: ldlm_bl_06 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020
[14491.176409] Call Trace:
[14491.176839]  [<ffffffffc06e867c>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[14491.177945]  [<ffffffffc06e899c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[14491.178941]  [<ffffffffc0cdf284>] osc_extent_make_ready+0xb64/0xe50 [osc]
[14491.180097]  [<ffffffffc0ce228b>] osc_io_unplug0+0xeeb/0x1900 [osc]
[14491.181117]  [<ffffffffc0ce6930>] osc_cache_writeback_range+0x9a0/0xfd0 [osc]
[14491.182254]  [<ffffffffc0cd1925>] osc_lock_flush+0x195/0x290 [osc]
[14491.183251]  [<ffffffffc0cd1df8>] osc_ldlm_blocking_ast+0x2f8/0x3e0 [osc]
[14491.184337]  [<ffffffffc0ac713a>] ldlm_cancel_callback+0x8a/0x310 [ptlrpc]
[14491.185607]  [<ffffffffc0ad2dd1>] ldlm_cli_cancel_local+0xd1/0x420 [ptlrpc]
[14491.186747]  [<ffffffffc0ad8fec>] ldlm_cli_cancel+0x10c/0x550 [ptlrpc]
[14491.187822]  [<ffffffffc0cd1c7a>] osc_ldlm_blocking_ast+0x17a/0x3e0 [osc]
[14491.189143]  [<ffffffffc0add6e8>] ldlm_handle_bl_callback+0xc8/0x3e0 [ptlrpc]
[14491.190361]  [<ffffffffc0addf9f>] ldlm_bl_thread_main+0x59f/0xaa0 [ptlrpc]

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-dom test_fsx - trevis-209vm6 crashed during sanity-dom test_fsx



 Comments   
Comment by John Hammond [ 12/Jan/21 ]

This assertion can be reproduced outside of sanity-dom:

${LUSTRE}/tests/llmount.sh
touch /mnt/lustre/f0
fsx -c 50 -p 100 -N 1000 -l 1048576 -S 0 -d -d /mnt/lustre/f0

On my system it takes fewer ~5 runs of fsx.

[ 1456.917867] LustreError: 6997:0:(osc_cache.c:1113:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed: 
[ 1456.920595] LustreError: 6997:0:(osc_cache.c:1113:osc_extent_make_ready()) LBUG
[ 1456.922471] Pid: 6997, comm: fsx 3.10.0-1062.9.1.el7.x86_64.debug #1 SMP Mon Mar 16 12:44:56 CDT 2020
[ 1456.924801] Call Trace:
[ 1456.925471]  [<ffffffffc06c2d4c>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 1456.927163]  [<ffffffffc06c306c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 1456.928725]  [<ffffffffc0e025e4>] osc_extent_make_ready+0xb64/0xe50 [osc]
[ 1456.930547]  [<ffffffffc0e05603>] osc_io_unplug0+0xee3/0x1910 [osc]
[ 1456.932230]  [<ffffffffc0e09d10>] osc_cache_writeback_range+0x9a0/0xfd0 [osc]
[ 1456.934128]  [<ffffffffc0df8745>] osc_io_fsync_start+0x85/0x1a0 [osc]
[ 1456.935852]  [<ffffffffc08efa10>] cl_io_start+0x70/0x140 [obdclass]
[ 1456.937585]  [<ffffffffc0e49e67>] lov_io_call.isra.7+0x87/0x140 [lov]
[ 1456.939251]  [<ffffffffc0e4a026>] lov_io_start+0x56/0x150 [lov]
[ 1456.940703]  [<ffffffffc08efa10>] cl_io_start+0x70/0x140 [obdclass]
[ 1456.942419]  [<ffffffffc08f1eff>] cl_io_loop+0x9f/0x200 [obdclass]
[ 1456.943941]  [<ffffffffc14bd10b>] cl_sync_file_range+0x2db/0x380 [lustre]
[ 1456.945905]  [<ffffffffc14e081a>] ll_writepages+0x7a/0x200 [lustre]
[ 1456.947975]  [<ffffffffa1215b04>] do_writepages+0x24/0x50
[ 1456.949314]  [<ffffffffa1207e35>] __filemap_fdatawrite_range+0x65/0x80
[ 1456.950776]  [<ffffffffa1207f94>] filemap_write_and_wait_range+0x44/0x90
[ 1456.952472]  [<ffffffffc14bd240>] ll_fsync+0x90/0x540 [lustre]
[ 1456.953904]  [<ffffffffa12e5770>] vfs_fsync_range+0x20/0x30
[ 1456.955335]  [<ffffffffa124e18e>] SyS_msync+0x1fe/0x250
[ 1456.962318]  [<ffffffffa1887d4a>] tracesys+0xa6/0xcc
[ 1456.963771]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 1456.965134] Kernel panic - not syncing: LBUG
Comment by Andreas Dilger [ 14/Jan/21 ]

Bobijam, could you please take a look. This is being hit pretty regularly since fallocate() checking has been fixed in the client, and fsx is now properly testing that functionality.

Is it possible that this LASSERT() is just wrong for fallocate() when there are no data pages being sent in the RPC?

Comment by Andreas Dilger [ 14/Jan/21 ]

It looks like there was a patch https://review.whamcloud.com/30848 "LU-10407 osc: update size before queue page" pushed a couple of years ago and never landed, due to Jinshan giving it a -1. That might give some ideas where to start.

Comment by Oleg Drokin [ 14/Jan/21 ]

This is a major sanity-benchmark crashes source (in fsx) for me.

Comment by Oleg Drokin [ 14/Jan/21 ]

I just also filed LU-14333 that seems to be related, it hits if this one did not hit, at a later point in time in cleanup.

Comment by Andreas Dilger [ 14/Jan/21 ]

Oleg reports that the 30848 patch does not solve the fsx test issues being seen.

Comment by Zhenyu Xu [ 15/Jan/21 ]

my test shows that #30848 only reduced the hit probability, and I got one hit may sheds some light.

000814 1610683006.953568 write      0x09385c thru 0x098924	(0x0550c9 bytes)
000817 1610683006.954846 fallocate  0x092fd1 thru 0x09bfb3	(0x058fe3 bytes)

then the issue got hit with some debug info shows

[ 1361.292883] LustreError: 16917:0:(osc_cache.c:1287:osc_refresh_count()) kms 602065(0x92fd1) page offset 622592(0x98000)
[ 1361.293036] LustreError: 16917:0:(osc_cache.c:1113:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed: last_oap_count 0

the file's kms and fallocate offset coincides (0x092fd1), the last sync page looks from the write (0x98000 -> 0x98924).

Comment by Gerrit Updater [ 15/Jan/21 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/41233
Subject: LU-14326 osc: correctly update size/kms for fallocate
Project: fs/lustre-release
Branch: master
Current Patch Set: 3
Commit: a202c4dbbd8dc6f763b6672a9fe604a9179f209a

Comment by Zhenyu Xu [ 15/Jan/21 ]

Ran the fsx test for 1000 times with the patch, no LBUG happened again. 

Comment by Oleg Drokin [ 15/Jan/21 ]

please see my comment in the gerrit. it shifted failures to LU-14333 style crashes and also now sanityn fsx tests in test_16[abc] are 100% failure because fsx sees wrong size.

Comment by Zhenyu Xu [ 15/Jan/21 ]

I think the patch set 3 is a correct fix.

Comment by James A Simmons [ 15/Jan/21 ]

I don't think this is fallocate issue. This problem has been around a long time. Its just fallocate really exposes this problem.

Comment by Gerrit Updater [ 20/Jan/21 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/41272
Subject: LU-14326 osc: correctly update size/kms for fallocate
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 114140000a298b9ee0e774fc1785fed034961afb

Comment by Gerrit Updater [ 22/Jan/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41272/
Subject: LU-14326 osc: correctly update size/kms for fallocate
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 43979e4e257e78d3d94ea3f21b65fdebbbdc690f

Comment by Peter Jones [ 22/Jan/21 ]

Landed for 2.14

Generated at Sat Feb 10 03:08:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.