[LU-14326] sanity-dom test_fsx: crash in osc_extent_make_ready() Created: 12/Jan/21 Updated: 23/Jan/21 Resolved: 22/Jan/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/393c6362-298e-472f-be95-f23a1892e9df test_fsx failed with the following error: trevis-209vm6 crashed during sanity-dom test_fsx [14491.172043] LustreError: 813:0:(osc_cache.c:1113:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed: [14491.173797] LustreError: 813:0:(osc_cache.c:1113:osc_extent_make_ready()) LBUG [14491.174924] Pid: 813, comm: ldlm_bl_06 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 [14491.176409] Call Trace: [14491.176839] [<ffffffffc06e867c>] libcfs_call_trace+0x8c/0xc0 [libcfs] [14491.177945] [<ffffffffc06e899c>] lbug_with_loc+0x4c/0xa0 [libcfs] [14491.178941] [<ffffffffc0cdf284>] osc_extent_make_ready+0xb64/0xe50 [osc] [14491.180097] [<ffffffffc0ce228b>] osc_io_unplug0+0xeeb/0x1900 [osc] [14491.181117] [<ffffffffc0ce6930>] osc_cache_writeback_range+0x9a0/0xfd0 [osc] [14491.182254] [<ffffffffc0cd1925>] osc_lock_flush+0x195/0x290 [osc] [14491.183251] [<ffffffffc0cd1df8>] osc_ldlm_blocking_ast+0x2f8/0x3e0 [osc] [14491.184337] [<ffffffffc0ac713a>] ldlm_cancel_callback+0x8a/0x310 [ptlrpc] [14491.185607] [<ffffffffc0ad2dd1>] ldlm_cli_cancel_local+0xd1/0x420 [ptlrpc] [14491.186747] [<ffffffffc0ad8fec>] ldlm_cli_cancel+0x10c/0x550 [ptlrpc] [14491.187822] [<ffffffffc0cd1c7a>] osc_ldlm_blocking_ast+0x17a/0x3e0 [osc] [14491.189143] [<ffffffffc0add6e8>] ldlm_handle_bl_callback+0xc8/0x3e0 [ptlrpc] [14491.190361] [<ffffffffc0addf9f>] ldlm_bl_thread_main+0x59f/0xaa0 [ptlrpc] VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by John Hammond [ 12/Jan/21 ] |
|
This assertion can be reproduced outside of sanity-dom: ${LUSTRE}/tests/llmount.sh
touch /mnt/lustre/f0
fsx -c 50 -p 100 -N 1000 -l 1048576 -S 0 -d -d /mnt/lustre/f0
On my system it takes fewer ~5 runs of fsx. [ 1456.917867] LustreError: 6997:0:(osc_cache.c:1113:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed: [ 1456.920595] LustreError: 6997:0:(osc_cache.c:1113:osc_extent_make_ready()) LBUG [ 1456.922471] Pid: 6997, comm: fsx 3.10.0-1062.9.1.el7.x86_64.debug #1 SMP Mon Mar 16 12:44:56 CDT 2020 [ 1456.924801] Call Trace: [ 1456.925471] [<ffffffffc06c2d4c>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 1456.927163] [<ffffffffc06c306c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 1456.928725] [<ffffffffc0e025e4>] osc_extent_make_ready+0xb64/0xe50 [osc] [ 1456.930547] [<ffffffffc0e05603>] osc_io_unplug0+0xee3/0x1910 [osc] [ 1456.932230] [<ffffffffc0e09d10>] osc_cache_writeback_range+0x9a0/0xfd0 [osc] [ 1456.934128] [<ffffffffc0df8745>] osc_io_fsync_start+0x85/0x1a0 [osc] [ 1456.935852] [<ffffffffc08efa10>] cl_io_start+0x70/0x140 [obdclass] [ 1456.937585] [<ffffffffc0e49e67>] lov_io_call.isra.7+0x87/0x140 [lov] [ 1456.939251] [<ffffffffc0e4a026>] lov_io_start+0x56/0x150 [lov] [ 1456.940703] [<ffffffffc08efa10>] cl_io_start+0x70/0x140 [obdclass] [ 1456.942419] [<ffffffffc08f1eff>] cl_io_loop+0x9f/0x200 [obdclass] [ 1456.943941] [<ffffffffc14bd10b>] cl_sync_file_range+0x2db/0x380 [lustre] [ 1456.945905] [<ffffffffc14e081a>] ll_writepages+0x7a/0x200 [lustre] [ 1456.947975] [<ffffffffa1215b04>] do_writepages+0x24/0x50 [ 1456.949314] [<ffffffffa1207e35>] __filemap_fdatawrite_range+0x65/0x80 [ 1456.950776] [<ffffffffa1207f94>] filemap_write_and_wait_range+0x44/0x90 [ 1456.952472] [<ffffffffc14bd240>] ll_fsync+0x90/0x540 [lustre] [ 1456.953904] [<ffffffffa12e5770>] vfs_fsync_range+0x20/0x30 [ 1456.955335] [<ffffffffa124e18e>] SyS_msync+0x1fe/0x250 [ 1456.962318] [<ffffffffa1887d4a>] tracesys+0xa6/0xcc [ 1456.963771] [<ffffffffffffffff>] 0xffffffffffffffff [ 1456.965134] Kernel panic - not syncing: LBUG |
| Comment by Andreas Dilger [ 14/Jan/21 ] |
|
Bobijam, could you please take a look. This is being hit pretty regularly since fallocate() checking has been fixed in the client, and fsx is now properly testing that functionality. Is it possible that this LASSERT() is just wrong for fallocate() when there are no data pages being sent in the RPC? |
| Comment by Andreas Dilger [ 14/Jan/21 ] |
|
It looks like there was a patch https://review.whamcloud.com/30848 " |
| Comment by Oleg Drokin [ 14/Jan/21 ] |
|
This is a major sanity-benchmark crashes source (in fsx) for me. |
| Comment by Oleg Drokin [ 14/Jan/21 ] |
|
I just also filed |
| Comment by Andreas Dilger [ 14/Jan/21 ] |
|
Oleg reports that the 30848 patch does not solve the fsx test issues being seen. |
| Comment by Zhenyu Xu [ 15/Jan/21 ] |
|
my test shows that #30848 only reduced the hit probability, and I got one hit may sheds some light. 000814 1610683006.953568 write 0x09385c thru 0x098924 (0x0550c9 bytes) 000817 1610683006.954846 fallocate 0x092fd1 thru 0x09bfb3 (0x058fe3 bytes) then the issue got hit with some debug info shows [ 1361.292883] LustreError: 16917:0:(osc_cache.c:1287:osc_refresh_count()) kms 602065(0x92fd1) page offset 622592(0x98000) [ 1361.293036] LustreError: 16917:0:(osc_cache.c:1113:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed: last_oap_count 0 the file's kms and fallocate offset coincides (0x092fd1), the last sync page looks from the write (0x98000 -> 0x98924). |
| Comment by Gerrit Updater [ 15/Jan/21 ] |
|
|
| Comment by Zhenyu Xu [ 15/Jan/21 ] |
|
Ran the fsx test for 1000 times with the patch, no LBUG happened again. |
| Comment by Oleg Drokin [ 15/Jan/21 ] |
|
please see my comment in the gerrit. it shifted failures to |
| Comment by Zhenyu Xu [ 15/Jan/21 ] |
|
I think the patch set 3 is a correct fix. |
| Comment by James A Simmons [ 15/Jan/21 ] |
|
I don't think this is fallocate issue. This problem has been around a long time. Its just fallocate really exposes this problem. |
| Comment by Gerrit Updater [ 20/Jan/21 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/41272 |
| Comment by Gerrit Updater [ 22/Jan/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41272/ |
| Comment by Peter Jones [ 22/Jan/21 ] |
|
Landed for 2.14 |