[LU-12810] replay-single test 20b fails with 'after 180548 > before N + 50' Created: 26/Sep/19 Updated: 14/Apr/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
DNE |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
replay-single test_20b fails with for ldiskfs with errors similar to 'after 180548 > before 25792 + 50'. replay-single test_20b fails with for ZFS with errors similar to 'after 21504 > before 3072 + 2048'. We find example of this error in master and b2_12 since at least June 2019 for the failover test group with DNE configured. In all of these cases, we see recovery complete and the test tries to sync and test if space is freed three times and, if not, exits with an error. Looking at https://testing.whamcloud.com/test_sets/a5f46eb0-d40e-11e9-97d5-52540065bddc, we see the sync/test in the client test_log: trevis-40vm8: *.lustre-MDT0000.recovery_status status: COMPLETE Waiting for local destroys to complete CMD: trevis-40vm8 lctl set_param -n os[cd]*.*MDT*.force_sync=1 CMD: trevis-40vm6 lctl set_param -n osd*.*OS*.force_sync=1 before 25800, after 1784144 CMD: trevis-40vm8 lctl set_param -n os[cd]*.*MDT*.force_sync=1 CMD: trevis-40vm6 lctl set_param -n osd*.*OS*.force_sync=1 before 25800, after 1784144 CMD: trevis-40vm8 lctl set_param -n os[cd]*.*MDT*.force_sync=1 CMD: trevis-40vm6 lctl set_param -n osd*.*OS*.force_sync=1 before 25800, after 1784144 replay-single test_20b: @@@@@@ FAIL: after 1784144 > before 25800 + 50 Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5829:error() = /usr/lib64/lustre/tests/replay-single.sh:513:test_20b() In the client test logs for all of these failures, we see a ‘Transport endpoint’ error at the beginning of the test when trying to set force_sync. For example, for https://testing.whamcloud.com/test_sets/b73e377a-d349-11e9-9fc9-52540065bddc, we see == replay-single test 20b: write, unlink, eviction, replay (test mds_cleanup_orphans) ================ 19:21:45 (1568056905) CMD: trevis-35vm11 lctl set_param -n os[cd]*.*MDT*.force_sync=1 trevis-35vm11: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0000-osc-MDT0000/force_sync=1: Transport endpoint is not connected trevis-35vm11: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0001-osc-MDT0000/force_sync=1: Transport endpoint is not connected trevis-35vm11: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0002-osc-MDT0000/force_sync=1: Transport endpoint is not connected trevis-35vm11: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0003-osc-MDT0000/force_sync=1: Transport endpoint is not connected trevis-35vm11: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0004-osc-MDT0000/force_sync=1: Transport endpoint is not connected trevis-35vm11: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0005-osc-MDT0000/force_sync=1: Transport endpoint is not connected CMD: trevis-35vm10 lctl set_param -n osd*.*OS*.force_sync=1 In one case, we see the ‘Transport endpoint’ error during the final syncs before calling error(): Waiting for local destroys to complete CMD: trevis-23vm12 lctl set_param -n os[cd]*.*MDT*.force_sync=1 trevis-23vm12: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0000-osc-MDT0000/force_sync=1: Transport endpoint is not connected trevis-23vm12: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0001-osc-MDT0000/force_sync=1: Transport endpoint is not connected trevis-23vm12: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0002-osc-MDT0000/force_sync=1: Transport endpoint is not connected trevis-23vm12: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0003-osc-MDT0000/force_sync=1: Transport endpoint is not connected trevis-23vm12: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0004-osc-MDT0000/force_sync=1: Transport endpoint is not connected trevis-23vm12: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0005-osc-MDT0000/force_sync=1: Transport endpoint is not connected CMD: trevis-23vm10 lctl set_param -n osd*.*OS*.force_sync=1 before 25832, after 180548 CMD: trevis-23vm12 lctl set_param -n os[cd]*.*MDT*.force_sync=1 CMD: trevis-23vm10 lctl set_param -n osd*.*OS*.force_sync=1 before 25832, after 180548 CMD: trevis-23vm12 lctl set_param -n os[cd]*.*MDT*.force_sync=1 CMD: trevis-23vm10 lctl set_param -n osd*.*OS*.force_sync=1 before 25832, after 180548 replay-single test_20b: @@@@@@ FAIL: after 180548 > before 25832 + 50 It seems like the forced syncs may not be taking place or not as many of them as we are trying are taking place. Logs for more recent failures are at |