[LU-14963] sanity-flr test_33a: timeout with no errors Created: 24/Aug/21  Updated: 24/Aug/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/c3779d8f-6a21-4801-b14b-6f81a72baade

test_33a failed with the following error:

Timeout occurred after 163 mins, last suite running was sanity-flr

The client does not show any errors/timeouts in this case (from client dmesg log):

[ 6322.413011] Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-ffff95d5a6dd8000.ost_server_uuid in DISCONN state after 1 sec
[ 6322.507218] bash (262976): drop_caches: 3
[ 9912.891960] sysrq: SysRq : Changing Loglevel
[ 9912.892752] sysrq: Loglevel set to 8
[ 9912.894583] sysrq: SysRq : Show State
:
:
[ 9915.051898] cat             I    0 264902 264901 0x00004080
[ 9915.052798] Call Trace:
[ 9915.053236]  __schedule+0x2c4/0x700
[ 9915.053828]  schedule+0x38/0xa0
[ 9915.054418]  cl_sync_io_wait+0x211/0x2f0 [obdclass]
[ 9915.055886]  ll_io_read_page+0x443/0x820 [lustre]
[ 9915.056681]  ll_readpage+0xf8/0x820 [lustre]
[ 9915.057391]  generic_file_buffered_read+0x623/0xb00
[ 9915.058908]  vvp_io_read_start+0x4c6/0x6e0 [lustre]
[ 9915.059739]  cl_io_start+0x59/0x110 [obdclass]
[ 9915.060500]  cl_io_loop+0x9a/0x1e0 [obdclass]
[ 9915.061231]  ll_file_io_generic+0x447/0xc40 [lustre]
[ 9915.062884]  ll_file_read_iter+0x246/0x2e0 [lustre]
[ 9915.063693]  new_sync_read+0x121/0x170
[ 9915.064321]  vfs_read+0x91/0x140
[ 9915.064875]  ksys_read+0x4f/0xb0
[ 9915.065429]  do_syscall_64+0x5b/0x1a0

It looks like the OST0000 remount did not finish recovery before OST0001 was unmounted, and OST0001 was never remounted because the cat did not complete (from OSS dmesg log):

[ 6219.486214] Lustre: lustre-OST0000: deleting orphan objects from 0x0:1027 to 0x0:1089
[ 6219.605486] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts || true
[ 6220.265117] Lustre: DEBUG MARKER: umount -d /mnt/lustre-ost2
[ 6220.605623] Lustre: Failing over lustre-OST0001
[ 6220.606433] Lustre: Skipped 3 previous similar messages
[ 6220.618067] Lustre: server umount lustre-OST0001 complete
[ 6224.603047] Lustre: lustre-OST0000: already connected client lustre-MDT0002-mdtlov_UUID (at 10.9.7.34@tcp) with handle 0x6b4562569fa55212. Rejecting client with the same UUID trying to reconnect with handle 0xd5828fc44237bcac
[ 6224.606376] Lustre: Skipped 2 previous similar messages
[ 6224.832608] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-207vm1.trevis.whamcloud.com: executing wait_import_state \(DISCONN\|IDLE\) osc.lustre-OST0001-osc-ffff95d5a6dd8000.ost_server_uuid 40
[ 6225.231972] Lustre: DEBUG MARKER: trevis-207vm1.trevis.whamcloud.com: executing wait_import_state (DISCONN|IDLE) osc.lustre-OST0001-osc-ffff95d5a6dd8000.ost_server_uuid 40
[ 6226.704321] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0001-osc-ffff95d5a6dd8000.ost_server_uuid in DISCONN state after 1 sec
[ 6227.113264] Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-ffff95d5a6dd8000.ost_server_uuid in DISCONN state after 1 sec
[ 6229.721530] Lustre: lustre-OST0000: already connected client lustre-MDT0002-mdtlov_UUID (at 10.9.7.34@tcp) with handle 0x6b4562569fa55212. Rejecting client with the same UUID trying to reconnect with handle 0xd5828fc44237bcac
[ 6229.729916] Lustre: lustre-OST0000: deleting orphan objects from 0x300000401:642 to 0x300000401:1089
[ 6239.965362] Lustre: lustre-OST0000: already connected client lustre-MDT0003-mdtlov_UUID (at 10.9.7.39@tcp) with handle 0xb40b5df2a03e11e5. Rejecting client with the same UUID trying to reconnect with handle 0x5296d9516634a720
[ 6239.965824] Lustre: lustre-OST0000: deleting orphan objects from 0x300000402:1187 to 0x300000402:1313
[ 6239.968458] Lustre: Skipped 3 previous similar messages
[ 6250.208413] Lustre: lustre-OST0000: deleting orphan objects from 0x300000400:1188 to 0x300000400:1313
[ 6357.721301] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.7.54@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[ 6357.724292] LustreError: Skipped 142 previous similar messages

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-flr test_33a - Timeout occurred after 163 mins, last suite running was sanity-flr


Generated at Sat Feb 10 03:14:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.