Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/c3779d8f-6a21-4801-b14b-6f81a72baade
test_33a failed with the following error:
Timeout occurred after 163 mins, last suite running was sanity-flr
The client does not show any errors/timeouts in this case (from client dmesg log):
[ 6322.413011] Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-ffff95d5a6dd8000.ost_server_uuid in DISCONN state after 1 sec [ 6322.507218] bash (262976): drop_caches: 3 [ 9912.891960] sysrq: SysRq : Changing Loglevel [ 9912.892752] sysrq: Loglevel set to 8 [ 9912.894583] sysrq: SysRq : Show State : : [ 9915.051898] cat I 0 264902 264901 0x00004080 [ 9915.052798] Call Trace: [ 9915.053236] __schedule+0x2c4/0x700 [ 9915.053828] schedule+0x38/0xa0 [ 9915.054418] cl_sync_io_wait+0x211/0x2f0 [obdclass] [ 9915.055886] ll_io_read_page+0x443/0x820 [lustre] [ 9915.056681] ll_readpage+0xf8/0x820 [lustre] [ 9915.057391] generic_file_buffered_read+0x623/0xb00 [ 9915.058908] vvp_io_read_start+0x4c6/0x6e0 [lustre] [ 9915.059739] cl_io_start+0x59/0x110 [obdclass] [ 9915.060500] cl_io_loop+0x9a/0x1e0 [obdclass] [ 9915.061231] ll_file_io_generic+0x447/0xc40 [lustre] [ 9915.062884] ll_file_read_iter+0x246/0x2e0 [lustre] [ 9915.063693] new_sync_read+0x121/0x170 [ 9915.064321] vfs_read+0x91/0x140 [ 9915.064875] ksys_read+0x4f/0xb0 [ 9915.065429] do_syscall_64+0x5b/0x1a0
It looks like the OST0000 remount did not finish recovery before OST0001 was unmounted, and OST0001 was never remounted because the cat did not complete (from OSS dmesg log):
[ 6219.486214] Lustre: lustre-OST0000: deleting orphan objects from 0x0:1027 to 0x0:1089 [ 6219.605486] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts || true [ 6220.265117] Lustre: DEBUG MARKER: umount -d /mnt/lustre-ost2 [ 6220.605623] Lustre: Failing over lustre-OST0001 [ 6220.606433] Lustre: Skipped 3 previous similar messages [ 6220.618067] Lustre: server umount lustre-OST0001 complete [ 6224.603047] Lustre: lustre-OST0000: already connected client lustre-MDT0002-mdtlov_UUID (at 10.9.7.34@tcp) with handle 0x6b4562569fa55212. Rejecting client with the same UUID trying to reconnect with handle 0xd5828fc44237bcac [ 6224.606376] Lustre: Skipped 2 previous similar messages [ 6224.832608] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-207vm1.trevis.whamcloud.com: executing wait_import_state \(DISCONN\|IDLE\) osc.lustre-OST0001-osc-ffff95d5a6dd8000.ost_server_uuid 40 [ 6225.231972] Lustre: DEBUG MARKER: trevis-207vm1.trevis.whamcloud.com: executing wait_import_state (DISCONN|IDLE) osc.lustre-OST0001-osc-ffff95d5a6dd8000.ost_server_uuid 40 [ 6226.704321] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0001-osc-ffff95d5a6dd8000.ost_server_uuid in DISCONN state after 1 sec [ 6227.113264] Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-ffff95d5a6dd8000.ost_server_uuid in DISCONN state after 1 sec [ 6229.721530] Lustre: lustre-OST0000: already connected client lustre-MDT0002-mdtlov_UUID (at 10.9.7.34@tcp) with handle 0x6b4562569fa55212. Rejecting client with the same UUID trying to reconnect with handle 0xd5828fc44237bcac [ 6229.729916] Lustre: lustre-OST0000: deleting orphan objects from 0x300000401:642 to 0x300000401:1089 [ 6239.965362] Lustre: lustre-OST0000: already connected client lustre-MDT0003-mdtlov_UUID (at 10.9.7.39@tcp) with handle 0xb40b5df2a03e11e5. Rejecting client with the same UUID trying to reconnect with handle 0x5296d9516634a720 [ 6239.965824] Lustre: lustre-OST0000: deleting orphan objects from 0x300000402:1187 to 0x300000402:1313 [ 6239.968458] Lustre: Skipped 3 previous similar messages [ 6250.208413] Lustre: lustre-OST0000: deleting orphan objects from 0x300000400:1188 to 0x300000400:1313 [ 6357.721301] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.7.54@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 6357.724292] LustreError: Skipped 142 previous similar messages
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-flr test_33a - Timeout occurred after 163 mins, last suite running was sanity-flr