[LU-11789] recovery-small test 134 fails with 'mv failed' Created: 17/Dec/18  Updated: 17/Dec/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: failover

Issue Links:
Related
is related to LU-11560 recovery-small test 134 fails with ‘r... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

recovery-small test_134 fails with 'mv failed' for failover test sessions with three slightly different error messages.

Looking at the test results at https://testing.whamcloud.com/test_sets/8e8c4c2c-ff0f-11e8-b970-52540065bddc, in the client test_log, we see an error when we try to move a file

Started lustre-MDT0000
trevis-18vm7: mv: cannot move ‘/mnt/lustre/d134.recovery-small/2/f134.recovery-small’ to ‘/mnt/lustre/d134.recovery-small/2/f134.recovery-small_2’: Input/output error

Looking at the Client 2 (vm7) dmesg log, we see the move command and client evict message

 [ 6809.839939] Lustre: DEBUG MARKER: trevis-18vm7.trevis.whamcloud.com: executing set_default_debug -1 all 4
[ 6810.338678] Lustre: DEBUG MARKER: mv /mnt/lustre/d134.recovery-small/2/f134.recovery-small /mnt/lustre/d134.recovery-small/2/f134.recovery-small_2
[ 6819.972537] Lustre: 4372:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544690327/real 1544690327]  req@ffff88facf421800 x1619718087405600/t0(0) o400->MGC10.9.4.221@tcp@10.9.4.222@tcp:26/25 lens 224/224 e 0 to 1 dl 1544690334 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 6819.975188] Lustre: 4372:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
[ 6819.976154] LustreError: 166-1: MGC10.9.4.221@tcp: Connection to MGS (at 10.9.4.222@tcp) was lost; in progress operations using this service will fail
[ 6819.977455] LustreError: Skipped 1 previous similar message
[ 6825.071794] Lustre: lustre-MDT0000-mdc-ffff88faf903b800: Connection to lustre-MDT0000 (at 10.9.4.222@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 6825.074196] Lustre: Skipped 2 previous similar messages
[ 6871.792922] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-18vm11.trevis.whamcloud.com: executing set_default_debug -1 all 4
[ 6871.985510] Lustre: DEBUG MARKER: trevis-18vm11.trevis.whamcloud.com: executing set_default_debug -1 all 4
[ 6895.189707] Lustre: Evicted from MGS (at 10.9.4.221@tcp) after server handle changed from 0x5bd95132c9c3f175 to 0x20082f9d6cc0548d
[ 6895.190991] Lustre: Skipped 1 previous similar message
[ 6895.192169] Lustre: MGC10.9.4.221@tcp: Connection restored to 10.9.4.221@tcp (at 10.9.4.221@tcp)
[ 6895.193082] Lustre: Skipped 10 previous similar messages
[ 6900.197799] LustreError: 167-0: lustre-MDT0000-mdc-ffff88faf903b800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.

LU-11560 has some similarities to this failure, but has the added failure of the remove fails before the file move fails.

Additional logs for these failures are at
https://testing.whamcloud.com/test_sets/9aa46956-fe8b-11e8-a97c-52540065bddc
https://testing.whamcloud.com/test_sets/4e5dea2a-fd1b-11e8-8512-52540065bddc
https://testing.whamcloud.com/test_sets/3a27684a-fbb0-11e8-8a18-52540065bddc
https://testing.whamcloud.com/test_sets/60009774-f617-11e8-b67f-52540065bddc

There are more failures for this test that has a little different error message

trevis-11vm7: error: invalid path '/mnt/lustre': Input/output error
trevis-11vm7: mv: cannot stat '/mnt/lustre/d134.recovery-small/2/f134.recovery-small_2': Input/output error

Logs for these failures are at
https://testing.whamcloud.com/test_sets/b2069cbc-eeab-11e8-86c0-52540065bddc
https://testing.whamcloud.com/test_sets/523a7a66-ef54-11e8-b67f-52540065bddc

There are more failures for this test that has a little different error message and that look a little more like LU-11560

Started lustre-MDT0000
trevis-7vm3: mv: cannot stat '/mnt/lustre/d134.recovery-small/2/f134.recovery-small_2': Input/output error
trevis-7vm3: error: invalid path '/mnt/lustre': Input/output error

Logs for these failures are at
https://testing.whamcloud.com/test_sets/666b91b0-e959-11e8-815b-52540065bddc


Generated at Sat Feb 10 02:46:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.