[LU-11560] recovery-small test 134 fails with ‘rm failed’ Created: 23/Oct/18  Updated: 08/Oct/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11789 recovery-small test 134 fails with 'm... Open
is related to LU-12210 Test failures associated with DNE ind... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

recovery-small test_134 is failing to remove and or move files. Looking at the client test_log for https://testing.whamcloud.com/test_sets/18761bbc-d05c-11e8-82f2-52540065bddc , we see errors on remove and copy

Started lustre-MDT0000
rm: cannot remove '/mnt/lustre/d134.recovery-small/1/f134.recovery-small': Input/output error
onyx-39vm5: error: invalid path '/mnt/lustre': Input/output error
onyx-39vm7: error: invalid path '/mnt/lustre': Input/output error
onyx-39vm7: mv: cannot stat '/mnt/lustre/d134.recovery-small/2/f134.recovery-small_2': Input/output error
onyx-39vm8: error: invalid path '/mnt/lustre': Input/output error
CMD: onyx-39vm5,onyx-39vm7,onyx-39vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/mpi/gcc/openmpi/bin:/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-*.mds_server_uuid 
onyx-39vm7: onyx-39vm7: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
onyx-39vm8: onyx-39vm8: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
onyx-39vm5: onyx-39vm5: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
onyx-39vm8: CMD: onyx-39vm8 lctl get_param -n at_max
onyx-39vm8: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
onyx-39vm7: CMD: onyx-39vm7 lctl get_param -n at_max
onyx-39vm7: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
onyx-39vm5: CMD: onyx-39vm5 lctl get_param -n at_max
onyx-39vm5: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
 recovery-small test_134: @@@@@@ FAIL: rm failed 

On client1 (vm5) console log, we see the client trying to remove the file and get an error

[181353.732914] Lustre: DEBUG MARKER: rm /mnt/lustre/d134.recovery-small/1/f134.recovery-small
[181364.712997] Lustre: 3269:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1539556867/real 1539556867]  req@ffff88007862b3c0 x1614152568603904/t0(0) o400->MGC10.2.8.116@tcp@10.2.8.117@tcp:26/25 lens 224/224 e 0 to 1 dl 1539556874 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[181364.713002] Lustre: 3269:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 12 previous similar messages
[181364.713034] LustreError: 166-1: MGC10.2.8.116@tcp: Connection to MGS (at 10.2.8.117@tcp) was lost; in progress operations using this service will fail
[181364.713035] LustreError: Skipped 1 previous similar message
 [181421.136794] LustreError: 26336:0:(file.c:4383:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5
[181421.136798] LustreError: 26336:0:(file.c:4383:ll_inode_revalidate_fini()) Skipped 26 previous similar messages

On another client’s (vm7) console log, we see the client trying to move a file and get the same error

[ 6695.690885] Lustre: DEBUG MARKER: mv /mnt/lustre/d134.recovery-small/2/f134.recovery-small /mnt/lustre/d134.recovery-small/2/f134.recovery-small_2
[ 6706.654708] Lustre: 1779:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1539556867/real 1539556867]  req@ffff88006e3946c0 x1614335408916880/t0(0) o400->MGC10.2.8.116@tcp@10.2.8.117@tcp:26/25 lens 224/224 e 0 to 1 dl 1539556874 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 6706.654741] Lustre: 1779:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
[ 6706.654775] LustreError: 166-1: MGC10.2.8.116@tcp: Connection to MGS (at 10.2.8.117@tcp) was lost; in progress operations using this service will fail
[ 6706.654776] LustreError: Skipped 1 previous similar message
[ 6709.002741] Lustre: lustre-MDT0000-mdc-ffff880070293800: Connection to lustre-MDT0000 (at 10.2.8.117@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 6709.002751] Lustre: Skipped 2 previous similar messages
[ 6774.109416] LustreError: 167-0: lustre-MDT0000-mdc-ffff880070293800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[ 6774.109423] LustreError: Skipped 1 previous similar message
[ 6774.109684] LustreError: 23597:0:(file.c:4383:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5
[ 6774.115931] Lustre: Evicted from MGS (at 10.2.8.116@tcp) after server handle changed from 0xc36de71764adb58 to 0xc256e14d64d7ddc7

We see recovery-small test 134 fail frequently with this error message, but the client logs do not always have the ‘revalidate FID’ error. Thus, I’m not sure the failures are caused by the same thing. Here’s an example of recovery-small test 134 failing without the ‘revalidate FID’ error
https://testing.whamcloud.com/test_sets/1067f8b8-d6c0-11e8-b589-52540065bddc



 Comments   
Comment by James Nunez (Inactive) [ 09/Apr/19 ]

We've seen recovery-small test 134 fail with 'rm failed' for DNE testing at a high rate since 8 April 2019. Some recent failures are at:
https://testing.whamcloud.com/test_sets/29425f24-5a99-11e9-9720-52540065bddc
https://testing.whamcloud.com/test_sets/4e97765e-5a78-11e9-9646-52540065bddc
https://testing.whamcloud.com/test_sets/56a401c4-5a4a-11e9-b98a-52540065bddc

Comment by Patrick Farrell (Inactive) [ 15/Apr/19 ]

[21070.939051] LustreError: 18395:0:(lod_dev.c:434:lod_sub_recovery_thread()) lustre-MDT0002-osp-MDT0000 get update log failed: rc = -22

That's the same error as in LU-12175, I suspect the same underlying cause...

Comment by Patrick Farrell (Inactive) [ 18/Apr/19 ]

Recent spate of failures was related to LU-12175, but the original report is still A) pretty recent, and B) predates the landing of the problematic patch (LU-11636), so I think let's leave this one open...

Generated at Sat Feb 10 02:44:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.