[LU-11560] recovery-small test 134 fails with ‘rm failed’ Created: 23/Oct/18 Updated: 08/Oct/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
recovery-small test_134 is failing to remove and or move files. Looking at the client test_log for https://testing.whamcloud.com/test_sets/18761bbc-d05c-11e8-82f2-52540065bddc , we see errors on remove and copy Started lustre-MDT0000 rm: cannot remove '/mnt/lustre/d134.recovery-small/1/f134.recovery-small': Input/output error onyx-39vm5: error: invalid path '/mnt/lustre': Input/output error onyx-39vm7: error: invalid path '/mnt/lustre': Input/output error onyx-39vm7: mv: cannot stat '/mnt/lustre/d134.recovery-small/2/f134.recovery-small_2': Input/output error onyx-39vm8: error: invalid path '/mnt/lustre': Input/output error CMD: onyx-39vm5,onyx-39vm7,onyx-39vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/mpi/gcc/openmpi/bin:/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-*.mds_server_uuid onyx-39vm7: onyx-39vm7: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid onyx-39vm8: onyx-39vm8: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid onyx-39vm5: onyx-39vm5: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid onyx-39vm8: CMD: onyx-39vm8 lctl get_param -n at_max onyx-39vm8: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec onyx-39vm7: CMD: onyx-39vm7 lctl get_param -n at_max onyx-39vm7: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec onyx-39vm5: CMD: onyx-39vm5 lctl get_param -n at_max onyx-39vm5: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec recovery-small test_134: @@@@@@ FAIL: rm failed On client1 (vm5) console log, we see the client trying to remove the file and get an error [181353.732914] Lustre: DEBUG MARKER: rm /mnt/lustre/d134.recovery-small/1/f134.recovery-small [181364.712997] Lustre: 3269:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1539556867/real 1539556867] req@ffff88007862b3c0 x1614152568603904/t0(0) o400->MGC10.2.8.116@tcp@10.2.8.117@tcp:26/25 lens 224/224 e 0 to 1 dl 1539556874 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [181364.713002] Lustre: 3269:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 12 previous similar messages [181364.713034] LustreError: 166-1: MGC10.2.8.116@tcp: Connection to MGS (at 10.2.8.117@tcp) was lost; in progress operations using this service will fail [181364.713035] LustreError: Skipped 1 previous similar message [181421.136794] LustreError: 26336:0:(file.c:4383:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5 [181421.136798] LustreError: 26336:0:(file.c:4383:ll_inode_revalidate_fini()) Skipped 26 previous similar messages On another client’s (vm7) console log, we see the client trying to move a file and get the same error [ 6695.690885] Lustre: DEBUG MARKER: mv /mnt/lustre/d134.recovery-small/2/f134.recovery-small /mnt/lustre/d134.recovery-small/2/f134.recovery-small_2 [ 6706.654708] Lustre: 1779:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1539556867/real 1539556867] req@ffff88006e3946c0 x1614335408916880/t0(0) o400->MGC10.2.8.116@tcp@10.2.8.117@tcp:26/25 lens 224/224 e 0 to 1 dl 1539556874 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 6706.654741] Lustre: 1779:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 6 previous similar messages [ 6706.654775] LustreError: 166-1: MGC10.2.8.116@tcp: Connection to MGS (at 10.2.8.117@tcp) was lost; in progress operations using this service will fail [ 6706.654776] LustreError: Skipped 1 previous similar message [ 6709.002741] Lustre: lustre-MDT0000-mdc-ffff880070293800: Connection to lustre-MDT0000 (at 10.2.8.117@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 6709.002751] Lustre: Skipped 2 previous similar messages [ 6774.109416] LustreError: 167-0: lustre-MDT0000-mdc-ffff880070293800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [ 6774.109423] LustreError: Skipped 1 previous similar message [ 6774.109684] LustreError: 23597:0:(file.c:4383:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5 [ 6774.115931] Lustre: Evicted from MGS (at 10.2.8.116@tcp) after server handle changed from 0xc36de71764adb58 to 0xc256e14d64d7ddc7 We see recovery-small test 134 fail frequently with this error message, but the client logs do not always have the ‘revalidate FID’ error. Thus, I’m not sure the failures are caused by the same thing. Here’s an example of recovery-small test 134 failing without the ‘revalidate FID’ error |
| Comments |
| Comment by James Nunez (Inactive) [ 09/Apr/19 ] |
|
We've seen recovery-small test 134 fail with 'rm failed' for DNE testing at a high rate since 8 April 2019. Some recent failures are at: |
| Comment by Patrick Farrell (Inactive) [ 15/Apr/19 ] |
|
[21070.939051] LustreError: 18395:0:(lod_dev.c:434:lod_sub_recovery_thread()) lustre-MDT0002-osp-MDT0000 get update log failed: rc = -22 That's the same error as in LU-12175, I suspect the same underlying cause... |
| Comment by Patrick Farrell (Inactive) [ 18/Apr/19 ] |
|
Recent spate of failures was related to LU-12175, but the original report is still A) pretty recent, and B) predates the landing of the problematic patch ( |