Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.4
-
None
-
3
-
9223372036854775807
Description
recovery-small test_134 is failing to remove and or move files. Looking at the client test_log for https://testing.whamcloud.com/test_sets/18761bbc-d05c-11e8-82f2-52540065bddc , we see errors on remove and copy
Started lustre-MDT0000 rm: cannot remove '/mnt/lustre/d134.recovery-small/1/f134.recovery-small': Input/output error onyx-39vm5: error: invalid path '/mnt/lustre': Input/output error onyx-39vm7: error: invalid path '/mnt/lustre': Input/output error onyx-39vm7: mv: cannot stat '/mnt/lustre/d134.recovery-small/2/f134.recovery-small_2': Input/output error onyx-39vm8: error: invalid path '/mnt/lustre': Input/output error CMD: onyx-39vm5,onyx-39vm7,onyx-39vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/mpi/gcc/openmpi/bin:/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-*.mds_server_uuid onyx-39vm7: onyx-39vm7: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid onyx-39vm8: onyx-39vm8: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid onyx-39vm5: onyx-39vm5: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid onyx-39vm8: CMD: onyx-39vm8 lctl get_param -n at_max onyx-39vm8: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec onyx-39vm7: CMD: onyx-39vm7 lctl get_param -n at_max onyx-39vm7: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec onyx-39vm5: CMD: onyx-39vm5 lctl get_param -n at_max onyx-39vm5: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec recovery-small test_134: @@@@@@ FAIL: rm failed
On client1 (vm5) console log, we see the client trying to remove the file and get an error
[181353.732914] Lustre: DEBUG MARKER: rm /mnt/lustre/d134.recovery-small/1/f134.recovery-small [181364.712997] Lustre: 3269:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1539556867/real 1539556867] req@ffff88007862b3c0 x1614152568603904/t0(0) o400->MGC10.2.8.116@tcp@10.2.8.117@tcp:26/25 lens 224/224 e 0 to 1 dl 1539556874 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [181364.713002] Lustre: 3269:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 12 previous similar messages [181364.713034] LustreError: 166-1: MGC10.2.8.116@tcp: Connection to MGS (at 10.2.8.117@tcp) was lost; in progress operations using this service will fail [181364.713035] LustreError: Skipped 1 previous similar message [181421.136794] LustreError: 26336:0:(file.c:4383:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5 [181421.136798] LustreError: 26336:0:(file.c:4383:ll_inode_revalidate_fini()) Skipped 26 previous similar messages
On another client’s (vm7) console log, we see the client trying to move a file and get the same error
[ 6695.690885] Lustre: DEBUG MARKER: mv /mnt/lustre/d134.recovery-small/2/f134.recovery-small /mnt/lustre/d134.recovery-small/2/f134.recovery-small_2 [ 6706.654708] Lustre: 1779:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1539556867/real 1539556867] req@ffff88006e3946c0 x1614335408916880/t0(0) o400->MGC10.2.8.116@tcp@10.2.8.117@tcp:26/25 lens 224/224 e 0 to 1 dl 1539556874 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 6706.654741] Lustre: 1779:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 6 previous similar messages [ 6706.654775] LustreError: 166-1: MGC10.2.8.116@tcp: Connection to MGS (at 10.2.8.117@tcp) was lost; in progress operations using this service will fail [ 6706.654776] LustreError: Skipped 1 previous similar message [ 6709.002741] Lustre: lustre-MDT0000-mdc-ffff880070293800: Connection to lustre-MDT0000 (at 10.2.8.117@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 6709.002751] Lustre: Skipped 2 previous similar messages [ 6774.109416] LustreError: 167-0: lustre-MDT0000-mdc-ffff880070293800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [ 6774.109423] LustreError: Skipped 1 previous similar message [ 6774.109684] LustreError: 23597:0:(file.c:4383:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5 [ 6774.115931] Lustre: Evicted from MGS (at 10.2.8.116@tcp) after server handle changed from 0xc36de71764adb58 to 0xc256e14d64d7ddc7
We see recovery-small test 134 fail frequently with this error message, but the client logs do not always have the ‘revalidate FID’ error. Thus, I’m not sure the failures are caused by the same thing. Here’s an example of recovery-small test 134 failing without the ‘revalidate FID’ error
https://testing.whamcloud.com/test_sets/1067f8b8-d6c0-11e8-b589-52540065bddc