Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11560

recovery-small test 134 fails with ‘rm failed’

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.4
    • None
    • 3
    • 9223372036854775807

    Description

      recovery-small test_134 is failing to remove and or move files. Looking at the client test_log for https://testing.whamcloud.com/test_sets/18761bbc-d05c-11e8-82f2-52540065bddc , we see errors on remove and copy

      Started lustre-MDT0000
      rm: cannot remove '/mnt/lustre/d134.recovery-small/1/f134.recovery-small': Input/output error
      onyx-39vm5: error: invalid path '/mnt/lustre': Input/output error
      onyx-39vm7: error: invalid path '/mnt/lustre': Input/output error
      onyx-39vm7: mv: cannot stat '/mnt/lustre/d134.recovery-small/2/f134.recovery-small_2': Input/output error
      onyx-39vm8: error: invalid path '/mnt/lustre': Input/output error
      CMD: onyx-39vm5,onyx-39vm7,onyx-39vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/mpi/gcc/openmpi/bin:/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-*.mds_server_uuid 
      onyx-39vm7: onyx-39vm7: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      onyx-39vm8: onyx-39vm8: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      onyx-39vm5: onyx-39vm5: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      onyx-39vm8: CMD: onyx-39vm8 lctl get_param -n at_max
      onyx-39vm8: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      onyx-39vm7: CMD: onyx-39vm7 lctl get_param -n at_max
      onyx-39vm7: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      onyx-39vm5: CMD: onyx-39vm5 lctl get_param -n at_max
      onyx-39vm5: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
       recovery-small test_134: @@@@@@ FAIL: rm failed 
      

      On client1 (vm5) console log, we see the client trying to remove the file and get an error

      [181353.732914] Lustre: DEBUG MARKER: rm /mnt/lustre/d134.recovery-small/1/f134.recovery-small
      [181364.712997] Lustre: 3269:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1539556867/real 1539556867]  req@ffff88007862b3c0 x1614152568603904/t0(0) o400->MGC10.2.8.116@tcp@10.2.8.117@tcp:26/25 lens 224/224 e 0 to 1 dl 1539556874 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [181364.713002] Lustre: 3269:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 12 previous similar messages
      [181364.713034] LustreError: 166-1: MGC10.2.8.116@tcp: Connection to MGS (at 10.2.8.117@tcp) was lost; in progress operations using this service will fail
      [181364.713035] LustreError: Skipped 1 previous similar message
       [181421.136794] LustreError: 26336:0:(file.c:4383:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5
      [181421.136798] LustreError: 26336:0:(file.c:4383:ll_inode_revalidate_fini()) Skipped 26 previous similar messages
      

      On another client’s (vm7) console log, we see the client trying to move a file and get the same error

      [ 6695.690885] Lustre: DEBUG MARKER: mv /mnt/lustre/d134.recovery-small/2/f134.recovery-small /mnt/lustre/d134.recovery-small/2/f134.recovery-small_2
      [ 6706.654708] Lustre: 1779:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1539556867/real 1539556867]  req@ffff88006e3946c0 x1614335408916880/t0(0) o400->MGC10.2.8.116@tcp@10.2.8.117@tcp:26/25 lens 224/224 e 0 to 1 dl 1539556874 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 6706.654741] Lustre: 1779:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
      [ 6706.654775] LustreError: 166-1: MGC10.2.8.116@tcp: Connection to MGS (at 10.2.8.117@tcp) was lost; in progress operations using this service will fail
      [ 6706.654776] LustreError: Skipped 1 previous similar message
      [ 6709.002741] Lustre: lustre-MDT0000-mdc-ffff880070293800: Connection to lustre-MDT0000 (at 10.2.8.117@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 6709.002751] Lustre: Skipped 2 previous similar messages
      [ 6774.109416] LustreError: 167-0: lustre-MDT0000-mdc-ffff880070293800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
      [ 6774.109423] LustreError: Skipped 1 previous similar message
      [ 6774.109684] LustreError: 23597:0:(file.c:4383:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5
      [ 6774.115931] Lustre: Evicted from MGS (at 10.2.8.116@tcp) after server handle changed from 0xc36de71764adb58 to 0xc256e14d64d7ddc7
      

      We see recovery-small test 134 fail frequently with this error message, but the client logs do not always have the ‘revalidate FID’ error. Thus, I’m not sure the failures are caused by the same thing. Here’s an example of recovery-small test 134 failing without the ‘revalidate FID’ error
      https://testing.whamcloud.com/test_sets/1067f8b8-d6c0-11e8-b589-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: