Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12066

recovery-small test 26b fails with “Client was not evicted by ost rc=1”

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.13.0, Lustre 2.10.7, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.15.3
    • failover test session
    • 3
    • 9223372036854775807

    Description

      recovery-small test_26b fails with “Client was not evicted by ost rc=1”. We only see this issue with failover testing

      Looking at a recent failure at https://testing.whamcloud.com/test_sets/e21fbf5e-4500-11e9-9720-52540065bddc , in the suite_log, we see an error at the beginning of the test

      == recovery-small test 26b: evict dead exports ======================================================= 09:09:58 (1552381798)
      CMD: trevis-42vm12 lctl get_param -n timeout
      trevis-42vm1: error: invalid path '/mnt/lustre': Input/output error
      Starting client: trevis-42vm1.trevis.whamcloud.com:  -o user_xattr,flock trevis-42vm11:trevis-42vm12:/lustre /mnt/lustre2
      CMD: trevis-42vm1.trevis.whamcloud.com mkdir -p /mnt/lustre2
      CMD: trevis-42vm1.trevis.whamcloud.com mount -t lustre -o user_xattr,flock trevis-42vm11:trevis-42vm12:/lustre /mnt/lustre2
      CMD: trevis-42vm12 lctl get_param -n mdt.lustre-MDT0000.num_exports
      CMD: trevis-42vm5 lctl get_param -n obdfilter.lustre-OST0000.num_exports
      starting with 4 OST and 12 MDS exports
      …
      CMD: trevis-42vm5 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
      Update not seen after 60s: wanted '3' got '4'
       recovery-small test_26b: @@@@@@ FAIL: Client was not evicted by ost rc=1 
      

      On client 1 (vm1) we see an error in the console logs

      ======================================================= 09:09:58 \(1552381798\)
      [261204.673102] Lustre: DEBUG MARKER: == recovery-small test 26b: evict dead exports ======================================================= 09:09:58 (1552381798)
      [261205.856521] Lustre: Evicted from MGS (at 10.9.3.160@tcp) after server handle changed from 0x2ef44212d062fce8 to 0x2ef44212d0630260
      [261210.854799] LustreError: 10265:0:(file.c:3644:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5
      …
      

      There’s nothing obviously wrong looking at the rest of the console logs.

      Log for more failures are at:
      2.10.7 RC1 - https://testing.whamcloud.com/test_sets/88e90b9e-429d-11e9-a256-52540065bddc
      2.10.6.79 - https://testing.whamcloud.com/test_sets/2d788f9c-41a2-11e9-b98a-52540065bddc
      2.10.6.79 - https://testing.whamcloud.com/test_sets/1b0bebc4-4143-11e9-92fe-52540065bddc
      2.10.6.63 - https://testing.whamcloud.com/test_sets/58d6004e-3cce-11e9-8e92-52540065bddc
      2.10.6.62 - https://testing.whamcloud.com/test_sets/1b8704d2-39da-11e9-8f69-52540065bddc

      We see a similar failure with master failover testing, but the logs do not have the ll_inode_revalidate_fini() error in the client console log and don’t have the ‘invalid path’ error. In some cases, recovery-small test 26a fails before the 26b failure.
      2.12.51.105 - https://testing.whamcloud.com/test_sets/3bae48de-4456-11e9-9720-52540065bddc
      2.12.51.98 - https://testing.whamcloud.com/test_sets/f7eacb0a-41fc-11e9-a256-52540065bddc
      2.12.51.98 - https://testing.whamcloud.com/test_sets/bfd286aa-4160-11e9-8e92-52540065bddc
      2.12.51.97 - https://testing.whamcloud.com/test_sets/4e759c4a-3f3a-11e9-a256-52540065bddc
      2.12.51.79 - https://testing.whamcloud.com/test_sets/d624519c-3f1f-11e9-92fe-52540065bddc
      2.12.51.97 - https://testing.whamcloud.com/test_sets/6cda3f0e-3e88-11e9-92fe-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-12066] recovery-small test 26b fails with “Client was not evicted by ost rc=1”
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56356/
            Subject: LU-12066 tests: activate OSTs in recovery-small/26b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c30b69fa0038153ca26f4e493c29021b65d7d5c6

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56356/ Subject: LU-12066 tests: activate OSTs in recovery-small/26b Project: fs/lustre-release Branch: master Current Patch Set: Commit: c30b69fa0038153ca26f4e493c29021b65d7d5c6
            arshad512 Arshad Hussain added a comment - - edited

            +1 on Master.

            https://testing.whamcloud.com/test_sessions/1fcfb425-039b-4dca-bd0f-fdc9a4fd89c0

            starting with 7 OST and 18 MDS exports
            [...]
            Waiting 20s for '6'
            CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
            CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
            CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
            CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
            CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
            CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
            CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
            CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
            CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
            CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
            CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
            CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
            CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
            CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
            CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2
            Update not seen after 40s: want '6' got '4'
             recovery-small test_26b: @@@@@@ FAIL: Client was not evicted by ost  

             

            arshad512 Arshad Hussain added a comment - - edited +1 on Master. https://testing.whamcloud.com/test_sessions/1fcfb425-039b-4dca-bd0f-fdc9a4fd89c0 starting with 7 OST and 18 MDS exports [...] Waiting 20s for '6' CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2 CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2 CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2 CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2 CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2 CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2 CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2 CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2 CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2 CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2 CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2 CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2 CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2 CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2 CMD: trevis-93vm3 lctl get_param -n *.lustre-OST0000.num_exports | cut -d' ' -f2 Update not seen after 40s: want '6' got '4' recovery-small test_26b: @@@@@@ FAIL: Client was not evicted by ost  

            I think the problem with this test is caused by idle OST disconnect. This started failing back in 2018-06 after the LU-7236 idle disconnect patch landed, and has continued to fail since then.

            adilger Andreas Dilger added a comment - I think the problem with this test is caused by idle OST disconnect. This started failing back in 2018-06 after the LU-7236 idle disconnect patch landed, and has continued to fail since then.

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56356
            Subject: LU-12066 tests: activate OSTs in recovery-small/26b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: cced00bee691a69bf41682418c098ddf3fcba786

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56356 Subject: LU-12066 tests: activate OSTs in recovery-small/26b Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cced00bee691a69bf41682418c098ddf3fcba786

            "Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46934
            Subject: LU-12066 test: cleanup staled exports
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 45f498b586917a7f75d8e995a8532865876ae696

            gerrit Gerrit Updater added a comment - "Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46934 Subject: LU-12066 test: cleanup staled exports Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 45f498b586917a7f75d8e995a8532865876ae696
            pjones Peter Jones added a comment -

            Hongchao

            Could you please advise?

            Thanks

            Peter

            pjones Peter Jones added a comment - Hongchao Could you please advise? Thanks Peter

            Although this ticket is for failover test group failures, there is a recent interop failure for a full test session that has a similar failure at https://testing.whamcloud.com/test_sets/85c1940b-6a24-4b96-9e02-5e5e976474bb for 2.13.57.36 clients and 2.12.6 servers.

            jamesanunez James Nunez (Inactive) added a comment - Although this ticket is for failover test group failures, there is a recent interop failure for a full test session that has a similar failure at https://testing.whamcloud.com/test_sets/85c1940b-6a24-4b96-9e02-5e5e976474bb for 2.13.57.36 clients and 2.12.6 servers.
            hornc Chris Horn added a comment - +1 on master https://testing.whamcloud.com/test_sessions/b2965d82-a459-4188-a035-72180920afb6

            People

              hongchao.zhang Hongchao Zhang
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: