Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4571

sanity test_17n: create remote dir error 0

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0
    • Lustre 2.6.0
    • None
    • 3
    • 12491

    Description

      This issue was created by maloo for Di Wang <di.wang@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/9794753a-8b8b-11e3-b099-52540035b04c.

      The sub-test test_17n failed with the following error:

      create remote dir error 0

      Info required for matching: sanity 17n

      Attachments

        Issue Links

          Activity

            [LU-4571] sanity test_17n: create remote dir error 0

            I wonder if it should be taken further and always allow LWP connections during recovery, or possibly always allowing MDT connections during recovery? That is something for discussion, not a reason to block the current patch though.

            adilger Andreas Dilger added a comment - I wonder if it should be taken further and always allow LWP connections during recovery, or possibly always allowing MDT connections during recovery? That is something for discussion, not a reason to block the current patch though.
            di.wang Di Wang added a comment - http://review.whamcloud.com/9106
            di.wang Di Wang added a comment -

            It seems there are two problems.

            1. OSP should not be evicted during restart, since the connection is written to disk, should be recoverable over restart. It seems in sanity_17m, it use -f to umount mds1, which might be the reason why the connection is not being honored over restart, then cause this eviction. I am not sure why we using stop -f here. but may need change it to stop.

            test_17m() {
            
            .....
                    echo "stop and checking mds${mds_index}: $cmd"
                    # e2fsck should not return error
                    stop mds${mds_index} -f
                    do_facet mds${mds_index} $cmd || rc=$?
            
            .....
            }
            

            2. LWP will likely be evicted during restart, because it is a lightweight connection, and we do not store any connection states in disk. So MDT does not recognize any reconnection after restart, so all of reconnection(if it happens) will be evicted here. Unfortunately, it will cause some errors(EIO), if some threads happens to wait on this import(mostly LOD is doing lookup OST seq), it will get error.

            di.wang Di Wang added a comment - It seems there are two problems. 1. OSP should not be evicted during restart, since the connection is written to disk, should be recoverable over restart. It seems in sanity_17m, it use -f to umount mds1, which might be the reason why the connection is not being honored over restart, then cause this eviction. I am not sure why we using stop -f here. but may need change it to stop. test_17m() { ..... echo "stop and checking mds${mds_index}: $cmd" # e2fsck should not return error stop mds${mds_index} -f do_facet mds${mds_index} $cmd || rc=$? ..... } 2. LWP will likely be evicted during restart, because it is a lightweight connection, and we do not store any connection states in disk. So MDT does not recognize any reconnection after restart, so all of reconnection(if it happens) will be evicted here. Unfortunately, it will cause some errors(EIO), if some threads happens to wait on this import(mostly LOD is doing lookup OST seq), it will get error.

            I think this is exactly the same bug as LU-4420, which is also caused by a previous test restarting the MDS, and the recovery bleeds over into the next test.

            adilger Andreas Dilger added a comment - I think this is exactly the same bug as LU-4420 , which is also caused by a previous test restarting the MDS, and the recovery bleeds over into the next test.
            adilger Andreas Dilger added a comment - - edited

            Di, it is better to keep the fix for bugs in separate patches, instead of merging it into the split directory feature, because that makes it easier to backport this fix to b2_4 and b2_5 (where it also exists).

            Also, I think that adding wait_osp_import_state() in the test script will just hide the bug during testing, but the same problem can be hit in production. This is caused by sanity.sh test_17m stopping and restarting the MDS, but the recovery is not completed before the next test starts, as can be seen in the mds2 console logs:

            23:44:10:Lustre: DEBUG MARKER: == sanity test 17n: run e2fsck against master/slave MDT which contains remote dir == 23:43:54 (1391240634)
            23:44:10:LustreError: 167-0: lustre-MDT0000-osp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
            23:44:10:LustreError: 167-0: lustre-MDT0000-osp-MDT0002: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
            23:44:10:Lustre: lustre-MDT0000-osp-MDT0002: Connection restored to lustre-MDT0000 (at 10.10.17.60@tcp)
            23:44:10:Lustre: Skipped 2 previous similar messages
            23:44:31:LustreError: 167-0: lustre-MDT0000-lwp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
            

            Note that the "create remote dir error" failure happens in test_17n before it stops the MDTs to be checked, so it must be a result of test_17m restarting the MDS.
            I think the wait for recovery completion should probably be done in the kernel, otherwise this could be hit by anyone using DNE when one of the MDS nodes fails.

            adilger Andreas Dilger added a comment - - edited Di, it is better to keep the fix for bugs in separate patches, instead of merging it into the split directory feature, because that makes it easier to backport this fix to b2_4 and b2_5 (where it also exists). Also, I think that adding wait_osp_import_state() in the test script will just hide the bug during testing, but the same problem can be hit in production. This is caused by sanity.sh test_17m stopping and restarting the MDS, but the recovery is not completed before the next test starts, as can be seen in the mds2 console logs: 23:44:10:Lustre: DEBUG MARKER: == sanity test 17n: run e2fsck against master/slave MDT which contains remote dir == 23:43:54 (1391240634) 23:44:10:LustreError: 167-0: lustre-MDT0000-osp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. 23:44:10:LustreError: 167-0: lustre-MDT0000-osp-MDT0002: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. 23:44:10:Lustre: lustre-MDT0000-osp-MDT0002: Connection restored to lustre-MDT0000 (at 10.10.17.60@tcp) 23:44:10:Lustre: Skipped 2 previous similar messages 23:44:31:LustreError: 167-0: lustre-MDT0000-lwp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. Note that the "create remote dir error" failure happens in test_17n before it stops the MDTs to be checked, so it must be a result of test_17m restarting the MDS. I think the wait for recovery completion should probably be done in the kernel, otherwise this could be hit by anyone using DNE when one of the MDS nodes fails.
            di.wang Di Wang added a comment -

            It seems MDT needs to wait OSP connection to be FULL, then go further, i.e. we need add a wait_osp_import_state. I will add the fix in http://review.whamcloud.com/#/c/7445/.

            di.wang Di Wang added a comment - It seems MDT needs to wait OSP connection to be FULL, then go further, i.e. we need add a wait_osp_import_state. I will add the fix in http://review.whamcloud.com/#/c/7445/ .

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: