Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4571

sanity test_17n: create remote dir error 0

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0
    • Lustre 2.6.0
    • None
    • 3
    • 12491

    Description

      This issue was created by maloo for Di Wang <di.wang@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/9794753a-8b8b-11e3-b099-52540035b04c.

      The sub-test test_17n failed with the following error:

      create remote dir error 0

      Info required for matching: sanity 17n

      Attachments

        Issue Links

          Activity

            [LU-4571] sanity test_17n: create remote dir error 0
            di.wang Di Wang added a comment -

            discussion on Skype

            [2/5/14, 9:19:26 AM] wangdi: if we just evict the LWP, it might cause some in-flight RPC failed
            [2/5/14, 9:19:40 AM] Johann Lombardi: yup
            [2/5/14, 9:19:41 AM] wangdi: which then cause application error during recovery
            [2/5/14, 9:19:44 AM] Johann Lombardi: yes
            [2/5/14, 9:19:54 AM] Johann Lombardi: that's what the quota code expects
            [2/5/14, 9:20:03 AM] wangdi: which then cause some test failure,
            [2/5/14, 9:20:13 AM] Johann Lombardi: all the states associated with the previous "connection" should go away
            [2/5/14, 9:20:47 AM] Johann Lombardi: could you please remind me how you use LWP?
            [2/5/14, 9:20:59 AM] Johann Lombardi: for seq allocation mostly, right?
            [2/5/14, 9:21:15 AM] wangdi: no, mostly for fld(seq) lookup
            [2/5/14, 9:21:35 AM] wangdi: seq allocation is rare actually
            [2/5/14, 9:21:38 AM] Johann Lombardi: ah, then the lack of recovery should not be a problem
            [2/5/14, 9:21:45 AM] Johann Lombardi: you can just resend the request yourself
            [2/5/14, 9:22:06 AM] wangdi: that would complicate thing
            [2/5/14, 9:22:24 AM] wangdi: I am thinking some easy way. :)
            [2/5/14, 9:22:38 AM] Johann Lombardi: well, the purpose of lwp is to be lightweight
            [2/5/14, 9:22:52 AM] Johann Lombardi: as such, it has no entry in last_rcvd
            [2/5/14, 9:22:58 AM] Johann Lombardi: and gets evicted upon restart
            [2/5/14, 9:23:07 AM] wangdi: I understand this.
            [2/5/14, 9:23:14 AM] Johann Lombardi: it is like this by design
            [2/5/14, 9:23:54 AM] wangdi: quota need this evict thing? or just design
            [2/5/14, 9:24:01 AM] wangdi: if there are no replay
            [2/5/14, 9:25:06 AM] wangdi: why do we need this eviction.
            [2/5/14, 9:25:24 AM] Johann Lombardi: quota needs all outstanding requests associated with previous connection to fail, yes
            [2/5/14, 9:25:31 AM] wangdi: I see
            [2/5/14, 9:25:36 AM] wangdi: thanks
            [2/5/14, 9:25:46 AM] Johann Lombardi: well, the eviction is a side effect of not being in the last_rcvd file
            [2/5/14, 9:26:15 AM] Johann Lombardi: quota updates are not synchronous to disk and not replayed
            [2/5/14, 9:26:28 AM] wangdi: if the eviction can chose not to fail those read-only in-flight RPC
            [2/5/14, 9:26:32 AM] wangdi: that would be best
            [2/5/14, 9:26:35 AM] Johann Lombardi: when the mdt restarts, it might have a different view of allocation
            [2/5/14, 9:26:49 AM] Johann Lombardi: so we should restart from scratch each time
            [2/5/14, 9:27:00 AM] Johann Lombardi: and forget about previous requests that did not complete
            [2/5/14, 9:27:09 AM] Johann Lombardi: because there are not relevant any more
            [2/5/14, 9:27:22 AM] Johann Lombardi: i guess we could change the behavior a bit
            [2/5/14, 9:27:42 AM] Johann Lombardi: and not fail requests which are in the delayed list
            [2/5/14, 9:28:16 AM] Johann Lombardi: quota could then set the no_delay flag (if not done already) to bypass this behavior
            [2/5/14, 9:28:47 AM] Johann Lombardi: however, i guess it will require to patch ptlrpc no to fail those requests on eviction which will happen anyway since we have not entry in the last_rcvd file
            [2/5/14, 9:28:50 AM] Johann Lombardi: what do you think?
            [2/5/14, 9:29:31 AM] wangdi: the purpose of eviction is for the "replay" req, not for delay req in this case.
            [2/5/14, 9:29:55 AM] wangdi: so probably change ptlrpc makes sense, but that would complicate the protocol
            [2/5/14, 9:29:57 AM] wangdi: I hate that
            [2/5/14, 9:30:00 AM] Johann Lombardi: well, requests from the delayed list will be failed too
            [2/5/14, 9:30:08 AM] Johann Lombardi: in the case of eviction
            [2/5/14, 9:30:24 AM] wangdi: yes, but it should not for lightweight connection
            [2/5/14, 9:30:54 AM] Johann Lombardi: yes, so you could change this behavior and set rq_no_delay on quota request (against, if not done already)
            [2/5/14, 9:31:12 AM] Johann Lombardi: yours will be handled
            [2/5/14, 9:32:15 AM] wangdi: yes, probably
            [2/5/14, 9:32:27 AM] wangdi: need think a bit. thanks for the discussion.
            [2/5/14, 9:32:27 AM] Johann Lombardi: i don't see any other way to do that
            [2/5/14, 9:32:44 AM] Johann Lombardi: except adding lwp to last_rcvd
            [2/5/14, 9:32:52 AM] Johann Lombardi: which would defeat the original purpose
            [2/5/14, 9:33:03 AM] Johann Lombardi: np
            [2/5/14, 9:33:40 AM] wangdi: the other way is to resend the fld rpc as you said, but that would complicate thing
            [2/5/14, 9:34:17 AM] Johann Lombardi: it might, indeed
            [2/5/14, 10:32:30 AM] wangdi: set no_delay will cause quota RPC failed un any unhealthy connection. not just restart.
            [2/5/14, 10:32:35 AM] wangdi: is it ok for quota?
            [2/5/14, 10:48:13 AM] wangdi: hmm, we already set no_delay flag for quota request
            [2/5/14, 10:48:27 AM] wangdi: so I guess this eviction is unnecessary anymore?
            [2/5/14, 10:48:49 AM] wangdi: I will disable part of recovery-small test then
            
            di.wang Di Wang added a comment - discussion on Skype [2/5/14, 9:19:26 AM] wangdi: if we just evict the LWP, it might cause some in-flight RPC failed [2/5/14, 9:19:40 AM] Johann Lombardi: yup [2/5/14, 9:19:41 AM] wangdi: which then cause application error during recovery [2/5/14, 9:19:44 AM] Johann Lombardi: yes [2/5/14, 9:19:54 AM] Johann Lombardi: that's what the quota code expects [2/5/14, 9:20:03 AM] wangdi: which then cause some test failure, [2/5/14, 9:20:13 AM] Johann Lombardi: all the states associated with the previous "connection" should go away [2/5/14, 9:20:47 AM] Johann Lombardi: could you please remind me how you use LWP? [2/5/14, 9:20:59 AM] Johann Lombardi: for seq allocation mostly, right? [2/5/14, 9:21:15 AM] wangdi: no, mostly for fld(seq) lookup [2/5/14, 9:21:35 AM] wangdi: seq allocation is rare actually [2/5/14, 9:21:38 AM] Johann Lombardi: ah, then the lack of recovery should not be a problem [2/5/14, 9:21:45 AM] Johann Lombardi: you can just resend the request yourself [2/5/14, 9:22:06 AM] wangdi: that would complicate thing [2/5/14, 9:22:24 AM] wangdi: I am thinking some easy way. :) [2/5/14, 9:22:38 AM] Johann Lombardi: well, the purpose of lwp is to be lightweight [2/5/14, 9:22:52 AM] Johann Lombardi: as such, it has no entry in last_rcvd [2/5/14, 9:22:58 AM] Johann Lombardi: and gets evicted upon restart [2/5/14, 9:23:07 AM] wangdi: I understand this. [2/5/14, 9:23:14 AM] Johann Lombardi: it is like this by design [2/5/14, 9:23:54 AM] wangdi: quota need this evict thing? or just design [2/5/14, 9:24:01 AM] wangdi: if there are no replay [2/5/14, 9:25:06 AM] wangdi: why do we need this eviction. [2/5/14, 9:25:24 AM] Johann Lombardi: quota needs all outstanding requests associated with previous connection to fail, yes [2/5/14, 9:25:31 AM] wangdi: I see [2/5/14, 9:25:36 AM] wangdi: thanks [2/5/14, 9:25:46 AM] Johann Lombardi: well, the eviction is a side effect of not being in the last_rcvd file [2/5/14, 9:26:15 AM] Johann Lombardi: quota updates are not synchronous to disk and not replayed [2/5/14, 9:26:28 AM] wangdi: if the eviction can chose not to fail those read-only in-flight RPC [2/5/14, 9:26:32 AM] wangdi: that would be best [2/5/14, 9:26:35 AM] Johann Lombardi: when the mdt restarts, it might have a different view of allocation [2/5/14, 9:26:49 AM] Johann Lombardi: so we should restart from scratch each time [2/5/14, 9:27:00 AM] Johann Lombardi: and forget about previous requests that did not complete [2/5/14, 9:27:09 AM] Johann Lombardi: because there are not relevant any more [2/5/14, 9:27:22 AM] Johann Lombardi: i guess we could change the behavior a bit [2/5/14, 9:27:42 AM] Johann Lombardi: and not fail requests which are in the delayed list [2/5/14, 9:28:16 AM] Johann Lombardi: quota could then set the no_delay flag (if not done already) to bypass this behavior [2/5/14, 9:28:47 AM] Johann Lombardi: however, i guess it will require to patch ptlrpc no to fail those requests on eviction which will happen anyway since we have not entry in the last_rcvd file [2/5/14, 9:28:50 AM] Johann Lombardi: what do you think? [2/5/14, 9:29:31 AM] wangdi: the purpose of eviction is for the "replay" req, not for delay req in this case. [2/5/14, 9:29:55 AM] wangdi: so probably change ptlrpc makes sense, but that would complicate the protocol [2/5/14, 9:29:57 AM] wangdi: I hate that [2/5/14, 9:30:00 AM] Johann Lombardi: well, requests from the delayed list will be failed too [2/5/14, 9:30:08 AM] Johann Lombardi: in the case of eviction [2/5/14, 9:30:24 AM] wangdi: yes, but it should not for lightweight connection [2/5/14, 9:30:54 AM] Johann Lombardi: yes, so you could change this behavior and set rq_no_delay on quota request (against, if not done already) [2/5/14, 9:31:12 AM] Johann Lombardi: yours will be handled [2/5/14, 9:32:15 AM] wangdi: yes, probably [2/5/14, 9:32:27 AM] wangdi: need think a bit. thanks for the discussion. [2/5/14, 9:32:27 AM] Johann Lombardi: i don't see any other way to do that [2/5/14, 9:32:44 AM] Johann Lombardi: except adding lwp to last_rcvd [2/5/14, 9:32:52 AM] Johann Lombardi: which would defeat the original purpose [2/5/14, 9:33:03 AM] Johann Lombardi: np [2/5/14, 9:33:40 AM] wangdi: the other way is to resend the fld rpc as you said, but that would complicate thing [2/5/14, 9:34:17 AM] Johann Lombardi: it might, indeed [2/5/14, 10:32:30 AM] wangdi: set no_delay will cause quota RPC failed un any unhealthy connection. not just restart. [2/5/14, 10:32:35 AM] wangdi: is it ok for quota? [2/5/14, 10:48:13 AM] wangdi: hmm, we already set no_delay flag for quota request [2/5/14, 10:48:27 AM] wangdi: so I guess this eviction is unnecessary anymore? [2/5/14, 10:48:49 AM] wangdi: I will disable part of recovery-small test then

            could you please look at LU-4571, I do not understand why LWP needs to be evicted after restart. (recovery-small.sh 106), which basically make the fix on LU-4571 fails on this test. please post here, if you remember sth. Thanks

            The purpose of LWP was to have a stateless connection (no entry in last_rcvd file, no lock replay, ...).
            What's exactly the problem?

            johann Johann Lombardi (Inactive) added a comment - could you please look at LU-4571 , I do not understand why LWP needs to be evicted after restart. (recovery-small.sh 106), which basically make the fix on LU-4571 fails on this test. please post here, if you remember sth. Thanks The purpose of LWP was to have a stateless connection (no entry in last_rcvd file, no lock replay, ...). What's exactly the problem?

            With the normal MDT->OST connections, if the UUID is the same (which it should be, since it is using the MDT target name) it will allow it to reconnect to the same export. Should that also be the case for LWP connections?

            adilger Andreas Dilger added a comment - With the normal MDT->OST connections, if the UUID is the same (which it should be, since it is using the MDT target name) it will allow it to reconnect to the same export. Should that also be the case for LWP connections?
            di.wang Di Wang added a comment -

            IIRC, LWP connection is always allowed during recovery for the purpose of fld lookup etc, but in this case, it will evict the old import. I am not sure whether we should do this for OSP, since it still need go all the process of recovery. need think a bit.

            di.wang Di Wang added a comment - IIRC, LWP connection is always allowed during recovery for the purpose of fld lookup etc, but in this case, it will evict the old import. I am not sure whether we should do this for OSP, since it still need go all the process of recovery. need think a bit.

            I wonder if it should be taken further and always allow LWP connections during recovery, or possibly always allowing MDT connections during recovery? That is something for discussion, not a reason to block the current patch though.

            adilger Andreas Dilger added a comment - I wonder if it should be taken further and always allow LWP connections during recovery, or possibly always allowing MDT connections during recovery? That is something for discussion, not a reason to block the current patch though.
            di.wang Di Wang added a comment - http://review.whamcloud.com/9106
            di.wang Di Wang added a comment -

            It seems there are two problems.

            1. OSP should not be evicted during restart, since the connection is written to disk, should be recoverable over restart. It seems in sanity_17m, it use -f to umount mds1, which might be the reason why the connection is not being honored over restart, then cause this eviction. I am not sure why we using stop -f here. but may need change it to stop.

            test_17m() {
            
            .....
                    echo "stop and checking mds${mds_index}: $cmd"
                    # e2fsck should not return error
                    stop mds${mds_index} -f
                    do_facet mds${mds_index} $cmd || rc=$?
            
            .....
            }
            

            2. LWP will likely be evicted during restart, because it is a lightweight connection, and we do not store any connection states in disk. So MDT does not recognize any reconnection after restart, so all of reconnection(if it happens) will be evicted here. Unfortunately, it will cause some errors(EIO), if some threads happens to wait on this import(mostly LOD is doing lookup OST seq), it will get error.

            di.wang Di Wang added a comment - It seems there are two problems. 1. OSP should not be evicted during restart, since the connection is written to disk, should be recoverable over restart. It seems in sanity_17m, it use -f to umount mds1, which might be the reason why the connection is not being honored over restart, then cause this eviction. I am not sure why we using stop -f here. but may need change it to stop. test_17m() { ..... echo "stop and checking mds${mds_index}: $cmd" # e2fsck should not return error stop mds${mds_index} -f do_facet mds${mds_index} $cmd || rc=$? ..... } 2. LWP will likely be evicted during restart, because it is a lightweight connection, and we do not store any connection states in disk. So MDT does not recognize any reconnection after restart, so all of reconnection(if it happens) will be evicted here. Unfortunately, it will cause some errors(EIO), if some threads happens to wait on this import(mostly LOD is doing lookup OST seq), it will get error.

            I think this is exactly the same bug as LU-4420, which is also caused by a previous test restarting the MDS, and the recovery bleeds over into the next test.

            adilger Andreas Dilger added a comment - I think this is exactly the same bug as LU-4420 , which is also caused by a previous test restarting the MDS, and the recovery bleeds over into the next test.
            adilger Andreas Dilger added a comment - - edited

            Di, it is better to keep the fix for bugs in separate patches, instead of merging it into the split directory feature, because that makes it easier to backport this fix to b2_4 and b2_5 (where it also exists).

            Also, I think that adding wait_osp_import_state() in the test script will just hide the bug during testing, but the same problem can be hit in production. This is caused by sanity.sh test_17m stopping and restarting the MDS, but the recovery is not completed before the next test starts, as can be seen in the mds2 console logs:

            23:44:10:Lustre: DEBUG MARKER: == sanity test 17n: run e2fsck against master/slave MDT which contains remote dir == 23:43:54 (1391240634)
            23:44:10:LustreError: 167-0: lustre-MDT0000-osp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
            23:44:10:LustreError: 167-0: lustre-MDT0000-osp-MDT0002: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
            23:44:10:Lustre: lustre-MDT0000-osp-MDT0002: Connection restored to lustre-MDT0000 (at 10.10.17.60@tcp)
            23:44:10:Lustre: Skipped 2 previous similar messages
            23:44:31:LustreError: 167-0: lustre-MDT0000-lwp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
            

            Note that the "create remote dir error" failure happens in test_17n before it stops the MDTs to be checked, so it must be a result of test_17m restarting the MDS.
            I think the wait for recovery completion should probably be done in the kernel, otherwise this could be hit by anyone using DNE when one of the MDS nodes fails.

            adilger Andreas Dilger added a comment - - edited Di, it is better to keep the fix for bugs in separate patches, instead of merging it into the split directory feature, because that makes it easier to backport this fix to b2_4 and b2_5 (where it also exists). Also, I think that adding wait_osp_import_state() in the test script will just hide the bug during testing, but the same problem can be hit in production. This is caused by sanity.sh test_17m stopping and restarting the MDS, but the recovery is not completed before the next test starts, as can be seen in the mds2 console logs: 23:44:10:Lustre: DEBUG MARKER: == sanity test 17n: run e2fsck against master/slave MDT which contains remote dir == 23:43:54 (1391240634) 23:44:10:LustreError: 167-0: lustre-MDT0000-osp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. 23:44:10:LustreError: 167-0: lustre-MDT0000-osp-MDT0002: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. 23:44:10:Lustre: lustre-MDT0000-osp-MDT0002: Connection restored to lustre-MDT0000 (at 10.10.17.60@tcp) 23:44:10:Lustre: Skipped 2 previous similar messages 23:44:31:LustreError: 167-0: lustre-MDT0000-lwp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. Note that the "create remote dir error" failure happens in test_17n before it stops the MDTs to be checked, so it must be a result of test_17m restarting the MDS. I think the wait for recovery completion should probably be done in the kernel, otherwise this could be hit by anyone using DNE when one of the MDS nodes fails.
            di.wang Di Wang added a comment -

            It seems MDT needs to wait OSP connection to be FULL, then go further, i.e. we need add a wait_osp_import_state. I will add the fix in http://review.whamcloud.com/#/c/7445/.

            di.wang Di Wang added a comment - It seems MDT needs to wait OSP connection to be FULL, then go further, i.e. we need add a wait_osp_import_state. I will add the fix in http://review.whamcloud.com/#/c/7445/ .

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: