Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-184

Test failure on test suite insanity, subtest test_0

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.1.0
    • None
    • None
    • 3
    • 5074

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b573b3c0-5c15-11e0-a272-52540025f9af.

      The sub-test test_0 failed with the following error:

      post-failover df: 1

      Attachments

        Activity

          [LU-184] Test failure on test suite insanity, subtest test_0

          Niu, yes, I agree though I can't find how 1.8 keeps orphans, can you show, please?

          tappro Mikhail Pershin added a comment - Niu, yes, I agree though I can't find how 1.8 keeps orphans, can you show, please?

          As far as I can see, the problem is that we have to preserve client data on failover umount for futhure recovery, however, the orphan was cleared while closing files on umount, such inconsistence would cause open replay error.

          For 1.8, we don't have such problem, because the final close on failover umount doesn't clear orphan in 1.8.

          My proposal is that we should keep orhpan on failover umount, just like 1.8 does. Which requires some code changes in close/orphan handling path, and the post recovery orphan cleanup as well. Tappro/Andreas/Wangdi/Oleg: Does it sound ok? If you all agree with me, I will try to make a patch to fix it in this way. Thank you.

          niu Niu Yawei (Inactive) added a comment - As far as I can see, the problem is that we have to preserve client data on failover umount for futhure recovery, however, the orphan was cleared while closing files on umount, such inconsistence would cause open replay error. For 1.8, we don't have such problem, because the final close on failover umount doesn't clear orphan in 1.8. My proposal is that we should keep orhpan on failover umount, just like 1.8 does. Which requires some code changes in close/orphan handling path, and the post recovery orphan cleanup as well. Tappro/Andreas/Wangdi/Oleg: Does it sound ok? If you all agree with me, I will try to make a patch to fix it in this way. Thank you.

          the obd_fail is set to 1 by server_stop_servers() just before class_manual_cleanup() call. That is why we have clients in last_rcvd but no orphans.

          tappro Mikhail Pershin added a comment - the obd_fail is set to 1 by server_stop_servers() just before class_manual_cleanup() call. That is why we have clients in last_rcvd but no orphans.

          Btw, from this condition:
          if (obd->obd_replayable && (!obd->obd_fail || exp->exp_failed))

          the ordinary umount should cause client removal always it seems, obd_fail is set only if 'lctl cleanup' was used but not during umount, or I missed something? That said there shoudn't be recovery on server after umount/mount. But we saw that, is this expected behavior?

          tappro Mikhail Pershin added a comment - Btw, from this condition: if (obd->obd_replayable && (!obd->obd_fail || exp->exp_failed)) the ordinary umount should cause client removal always it seems, obd_fail is set only if 'lctl cleanup' was used but not during umount, or I missed something? That said there shoudn't be recovery on server after umount/mount. But we saw that, is this expected behavior?
          tappro Mikhail Pershin added a comment - - edited

          I am trying to remember 22299 details and find out that there we tried to keep clients in last_rcvd:
          @@ -3054,7 +3054,7 @@ static int filter_disconnect(struct obd_export *exp)

          rc = server_disconnect_export(exp);

          • if (exp->exp_obd->obd_replayable)
            + if (obd->obd_replayable && (!obd->obd_fail || exp->exp_failed))
            filter_client_del(exp);

          I think this is the same problem - should umount be gentle or not, without rdonly umount removed also last_rcvd completely because of client disconnection. But with changes above clients will be preserved if there is failover. That is why we have problem - the clients are kept during umount but orphans are not.

          Previously we have all clients removed - no problem with orphans OR with rdonly nothing was removed actually.

          So now we need to preserve orphans if there is failover (like for clients) or we shouldn't preserve anything and don't use umount/mount pair for failover without setting rdonly mode before that.

          tappro Mikhail Pershin added a comment - - edited I am trying to remember 22299 details and find out that there we tried to keep clients in last_rcvd: @@ -3054,7 +3054,7 @@ static int filter_disconnect(struct obd_export *exp) rc = server_disconnect_export(exp); if (exp->exp_obd->obd_replayable) + if (obd->obd_replayable && (!obd->obd_fail || exp->exp_failed)) filter_client_del(exp); I think this is the same problem - should umount be gentle or not, without rdonly umount removed also last_rcvd completely because of client disconnection. But with changes above clients will be preserved if there is failover. That is why we have problem - the clients are kept during umount but orphans are not. Previously we have all clients removed - no problem with orphans OR with rdonly nothing was removed actually. So now we need to preserve orphans if there is failover (like for clients) or we shouldn't preserve anything and don't use umount/mount pair for failover without setting rdonly mode before that.
          di.wang Di Wang (Inactive) added a comment - - edited

          In 2.x code, when unlink the orphan inode, only the name entry will be deleted, and the real inode and OI entry will be kept until final close, i.e. the last reference of the lu object is being dropped. So we probably safe for crash, since replay can still find the object by fid (through OI). But for umount, MDT will close the file (without notifying client), and then it will be deleted in OI and disk (triggered by lu_object_put->lu_object_free, different with 1.8), we might in trouble on replay.

          It seems we did put the inode in orph list, but never tried to lookup it in the open replay. Niu, maybe you should check the object under orphan before delete the object in OI and ldiskfs? hmm, we might have an un-deleted orphan, if the client crashes or evicted, but it might not worse than 1.8 ? Sorry I was wrong here, mdd_recovery_complete does actually delete the orphan object. Tappro probably know more and give better suggestion. Please correct me, if I am wrong.

          di.wang Di Wang (Inactive) added a comment - - edited In 2.x code, when unlink the orphan inode, only the name entry will be deleted, and the real inode and OI entry will be kept until final close, i.e. the last reference of the lu object is being dropped. So we probably safe for crash, since replay can still find the object by fid (through OI). But for umount, MDT will close the file (without notifying client), and then it will be deleted in OI and disk (triggered by lu_object_put->lu_object_free, different with 1.8), we might in trouble on replay. It seems we did put the inode in orph list, but never tried to lookup it in the open replay. Niu, maybe you should check the object under orphan before delete the object in OI and ldiskfs? hmm, we might have an un-deleted orphan, if the client crashes or evicted, but it might not worse than 1.8 ? Sorry I was wrong here, mdd_recovery_complete does actually delete the orphan object. Tappro probably know more and give better suggestion. Please correct me, if I am wrong.

          Andreas, I don't see defect there, we just close all opened files upon umount and that causing orphan removal. mdt_export_cleanup() do that by walking through list of mfd. 1.8 does the same as I can see.

          But there is one difference with old times - now we don't set rdonly mode, that mode prevents orphans previously but after 22299 that was removed. Personally I don't see there a big problem, server umount is not what supposed to be done for working server. But our tests may need modification given the fact that rdonly is not default behavior during umount.

          tappro Mikhail Pershin added a comment - Andreas, I don't see defect there, we just close all opened files upon umount and that causing orphan removal. mdt_export_cleanup() do that by walking through list of mfd. 1.8 does the same as I can see. But there is one difference with old times - now we don't set rdonly mode, that mode prevents orphans previously but after 22299 that was removed. Personally I don't see there a big problem, server umount is not what supposed to be done for working server. But our tests may need modification given the fact that rdonly is not default behavior during umount.

          The open-unlink files should always be linked from the PENDING directory, so that MDS unmount/crash doesn't drop the last reference to the inode that would otherwise only be held in memory. Is there something broken in the 2.x code related to linking orphan inodes into PENDING? I know there have been some changes in this area in 1.8 in the last year or so, and it is possible that there is a race in how the inode nlink count is checked when a file should be added to PENDING.

          Also, I added Oleg to this bug, because I hope he may recall a problem related to file close that was causing the client to not send the "real" close to the MDS even after the file was closed by the application and exited. This is a separate defect that shouldn't be causing this test to fail, but it is something that should be fixed anyway.

          adilger Andreas Dilger added a comment - The open-unlink files should always be linked from the PENDING directory, so that MDS unmount/crash doesn't drop the last reference to the inode that would otherwise only be held in memory. Is there something broken in the 2.x code related to linking orphan inodes into PENDING? I know there have been some changes in this area in 1.8 in the last year or so, and it is possible that there is a race in how the inode nlink count is checked when a file should be added to PENDING. Also, I added Oleg to this bug, because I hope he may recall a problem related to file close that was causing the client to not send the "real" close to the MDS even after the file was closed by the application and exited. This is a separate defect that shouldn't be causing this test to fail, but it is something that should be fixed anyway.

          I reproduced this issue on Toro, after analyzing the log, I think I see the bug:

          At the begining of insanity test_0, it fails the mds by umount, and the open files on mds were closed during mds umount, for the orphan file, it'll be deleted upon the last close, however, the clients hasn't issue close yet, so all the open requests are still retained on clients. After mds restart and entering recovery, the clients reconnect and replay the open requests, some of the open replays will fail to find orphan objects since they've been deleted during mds umount.

          I think probably we should keep the orphan while closing file on umount, anyway, they'll be cleared after recovery or file close, so we needn't worry about orphan leak. Tappro, what's your opinion?

          niu Niu Yawei (Inactive) added a comment - I reproduced this issue on Toro, after analyzing the log, I think I see the bug: At the begining of insanity test_0, it fails the mds by umount, and the open files on mds were closed during mds umount, for the orphan file, it'll be deleted upon the last close, however, the clients hasn't issue close yet, so all the open requests are still retained on clients. After mds restart and entering recovery, the clients reconnect and replay the open requests, some of the open replays will fail to find orphan objects since they've been deleted during mds umount. I think probably we should keep the orphan while closing file on umount, anyway, they'll be cleared after recovery or file close, so we needn't worry about orphan leak. Tappro, what's your opinion?
          sarah Sarah Liu added a comment -

          In my configure file, the debug size is set to 32, I will enlarge it and rerun the tests.

          sarah Sarah Liu added a comment - In my configure file, the debug size is set to 32, I will enlarge it and rerun the tests.

          Hi, Sarah

          I downloaded the logs and found that each client log is only about 3M, and server log is about 20 MB, so the messages from runracer test are missed in the log. What's the debug log size of the tests? Could you enlarge the debug size (for instance, 50MB for both client and server?) then reproduce it again? Thanks.

          niu Niu Yawei (Inactive) added a comment - Hi, Sarah I downloaded the logs and found that each client log is only about 3M, and server log is about 20 MB, so the messages from runracer test are missed in the log. What's the debug log size of the tests? Could you enlarge the debug size (for instance, 50MB for both client and server?) then reproduce it again? Thanks.

          People

            niu Niu Yawei (Inactive)
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: