[LU-184] Test failure on test suite insanity, subtest test_0 - Whamcloud Community JIRA

Niu Yawei (Inactive) added a comment - 12/Apr/11 8:24 PM

As far as I can see, the problem is that we have to preserve client data on failover umount for futhure recovery, however, the orphan was cleared while closing files on umount, such inconsistence would cause open replay error.

For 1.8, we don't have such problem, because the final close on failover umount doesn't clear orphan in 1.8.

My proposal is that we should keep orhpan on failover umount, just like 1.8 does. Which requires some code changes in close/orphan handling path, and the post recovery orphan cleanup as well. Tappro/Andreas/Wangdi/Oleg: Does it sound ok? If you all agree with me, I will try to make a patch to fix it in this way. Thank you.

Niu Yawei (Inactive) added a comment - 12/Apr/11 8:24 PM As far as I can see, the problem is that we have to preserve client data on failover umount for futhure recovery, however, the orphan was cleared while closing files on umount, such inconsistence would cause open replay error. For 1.8, we don't have such problem, because the final close on failover umount doesn't clear orphan in 1.8. My proposal is that we should keep orhpan on failover umount, just like 1.8 does. Which requires some code changes in close/orphan handling path, and the post recovery orphan cleanup as well. Tappro/Andreas/Wangdi/Oleg: Does it sound ok? If you all agree with me, I will try to make a patch to fix it in this way. Thank you.

Mikhail Pershin added a comment - 12/Apr/11 12:53 PM

the obd_fail is set to 1 by server_stop_servers() just before class_manual_cleanup() call. That is why we have clients in last_rcvd but no orphans.

Mikhail Pershin added a comment - 12/Apr/11 12:53 PM the obd_fail is set to 1 by server_stop_servers() just before class_manual_cleanup() call. That is why we have clients in last_rcvd but no orphans.

Mikhail Pershin added a comment - 12/Apr/11 12:31 PM

Btw, from this condition:
if (obd->obd_replayable && (!obd->obd_fail || exp->exp_failed))

the ordinary umount should cause client removal always it seems, obd_fail is set only if 'lctl cleanup' was used but not during umount, or I missed something? That said there shoudn't be recovery on server after umount/mount. But we saw that, is this expected behavior?

Mikhail Pershin added a comment - 12/Apr/11 12:31 PM Btw, from this condition: if (obd->obd_replayable && (!obd->obd_fail || exp->exp_failed)) the ordinary umount should cause client removal always it seems, obd_fail is set only if 'lctl cleanup' was used but not during umount, or I missed something? That said there shoudn't be recovery on server after umount/mount. But we saw that, is this expected behavior?

Mikhail Pershin added a comment - 12/Apr/11 12:07 PM - edited

I am trying to remember 22299 details and find out that there we tried to keep clients in last_rcvd:
@@ -3054,7 +3054,7 @@ static int filter_disconnect(struct obd_export *exp)

rc = server_disconnect_export(exp);

if (exp->exp_obd->obd_replayable)
+ if (obd->obd_replayable && (!obd->obd_fail || exp->exp_failed))
filter_client_del(exp);

I think this is the same problem - should umount be gentle or not, without rdonly umount removed also last_rcvd completely because of client disconnection. But with changes above clients will be preserved if there is failover. That is why we have problem - the clients are kept during umount but orphans are not.

Previously we have all clients removed - no problem with orphans OR with rdonly nothing was removed actually.

So now we need to preserve orphans if there is failover (like for clients) or we shouldn't preserve anything and don't use umount/mount pair for failover without setting rdonly mode before that.

Mikhail Pershin added a comment - 12/Apr/11 12:07 PM - edited I am trying to remember 22299 details and find out that there we tried to keep clients in last_rcvd: @@ -3054,7 +3054,7 @@ static int filter_disconnect(struct obd_export *exp) rc = server_disconnect_export(exp); if (exp->exp_obd->obd_replayable) + if (obd->obd_replayable && (!obd->obd_fail || exp->exp_failed)) filter_client_del(exp); I think this is the same problem - should umount be gentle or not, without rdonly umount removed also last_rcvd completely because of client disconnection. But with changes above clients will be preserved if there is failover. That is why we have problem - the clients are kept during umount but orphans are not. Previously we have all clients removed - no problem with orphans OR with rdonly nothing was removed actually. So now we need to preserve orphans if there is failover (like for clients) or we shouldn't preserve anything and don't use umount/mount pair for failover without setting rdonly mode before that.

Di Wang (Inactive) added a comment - 12/Apr/11 11:47 AM - edited

In 2.x code, when unlink the orphan inode, only the name entry will be deleted, and the real inode and OI entry will be kept until final close, i.e. the last reference of the lu object is being dropped. So we probably safe for crash, since replay can still find the object by fid (through OI). But for umount, MDT will close the file (without notifying client), and then it will be deleted in OI and disk (triggered by lu_object_put->lu_object_free, different with 1.8), we might in trouble on replay.

It seems we did put the inode in orph list, but never tried to lookup it in the open replay. Niu, maybe you should check the object under orphan before delete the object in OI and ldiskfs? hmm, we might have an un-deleted orphan, if the client crashes or evicted, but it might not worse than 1.8 ? Sorry I was wrong here, mdd_recovery_complete does actually delete the orphan object. Tappro probably know more and give better suggestion. Please correct me, if I am wrong.

Di Wang (Inactive) added a comment - 12/Apr/11 11:47 AM - edited In 2.x code, when unlink the orphan inode, only the name entry will be deleted, and the real inode and OI entry will be kept until final close, i.e. the last reference of the lu object is being dropped. So we probably safe for crash, since replay can still find the object by fid (through OI). But for umount, MDT will close the file (without notifying client), and then it will be deleted in OI and disk (triggered by lu_object_put->lu_object_free, different with 1.8), we might in trouble on replay. It seems we did put the inode in orph list, but never tried to lookup it in the open replay. Niu, maybe you should check the object under orphan before delete the object in OI and ldiskfs? hmm, we might have an un-deleted orphan, if the client crashes or evicted, but it might not worse than 1.8 ? Sorry I was wrong here, mdd_recovery_complete does actually delete the orphan object. Tappro probably know more and give better suggestion. Please correct me, if I am wrong.

Mikhail Pershin added a comment - 12/Apr/11 11:46 AM

Andreas, I don't see defect there, we just close all opened files upon umount and that causing orphan removal. mdt_export_cleanup() do that by walking through list of mfd. 1.8 does the same as I can see.

But there is one difference with old times - now we don't set rdonly mode, that mode prevents orphans previously but after 22299 that was removed. Personally I don't see there a big problem, server umount is not what supposed to be done for working server. But our tests may need modification given the fact that rdonly is not default behavior during umount.

Mikhail Pershin added a comment - 12/Apr/11 11:46 AM Andreas, I don't see defect there, we just close all opened files upon umount and that causing orphan removal. mdt_export_cleanup() do that by walking through list of mfd. 1.8 does the same as I can see. But there is one difference with old times - now we don't set rdonly mode, that mode prevents orphans previously but after 22299 that was removed. Personally I don't see there a big problem, server umount is not what supposed to be done for working server. But our tests may need modification given the fact that rdonly is not default behavior during umount.

Andreas Dilger added a comment - 12/Apr/11 11:22 AM

The open-unlink files should always be linked from the PENDING directory, so that MDS unmount/crash doesn't drop the last reference to the inode that would otherwise only be held in memory. Is there something broken in the 2.x code related to linking orphan inodes into PENDING? I know there have been some changes in this area in 1.8 in the last year or so, and it is possible that there is a race in how the inode nlink count is checked when a file should be added to PENDING.

Also, I added Oleg to this bug, because I hope he may recall a problem related to file close that was causing the client to not send the "real" close to the MDS even after the file was closed by the application and exited. This is a separate defect that shouldn't be causing this test to fail, but it is something that should be fixed anyway.

Andreas Dilger added a comment - 12/Apr/11 11:22 AM The open-unlink files should always be linked from the PENDING directory, so that MDS unmount/crash doesn't drop the last reference to the inode that would otherwise only be held in memory. Is there something broken in the 2.x code related to linking orphan inodes into PENDING? I know there have been some changes in this area in 1.8 in the last year or so, and it is possible that there is a race in how the inode nlink count is checked when a file should be added to PENDING. Also, I added Oleg to this bug, because I hope he may recall a problem related to file close that was causing the client to not send the "real" close to the MDS even after the file was closed by the application and exited. This is a separate defect that shouldn't be causing this test to fail, but it is something that should be fixed anyway.

Niu Yawei (Inactive) added a comment - 12/Apr/11 5:03 AM

I reproduced this issue on Toro, after analyzing the log, I think I see the bug:

At the begining of insanity test_0, it fails the mds by umount, and the open files on mds were closed during mds umount, for the orphan file, it'll be deleted upon the last close, however, the clients hasn't issue close yet, so all the open requests are still retained on clients. After mds restart and entering recovery, the clients reconnect and replay the open requests, some of the open replays will fail to find orphan objects since they've been deleted during mds umount.

I think probably we should keep the orphan while closing file on umount, anyway, they'll be cleared after recovery or file close, so we needn't worry about orphan leak. Tappro, what's your opinion?

Niu Yawei (Inactive) added a comment - 12/Apr/11 5:03 AM I reproduced this issue on Toro, after analyzing the log, I think I see the bug: At the begining of insanity test_0, it fails the mds by umount, and the open files on mds were closed during mds umount, for the orphan file, it'll be deleted upon the last close, however, the clients hasn't issue close yet, so all the open requests are still retained on clients. After mds restart and entering recovery, the clients reconnect and replay the open requests, some of the open replays will fail to find orphan objects since they've been deleted during mds umount. I think probably we should keep the orphan while closing file on umount, anyway, they'll be cleared after recovery or file close, so we needn't worry about orphan leak. Tappro, what's your opinion?

Sarah Liu added a comment - 11/Apr/11 9:51 PM

In my configure file, the debug size is set to 32, I will enlarge it and rerun the tests.

Sarah Liu added a comment - 11/Apr/11 9:51 PM In my configure file, the debug size is set to 32, I will enlarge it and rerun the tests.

Niu Yawei (Inactive) added a comment - 11/Apr/11 3:00 AM

Hi, Sarah

I downloaded the logs and found that each client log is only about 3M, and server log is about 20 MB, so the messages from runracer test are missed in the log. What's the debug log size of the tests? Could you enlarge the debug size (for instance, 50MB for both client and server?) then reproduce it again? Thanks.

Niu Yawei (Inactive) added a comment - 11/Apr/11 3:00 AM Hi, Sarah I downloaded the logs and found that each client log is only about 3M, and server log is about 20 MB, so the messages from runracer test are missed in the log. What's the debug log size of the tests? Could you enlarge the debug size (for instance, 50MB for both client and server?) then reproduce it again? Thanks.

Mikhail Pershin added a comment - 09/Apr/11 2:11 AM

This is obviously bug, from what I see we have many version mismatches like the following:

00000004:00000002:0.0:1302324530.070202:0:10190:0:(mdt_handler.c:1607:mdt_reint_opcode()) @@@ reint opt = 6 req@ffff8106384fc000 x1365585644912170/t0(4295521982) o-1->917221e7-5188-2444-10d7-07b3e1abdbd0@NET_0x50000c0a80416_UUID:0/0 lens 528/0 e 0 to 0 dl 1302324555 ref 1 fl Complete:/ffffffff/ffffffff rc 0/-1
00000004:00000002:0.0:1302324530.070213:0:10190:0:(mdt_open.c:1194:mdt_reint_open()) I am going to open [0x200000400:0x1:0x0]/(14->[0x200000402:0x66c8:0x0]) cr_flag=02102 mode=0200100644 msg_flag=0x4
<skip>
00000004:00000002:0.0:1302324530.070256:0:10190:0:(mdt_reint.c:114:mdt_obj_version_get()) FID [0x200000400:0x1:0x0] version is 0x100000002
00000004:00000002:0.0:1302324530.070260:0:10190:0:(mdt_reint.c:114:mdt_obj_version_get()) FID [0x200000401:0x4693:0x0] version is 0x10008b4de
00000004:00000002:0.0:1302324530.070261:0:10190:0:(mdt_reint.c:143:mdt_version_check()) Version mismatch 0x1 != 0x10008b4de

So, we got open|create replay, which is already committed (it is seen on client and by transno). There is also two checks for versions - first one is for parent (passed), second is for child - failed.

Note the expected version is 0x1 - this is special version for non-existent files, that mean this file didn't exist before this open and that is right - that was just new create. But now we have file with some version and check fails. This is also not error - that also can be if file was created and used. The problem is why this code is executed during replay, that shouldn't be and this is wrong.

Let's look at mdt_reint_open:

if (req_is_replay(req) ||
(req->rq_export->exp_libclient && create_flags&MDS_OPEN_HAS_EA)) {
/* This is a replay request or from liblustre with ea. */
result = mdt_open_by_fid(info, ldlm_rep);
if (result != -ENOENT)

{ <skip> GOTO(out, result); }

so if we have replay then first thing we are trying to do - open file by fid. If file exists then we don't need to re-create it just open is needed.
And only if we get -ENOENT then we pass through this check and will try to re-create file and only in that case versions are checked.
After that we check parent version - it is fine, and then do lookup for child and check its version expecting it to be 1 (ENOENT_VERSION), but instead of that we find child by name and it exists. Therefore we have two options there:

1) FID lookup doesn't work, so we can't find child by FID but can by name
2) this is other file with the same name, but different FID.

Look at debug above:
I am going to open [0x200000400:0x1:0x0]/(14->[0x200000402:0x66c8:0x0]), so child FID is [0x200000402:0x66c8:0x0]
but version is checked for FID [0x200000401:0x4693:0x0].
That is why it fails, we have another file with the same name.

For me that looks like we have bug in close code or orphan code. We have open replays for files which doesn't exist already, I see also other messages in log which says the same, e.g. "OPEN & CREAT not in open replay." - that mean we are doing open without create but file doesn't exists. That can be only if files were deleted but not closed. That is why replays are staying in queue.

I'd propose to check close() paths to make sure they can't be lost. Another check should be done for orphan code - maybe file wasn't closed yet but another client destroy it. Usually that should produce orphan which must be re-opened during replay. Maybe that orphan wasn't created.

I tend to think this is close() issue, not orphans, because we have a lot of tests for orphan handling in replay-single and I doubt there is some bug but such possibility exists still.

Mikhail Pershin added a comment - 09/Apr/11 2:11 AM This is obviously bug, from what I see we have many version mismatches like the following: 00000004:00000002:0.0:1302324530.070202:0:10190:0:(mdt_handler.c:1607:mdt_reint_opcode()) @@@ reint opt = 6 req@ffff8106384fc000 x1365585644912170/t0(4295521982) o-1->917221e7-5188-2444-10d7-07b3e1abdbd0@NET_0x50000c0a80416_UUID:0/0 lens 528/0 e 0 to 0 dl 1302324555 ref 1 fl Complete:/ffffffff/ffffffff rc 0/-1 00000004:00000002:0.0:1302324530.070213:0:10190:0:(mdt_open.c:1194:mdt_reint_open()) I am going to open [0x200000400:0x1:0x0] /(14-> [0x200000402:0x66c8:0x0] ) cr_flag=02102 mode=0200100644 msg_flag=0x4 <skip> 00000004:00000002:0.0:1302324530.070256:0:10190:0:(mdt_reint.c:114:mdt_obj_version_get()) FID [0x200000400:0x1:0x0] version is 0x100000002 00000004:00000002:0.0:1302324530.070260:0:10190:0:(mdt_reint.c:114:mdt_obj_version_get()) FID [0x200000401:0x4693:0x0] version is 0x10008b4de 00000004:00000002:0.0:1302324530.070261:0:10190:0:(mdt_reint.c:143:mdt_version_check()) Version mismatch 0x1 != 0x10008b4de So, we got open|create replay, which is already committed (it is seen on client and by transno). There is also two checks for versions - first one is for parent (passed), second is for child - failed. Note the expected version is 0x1 - this is special version for non-existent files, that mean this file didn't exist before this open and that is right - that was just new create. But now we have file with some version and check fails. This is also not error - that also can be if file was created and used. The problem is why this code is executed during replay, that shouldn't be and this is wrong. Let's look at mdt_reint_open: if (req_is_replay(req) || (req->rq_export->exp_libclient && create_flags&MDS_OPEN_HAS_EA)) { /* This is a replay request or from liblustre with ea. */ result = mdt_open_by_fid(info, ldlm_rep); if (result != -ENOENT) { <skip> GOTO(out, result); } so if we have replay then first thing we are trying to do - open file by fid. If file exists then we don't need to re-create it just open is needed. And only if we get -ENOENT then we pass through this check and will try to re-create file and only in that case versions are checked. After that we check parent version - it is fine, and then do lookup for child and check its version expecting it to be 1 (ENOENT_VERSION), but instead of that we find child by name and it exists. Therefore we have two options there: 1) FID lookup doesn't work, so we can't find child by FID but can by name 2) this is other file with the same name, but different FID. Look at debug above: I am going to open [0x200000400:0x1:0x0] /(14-> [0x200000402:0x66c8:0x0] ), so child FID is [0x200000402:0x66c8:0x0] but version is checked for FID [0x200000401:0x4693:0x0] . That is why it fails, we have another file with the same name. For me that looks like we have bug in close code or orphan code. We have open replays for files which doesn't exist already, I see also other messages in log which says the same, e.g. "OPEN & CREAT not in open replay." - that mean we are doing open without create but file doesn't exists. That can be only if files were deleted but not closed. That is why replays are staying in queue. I'd propose to check close() paths to make sure they can't be lost. Another check should be done for orphan code - maybe file wasn't closed yet but another client destroy it. Usually that should produce orphan which must be re-opened during replay. Maybe that orphan wasn't created. I tend to think this is close() issue, not orphans, because we have a lot of tests for orphan handling in replay-single and I doubt there is some bug but such possibility exists still.

Test failure on test suite insanity, subtest test_0

Details

Description

Attachments

Attachments

Activity

People

Dates