[LU-924] Test failure on test suite recovery-small, subtest test_105 Created: 14/Dec/11  Updated: 29/Jun/12  Resolved: 04/Jan/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.2.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-1582 replay-ost-single.sh test_8b: client ... Resolved
Severity: 3
Rank (Obsolete): 4788

 Description   

This issue was created by maloo for Chris Gearing <chris@whamcloud.com>

T

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/467f4a76-25fd-11e1-ae7f-5254004bbbd3.

The sub-test test_105 failed with the following error:

post-failover df: 1

Info required for matching: recovery-small 105



 Comments   
Comment by Peter Jones [ 14/Dec/11 ]

Jinshan in looking into this one?

Comment by Jinshan Xiong (Inactive) [ 14/Dec/11 ]

The problem is clear: the client data was not written into persistent storage before failing over OSTs.

From the log, client UUID 8f9b83bd-22e0-c555-20a1-fe22e2547e58 was the old one, and in the test case, we've remounted it new UUID 7419f6fe-3688-8bc3-9560-7367776851cd. However, it didn't update storage in time so when OST was up, it still waited for old client.

Comment by Chris Gearing (Inactive) [ 15/Dec/11 ]

Hi Jinshan,

LU-924 references a problem which I could read as being because I've used logical volumes for autotest.

Can you elaborate please, autotest very much relies on lvm so I'm somewhat concerned.
[19:25:35] Jinshan Xiong: Not entirely, there were a lot of failure about that. the root cause is that we used to wait for transaction to be finsished before returning to client
[19:25:55] Jinshan Xiong: it worked well with hard drives
[19:26:23] Jinshan Xiong: but for lvm, it seems to take longer for data to be flushed
[19:26:23] Chris Gearing: A lot of failures about what, you'll have to forgive me I have no historical knowledge I so need it spelled out!
[19:26:41 | Edited 19:27:01] Chris Gearing: But don't we handshake the data onto the physical media?
[19:26:48] Jinshan Xiong: I don't remember the ticket number
[19:26:58] Jinshan Xiong: but Niu had some fixes about it
[19:27:09] Jinshan Xiong: I will check with him later
[19:27:19] Chris Gearing: Should this be allocated to Niu?
[19:27:22] Jinshan Xiong: On 12/15/11, at 11:26 AM, Chris Gearing wrote:
> But don't we handshake the data onto the physical media?
unfortunately
[19:27:23] Jinshan Xiong: no
[19:27:54] Jinshan Xiong: On 12/15/11, at 11:27 AM, Chris Gearing wrote:
> Should this be allocated to Niu?
Sure, this is the same kind of issue.
[19:28:48] Jinshan Xiong: hmm I shouldn't have faulted lvm
[19:29:00] Jinshan Xiong: this is definitely a lustre issue
[19:29:33] Jinshan Xiong: we should write critical data with OSYNC mode

Comment by Niu Yawei (Inactive) [ 16/Dec/11 ]

The problem is clear: the client data was not written into persistent storage before failing over OSTs.

From the log, client UUID 8f9b83bd-22e0-c555-20a1-fe22e2547e58 was the old one, and in the test case, we've remounted it new UUID 7419f6fe-3688-8bc3-9560-7367776851cd. However, it didn't update storage in time so when OST was up, it still waited for old client.

We always sync write client data to disk, so I don't see why it wasn't updated.

The test failed because the client (which invoke test) was evicted, this client isn't the remounted one, am I right?

Comment by Chris Gearing (Inactive) [ 16/Dec/11 ]

I don't know the answer to the question about which client was evicted and remounted I'm affraid.

Perhaps Jinshan has more knowledge of the test.

Comment by Niu Yawei (Inactive) [ 19/Dec/11 ]

After looking closer code, I realized that we didn't sync write client data to disk, but just set exp_need_sync = 1 to notify following operation to sync the client data, which could probably is to avoid sync write flood when there are many many clients connecting.

And the log shows that the df client (which invoking the test) has been evicted by ost-00000, since ost-0000 didn't find old export in disk and it regards df client as a new client, then the df was not resend after recovery, and result in test failuer at the end.

I think we'd better make sure that the client data of df client is synced to disk (otherwise, it could be evicted as a new client during recovery), so there are probably two ways to achieve it:

  • sync all osts on ost1 before 'fail ost1';
  • create at least one objects on each ost of ost1 before 'fail ost1';

Jinshan, what's your opinion?

Comment by Niu Yawei (Inactive) [ 19/Dec/11 ]

patch for master: http://review.whamcloud.com/1888

Comment by Jinshan Xiong (Inactive) [ 21/Dec/11 ]

Though we have talked this over skype, I write it here as a record.

I think sync seems fine to me.

Sorry for delay response.

Comment by Peter Jones [ 04/Jan/12 ]

Landed for 2.2

Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #402
LU-924 test: sync client data before reboot server (Revision 17a69cf25ed0991e04d85c259f4294dc59734e1e)

Result = SUCCESS
Oleg Drokin : 17a69cf25ed0991e04d85c259f4294dc59734e1e
Files :

  • lustre/tests/test-framework.sh
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #402
LU-924 test: sync client data before reboot server (Revision 17a69cf25ed0991e04d85c259f4294dc59734e1e)

Result = SUCCESS
Oleg Drokin : 17a69cf25ed0991e04d85c259f4294dc59734e1e
Files :

  • lustre/tests/test-framework.sh
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #402
LU-924 test: sync client data before reboot server (Revision 17a69cf25ed0991e04d85c259f4294dc59734e1e)

Result = SUCCESS
Oleg Drokin : 17a69cf25ed0991e04d85c259f4294dc59734e1e
Files :

  • lustre/tests/test-framework.sh
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #402
LU-924 test: sync client data before reboot server (Revision 17a69cf25ed0991e04d85c259f4294dc59734e1e)

Result = SUCCESS
Oleg Drokin : 17a69cf25ed0991e04d85c259f4294dc59734e1e
Files :

  • lustre/tests/test-framework.sh
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #402
LU-924 test: sync client data before reboot server (Revision 17a69cf25ed0991e04d85c259f4294dc59734e1e)

Result = SUCCESS
Oleg Drokin : 17a69cf25ed0991e04d85c259f4294dc59734e1e
Files :

  • lustre/tests/test-framework.sh
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #402
LU-924 test: sync client data before reboot server (Revision 17a69cf25ed0991e04d85c259f4294dc59734e1e)

Result = SUCCESS
Oleg Drokin : 17a69cf25ed0991e04d85c259f4294dc59734e1e
Files :

  • lustre/tests/test-framework.sh
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #402
LU-924 test: sync client data before reboot server (Revision 17a69cf25ed0991e04d85c259f4294dc59734e1e)

Result = SUCCESS
Oleg Drokin : 17a69cf25ed0991e04d85c259f4294dc59734e1e
Files :

  • lustre/tests/test-framework.sh
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #402
LU-924 test: sync client data before reboot server (Revision 17a69cf25ed0991e04d85c259f4294dc59734e1e)

Result = SUCCESS
Oleg Drokin : 17a69cf25ed0991e04d85c259f4294dc59734e1e
Files :

  • lustre/tests/test-framework.sh
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #402
LU-924 test: sync client data before reboot server (Revision 17a69cf25ed0991e04d85c259f4294dc59734e1e)

Result = SUCCESS
Oleg Drokin : 17a69cf25ed0991e04d85c259f4294dc59734e1e
Files :

  • lustre/tests/test-framework.sh
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #402
LU-924 test: sync client data before reboot server (Revision 17a69cf25ed0991e04d85c259f4294dc59734e1e)

Result = SUCCESS
Oleg Drokin : 17a69cf25ed0991e04d85c259f4294dc59734e1e
Files :

  • lustre/tests/test-framework.sh
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » i686,server,el5,ofa #402
LU-924 test: sync client data before reboot server (Revision 17a69cf25ed0991e04d85c259f4294dc59734e1e)

Result = SUCCESS
Oleg Drokin : 17a69cf25ed0991e04d85c259f4294dc59734e1e
Files :

  • lustre/tests/test-framework.sh
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #402
LU-924 test: sync client data before reboot server (Revision 17a69cf25ed0991e04d85c259f4294dc59734e1e)

Result = SUCCESS
Oleg Drokin : 17a69cf25ed0991e04d85c259f4294dc59734e1e
Files :

  • lustre/tests/test-framework.sh
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #402
LU-924 test: sync client data before reboot server (Revision 17a69cf25ed0991e04d85c259f4294dc59734e1e)

Result = SUCCESS
Oleg Drokin : 17a69cf25ed0991e04d85c259f4294dc59734e1e
Files :

  • lustre/tests/test-framework.sh
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » i686,client,el5,ofa #402
LU-924 test: sync client data before reboot server (Revision 17a69cf25ed0991e04d85c259f4294dc59734e1e)

Result = SUCCESS
Oleg Drokin : 17a69cf25ed0991e04d85c259f4294dc59734e1e
Files :

  • lustre/tests/test-framework.sh
Generated at Sat Feb 10 01:11:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.