[LU-2194] Test failure on test suite recovery-small, subtest test_19b Created: 16/Oct/12  Updated: 03/Apr/16  Resolved: 07/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0, Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: LB

Severity: 3
Rank (Obsolete): 5236

 Description   

This issue was created by maloo for Li Wei <liwei@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/a7b63102-1737-11e2-afe1-52540035b04c.

The sub-test test_19b failed with the following error:

no eviction

Info required for matching: recovery-small 19b



 Comments   
Comment by Nathaniel Clark [ 21/Nov/12 ]

http://review.whamcloud.com/4652

Comment by Nathaniel Clark [ 03/Dec/12 ]

https://maloo.whamcloud.com/test_sets/d9395c3c-3d83-11e2-9127-52540035b04c

Comment by Hongchao Zhang [ 11/Dec/12 ]

same error as above link https://maloo.whamcloud.com/test_sets/d9395c3c-3d83-11e2-9127-52540035b04c

https://maloo.whamcloud.com/test_sets/3e897880-432c-11e2-b57a-52540035b04c

Comment by Hongchao Zhang [ 11/Dec/12 ]

the reason is the client is evicted by OST, which is the result of previous subtest "test_19a"
in "test_19a", fail_loc OBD_FAIL_LDLM_CANCEL_NET is set at MDT & OSTs, and it detect an eviction (was from MDT),
then it assume the test passed and continue to run test_19b, which encountered the eviction from OST.

Comment by Peter Jones [ 17/Dec/12 ]

Landed for 2.4

Comment by Peng Tao [ 24/Dec/12 ]

The same error as comment 2/3 above, is seen again with patch in comment 1 (commit cc980df563ef86847aae1e0a3f0a5b17589e6297) applied.

https://maloo.whamcloud.com/test_sets/070528a6-4cea-11e2-bf7d-52540035b04c

https://maloo.whamcloud.com/test_sets/befc32f8-4c25-11e2-875d-52540035b04c

Comment by Keith Mannthey (Inactive) [ 04/Jan/13 ]

I don't know if this is exactly the same error originally reported but it is the same test and the same errors Peng Tao is seeing.

Error: 'test_19b failed with 1'

https://maloo.whamcloud.com/test_sessions/4d2113e8-471a-11e2-9537-52540035b04c

In my example the on the MDS the lnet service thead gets tied up:
"LNet: Service thread pid 7757 completed after 150.32s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources)."
There is a 40 second thread hung stack dump and plenty of other errors in the dmesg log about this Lnet service thread 7757.

Comment by Keith Mannthey (Inactive) [ 04/Jan/13 ]

It seems test_19b is still having issues.

Comment by Hongchao Zhang [ 05/Jan/13 ]

same error:
https://maloo.whamcloud.com/test_sets/befc32f8-4c25-11e2-875d-52540035b04c

Comment by nasf (Inactive) [ 16/Jan/13 ]

Another failure:

https://maloo.whamcloud.com/test_sets/ac8aa71c-6034-11e2-84d4-52540035b04c

Comment by Keith Mannthey (Inactive) [ 17/Jan/13 ]

Is the environment not set before the tests are starting in some way?

Comment by Bob Glossman (Inactive) [ 21/Jan/13 ]

another instance:
https://maloo.whamcloud.com/test_sets/119c11ae-6233-11e2-b20c-52540035b04c

Comment by Nathaniel Clark [ 21/Jan/13 ]

http://review.whamcloud.com/5141

Comment by nasf (Inactive) [ 22/Jan/13 ]

Another failure instance:

https://maloo.whamcloud.com/test_sets/da6627c0-64b1-11e2-a1aa-52540035b04c

Comment by Sebastien Buisson (Inactive) [ 29/Jan/13 ]

Yet another failure instance:
https://maloo.whamcloud.com/test_sets/1120911e-69a9-11e2-954a-52540035b04c

Comment by Henri Doreau (Inactive) [ 30/Jan/13 ]

Another one:
https://maloo.whamcloud.com/test_sets/145ca582-6ab4-11e2-bea0-52540035b04c

Comment by Bob Glossman (Inactive) [ 30/Jan/13 ]

Another instance:
https://maloo.whamcloud.com/test_sets/1fb3c870-6ae1-11e2-bea0-52540035b04c

Comment by Peter Jones [ 18/Feb/13 ]

Has this reoccurred since the extra debugging has been landed?

Comment by Nathaniel Clark [ 19/Feb/13 ]

Found one post debugging:
https://maloo.whamcloud.com/test_sets/e3c10bd4-7878-11e2-9928-52540035b04c

Comment by Nathaniel Clark [ 19/Feb/13 ]

Client error for multiop:

Lustre: DEBUG MARKER: multiop /mnt/lustre/f.recovery-small.19b Ow
LustreError: 11-0: lustre-OST0000-osc-ffff88007c0b8000: Communicating with 10.10.17.16@tcp, operation ldlm_enqueue failed with -107.
LustreError: 23456:0:(ldlm_request.c:1267:ldlm_cli_cancel_req()) Got rc -5 from cancel RPC: canceling anyway
LustreError: 23456:0:(ldlm_request.c:1904:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -5
LustreError: 23456:0:(ldlm_request.c:1904:ldlm_cli_cancel_list()) Skipped 1 previous similar message
Lustre: DEBUG MARKER: /usr/sbin/lctl mark recovery-small test_19b: @@@@@@ FAIL: failed to run multiop: 5

Comment by Bob Glossman (Inactive) [ 25/Feb/13 ]

another instance:
https://maloo.whamcloud.com/test_sets/48d621c0-7271-11e2-9b41-52540035b04c

Comment by Zhenyu Xu [ 25/Feb/13 ]

another hit at https://maloo.whamcloud.com/test_sets/87356fc4-7f6f-11e2-afad-52540035b04c

from OST console message, it says

07:12:30:Lustre: DEBUG MARKER: == recovery-small test 19b: test expired_lock_main on ost (2867) == 07:12:23 (1361805143)
07:12:30:LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 153s: evicting client at 10.10.17.14@tcp ns: filter-ffff88006b003000 lock: ffff8800679ac000/0xf574bdda33b2fbc1 lrc: 3/0,0 mode: PR/PR res: 36/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 10.10.17.14@tcp remote: 0x2e77606da0e4cfde expref: 4 pid: 29406 timeout: 4307850023 lvb_type: 1
07:12:30:LNet: Service thread pid 29405 completed after 150.98s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
07:12:30:Lustre: DEBUG MARKER: /usr/sbin/lctl mark recovery-small test_19b: @@@@@@ FAIL: failed to run multiop: 5

and OST evicts client

Comment by Bob Glossman (Inactive) [ 05/Mar/13 ]

another instance:
https://maloo.whamcloud.com/test_sets/eeae8204-85d6-11e2-9f8d-52540035b04c

Comment by Peter Jones [ 08/Mar/13 ]

Hongchao

Could you please look into this one?

Thanks

Peter

Comment by Hongchao Zhang [ 11/Mar/13 ]

the patch is tracked at http://review.whamcloud.com/#change,5679

Comment by Peter Jones [ 18/Mar/13 ]

Landed for 2.4

Comment by Andreas Dilger [ 01/Oct/14 ]

recovery-small test_19b is still being skipped due to this bug.

Comment by Gerrit Updater [ 06/Feb/15 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/13671
Subject: LU-2194 test: remove test_19b from except list
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7531f2ac0495f021d79427bf9dfab41a2d269db6

Comment by Gerrit Updater [ 03/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13671/
Subject: LU-2194 test: remove test_19b from except list
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 87e12095044c905078008efe11b3476374088c3e

Generated at Sat Feb 10 01:23:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.