[LU-6573] multiple tests: client evicted, Input/output error Created: 06/May/15  Updated: 01/Jul/16  Resolved: 22/May/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-6604 sanity-sec test_23: @@@@@@ FAIL: acl ... Resolved
is duplicated by LU-6623 sanity-sec test_21:mapped_trusted_noa... Resolved
Related
is related to LU-6441 OST problems following router node cr... Resolved
is related to LU-5407 Failover failure on test suite replay... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f9a7140c-f404-11e4-9769-5254006e85c2.

The sub-test test_10b failed with the following error:

dd /mnt/lustre2
10:05:09:Lustre: DEBUG MARKER: == sanityn test 10b: write of file with sub-page size on multiple mounts == 10:04:38 (1430906678)
10:05:09:LustreError: 11-0: lustre-OST0002-osc-ffff88007b4d1400: operation ldlm_enqueue to node 10.1.5.213@tcp failed: rc = -107
10:05:09:Lustre: lustre-OST0002-osc-ffff88007b4d1400: Connection to lustre-OST0002 (at 10.1.5.213@tcp) was lost; in progress operations using this service will wait for recovery to complete
10:05:09:LustreError: 167-0: lustre-OST0002-osc-ffff88007b4d1400: This client was evicted by lustre-OST0002; in progress operations using this service will fail.
10:05:09:Lustre: lustre-OST0002-osc-ffff88007b4d1400: Connection restored to lustre-OST0002 (at 10.1.5.213@tcp)
10:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanityn test_10b: @@@@@@ FAIL: dd \/mnt\/lustre2 
10:05:09:Lustre: DEBUG MARKER: sanityn test_10b: @@@@@@ FAIL: dd /mnt/lustre2
10:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2015-05-06/lustre-reviews-el6_6-x86_64--review-dne-part-1--2_8_1__32016__-69851820759220-072240/sanityn.test_10b.debug_log.$(hostname -s).1430906678.log;
10:05:09:         dmesg > /logdir/test_logs/2015-05-06/lustre-reviews-el
10:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanityn test 11: execution of file opened for write should return error ====== 10:04:40 \(1430906680\)



 Comments   
Comment by Andreas Dilger [ 06/May/15 ]

Looks like the OST evicted the client before test_10b started:

Lustre: DEBUG MARKER: == sanityn test 9: append of file with sub-page size on multiple mounts == 10:04:29 (1430906669)
Lustre: lustre-OST0002: haven't heard from client 40fe07c8-ede6-5a1d-89d2-dbf51d83c46a (at 10.1.5.215@tcp) in 48 seconds. I think it's dead, and I am evicting it. exp ffff8800690a9800, cur 1430906674 expire 1430906644 last 1430906626
Lustre: DEBUG MARKER: == sanityn test 10a: write of file with sub-page size on multiple mounts == 10:04:34 (1430906674)
Lustre: DEBUG MARKER: == sanityn test 10b: write of file with sub-page size on multiple mounts == 10:04:38 (1430906678)
Lustre: DEBUG MARKER: sanityn test_10b: @@@@@@ FAIL: dd /mnt/lustre2

This might be due to slowness of the VM host or any number of issues. It seems that this is failing randomly in a wide variety of different tests.

Comment by Andreas Dilger [ 07/May/15 ]

I wonder if this failure and LU-6558 are fallout from landing http://review.whamcloud.com/9286 "LU-3266 test: regression tests for nrs policies", which is enabling NRS in sanity.sh test_77 but it seems that it isn't disabling NRS afterward?

Comment by Niu Yawei (Inactive) [ 08/May/15 ]

I don't see why lots of replay-single test_58c failures were set as associated:

https://testing.hpdd.intel.com/test_sets/03f97924-f11a-11e4-9e14-5254006e85c2
https://testing.hpdd.intel.com/test_sets/75a610ae-f143-11e4-bb65-5254006e85c2
https://testing.hpdd.intel.com/test_sets/0d145610-f2d1-11e4-aad2-5254006e85c2
https://testing.hpdd.intel.com/test_sets/08da0d34-f442-11e4-a594-5254006e85c2
https://testing.hpdd.intel.com/test_sets/ff57f6e2-f476-11e4-ac9e-5254006e85c2

I think they should be caused by LU-4913 and should have been fixed by LU-5420.

Comment by Andreas Dilger [ 09/May/15 ]

Seen on the server:

Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0000-lwp-OST0001_UUID(at 192.168.5.153@o2ib), waiting for 3 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 0:59
Lustre: Skipped 25 previous similar messages

so it seems that the LWP connection is being blocked during recovery, which probably isn't right since it won't affect recovery.

Comment by Niu Yawei (Inactive) [ 11/May/15 ]

The denying LWP connection problem could be caused by 34a9bf694a584437e4bf7a037dea32a6271e93d3

Author: Mikhail Pershin <mike.pershin@intel.com>
Date:   Tue Dec 3 17:30:15 2013 +0400

    LU-4214 target: LWP failover should create new export

    Upon reconnection from new NID LWP client need to create new
    export and destroy old one like that is done for MDS connection
Comment by Niu Yawei (Inactive) [ 11/May/15 ]

Looks something went wrong in the eviction mechanism.

Take this failure https://testing.hpdd.intel.com/test_sets/50e05d66-f589-11e4-91fd-5254006e85c2 as example:

00000100:00100000:0.0:1431048749.096032:0:24985:0:(service.c:2075:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:e6747c7f-1ec5-b75c-2bcb-cd4c06ffb96a+8:9457:x1500561708378636:12345-10.1.4.112@tcp:400

MDT received PING from client 10.1.4.112 on time 1431048749, and the PING was processed.

00000100:00080000:0.0:1431048756.180583:0:25594:0:(service.c:1098:ptlrpc_update_export_timer()) lustre-MDT0000: Think about evicting 10.1.4.112@tcp from 1431048725

However, on time 1431048756, MDT think the last request from 10.1.4.112 is from time 1431048725, which is much older than 1431048749.

00000100:00080000:1.0:1431048773.266900:0:24981:0:(pinger.c:637:ping_evictor_main()) evicting all exports of obd lustre-MDT0000 older than 1431048743
00000100:02000400:1.0:1431048773.266911:0:24981:0:(pinger.c:663:ping_evictor_main()) lustre-MDT0000: haven't heard from client 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef (at 10.1.4.112@tcp) in 48 seconds. I think it's dead, and I am evicting it. exp ffff880072a0b800, cur 1431048773 expire 1431048743 last 1431048725

At last, on time 1431048773, the client 10.1.4.112 is evicted.

Seems the last request time for the 10.1.4.112 was not updated properly somehow.

Comment by Niu Yawei (Inactive) [ 12/May/15 ]

Oops, 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef and 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef are different mounts from same NID.

Comment by Gerrit Updater [ 12/May/15 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/14777
Subject: LU-6573 ptlrpc: debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: eb866c9274821fb6a818d6ec42c2b486b324d0cc

Comment by Niu Yawei (Inactive) [ 12/May/15 ]

I doubt that client doesn't send PINGs as expected, which caused server evict client at the end. The debug patch http://review.whamcloud.com/#/c/14777/ provides more verbose messages on client handling PINGs, hope we can reproduce the problem with the patch applied.

Comment by Bob Glossman (Inactive) [ 14/May/15 ]

I think this is another instance, seen in master:
https://testing.hpdd.intel.com/test_sets/5265d50a-fa23-11e4-a7e2-5254006e85c2

Comment by Niu Yawei (Inactive) [ 18/May/15 ]

I saw in some failures that ping interval is 1 second (the default obd_timeout for autotest is 20 seconds, and the default ping interval should be 5 seconds), I think that sanity test_224c could be the culprit: it changed obd_timeout to 5 seconds during testing, it did change back to default value, but it didn't wait to it take effect. I fixed this in: http://review.whamcloud.com/14844

In most zfs test failures, the ping interval is the default 5 seconds, I'm wondering if we'd use longer obd_timeout/ping_interval for zfs testing?

Comment by Gerrit Updater [ 18/May/15 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/14844
Subject: LU-6573 test: fix sanity test_224c
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 942195be07082d82fca7dee225e9ddb0fe311290

Comment by Gerrit Updater [ 21/May/15 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14844/
Subject: LU-6573 test: fix sanity test_224c
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 551dc77e6ecd590d35fe7759124b98961642e831

Comment by Peter Jones [ 22/May/15 ]

Landed for 2.8

Comment by Andreas Dilger [ 26/May/15 ]

This was introduced by patch http://review.whamcloud.com/14399 "LU-6441 ptlrpc: ptlrpc_bulk_abort unlink all entries in bd_mds".

Generated at Sat Feb 10 02:01:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.