[LU-6573] multiple tests: client evicted, Input/output error Created: 06/May/15 Updated: 01/Jul/16 Resolved: 22/May/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for wangdi <di.wang@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f9a7140c-f404-11e4-9769-5254006e85c2. The sub-test test_10b failed with the following error: dd /mnt/lustre2 10:05:09:Lustre: DEBUG MARKER: == sanityn test 10b: write of file with sub-page size on multiple mounts == 10:04:38 (1430906678) 10:05:09:LustreError: 11-0: lustre-OST0002-osc-ffff88007b4d1400: operation ldlm_enqueue to node 10.1.5.213@tcp failed: rc = -107 10:05:09:Lustre: lustre-OST0002-osc-ffff88007b4d1400: Connection to lustre-OST0002 (at 10.1.5.213@tcp) was lost; in progress operations using this service will wait for recovery to complete 10:05:09:LustreError: 167-0: lustre-OST0002-osc-ffff88007b4d1400: This client was evicted by lustre-OST0002; in progress operations using this service will fail. 10:05:09:Lustre: lustre-OST0002-osc-ffff88007b4d1400: Connection restored to lustre-OST0002 (at 10.1.5.213@tcp) 10:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanityn test_10b: @@@@@@ FAIL: dd \/mnt\/lustre2 10:05:09:Lustre: DEBUG MARKER: sanityn test_10b: @@@@@@ FAIL: dd /mnt/lustre2 10:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2015-05-06/lustre-reviews-el6_6-x86_64--review-dne-part-1--2_8_1__32016__-69851820759220-072240/sanityn.test_10b.debug_log.$(hostname -s).1430906678.log; 10:05:09: dmesg > /logdir/test_logs/2015-05-06/lustre-reviews-el 10:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanityn test 11: execution of file opened for write should return error ====== 10:04:40 \(1430906680\) |
| Comments |
| Comment by Andreas Dilger [ 06/May/15 ] |
|
Looks like the OST evicted the client before test_10b started: Lustre: DEBUG MARKER: == sanityn test 9: append of file with sub-page size on multiple mounts == 10:04:29 (1430906669) Lustre: lustre-OST0002: haven't heard from client 40fe07c8-ede6-5a1d-89d2-dbf51d83c46a (at 10.1.5.215@tcp) in 48 seconds. I think it's dead, and I am evicting it. exp ffff8800690a9800, cur 1430906674 expire 1430906644 last 1430906626 Lustre: DEBUG MARKER: == sanityn test 10a: write of file with sub-page size on multiple mounts == 10:04:34 (1430906674) Lustre: DEBUG MARKER: == sanityn test 10b: write of file with sub-page size on multiple mounts == 10:04:38 (1430906678) Lustre: DEBUG MARKER: sanityn test_10b: @@@@@@ FAIL: dd /mnt/lustre2 This might be due to slowness of the VM host or any number of issues. It seems that this is failing randomly in a wide variety of different tests. |
| Comment by Andreas Dilger [ 07/May/15 ] |
|
I wonder if this failure and |
| Comment by Niu Yawei (Inactive) [ 08/May/15 ] |
|
I don't see why lots of replay-single test_58c failures were set as associated: https://testing.hpdd.intel.com/test_sets/03f97924-f11a-11e4-9e14-5254006e85c2 I think they should be caused by |
| Comment by Andreas Dilger [ 09/May/15 ] |
|
Seen on the server: Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0000-lwp-OST0001_UUID(at 192.168.5.153@o2ib), waiting for 3 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 0:59 Lustre: Skipped 25 previous similar messages so it seems that the LWP connection is being blocked during recovery, which probably isn't right since it won't affect recovery. |
| Comment by Niu Yawei (Inactive) [ 11/May/15 ] |
|
The denying LWP connection problem could be caused by 34a9bf694a584437e4bf7a037dea32a6271e93d3 Author: Mikhail Pershin <mike.pershin@intel.com>
Date: Tue Dec 3 17:30:15 2013 +0400
LU-4214 target: LWP failover should create new export
Upon reconnection from new NID LWP client need to create new
export and destroy old one like that is done for MDS connection
|
| Comment by Niu Yawei (Inactive) [ 11/May/15 ] |
|
Looks something went wrong in the eviction mechanism. Take this failure https://testing.hpdd.intel.com/test_sets/50e05d66-f589-11e4-91fd-5254006e85c2 as example: 00000100:00100000:0.0:1431048749.096032:0:24985:0:(service.c:2075:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:e6747c7f-1ec5-b75c-2bcb-cd4c06ffb96a+8:9457:x1500561708378636:12345-10.1.4.112@tcp:400 MDT received PING from client 10.1.4.112 on time 1431048749, and the PING was processed. 00000100:00080000:0.0:1431048756.180583:0:25594:0:(service.c:1098:ptlrpc_update_export_timer()) lustre-MDT0000: Think about evicting 10.1.4.112@tcp from 1431048725 However, on time 1431048756, MDT think the last request from 10.1.4.112 is from time 1431048725, which is much older than 1431048749. 00000100:00080000:1.0:1431048773.266900:0:24981:0:(pinger.c:637:ping_evictor_main()) evicting all exports of obd lustre-MDT0000 older than 1431048743
00000100:02000400:1.0:1431048773.266911:0:24981:0:(pinger.c:663:ping_evictor_main()) lustre-MDT0000: haven't heard from client 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef (at 10.1.4.112@tcp) in 48 seconds. I think it's dead, and I am evicting it. exp ffff880072a0b800, cur 1431048773 expire 1431048743 last 1431048725
At last, on time 1431048773, the client 10.1.4.112 is evicted. Seems the last request time for the 10.1.4.112 was not updated properly somehow. |
| Comment by Niu Yawei (Inactive) [ 12/May/15 ] |
|
Oops, 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef and 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef are different mounts from same NID. |
| Comment by Gerrit Updater [ 12/May/15 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/14777 |
| Comment by Niu Yawei (Inactive) [ 12/May/15 ] |
|
I doubt that client doesn't send PINGs as expected, which caused server evict client at the end. The debug patch http://review.whamcloud.com/#/c/14777/ provides more verbose messages on client handling PINGs, hope we can reproduce the problem with the patch applied. |
| Comment by Bob Glossman (Inactive) [ 14/May/15 ] |
|
I think this is another instance, seen in master: |
| Comment by Niu Yawei (Inactive) [ 18/May/15 ] |
|
I saw in some failures that ping interval is 1 second (the default obd_timeout for autotest is 20 seconds, and the default ping interval should be 5 seconds), I think that sanity test_224c could be the culprit: it changed obd_timeout to 5 seconds during testing, it did change back to default value, but it didn't wait to it take effect. I fixed this in: http://review.whamcloud.com/14844 In most zfs test failures, the ping interval is the default 5 seconds, I'm wondering if we'd use longer obd_timeout/ping_interval for zfs testing? |
| Comment by Gerrit Updater [ 18/May/15 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/14844 |
| Comment by Gerrit Updater [ 21/May/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14844/ |
| Comment by Peter Jones [ 22/May/15 ] |
|
Landed for 2.8 |
| Comment by Andreas Dilger [ 26/May/15 ] |
|
This was introduced by patch http://review.whamcloud.com/14399 " |