[LU-71] metabench failures Created: 09/Feb/11  Updated: 19/Mar/11  Resolved: 19/Mar/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0, Lustre 2.1.0
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Bugzilla ID: 20,581
Rank (Obsolete): 4247

 Description   

Originally the bugzilla bug was about two separate issues as I understand it, but now it mostly revolves around a hash collision issue.
FanYong has a patch in the bug that needs to be tested at Hyperion.



 Comments   
Comment by nasf (Inactive) [ 01/Mar/11 ]

The new patch needs to be verified on Hyperion before landed to lustre-2.1:

http://review.whamcloud.com/#change,281

Comment by Cliff White (Inactive) [ 07/Mar/11 ]

Patch is under test on Hyperion - hyperion-sanity results:
Full System MIB

000: Table of 824 tasks with up to 4097 system calls
000: date tasks xfer call time write read
000: limit limit MB/s MB/s
000: ------------------------ ------ ----- ----- ----- ---------- ----------
000: Fri Mar 4 14:23:24 2011 824 1024k 4096 300 4904.61 4463.70
000: Fri Mar 4 23:04:08 2011 824 1024k 4096 300 4794.67 4370.48
000: Sat Mar 5 07:45:25 2011 824 1024k 4096 300 4774.72 4364.73
000: Sat Mar 5 16:26:57 2011 824 1024k 4096 300 4677.37 4291.81
AVERAGE 4787.84 4372.68

000: Table of 824 tasks with up to 65 system calls
000: date tasks xfer call time write read
000: limit limit MB/s MB/s
000: ------------------------ ------ ----- ----- ----- ---------- ----------
000: Fri Mar 4 14:24:01 2011 824 1024k 64 300 3221.95 13247.59
000: Fri Mar 4 23:04:49 2011 824 1024k 64 300 2837.28 10886.22
000: Sat Mar 5 07:46:01 2011 824 1024k 64 300 3392.38 13772.67
000: Sat Mar 5 16:27:33 2011 824 1024k 64 300 3175.49 13043.68
AVERAGE 3156.78 12737.54

IOR
file-per-process
Write AVERAGE 3198.88 Mib/sec
Read AVERAGE 2643.08

file-per-process independent
Write AVERAGE 3242.98
Read AVERAGE 2647.37

single-shared-file
Write AVERAGE 1979.96
Read AVERAGE 4094.98

single-shared-file independent
Write AVERAGE 1982.10
Read Average 4213.00

These results are comparable to previous 2.1 runs
One metabench test failed due to a single client eviction

Mar 6 00:00:13 ehyperion571 mrshd[19987]: root@ehyperion0 as root: cmd='rdistd -S'
Mar 6 00:03:09 ehyperion571 logger: fixperms: done running at 03/06/11 00:03:09 ...
Mar 6 00:16:55 ehyperion571 LustreError: 20162:0:(dir.c:316:ll_get_dir_page()) dir page locate: [0x2000007d6:0xaf0:0x0] at 13476436484223492: rc
-5
Mar 6 00:18:36 ehyperion571 LustreError: 11-0: an error occurred while communicating with 192.168.117.1@o2ib. The ldlm_enqueue operation failed w
ith -107
Mar 6 00:18:36 ehyperion571 Lustre: lustre-MDT0000-mdc-ffff810210198400: Connection to service lustre-MDT0000 via nid 192.168.117.1@o2ib was lost
; in progress operations using this service will wait for recovery to complete.
Mar 6 00:18:36 ehyperion571 LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
Mar 6 00:18:36 ehyperion571 LustreError: 20131:0:(mdc_locks.c:711:mdc_enqueue()) ldlm_cli_enqueue: -4
Mar 6 00:18:36 ehyperion571 LustreError: 20164:0:(ldlm_resource.c:746:ldlm_resource_complain()) Namespace lustre-MDT0000-mdc-ffff810210198400 res
ource refcount nonzero (1) after lock cleanup; forcing cleanup.
Mar 6 00:18:36 ehyperion571 LustreError: 20164:0:(ldlm_resource.c:752:ldlm_resource_complain()) Resource: ffff8101f6c9f0c0 (8589936606/39266/0/0)
(rc: 1)
Mar 6 00:18:36 ehyperion571 Lustre: lustre-MDT0000-mdc-ffff810210198400: Connection restored to service lustre-MDT0000 using nid 192.168.117.1@o2
ib.

Comment by Cliff White (Inactive) [ 08/Mar/11 ]

Okay the second pass of the test has also failed metabench, with the same failure - this may be a cause for concern.

metabench -w /p/l_wham/white215/hyperion.14374/metabench -k -c 16384 -C -z

2097152 785.6744 2669.24 152 0.0464 3272.76

[03/07/2011 22:28:48] Leaving time_file_creation with proc_id = 823
[03/07/2011 22:28:48] Entering par_create_multidir to create 2549 files in 1 dirs
[03/07/2011 22:30:29] FATAL error on process 823
Proc 823: Unable to stat file [/p/l_wham/white215/hyperion.14374/metabench/TIME_CREATE_823.003/cspFCaRlAs4Dx] [144131814699691100/8881629
303163026]: Interrupted system call
srun: error: hyperion571: task 823: Exited with exit code 255
srun: First task exited 1800s ago

Client errors:
Mar 7 22:28:49 ehyperion571 LustreError: 1303:0:(dir.c:316:ll_get_dir_page()) dir page locate: [0x2000efff0:0x1918:0x0] at 8879820204541995: rc -5
Mar 7 22:30:29 ehyperion571 LustreError: 11-0: an error occurred while communicating with 192.168.117.1@o2ib. The ldlm_enqueue operation failed with -107
Mar 7 22:30:29 ehyperion571 Lustre: lustre-MDT0000-mdc-ffff810210198400: Connection to service lustre-MDT0000 via nid 192.168.117.1@o2ib was lost; in progress opera
tions using this service will wait for recovery to complete.
Mar 7 22:30:29 ehyperion571 LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
Mar 7 22:30:29 ehyperion571 LustreError: 1270:0:(mdc_locks.c:711:mdc_enqueue()) ldlm_cli_enqueue: -4
Mar 7 22:30:29 ehyperion571 LustreError: 1306:0:(ldlm_resource.c:746:ldlm_resource_complain()) Namespace lustre-MDT0000-mdc-ffff810210198400 resource refcount nonze
ro (1) after lock cleanup; forcing cleanup.
Mar 7 22:30:29 ehyperion571 LustreError: 1306:0:(ldlm_resource.c:746:ldlm_resource_complain()) Skipped 1 previous similar message
Mar 7 22:30:29 ehyperion571 LustreError: 1306:0:(ldlm_resource.c:752:ldlm_resource_complain()) Resource: ffff81014b484a80 (8590920616/109475/0/0) (rc: 1)
Mar 7 22:30:29 ehyperion571 LustreError: 1306:0:(ldlm_resource.c:752:ldlm_resource_complain()) Skipped 1 previous similar message
Mar 7 22:30:29 ehyperion571 Lustre: lustre-MDT0000-mdc-ffff810210198400: Connection restored to service lustre-MDT0000 using nid 192.168.117.1@o2ib.
Mar 7 23:00:13 ehyperion571 mrshd[1323]: root@ehyperion0 as root: cmd='rdistd -S'
Mar 7 23:00:32 ehyperion571 logger: fixperms: done running at 03/07/11 23:00:32 ...
Mar 8 00:00:13 ehyperion571 mrshd[1487]: root@ehyperion0 as root: cmd='rdistd -S'

MDS hyperion720
Mar 7 22:30:29 ehyperion720 LustreError: 0:0:(ldlm_lockd.c:348:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.115.
142@o2ib ns: mdt-ffff81012b222000 lock: ffff81017773e480/0x697433fb7bc82718 lrc: 3/0,0 mode: PR/PR res: 8590920616/109475 bits 0x3 rrc: 2 type: IBT flags: 0x4000020
remote: 0xf00e0d9dee2b9e00 expref: 10 pid: 10361 timeout: 4586658929
Mar 7 22:30:29 ehyperion720 LustreError: 10349:0:(mdt_handler.c:2806:mdt_recovery()) operation 101 on unconnected MDS from 12345-192.168.115.142@o2ib
Mar 7 22:30:29 ehyperion720 LustreError: 10349:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (-107) req@ffff81005747ec00 x1362388536975602/t0(0)
o-1-><?>@<?>:0/0 lens 552/0 e 0 to 0 dl 1299565959 ref 1 fl Interpret:/ffffffff/ffffffff rc -107/-1
Mar 7 22:30:29 ehyperion720 Lustre: 10349:0:(ldlm_lib.c:871:target_handle_connect()) lustre-MDT0000: connection from 8a9522b4-1c87-2ee8-a571-ded5c0d66ba3@192.168.11
5.142@o2ib t4922514102 exp 0000000000000000 cur 1299565829 last 0
Mar 7 22:30:29 ehyperion720 Lustre: 10349:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import lustre-MDT0000->NET_0x50000c0a8738e_UUID netid 50000: select flavor null

No errors on any OSTs.

Comment by nasf (Inactive) [ 08/Mar/11 ]

Thanks Cliff. From the test result we can say that the patch works as we expected. The failure case you attached is for test without my patch, right?

> Client errors:
> Mar 7 22:28:49 ehyperion571 LustreError: 1303:0:(dir.c:316:ll_get_dir_page()) dir page locate: [0x2000efff0:0x1918:0x0] at 8879820204541995: rc -5

It is corresponding to original Lustre code without my patch.

Comment by Cliff White (Inactive) [ 08/Mar/11 ]

no, the failure case I attached is with your patch
cliffw


cliffw
Support Guy
WhamCloud, Inc.
www.whamcloud.com

Comment by Cliff White (Inactive) [ 08/Mar/11 ]

I took the RPMs from build 347
lustre-2.0.59-2.6.18_194.17.1.el5_lustre.g94d9119_gd5e659b.x86_64.rpm
cliffw


cliffw
Support Guy
WhamCloud, Inc.
www.whamcloud.com

Comment by nasf (Inactive) [ 08/Mar/11 ]

Very strange, according to the error message, the line corresponding to "(dir.c:316:ll_get_dir_page()) dir page locate:" is that:

page = ll_dir_page_locate(dir, &lhash, &start, &end);
if (IS_ERR(page))

{ CERROR("dir page locate: "DFID" at "LPU64": rc %ld\n", ===> PFID(ll_inode2fid(dir)), lhash, PTR_ERR(page)); GOTO(out_unlock, page); }

Such code section is just line 316 for original master without patch. For patched master, it is line 340. I have checked the source code for build 347 you used.

http://build.whamcloud.com/job/reviews-centos5/347/
lustre-source-2.0.59-2.6.18_194.17.1.el5_lustre.g94d9119_gd5e659b.x86_64.rpm

So would you please to login ehyperion571 to check the kernel version for further confirm. Thanks!

Comment by Cliff White (Inactive) [ 10/Mar/11 ]

There was a mistake in the kernel build.
Re-tested with the correct kernel, metabench passed 14 runs.

NERSC Time: 4735.13
NERSC Time: 4868.54
NERSC Time: 4920.77
NERSC Time: 4940.35
NERSC Time: 4934.38
NERSC Time: 4934.63
NERSC Time: 5030.45
NERSC Time: 5154.06
NERSC Time: 5052.45
NERSC Time: 5125.95
NERSC Time: 5095.29
NERSC Time: 5115.10
NERSC Time: 5138.36
NERSC Time: 5171.40

Comment by nasf (Inactive) [ 10/Mar/11 ]

Thanks Cliff, this bug has blocked us for a long time. It is really helpful.

Comment by nasf (Inactive) [ 19/Mar/11 ]

patch has been merged into lustre-2.1 candidate.

Generated at Sat Feb 10 01:03:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.