[LU-6663] DNE2 directories has very very bad performance Created: 29/May/15  Updated: 15/Jun/15  Resolved: 11/Jun/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: James A Simmons Assignee: Di Wang
Resolution: Not a Bug Votes: 0
Labels: dne2
Environment:

The problem only happens with the contents in an DNE2 directory


Issue Links:
Related
is related to LU-6724 Downgrading from 2.8 with DNE2 patche... Resolved
is related to LU-3534 async update cross-MDTs Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

With my DNE2 setup I attempted to start some jobs on our test cluster and the job got stuck for hours attempting to run. So I did testing to see what was breaking. A simple md5sum on files showed the problem very easily. For normal directories md5sum on a file came back very fast.

[root@ninja06 johndoe]# date;md5sum ior;date
Fri May 29 10:08:24 EDT 2015
4ba1b26f0a4b71dccb237d3fd25f3b67 ior
Fri May 29 10:08:24 EDT 2015

But for DNE2 directories I saw this:

[root@ninja06 jsimmons]# date;md5sum simul;date
Fri May 29 10:08:38 EDT 2015
9fef8669fb0e6669ac646d69062521d3 simul
Fri May 29 10:09:59 EDT 2015

This is not a issue for stats. So it only appears when accessing data from a file. An ls -al on the file simul comes back very fast. I did this test without the work from LU-3534 and the problem still exist.



 Comments   
Comment by Di Wang [ 29/May/15 ]

Hmm, I tried this on my local test node. Unfortunately, I can not reproduce it with 4 MDTs and 4 stripe_count. James, could you please tell me more about your system, how many MDTs and what is the stripe_count of the striped directory. And what is the size of simul and ior file. Thanks.

[root@testnode tests]# date; md5sum /mnt/lustre/test2/dd; date
Wed May 27 19:56:37 PDT 2015
d8b61b2c0025919d5321461045c8226f  /mnt/lustre/test2/dd
Wed May 27 19:56:38 PDT 2015
[root@testnode tests]# date; md5sum /mnt/lustre/test1/dd; date
Wed May 27 19:56:44 PDT 2015
d8b61b2c0025919d5321461045c8226f  /mnt/lustre/test1/dd
Wed May 27 19:56:45 PDT 2015
[root@testnode tests]# ../utils/lfs getdirstripe /mnt/lustre/test1
/mnt/lustre/test1
lmv_stripe_count: 4 lmv_stripe_offset: 0
mdtidx		 FID[seq:oid:ver]
     0		 [0x200000400:0x2:0x0]		
     1		 [0x240000403:0x2:0x0]		
     2		 [0x280000403:0x2:0x0]		
     3		 [0x2c0000401:0x2:0x0]		
[root@testnode tests]# ../utils/lfs getdirstripe /mnt/lustre/test2
/mnt/lustre/test2
lmv_stripe_count: 0 lmv_stripe_offset: 0
[root@testnode tests]# ll /mnt/lustre/test2/
total 512004
-rw-r--r-- 1 root root 524288000 May 27 19:56 dd
[root@testnode tests]# ll /mnt/lustre/test1/
total 512004
-rw-r--r-- 1 root root 524288000 May 27 19:56 dd
[root@testnode tests]# 
Comment by Andreas Dilger [ 29/May/15 ]

James, can you run an starve to see where md5sum is spending its time? Maybe also with +rpctrace debugging logs of strace isn't clear. Also, what back-end filesystem are you using for the MDTs and OSTs?

Comment by Di Wang [ 29/May/15 ]

Hmm, I also checked RPC count after md5sum

1000 1k files on this striped directory
(umount/mount before the test to clear cache, and also echo 0 > /proc/*..../stats)
The RPC stats after "md5sum *"

[root@testnode test1]# cat /proc/fs/lustre/mdc/lustre-MDT000*-mdc-*/stats
snapshot_time             1432783774.817424 secs.usecs
req_waittime              759 samples [usec] 33 311 81604 9966800
req_active                759 samples [reqs] 1 1 759 759
mds_close                 251 samples [usec] 33 311 18595 1595875
obd_ping                  4 samples [usec] 80 170 527 74209
snapshot_time             1432783774.817449 secs.usecs
req_waittime              750 samples [usec] 34 279 81081 10035485
req_active                750 samples [reqs] 1 1 750 750
mds_close                 249 samples [usec] 34 180 18120 1492930
obd_ping                  3 samples [usec] 59 169 363 50267
snapshot_time             1432783774.817466 secs.usecs
req_waittime              751 samples [usec] 33 294 81646 10083800
req_active                751 samples [reqs] 1 1 751 751
mds_close                 249 samples [usec] 33 153 18560 1536386
obd_ping                  2 samples [usec] 109 128 237 28265
snapshot_time             1432783774.817483 secs.usecs
req_waittime              755 samples [usec] 32 369 81239 10030651
req_active                755 samples [reqs] 1 1 755 755
mds_close                 251 samples [usec] 32 199 18483 1542053
obd_ping                  2 samples [usec] 175 189 364 66346
[root@testnode test1]# cat /proc/fs/lustre/osc/lustre-OST0001-osc-ffff8801e8e49400/stats 
snapshot_time             1432784238.566254 secs.usecs
req_waittime              1012 samples [usec] 36 424 96854 10257148
req_active                1012 samples [reqs] 1 1 1012 1012
read_bytes                499 samples [bytes] 896 896 447104 400605184
ost_read                  499 samples [usec] 42 253 48718 5248004
obd_ping                  14 samples [usec] 54 160 1551 186601
[root@testnode test1]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff8801e8e49400/stats 
snapshot_time             1432784253.55197 secs.usecs
req_waittime              1022 samples [usec] 22 288 96836 10097582
req_active                1022 samples [reqs] 1 1 1022 1022
read_bytes                501 samples [bytes] 896 896 448896 402210816
ost_read                  501 samples [usec] 41 288 47596 4964502
obd_ping                  20 samples [usec] 22 177 2295 305467

Here is the result for non-striped directory

[root@testnode test2]# cat /proc/fs/lustre/mdc/lustre-MDT000*-mdc-*/stats
snapshot_time             1432783875.323394 secs.usecs
req_waittime              3006 samples [usec] 33 483 341808 43798954
req_active                3006 samples [reqs] 1 1 3006 3006
mds_close                 1000 samples [usec] 33 270 87747 8322307
mds_readpage              1 samples [usec] 483 483 483 233289
obd_ping                  4 samples [usec] 64 118 372 36198
snapshot_time             1432783875.323418 secs.usecs
req_waittime              5 samples [usec] 29 113 386 34914
req_active                5 samples [reqs] 1 1 5 5
obd_ping                  5 samples [usec] 29 113 386 34914
snapshot_time             1432783875.323433 secs.usecs
req_waittime              4 samples [usec] 84 115 379 36499
req_active                4 samples [reqs] 1 1 4 4
obd_ping                  4 samples [usec] 84 115 379 36499
snapshot_time             1432783875.323449 secs.usecs
req_waittime              4 samples [usec] 22 198 503 83013
req_active                4 samples [reqs] 1 1 4 4
obd_ping                  4 samples [usec] 22 198 503 83013
[root@testnode test2]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff88019a235000/stats
snapshot_time             1432784328.527276 secs.usecs
req_waittime              1011 samples [usec] 30 349 91417 9311277
req_active                1011 samples [reqs] 1 1 1011 1011
read_bytes                500 samples [bytes] 896 896 448000 401408000
ost_read                  500 samples [usec] 44 349 44299 4465957
ost_connect               1 samples [usec] 109 109 109 11881
obd_ping                  10 samples [usec] 30 209 1155 159765
[root@testnode test2]# cat /proc/fs/lustre/osc/lustre-OST0001-osc-ffff88019a235000/stats
snapshot_time             1432784335.389316 secs.usecs
req_waittime              1012 samples [usec] 29 284 91225 9297035
req_active                1012 samples [reqs] 1 1 1012 1012
read_bytes                500 samples [bytes] 896 896 448000 401408000
ost_read                  500 samples [usec] 42 284 44763 4518767
ost_connect               1 samples [usec] 110 110 110 12100
obd_ping                  11 samples [usec] 29 248 1178 163494

Seems nothing unusual. James, are these files under ior or simul have same data stripes for striped_directory and non-striped_directory? And also if you can repeat the problem on the testnode. Could you please collect the debug log (-1 level) on the client side for me? thanks!

Comment by James A Simmons [ 29/May/15 ]

Andreas I get the following strafe of md5sum ./simul.

open("simul", O_RDONLY) = 3
fstat(3,

{st_mode=S_IFREG|0755, st_size=6862991, ...}

) = 0
mmap(NULL, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fccf6689000
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0004\16@\0\0\0\0\0"..., 4194304) = 4194304
read(3, "\0\0\2# \22%\17\0\0\10\353\21\t\0\0\2#(\22d\37\0\0\10\354;\t\0\0\2#"..., 4194304) = 2668687
read(3, "", 4194304) = 0
close(3) = 0
munmap(0x7fccf6689000, 4194304) = 0

Its the first read that is taking so long.

BTW the second time you run this command no lag happens. So client side caching hides the problem.

Comment by James A Simmons [ 29/May/15 ]

Everything is the test system is running the same Lustre version which is master + DNE patches and a few other patches as well. With no patches the problem still exist. The back end servers are running ldiskfs. Their are 16 MDS servers each with one MDT. I have 4 OSS servers with a total of 56 OSTs. I created myself a home directory using lfs setdirstripe -c 16 which contain the simul binaries which I run md5sum on that binary. I have a few other binaries as well for testing purposes. Here is RPC stats

[root@ninja06 ~]# cat /proc/fs/lustre/mdc/sultan-MDT0000-mdc-ffff8808041f1800/stats;md5sum /lustre/sultan/s
tf008/scratch/jsimmons/simul;cat /proc/fs/lustre/mdc/sultan-MDT0000-mdc-ffff8808041f1800/stats
snapshot_time 1432921773.276703 secs.usecs
req_waittime 595 samples [usec] 55 16742 115819 318800537
req_active 595 samples [reqs] 1 3 821 1279
mds_getattr 1 samples [usec] 241 241 241 58081
mds_close 4 samples [usec] 143 4111 4704 17023942
mds_readpage 4 samples [usec] 833 926 3558 3170490
mds_connect 1 samples [usec] 16742 16742 16742 280294564
mds_getstatus 1 samples [usec] 280 280 280 78400
mds_statfs 1 samples [usec] 347 347 347 120409
ldlm_cancel 275 samples [usec] 63 386 35964 5427452
obd_ping 2 samples [usec] 255 363 618 196794
fld_query 16 samples [usec] 55 211 2082 304252
9fef8669fb0e6669ac646d69062521d3 /lustre/sultan/stf008/scratch/jsimmons/simul
snapshot_time 1432921837.998247 secs.usecs
req_waittime 597 samples [usec] 55 16742 116370 318952518
req_active 597 samples [reqs] 1 3 823 1281
mds_getattr 1 samples [usec] 241 241 241 58081
mds_close 4 samples [usec] 143 4111 4704 17023942
mds_readpage 4 samples [usec] 833 926 3558 3170490
mds_connect 1 samples [usec] 16742 16742 16742 280294564
mds_getstatus 1 samples [usec] 280 280 280 78400
mds_statfs 1 samples [usec] 347 347 347 120409
ldlm_cancel 276 samples [usec] 63 386 36230 5498208
obd_ping 2 samples [usec] 255 363 618 196794
fld_query 16 samples [usec] 55 211 2082 304252

Comment by James A Simmons [ 29/May/15 ]

I uploaded full debug logs from the client to ftp.whamcloud.com//uploads/LU-6663/LU-6663-log.

Comment by Di Wang [ 29/May/15 ]

Thanks James, I just glanced debug log a bit, it seems to me all of the time costs is in CLIO stack. I do not understand why this only happens under striped directory. James do you have the debug logs for non-striped directory? I wonder if this is related with the recent CLIO simplification changes. I will check the log carefully.

Comment by James A Simmons [ 29/May/15 ]

Yes, I just uploaded LU-6663-log-nonstripe to the same directory on ftp.whamcloud.com

Comment by Di Wang [ 29/May/15 ]

Just checked the debug log, here is what happens.

It is file_read costs most of the time,

00000080:00200000:0.0:1432922231.938967:0:14593:0:(file.c:1088:ll_file_io_generic()) file: simul, type: 0 ppos: 0, count: 4194304
...
00000080:00200000:0.0:1432922285.038890:0:14593:0:(file.c:1196:ll_file_io_generic()) iot: 0, result: 4194304

So it takes about 54 seconds to read this 4MB file.

And inside this read, it sent out 7 read RPCs, let's see the first 4 of them, the other 3 are read-ahead RPCs. The RPC size seems good.

00000008:00000002:5.0:1432922231.992289:0:21013:0:(osc_request.c:1829:osc_build_rpc()) @@@ 256 pages, aa ffff8810520764e0. now 1r/0w in flight  req@ffff881052076380 x1502512486180268/t0(0) o3->sultan-OST0035-osc-ffff8807f7983c00@10.37.248.70@o2ib1:6/4 lens 488/432 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1

00000008:00000002:2.0:1432922231.998420:0:21015:0:(osc_request.c:1829:osc_build_rpc()) @@@ 256 pages, aa ffff8807f24e9de0. now 1r/0w in flight  req@ffff8807f24e9c80 x1502512486180276/t0(0) o3->sultan-OST0036-osc-ffff8807f7983c00@10.37.248.71@o2ib1:6/4 lens 488/432 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1

00000008:00000002:2.0:1432922232.004918:0:21015:0:(osc_request.c:1829:osc_build_rpc()) @@@ 256 pages, aa ffff8807f24e9ae0. now 1r/0w in flight  req@ffff8807f24e9980 x1502512486180284/t0(0) o3->sultan-OST0037-osc-ffff8807f7983c00@10.37.248.72@o2ib1:6/4 lens 488/432 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1

00000008:00000002:6.0:1432922232.011237:0:21017:0:(osc_request.c:1829:osc_build_rpc()) @@@ 256 pages, aa ffff8810553fade0. now 1r/0w in flight  req@ffff8810553fac80 x1502512486180292/t0(0) o3->sultan-OST0000-osc-ffff8807f7983c00@10.37.248.69@o2ib1:6/4 lens 488/432 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1

It seems only the 1st one get reply in time

00000100:00000040:4.0:1432922231.994268:0:21016:0:(lustre_net.h:2439:ptlrpc_rqphase_move()) @@@ move req "Bulk" -> "Interpret"  req@ffff881052076380 x1502512486180268/t0(0) o3->sultan-OST0035-osc-ffff8807f7983c00@10.37.248.70@o2ib1:6/4 lens 488/400 e 0 to 0 dl 1432922245 ref 2 fl Bulk:RM/0/0 rc 1048576/1048576

The other 3 RPCs are all get early reply and then get the pages after around 27 seconds. Here is the one example

00000100:00001000:5.0:1432922239.998446:0:21013:0:(client.c:468:ptlrpc_at_recv_early_reply()) @@@ Early reply #1, new deadline in 56s (50s)  req@ffff8807f24e9c80 x1502512486180276/t0(0) o3->sultan-OST0036-osc-ffff8807f7983c00@10.37.248.71@o2ib1:6/4 lens 488/432 e 1 to 0 dl 1432922295 ref 2 fl Rpc:/0/ffffffff rc 0/-1

00000100:00000040:4.0:1432922258.003100:0:21013:0:(lustre_net.h:2439:ptlrpc_rqphase_move()) @@@ move req "Bulk" -> "Interpret"  req@ffff8807f24e9c80 x1502512486180276/t0(0) o3->sultan-OST0036-osc-ffff8807f7983c00@10.37.248.71@o2ib1:6/4 lens 488/400 e 1 to 0 dl 1432922295 ref 2 fl Bulk:RM/0/0 rc 1048576/1048576

I guess either something happens on the OST side or network.

James, are there anything special about these nodes 10.37.248.71,10.37.248.72,10.37.248.69 ? Thanks

Comment by Di Wang [ 29/May/15 ]

Interesting, I checked the debug log for non_striped directory. And the RPC size and stats are quite similar as striped dir case, but every RPC gets reply immediately

00000008:00000002:1.0:1432929764.067930:0:21017:0:(osc_request.c:1829:osc_build_rpc()) @@@ 256 pages, aa ffff880871aef1e0. now 1r/0w in flight  req@ffff880871aef080 x1502512486987692/t0(0) o3->sultan-OST001a-osc-ffff880804b97800@10.37.248.71@o2ib1:6/4 lens 488/432 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1

....

00000100:00000040:0.0:1432929764.082631:0:21011:0:(lustre_net.h:2439:ptlrpc_rqphase_move()) @@@ move req "Bulk" -> "Interpret"  req@ffff880871aef080 x1502512486987708/t0(0) o3->sultan-OST001c-osc-ffff880804b97800@10.37.248.69@o2ib1:6/4 lens 488/400 e 0 to 0 dl 1432929778 ref 2 fl Bulk:RM/0/0 rc 1048576/1048576

James, is this issue easily reproduced on your testnode? Could you please send me both clients and OSSs debug logs? Thanks

Comment by Andreas Dilger [ 29/May/15 ]

Di, James, one thing to consider is if the FID for the OST object is potentially causing some problem? That would be something unique to DNE - that the OST objects for files created on MDT000x, x != 0, use FID-on-OST and objects created on MDT0000 use old-style IDIF FIDs.

That said, it is strange to see this only on James' system, but possibly it relates to the number of MDTs * OSTs causing the FLDB to be very large or something?

Comment by Di Wang [ 29/May/15 ]

Hmm, that is true, for striped directory case, the FID on OST object are real FIDs.

res: [0x138000040b:0xe2:0x0].0
res: [0x1340000404:0xe2:0x0].0
.......

For non-striped directory case, the FID are IDIF

[0x2:0x0:0x0].0
.....

But there are FLD lookup in the debug log. It might be interesting to see what happens on OST for real FID.

James: An easy way to verify this would be create a remote directory on MDTn (n != 0), then create the file under the remote directory, which will allocate regular FIDs for the file, to see if the problem can be reproduced. And if you can get OST debug log for me, which will be helpful. Thanks!

Comment by Di Wang [ 29/May/15 ]

Andreas: I just check the debug log, it seems only the first read RPC send to the each OST are slow. So maybe you are right, it is because of locating the real FID object is slow on OST. Hmm.

Comment by Andreas Dilger [ 29/May/15 ]

Di, maybe something about loading the O/

{seq}

object directories is doing too much work?

Comment by Di Wang [ 29/May/15 ]

Andreas: hmm, loading a new sequence does need extra work, which includes about 33 directory lookup and 32 xatt_set. But I assume OST is not busy during the test (James, please confirm ?), so it may not cost about 28 seconds. Let's see what debug log say.

Comment by James A Simmons [ 30/May/15 ]

The only activating on the OSTs is the retrieve of the simul executable file raw data to calculate the md5sum. So it is quiet system at the time.

Comment by Di Wang [ 01/Jun/15 ]

James: Any news on the OST side debug log? Thanks!

Comment by James A Simmons [ 01/Jun/15 ]

I just finished the DNE1 directory test. I created a empty file in the DNE1 directory using MDS 4 and then remounted lustre on the client. Then I attempted an md5sum on the new empty file which after 3 minutes never finished so I stop the md5sum command and did a log dump. I pushed the log to ftp.whamcloud.com/uploads/LU-6663/LU-6663-dne1.log. Now I'm moving on to getting the OSS logs.

Also as a side note I found unmount the lustre client did not finish as well.I had to reboot the node.

Comment by James A Simmons [ 01/Jun/15 ]

I'm seeing this error on the OSS.

[root@sultan-oss1 ~]# dmesg
[246246.570179] LustreError: 23969:0:(events.c:447:server_bulk_callback()) event type 5, status -38, desc ffff880f43554000
[246289.576691] LNet: 23977:0:(o2iblnd_cb.c:410:kiblnd_handle_rx()) PUT_NACK from 10.37.202.61@o2ib1
[246289.585754] LNet: 23977:0:(o2iblnd_cb.c:410:kiblnd_handle_rx()) Skipped 10 previous similar messages
[246289.605663] LustreError: 23977:0:(events.c:447:server_bulk_callback()) event type 5, status -38, desc ffff880f4a522000

I uploaded the OSS logs to ftp.whamcloud.com/uploads/LU-6663/LU-6663-OSS.log

Comment by Di Wang [ 01/Jun/15 ]

Hmm, I found this in the debug log

00010000:00000001:10.0:1433178825.894370:0:28309:0:(ldlm_lib.c:2683:target_bulk_timeout()) Process entered
00010000:00000001:10.0:1433178825.894375:0:28309:0:(ldlm_lib.c:2687:target_bulk_timeout()) Process leaving (rc=1 : 1 : 1)
00010000:00020000:10.0:1433178825.894381:0:28309:0:(ldlm_lib.c:2771:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff880ff94283c0 x1502435549823124/t0(0) o3->b9cf5051-0ff9-6cf9-cd67-9364a2516176@30@gni1:86/0 lens 488/432 e 0 to 0 dl 1433178836 ref 1 fl Interpret:/2/0 rc 0/0

anything special about connection between sultan-OST0034 and 30@gni1 ?

Comment by Di Wang [ 02/Jun/15 ]

Unfortunately, except the timeout bulk IO failures, I can not see other problem from the debug log. So it is quite possible this is caused by some problem in LNET. Hmm, DNE2 patches did not touch some of LNET code in this patch

http://review.whamcloud.com/#/c/12525

So it maybe related to this change. Unfortunately, there are no correspondent client debug log when timeout bulkio happens on the OST side.
LU-6663-dne1.log is too earlier.

James: if this is repeatable on your environment, could you please get the client side log when timeout happens on the server side?

Comment by James A Simmons [ 02/Jun/15 ]

The LU-6663-dne1.log is the client side log for when the OSS times out.

Comment by Di Wang [ 02/Jun/15 ]

LU-6663-dne1.log seems too earlier. According to the debug log, when the timeout on OSS happens

00000020:02000400:10.0:1433178825.928974:0:28309:0:(tgt_handler.c:1834:tgt_brw_read()) sultan-OST0034: Bulk IO read error with b9cf5051-0ff9-6cf9-cd67-9364a2516176 (at 30@gni1), client will retry: rc -110
00000020:00000001:10.0:1433178825.947041:0:28309:0:(tgt_handler.c:1851:tgt_brw_read()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92)
00010000:00000080:10.0:1433178825.947043:0:28309:0:(ldlm_lib.c:2427:target_committed_to_req()) @@@ not sending last_committed update (0/1)  req@ffff880ff94283c0 x1502435549823124/t0(0) o3->b9cf5051-0ff9-6cf9-cd67-9364a2516176@30@gni1:86/0 lens 488/432 e 0 to 0 dl 1433178836 ref 1 fl Interpret:/2/ffffffff rc -110/-1

And I can not find this 1502435549823 req on the client side. And also if time on each node are synchronized before the test, then it seems the timestamps in LU-6663-dne1.log seems much earlier than LU-6663-OSS.log?

Comment by James A Simmons [ 02/Jun/15 ]

Yes I see what you mean. I did the same test twice. Collected logs from the client first then the OSS. I guess you want me to collect data from both at the same time.

Comment by Di Wang [ 02/Jun/15 ]

Yes, please. I want to see if this is related with the change in http://review.whamcloud.com/#/c/12525

Comment by Doug Oucharek (Inactive) [ 02/Jun/15 ]

James: Is this system running GNI? I'm wondering if there is something in patch http://review.whamcloud.com/#/c/12525 which is not compatible with GNI operation. Is GNI using rdma reads now for bulk transfers (IB only uses rdma writes)?

Comment by James A Simmons [ 03/Jun/15 ]

The test client and server are both infiniband. Originally I tested on both GNI and o2ib but when I saw problems I moved testing only inifinband to make sure it wasn't a GNI issue. All the logs I have posted here are from an infiniband only systems.

Comment by Di Wang [ 03/Jun/15 ]

James: but the request which caused OST timeout seems from GNI interface.

00000020:02000400:10.0:1433178825.928974:0:28309:0:(tgt_handler.c:1834:tgt_brw_read()) sultan-OST0034: Bulk IO read error with b9cf5051-0ff9-6cf9-cd67-9364a2516176 (at 30@gni1), client will retry: rc -110
00000020:00000001:10.0:1433178825.947041:0:28309:0:(tgt_handler.c:1851:tgt_brw_read()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92)
00010000:00000080:10.0:1433178825.947043:0:28309:0:(ldlm_lib.c:2427:target_committed_to_req()) @@@ not sending last_committed update (0/1)  req@ffff880ff94283c0 x1502435549823124/t0(0) o3->b9cf5051-0ff9-6cf9-cd67-9364a2516176@30@gni1:86/0 lens 488/432 e 0 to 0 dl 1433178836 ref 1 fl Interpret:/2/ffffffff rc -110/-1

nid is 30@gni1

Comment by James A Simmons [ 03/Jun/15 ]

Ah yes one of the service nodes failed to unmount. This time we killed off the node so a clean run was done and the same problem still exist. I uploaded both the client and OSS logs to ftp.whamcloud.com/uploads/LU-6663/[client|oss]-dump-june-3.log. No GNI noise in the logs this time.

Comment by James A Simmons [ 03/Jun/15 ]

Your ftp site is down. The OSS log transfer is not complete.

Comment by Di Wang [ 03/Jun/15 ]

I think it works again. Hmm, I saw two files there

-rw-r--r-- 1 nobody ftp  34M Jun  3 09:58 client-dump-june-3.log
-rw-r--r-- 1 nobody ftp 485M Jun  3 10:16 oss-dump-june-3.log

Let me download to see what are there. You can keep uploading the OSS log. Thanks.

Comment by James A Simmons [ 03/Jun/15 ]

That looks about right for the logs. A few MB might be missing from the OSS log but that should be okay.

Comment by Di Wang [ 03/Jun/15 ]

James: what are the OSS and client (where you collected the debug log) IP addresses? And what is stripe layout of this file being read? Could you please post the stripe information here? Thanks.

Comment by James A Simmons [ 03/Jun/15 ]

Client : 10.37.248.124@o2ib1
OSS : 10.37.248.69@o2ib1

As you can tell the file is using MDS 4

[root@ninja06 ~]# lfs getstripe /lustre/sultan/stf008/scratch/jsimmons/test_mdt4/#test-dir.0/file.1
/lustre/sultan/stf008/scratch/jsimmons/test_mdt4/#test-dir.0/file.1
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 53
obdidx objid objid group
53 354 0x162 0x134000040e
54 354 0x162 0x138000040a
55 354 0x162 0x13c000040a
0 354 0x162 0x600000407

Comment by Di Wang [ 04/Jun/15 ]

According to the debug log, the file_read thread seems waiting there to get some pages

The debug log start from 1433350220.121884, and the file_thread show up until 1433350251.100439

00000020:00000001:2.0:1433350251.100439:0:22229:0:(cl_io.c:866:cl_page_list_add()) Process entered
...
00000080:00000001:2.0:1433350251.128153:0:22229:0:(file.c:1353:ll_file_read()) Process leaving (rc=4194304 : 4194304 : 400000)

And before read thread show up (from 1433350220.121884 to 1433350251.100439), it seems ptlrpc threads are reading pages from different stripes, and it issued 4 X 1M RPC to OST0000, 8 X 1M RPC to OST0036, 8 X 1M RPC to OST0037 and 7 X 1M RPC to OST0035. So it is probably read-ahead went badly, and
read thread does not get the page needed, so wait for long time. Unfortunately, there are no enough log to find out the real reason.

And I also checked the debug log on the OSS side, it seems the longest RPC handled time is 43902us, which is still far less than 1 second

00000100:00100000:9.0:1433350292.704085:0:24865:0:(service.c:2125:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io02_011:18eebf31-d253-c2e2-a426-9fd09c02ffc4+9:9454:x1502877014836248:12345-10.37.248.130@o2ib1:3 Request procesed in 43821us (43902us total) trans 0 rc 1048576/1048576

So OSS seems fine this time.

James: Could you please run the test without trace debug level on the client side, so we can get more log to see what happens on the client side? Anyway I still can not figure out why this only happens under striped directory.
lctl set_param debug=-1
lctl set_param debug=-trace

Comment by James A Simmons [ 04/Jun/15 ]

I placed new log file client-dump-june-3-no-trace.log which has trace disabled in the usual spot.

Comment by Di Wang [ 04/Jun/15 ]

James: Thanks. But the read RPCs are missing in the debug log, or it is grabbed too late? I did not see any read RPC in the debug log.
Did you set debug log to be

lctl set_param debug=all-trace

The file.1 seems on MDT0007, is this the same file as MDT4 (stripe_count = 4)? And the size is more than 7G?

and there seems a dead router in your system, and I am not sure how it will impact the performance.

00000800:00000100:3.0:1433379860.351507:0:2845:0:(o2iblnd_cb.c:2108:kiblnd_peer_connect_failed()) Deleting messages for 10.37.202.59@o2ib1: connection failed

00000400:00000200:2.0:1433379885.351487:0:20110:0:(router.c:1018:lnet_ping_router_locked()) rtr 10.37.202.59@o2ib1 50: deadline 4742051025 ping_notsent 1 alive 0 alive_count 1 lp_ping_timestamp 4742001025
Comment by James A Simmons [ 04/Jun/15 ]

That is not right. file.1 is an empty file in a DNE1 directory. Yes its OST stripe count is 4. I have had that dead router for ages :-/ Yes the debug is set as you asked and I ran the debug collection right after I started the md5sum on file.1.

Comment by James A Simmons [ 04/Jun/15 ]

Okay, I have no idea where that data came from, but file.1 is supposed to be empty.

Comment by Di Wang [ 04/Jun/15 ]

Hmm, interesting. The data are still there? or it is just the size of the file.1 is 7G. I definitely see a lot file read

00000080:00200000:0.0:1433379889.075253:0:2403:0:(file.c:1088:ll_file_io_generic()) file: file.1, type: 0 ppos: 7876902912, count: 4194304

And also on the OST side, I saw there are real data for the object

00000001:00001000:13.0:1433350227.207891:0:3039:0:(osd_io.c:830:osd_ldiskfs_map_ext_inode_pages()) inode 4800: map 256 pages from 37376
00000001:00001000:13.0:1433350227.207893:0:3039:0:(osd_io.c:774:osd_ldiskfs_map_nblocks()) blocks 37376-37631 requested for inode 4800

inode 4800 should be the object (0 354 0x162 0x600000407)on OST0000. hmm

Comment by Di Wang [ 04/Jun/15 ]

James: could you please mount OST0000 as ldiskfs to see what are under /mnt/ost0/O/600000407 to see if there is file whose inode number is 4800, and stat that file?

find /mnt/ost0/O/600000407 | xargs ls -i  | grep 4800

stat that file.

I want to know if the OI mapping is correctly for FID on OST, Thanks.

Comment by James A Simmons [ 04/Jun/15 ]

[root@sultan-oss1 d2]# pwd
/mnt/O/600000407/d2
[root@sultan-oss1 d2]# stat 162
File: `162'
Size: 4125097984 Blocks: 8056840 IO Block: 4096 regular file
Device: fd07h/64775d Inode: 4800 Links: 1
Access: (0666/rw-rw-rw) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 1969-12-31 19:00:00.000000000 -0500
Modify: 2015-05-29 21:23:35.000000000 -0400
Change: 2015-05-29 21:23:35.000000000 -0400

Comment by Di Wang [ 04/Jun/15 ]

Jame: Thanks. I will check if somewhere in DNE2 patches can fill an empty object in OST? very strange. In the mean time, could you please check if this is repeatable on your environment? i.e. create an empty file on a remote directory, then do md5sum to see if an empty object will be filled. Are there anything else you did for file.1?

Comment by Di Wang [ 04/Jun/15 ]

James: is this new formatted system? or upgraded? from 2.4? btw: the data under file.1 is garbage or likes data of other file?

Comment by James A Simmons [ 04/Jun/15 ]

It is a 2.5 formatted file system. file.1 looks like garbage. Now I could of did something to file file.1 :-/ I'm going to try our cray system tomorrow with a vanilla 2.7.54 to see if the problem still exist.

Comment by James A Simmons [ 05/Jun/15 ]

Can you list what the remaining patches are for DNE2 and what order they need to be applied in. Thanks.

Comment by James A Simmons [ 05/Jun/15 ]

This morning I updated to the latest vanilla master and ended up in a state where I could not mount the file system. So I tried migrating back to lustre 2.5 and when I attempted to mount the file system I got these errors:

[ 1025.018232] Lustre: Lustre: Build Version: 2.5.4--CHANGED-2.6.32-431.29.2.el6.atlas.x86_64
[ 1062.767104] LDISKFS-fs (dm-7): recovery complete
[ 1062.791599] LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. quota=on. Opts:
[ 1063.389770] LustreError: 24686:0:(ofd_fs.c:594:ofd_server_data_init()) sultan-OST0000: unsupported read-only filesystem feature(s) 2
[ 1063.412582] LustreError: 24686:0:(obd_config.c:572:class_setup()) setup sultan-OST0000 failed (-22)
[ 1063.421818] LustreError: 24686:0:(obd_config.c:1629:class_config_llog_handler()) MGC10.37.248.67@o2ib1: cfg command failed: rc = -22
[ 1063.433944] Lustre: cmd=cf003 0:sultan-OST0000 1:dev 2:0 3:f
[ 1063.440506] LustreError: 15b-f: MGC10.37.248.67@o2ib1: The configuration from log 'sultan-OST0000'failed from the MGS (-22). Make sure this client and the MGS are running compatible versions of Lustre.
[ 1063.458690] LustreError: 15c-8: MGC10.37.248.67@o2ib1: The configuration from log 'sultan-OST0000' failed (-22). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
[ 1063.482398] LustreError: 24600:0:(obd_mount_server.c:1254:server_start_targets()) failed to start server sultan-OST0000: -22
[ 1063.493822] LustreError: 24600:0:(obd_mount_server.c:1737:server_fill_super()) Unable to start targets: -22
[ 1063.503768] LustreError: 24600:0:(obd_mount_server.c:847:lustre_disconnect_lwp()) sultan-MDT0000-lwp-OST0000: Can't end config log sultan-client.
[ 1063.516947] LustreError: 24600:0:(obd_mount_server.c:1422:server_put_super()) sultan-OST0000: failed to disconnect lwp. (rc=-2)
[ 1063.528574] LustreError: 24600:0:(obd_config.c:619:class_cleanup()) Device 3 not setup
[ 1063.539611] Lustre: server umount sultan-OST0000 complete
[ 1063.545093] LustreError: 24600:0:(obd_mount.c:1330:lustre_fill_super()) Unable to mount /dev/mapper/sultan-ddn-l0 (-22)
[ 1070.949382] LDISKFS-fs (dm-6): recovery complete
[ 1070.956045] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. quota=on. Opts:
[ 1071.472962] LustreError: 24982:0:(ofd_fs.c:594:ofd_server_data_init()) sultan-OST0004: unsupported read-only filesystem feature(s) 2
[ 1071.495949] LustreError: 24982:0:(obd_config.c:572:class_setup()) setup sultan-OST0004 failed (-22)
[ 1071.505140] LustreError: 24982:0:(obd_config.c:1629:class_config_llog_handler()) MGC10.37.248.67@o2ib1: cfg command failed: rc = -22

Something borked my file system. I'm going to have to reformat. We really need to increase the scope to make sure going from 2.5 to 2.8 works as well as the reverse. This includes a setup of 2.5 with remote directories.

Comment by Di Wang [ 05/Jun/15 ]

James, I just update the patch. Here is the list based on master, you can applied these patches by this order

http://review.whamcloud.com/#/c/14679/
http://review.whamcloud.com/#/c/12825/36
http://review.whamcloud.com/#/c/14883/5
http://review.whamcloud.com/#/c/12282/47
http://review.whamcloud.com/#/c/12450/45
http://review.whamcloud.com/#/c/13785/24
http://review.whamcloud.com/#/c/13786/25
http://review.whamcloud.com/#/c/15161/1
http://review.whamcloud.com/#/c/15162/1
http://review.whamcloud.com/#/c/15163/1

will be the last one for DNE2 patches, there are a few more for fixing the racer, but probably not needed for current test.

Sure I will add this upgrade test(2.5 with remote dir upgrade to 2.8) to cont-sanity 32c. And I will try downgrade manually.

Comment by James A Simmons [ 08/Jun/15 ]

I reformatted the file system and the problems went away, so it was file system corruption that caused this. I know how this happened Originally my file system was build under Lustre 2.5 and I created a bunch of DNE1 directories for testing. When I booted to Lustre pre-2.8 I created a new DNE2 scratch workspace and move the contents of my old scratch space there. This included the batch of DNE1 directories I created. This totally hosed the file system. I bet the reverse is true as well, i.e placing a DNE2 directory under a DNE1 directory will cause mayhem as well.

Comment by Di Wang [ 08/Jun/15 ]

James: Thanks. I will add this process to cont-sanity.sh 32c and check.

Comment by James A Simmons [ 08/Jun/15 ]

Before we declare victory I moved my testing to our Cray system and the problem is still there. This time now that the file system corruption is gone we can get a better idea of what is going on. First I tried with the patch from LU-5385 and without and it was the same result. The Cray node handing attempting a md5sum simul. No errors client side but I do see errors on the OSS.

[ 5064.214349] LNet: 21859:0:(o2iblnd_cb.c:411:kiblnd_handle_rx()) PUT_NACK from 10.37.202.60@o2ib1
[ 5064.223501] LustreError: 21859:0:(events.c:447:server_bulk_callback()) event type 5, status -38, desc ffff88103430e000
[ 5164.196241] LustreError: 22947:0:(ldlm_lib.c:3056:target_bulk_io()) @@@ timeout on bulk READ after 100+0s req@ffff8810407fd080 x1503431681379072/t0(0) o3->dc99049c-a5df-abb6-3de7-890b86319cdf@30@gni1:281/0 lens 608/432 e 4 to 0 dl 1433784541 ref 1 fl Interpret:/0/0 rc 0/0
[ 5164.220805] Lustre: sultan-OST0000: Bulk IO read error with dc99049c-a5df-abb6-3de7-890b86319cdf (at 30@gni1), client will retry: rc -110
[ 5175.564732] Lustre: sultan-OST0000: Client dc99049c-a5df-abb6-3de7-890b86319cdf (at 30@gni1) reconnecting
[ 5175.574451] Lustre: Skipped 1 previous similar message
[ 5175.579726] Lustre: sultan-OST0000: Connection restored to dc99049c-a5df-abb6-3de7-890b86319cdf (at 30@gni1)
[ 5175.589709] Lustre: Skipped 272 previous similar messages
[ 5175.595765] LNet: 21867:0:(o2iblnd_cb.c:411:kiblnd_handle_rx()) PUT_NACK from 10.37.202.61@o2ib1
[ 5175.605076] LustreError: 21867:0:(events.c:447:server_bulk_callback()) event type 5, status -38, desc ffff8810340cc000

I pushed logs to ftp.whamcloud.com/uoloads/LU-6663/june-8-[client|oss].log for debugging. Perf shows it spinning on a spin lock on the OSS side. One last note of interest. The DNE2 directory doesn't have this problem. Only ordinary directories.

Comment by Di Wang [ 08/Jun/15 ]

I checked the debug log, it seems this slows still happens between o2ib and gni.

00000400:00000010:4.0:1433777204.396986:0:5507:0:(lib-lnet.h:240:lnet_md_free()) kfreed 'md': 128 at ffff8803ce5c81c0 (tot 381648128).
00000400:00000010:4.0:1433777204.396988:0:5507:0:(lib-lnet.h:273:lnet_msg_free()) kfreed 'msg': 352 at ffff8803cf242a00 (tot 381647776).
00000800:00004000:4.0:1433777204.396989:0:5507:0:(gnilnd_cb.c:3445:kgnilnd_check_fma_send_cq()) conn->gnc_tx_in_use refcount 0
00000800:00004000:4.0:1433777204.396990:0:5507:0:(gnilnd_cb.c:3446:kgnilnd_check_fma_send_cq()) conn ffff8803ea1b3800->27@gni1-- (3)
00000800:00000040:4.0:1433777204.396992:0:5507:0:(gnilnd_cb.c:3328:kgnilnd_check_fma_send_cq()) SMSG send CQ 0 not ready (data 0x140002253101005) processed 1
00000800:00000040:4.0:1433777204.396993:0:5507:0:(gnilnd_cb.c:3481:kgnilnd_check_fma_rcv_cq()) SMSG RX CQ 0 empty data 0x0 processed 0
00000800:00000040:4.0:1433777204.396994:0:5507:0:(gnilnd_cb.c:3170:kgnilnd_check_rdma_cq()) SEND RDMA CQ 0 empty processed 0
00000800:00000040:4.0:1433777204.396995:0:5507:0:(gnilnd_cb.c:3328:kgnilnd_check_fma_send_cq()) SMSG send CQ 0 not ready (data 0xffffffffa03ce1f9) processed 0
00000800:00000040:4.0:1433777204.396996:0:5507:0:(gnilnd_cb.c:3481:kgnilnd_check_fma_rcv_cq()) SMSG RX CQ 0 empty data 0xffffffffa040d230 processed 0
00000800:00000040:4.0:1433777204.396997:0:5507:0:(gnilnd_cb.c:3170:kgnilnd_check_rdma_cq()) SEND RDMA CQ 0 empty processed 0
00000800:00000040:4.0:1433777204.396999:0:5507:0:(gnilnd_cb.c:5080:kgnilnd_scheduler()) scheduling: found_work 0 busy_loops 75
00000800:00000040:4.0:1433777204.397000:0:5506:0:(gnilnd_cb.c:5084:kgnilnd_scheduler()) awake after schedule
00000800:00000040:4.0:1433777204.397001:0:5506:0:(gnilnd_cb.c:3328:kgnilnd_check_fma_send_cq()) SMSG send CQ 0 not ready (data 0x0) processed 0
00000800:00000040:4.0:1433777204.397002:0:5506:0:(gnilnd_cb.c:3481:kgnilnd_check_fma_rcv_cq()) SMSG RX CQ 0 empty data 0x0 processed 0
00000800:00000040:4.0:1433777204.397003:0:5506:0:(gnilnd_cb.c:3170:kgnilnd_check_rdma_cq()) SEND RDMA CQ 0 empty processed 0
00000800:00000040:4.0:1433777204.397004:0:5506:0:(gnilnd_cb.c:5080:kgnilnd_scheduler()) scheduling: found_work 0 busy_loops 76
00000100:00000400:3.0:1433777204.401048:0:5521:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1433777092/real 1433777092]  req@ffff8803e9ad99c0 x1503423118714204/t0(0) o3->sultan-OST0000-osc-ffff8803f36ce800@10.37.248.69@o2ib1:6/4 lens 608/432 e 4 to 1 dl 1433777204 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
00000100:00000200:3.0:1433777204.401055:0:5521:0:(events.c:97:reply_in_callback()) @@@ type 6, status 0  req@ffff8803e9ad99c0 x1503423118714204/t0(0) o3->sultan-OST0000-osc-ffff8803f36ce800@10.37.248.69@o2ib1:6/4 lens 608/432 e 4 to 1 dl 1433777204 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
00000100:00000200:3.0:1433777204.401060:0:5521:0:(events.c:118:reply_in_callback()) @@@ unlink  req@ffff8803e9ad99c0 x1503423118714204/t0(0) o3->sultan-OST0000-osc-ffff8803f36ce800@10.37.248.69@o2ib1:6/4 lens 608/432 e 4 to 1 dl 1433777204 ref 2 fl Rpc:X/2/ffffffff rc 0/-1

So client is using gni interface, OST0000 is using o2ib interface.

And also it seems there are network partitions on the connection to OST0000, though I do not know the real course is network or sth else?

00000100:00080000:5.0:1433777204.401753:0:5520:0:(import.c:1170:ptlrpc_connect_interpret()) reconnected to sultan-OST0000_UUID@10.37.248.69@o2ib1 after partition
00000100:00080000:5.0:1433777204.401754:0:5520:0:(import.c:1188:ptlrpc_connect_interpret()) ffff8803ced85000 sultan-OST0000_UUID: changing import state from CONNECTING to RECOVER
00000100:00080000:5.0:1433777204.401756:0:5520:0:(import.c:1488:ptlrpc_import_recovery_state_machine()) ffff8803ced85000 sultan-OST0000_UUID: changing import state from RECOVER to FULL
00000100:02000000:5.0:1433777204.401764:0:5520:0:(import.c:1494:ptlrpc_import_recovery_state_machine()) sultan-OST0000-osc-ffff8803f36ce800: Connection restored to 10.37.248.69@o2ib1 (at 10.37.248.69@o2ib1)
Comment by James A Simmons [ 08/Jun/15 ]

If the DNE2 directory gives no problem then it is a issue with small packets.

Comment by James A Simmons [ 10/Jun/15 ]

I tracked down what is causing the bug. Setting map_on_demand=256 for the ko2iblnd driver triggers this bug which for some reason can only be triggered on a Cray router. Peter should I open another ticket that properly defines the problem and close this ticket?

Comment by Jian Yu [ 10/Jun/15 ]

Yes, James. Peter suggested to close this one and open a new ticket for tracking the issue you found. Thank you.

Comment by Peter Jones [ 11/Jun/15 ]

Yes please!

Comment by Andreas Dilger [ 12/Jun/15 ]

James, Di,
did you ever open a bug on the 2.5->2.8 upgrade/downgrade problem from:
https://jira.hpdd.intel.com/browse/LU-6663?focusedCommentId=117610

[ 1062.791599] LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. quota=on. Opts:
[ 1063.389770] LustreError: 24686:0:(ofd_fs.c:594:ofd_server_data_init()) sultan-OST0000: unsupported read-only filesystem feature(s) 2

The unknown read-only feature appears to be OBD_ROCOMPAT_IDX_IN_IDIF. That feature shouldn't automatically be enabled, and needs active participation from the administrator:

static ssize_t
ldiskfs_osd_index_in_idif_seq_write(struct file *file, const char *buffer,
                                    size_t count, loff_t *off)
{
                LCONSOLE_WARN("%s: OST-index in IDIF has been enabled, "
                              "it cannot be reverted back.\n", osd_name(dev));
                return -EPERM;
 

James, did you set the index_in_idif feature in /proc, or is there a bug here that needs to be filed? Looking at the code it doesn't appear that this flag could have been set automatically.

Comment by Di Wang [ 12/Jun/15 ]

Andreas: I already added 2.5 to 2.8 dne upgrade to DNE patches series. I will try this downgrade steps locally to see if it is easy to be reproduced.

Comment by Di Wang [ 14/Jun/15 ]

http://review.whamcloud.com/#/c/15275/1 DNE upgrade test cases

Generated at Sat Feb 10 02:02:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.