[LU-6663] DNE2 directories has very very bad performance Created: 29/May/15 Updated: 15/Jun/15 Resolved: 11/Jun/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James A Simmons | Assignee: | Di Wang |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | dne2 | ||
| Environment: |
The problem only happens with the contents in an DNE2 directory |
||
| Issue Links: |
|
||||||||||||
| Severity: | 2 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
With my DNE2 setup I attempted to start some jobs on our test cluster and the job got stuck for hours attempting to run. So I did testing to see what was breaking. A simple md5sum on files showed the problem very easily. For normal directories md5sum on a file came back very fast. [root@ninja06 johndoe]# date;md5sum ior;date But for DNE2 directories I saw this: [root@ninja06 jsimmons]# date;md5sum simul;date This is not a issue for stats. So it only appears when accessing data from a file. An ls -al on the file simul comes back very fast. I did this test without the work from |
| Comments |
| Comment by Di Wang [ 29/May/15 ] |
|
Hmm, I tried this on my local test node. Unfortunately, I can not reproduce it with 4 MDTs and 4 stripe_count. James, could you please tell me more about your system, how many MDTs and what is the stripe_count of the striped directory. And what is the size of simul and ior file. Thanks. [root@testnode tests]# date; md5sum /mnt/lustre/test2/dd; date
Wed May 27 19:56:37 PDT 2015
d8b61b2c0025919d5321461045c8226f /mnt/lustre/test2/dd
Wed May 27 19:56:38 PDT 2015
[root@testnode tests]# date; md5sum /mnt/lustre/test1/dd; date
Wed May 27 19:56:44 PDT 2015
d8b61b2c0025919d5321461045c8226f /mnt/lustre/test1/dd
Wed May 27 19:56:45 PDT 2015
[root@testnode tests]# ../utils/lfs getdirstripe /mnt/lustre/test1
/mnt/lustre/test1
lmv_stripe_count: 4 lmv_stripe_offset: 0
mdtidx FID[seq:oid:ver]
0 [0x200000400:0x2:0x0]
1 [0x240000403:0x2:0x0]
2 [0x280000403:0x2:0x0]
3 [0x2c0000401:0x2:0x0]
[root@testnode tests]# ../utils/lfs getdirstripe /mnt/lustre/test2
/mnt/lustre/test2
lmv_stripe_count: 0 lmv_stripe_offset: 0
[root@testnode tests]# ll /mnt/lustre/test2/
total 512004
-rw-r--r-- 1 root root 524288000 May 27 19:56 dd
[root@testnode tests]# ll /mnt/lustre/test1/
total 512004
-rw-r--r-- 1 root root 524288000 May 27 19:56 dd
[root@testnode tests]#
|
| Comment by Andreas Dilger [ 29/May/15 ] |
|
James, can you run an starve to see where md5sum is spending its time? Maybe also with +rpctrace debugging logs of strace isn't clear. Also, what back-end filesystem are you using for the MDTs and OSTs? |
| Comment by Di Wang [ 29/May/15 ] |
|
Hmm, I also checked RPC count after md5sum 1000 1k files on this striped directory [root@testnode test1]# cat /proc/fs/lustre/mdc/lustre-MDT000*-mdc-*/stats snapshot_time 1432783774.817424 secs.usecs req_waittime 759 samples [usec] 33 311 81604 9966800 req_active 759 samples [reqs] 1 1 759 759 mds_close 251 samples [usec] 33 311 18595 1595875 obd_ping 4 samples [usec] 80 170 527 74209 snapshot_time 1432783774.817449 secs.usecs req_waittime 750 samples [usec] 34 279 81081 10035485 req_active 750 samples [reqs] 1 1 750 750 mds_close 249 samples [usec] 34 180 18120 1492930 obd_ping 3 samples [usec] 59 169 363 50267 snapshot_time 1432783774.817466 secs.usecs req_waittime 751 samples [usec] 33 294 81646 10083800 req_active 751 samples [reqs] 1 1 751 751 mds_close 249 samples [usec] 33 153 18560 1536386 obd_ping 2 samples [usec] 109 128 237 28265 snapshot_time 1432783774.817483 secs.usecs req_waittime 755 samples [usec] 32 369 81239 10030651 req_active 755 samples [reqs] 1 1 755 755 mds_close 251 samples [usec] 32 199 18483 1542053 obd_ping 2 samples [usec] 175 189 364 66346 [root@testnode test1]# cat /proc/fs/lustre/osc/lustre-OST0001-osc-ffff8801e8e49400/stats snapshot_time 1432784238.566254 secs.usecs req_waittime 1012 samples [usec] 36 424 96854 10257148 req_active 1012 samples [reqs] 1 1 1012 1012 read_bytes 499 samples [bytes] 896 896 447104 400605184 ost_read 499 samples [usec] 42 253 48718 5248004 obd_ping 14 samples [usec] 54 160 1551 186601 [root@testnode test1]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff8801e8e49400/stats snapshot_time 1432784253.55197 secs.usecs req_waittime 1022 samples [usec] 22 288 96836 10097582 req_active 1022 samples [reqs] 1 1 1022 1022 read_bytes 501 samples [bytes] 896 896 448896 402210816 ost_read 501 samples [usec] 41 288 47596 4964502 obd_ping 20 samples [usec] 22 177 2295 305467 Here is the result for non-striped directory [root@testnode test2]# cat /proc/fs/lustre/mdc/lustre-MDT000*-mdc-*/stats snapshot_time 1432783875.323394 secs.usecs req_waittime 3006 samples [usec] 33 483 341808 43798954 req_active 3006 samples [reqs] 1 1 3006 3006 mds_close 1000 samples [usec] 33 270 87747 8322307 mds_readpage 1 samples [usec] 483 483 483 233289 obd_ping 4 samples [usec] 64 118 372 36198 snapshot_time 1432783875.323418 secs.usecs req_waittime 5 samples [usec] 29 113 386 34914 req_active 5 samples [reqs] 1 1 5 5 obd_ping 5 samples [usec] 29 113 386 34914 snapshot_time 1432783875.323433 secs.usecs req_waittime 4 samples [usec] 84 115 379 36499 req_active 4 samples [reqs] 1 1 4 4 obd_ping 4 samples [usec] 84 115 379 36499 snapshot_time 1432783875.323449 secs.usecs req_waittime 4 samples [usec] 22 198 503 83013 req_active 4 samples [reqs] 1 1 4 4 obd_ping 4 samples [usec] 22 198 503 83013 [root@testnode test2]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff88019a235000/stats snapshot_time 1432784328.527276 secs.usecs req_waittime 1011 samples [usec] 30 349 91417 9311277 req_active 1011 samples [reqs] 1 1 1011 1011 read_bytes 500 samples [bytes] 896 896 448000 401408000 ost_read 500 samples [usec] 44 349 44299 4465957 ost_connect 1 samples [usec] 109 109 109 11881 obd_ping 10 samples [usec] 30 209 1155 159765 [root@testnode test2]# cat /proc/fs/lustre/osc/lustre-OST0001-osc-ffff88019a235000/stats snapshot_time 1432784335.389316 secs.usecs req_waittime 1012 samples [usec] 29 284 91225 9297035 req_active 1012 samples [reqs] 1 1 1012 1012 read_bytes 500 samples [bytes] 896 896 448000 401408000 ost_read 500 samples [usec] 42 284 44763 4518767 ost_connect 1 samples [usec] 110 110 110 12100 obd_ping 11 samples [usec] 29 248 1178 163494 Seems nothing unusual. James, are these files under ior or simul have same data stripes for striped_directory and non-striped_directory? And also if you can repeat the problem on the testnode. Could you please collect the debug log (-1 level) on the client side for me? thanks! |
| Comment by James A Simmons [ 29/May/15 ] |
|
Andreas I get the following strafe of md5sum ./simul. open("simul", O_RDONLY) = 3 ) = 0 Its the first read that is taking so long. BTW the second time you run this command no lag happens. So client side caching hides the problem. |
| Comment by James A Simmons [ 29/May/15 ] |
|
Everything is the test system is running the same Lustre version which is master + DNE patches and a few other patches as well. With no patches the problem still exist. The back end servers are running ldiskfs. Their are 16 MDS servers each with one MDT. I have 4 OSS servers with a total of 56 OSTs. I created myself a home directory using lfs setdirstripe -c 16 which contain the simul binaries which I run md5sum on that binary. I have a few other binaries as well for testing purposes. Here is RPC stats [root@ninja06 ~]# cat /proc/fs/lustre/mdc/sultan-MDT0000-mdc-ffff8808041f1800/stats;md5sum /lustre/sultan/s |
| Comment by James A Simmons [ 29/May/15 ] |
|
I uploaded full debug logs from the client to ftp.whamcloud.com//uploads/ |
| Comment by Di Wang [ 29/May/15 ] |
|
Thanks James, I just glanced debug log a bit, it seems to me all of the time costs is in CLIO stack. I do not understand why this only happens under striped directory. James do you have the debug logs for non-striped directory? I wonder if this is related with the recent CLIO simplification changes. I will check the log carefully. |
| Comment by James A Simmons [ 29/May/15 ] |
|
Yes, I just uploaded |
| Comment by Di Wang [ 29/May/15 ] |
|
Just checked the debug log, here is what happens. It is file_read costs most of the time, 00000080:00200000:0.0:1432922231.938967:0:14593:0:(file.c:1088:ll_file_io_generic()) file: simul, type: 0 ppos: 0, count: 4194304 ... 00000080:00200000:0.0:1432922285.038890:0:14593:0:(file.c:1196:ll_file_io_generic()) iot: 0, result: 4194304 So it takes about 54 seconds to read this 4MB file. And inside this read, it sent out 7 read RPCs, let's see the first 4 of them, the other 3 are read-ahead RPCs. The RPC size seems good. 00000008:00000002:5.0:1432922231.992289:0:21013:0:(osc_request.c:1829:osc_build_rpc()) @@@ 256 pages, aa ffff8810520764e0. now 1r/0w in flight req@ffff881052076380 x1502512486180268/t0(0) o3->sultan-OST0035-osc-ffff8807f7983c00@10.37.248.70@o2ib1:6/4 lens 488/432 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1 00000008:00000002:2.0:1432922231.998420:0:21015:0:(osc_request.c:1829:osc_build_rpc()) @@@ 256 pages, aa ffff8807f24e9de0. now 1r/0w in flight req@ffff8807f24e9c80 x1502512486180276/t0(0) o3->sultan-OST0036-osc-ffff8807f7983c00@10.37.248.71@o2ib1:6/4 lens 488/432 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1 00000008:00000002:2.0:1432922232.004918:0:21015:0:(osc_request.c:1829:osc_build_rpc()) @@@ 256 pages, aa ffff8807f24e9ae0. now 1r/0w in flight req@ffff8807f24e9980 x1502512486180284/t0(0) o3->sultan-OST0037-osc-ffff8807f7983c00@10.37.248.72@o2ib1:6/4 lens 488/432 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1 00000008:00000002:6.0:1432922232.011237:0:21017:0:(osc_request.c:1829:osc_build_rpc()) @@@ 256 pages, aa ffff8810553fade0. now 1r/0w in flight req@ffff8810553fac80 x1502512486180292/t0(0) o3->sultan-OST0000-osc-ffff8807f7983c00@10.37.248.69@o2ib1:6/4 lens 488/432 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1 It seems only the 1st one get reply in time 00000100:00000040:4.0:1432922231.994268:0:21016:0:(lustre_net.h:2439:ptlrpc_rqphase_move()) @@@ move req "Bulk" -> "Interpret" req@ffff881052076380 x1502512486180268/t0(0) o3->sultan-OST0035-osc-ffff8807f7983c00@10.37.248.70@o2ib1:6/4 lens 488/400 e 0 to 0 dl 1432922245 ref 2 fl Bulk:RM/0/0 rc 1048576/1048576 The other 3 RPCs are all get early reply and then get the pages after around 27 seconds. Here is the one example 00000100:00001000:5.0:1432922239.998446:0:21013:0:(client.c:468:ptlrpc_at_recv_early_reply()) @@@ Early reply #1, new deadline in 56s (50s) req@ffff8807f24e9c80 x1502512486180276/t0(0) o3->sultan-OST0036-osc-ffff8807f7983c00@10.37.248.71@o2ib1:6/4 lens 488/432 e 1 to 0 dl 1432922295 ref 2 fl Rpc:/0/ffffffff rc 0/-1 00000100:00000040:4.0:1432922258.003100:0:21013:0:(lustre_net.h:2439:ptlrpc_rqphase_move()) @@@ move req "Bulk" -> "Interpret" req@ffff8807f24e9c80 x1502512486180276/t0(0) o3->sultan-OST0036-osc-ffff8807f7983c00@10.37.248.71@o2ib1:6/4 lens 488/400 e 1 to 0 dl 1432922295 ref 2 fl Bulk:RM/0/0 rc 1048576/1048576 I guess either something happens on the OST side or network. James, are there anything special about these nodes 10.37.248.71,10.37.248.72,10.37.248.69 ? Thanks |
| Comment by Di Wang [ 29/May/15 ] |
|
Interesting, I checked the debug log for non_striped directory. And the RPC size and stats are quite similar as striped dir case, but every RPC gets reply immediately 00000008:00000002:1.0:1432929764.067930:0:21017:0:(osc_request.c:1829:osc_build_rpc()) @@@ 256 pages, aa ffff880871aef1e0. now 1r/0w in flight req@ffff880871aef080 x1502512486987692/t0(0) o3->sultan-OST001a-osc-ffff880804b97800@10.37.248.71@o2ib1:6/4 lens 488/432 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1 .... 00000100:00000040:0.0:1432929764.082631:0:21011:0:(lustre_net.h:2439:ptlrpc_rqphase_move()) @@@ move req "Bulk" -> "Interpret" req@ffff880871aef080 x1502512486987708/t0(0) o3->sultan-OST001c-osc-ffff880804b97800@10.37.248.69@o2ib1:6/4 lens 488/400 e 0 to 0 dl 1432929778 ref 2 fl Bulk:RM/0/0 rc 1048576/1048576 James, is this issue easily reproduced on your testnode? Could you please send me both clients and OSSs debug logs? Thanks |
| Comment by Andreas Dilger [ 29/May/15 ] |
|
Di, James, one thing to consider is if the FID for the OST object is potentially causing some problem? That would be something unique to DNE - that the OST objects for files created on MDT000x, x != 0, use FID-on-OST and objects created on MDT0000 use old-style IDIF FIDs. That said, it is strange to see this only on James' system, but possibly it relates to the number of MDTs * OSTs causing the FLDB to be very large or something? |
| Comment by Di Wang [ 29/May/15 ] |
|
Hmm, that is true, for striped directory case, the FID on OST object are real FIDs. res: [0x138000040b:0xe2:0x0].0 res: [0x1340000404:0xe2:0x0].0 ....... For non-striped directory case, the FID are IDIF [0x2:0x0:0x0].0 ..... But there are FLD lookup in the debug log. It might be interesting to see what happens on OST for real FID. James: An easy way to verify this would be create a remote directory on MDTn (n != 0), then create the file under the remote directory, which will allocate regular FIDs for the file, to see if the problem can be reproduced. And if you can get OST debug log for me, which will be helpful. Thanks! |
| Comment by Di Wang [ 29/May/15 ] |
|
Andreas: I just check the debug log, it seems only the first read RPC send to the each OST are slow. So maybe you are right, it is because of locating the real FID object is slow on OST. Hmm. |
| Comment by Andreas Dilger [ 29/May/15 ] |
|
Di, maybe something about loading the O/ {seq}object directories is doing too much work? |
| Comment by Di Wang [ 29/May/15 ] |
|
Andreas: hmm, loading a new sequence does need extra work, which includes about 33 directory lookup and 32 xatt_set. But I assume OST is not busy during the test (James, please confirm ?), so it may not cost about 28 seconds. Let's see what debug log say. |
| Comment by James A Simmons [ 30/May/15 ] |
|
The only activating on the OSTs is the retrieve of the simul executable file raw data to calculate the md5sum. So it is quiet system at the time. |
| Comment by Di Wang [ 01/Jun/15 ] |
|
James: Any news on the OST side debug log? Thanks! |
| Comment by James A Simmons [ 01/Jun/15 ] |
|
I just finished the DNE1 directory test. I created a empty file in the DNE1 directory using MDS 4 and then remounted lustre on the client. Then I attempted an md5sum on the new empty file which after 3 minutes never finished so I stop the md5sum command and did a log dump. I pushed the log to ftp.whamcloud.com/uploads/ Also as a side note I found unmount the lustre client did not finish as well.I had to reboot the node. |
| Comment by James A Simmons [ 01/Jun/15 ] |
|
I'm seeing this error on the OSS. [root@sultan-oss1 ~]# dmesg I uploaded the OSS logs to ftp.whamcloud.com/uploads/ |
| Comment by Di Wang [ 01/Jun/15 ] |
|
Hmm, I found this in the debug log 00010000:00000001:10.0:1433178825.894370:0:28309:0:(ldlm_lib.c:2683:target_bulk_timeout()) Process entered 00010000:00000001:10.0:1433178825.894375:0:28309:0:(ldlm_lib.c:2687:target_bulk_timeout()) Process leaving (rc=1 : 1 : 1) 00010000:00020000:10.0:1433178825.894381:0:28309:0:(ldlm_lib.c:2771:target_bulk_io()) @@@ timeout on bulk READ after 100+0s req@ffff880ff94283c0 x1502435549823124/t0(0) o3->b9cf5051-0ff9-6cf9-cd67-9364a2516176@30@gni1:86/0 lens 488/432 e 0 to 0 dl 1433178836 ref 1 fl Interpret:/2/0 rc 0/0 anything special about connection between sultan-OST0034 and 30@gni1 ? |
| Comment by Di Wang [ 02/Jun/15 ] |
|
Unfortunately, except the timeout bulk IO failures, I can not see other problem from the debug log. So it is quite possible this is caused by some problem in LNET. Hmm, DNE2 patches did not touch some of LNET code in this patch http://review.whamcloud.com/#/c/12525 So it maybe related to this change. Unfortunately, there are no correspondent client debug log when timeout bulkio happens on the OST side. James: if this is repeatable on your environment, could you please get the client side log when timeout happens on the server side? |
| Comment by James A Simmons [ 02/Jun/15 ] |
|
The |
| Comment by Di Wang [ 02/Jun/15 ] |
|
00000020:02000400:10.0:1433178825.928974:0:28309:0:(tgt_handler.c:1834:tgt_brw_read()) sultan-OST0034: Bulk IO read error with b9cf5051-0ff9-6cf9-cd67-9364a2516176 (at 30@gni1), client will retry: rc -110 00000020:00000001:10.0:1433178825.947041:0:28309:0:(tgt_handler.c:1851:tgt_brw_read()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92) 00010000:00000080:10.0:1433178825.947043:0:28309:0:(ldlm_lib.c:2427:target_committed_to_req()) @@@ not sending last_committed update (0/1) req@ffff880ff94283c0 x1502435549823124/t0(0) o3->b9cf5051-0ff9-6cf9-cd67-9364a2516176@30@gni1:86/0 lens 488/432 e 0 to 0 dl 1433178836 ref 1 fl Interpret:/2/ffffffff rc -110/-1 And I can not find this 1502435549823 req on the client side. And also if time on each node are synchronized before the test, then it seems the timestamps in |
| Comment by James A Simmons [ 02/Jun/15 ] |
|
Yes I see what you mean. I did the same test twice. Collected logs from the client first then the OSS. I guess you want me to collect data from both at the same time. |
| Comment by Di Wang [ 02/Jun/15 ] |
|
Yes, please. I want to see if this is related with the change in http://review.whamcloud.com/#/c/12525 |
| Comment by Doug Oucharek (Inactive) [ 02/Jun/15 ] |
|
James: Is this system running GNI? I'm wondering if there is something in patch http://review.whamcloud.com/#/c/12525 which is not compatible with GNI operation. Is GNI using rdma reads now for bulk transfers (IB only uses rdma writes)? |
| Comment by James A Simmons [ 03/Jun/15 ] |
|
The test client and server are both infiniband. Originally I tested on both GNI and o2ib but when I saw problems I moved testing only inifinband to make sure it wasn't a GNI issue. All the logs I have posted here are from an infiniband only systems. |
| Comment by Di Wang [ 03/Jun/15 ] |
|
James: but the request which caused OST timeout seems from GNI interface. 00000020:02000400:10.0:1433178825.928974:0:28309:0:(tgt_handler.c:1834:tgt_brw_read()) sultan-OST0034: Bulk IO read error with b9cf5051-0ff9-6cf9-cd67-9364a2516176 (at 30@gni1), client will retry: rc -110 00000020:00000001:10.0:1433178825.947041:0:28309:0:(tgt_handler.c:1851:tgt_brw_read()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92) 00010000:00000080:10.0:1433178825.947043:0:28309:0:(ldlm_lib.c:2427:target_committed_to_req()) @@@ not sending last_committed update (0/1) req@ffff880ff94283c0 x1502435549823124/t0(0) o3->b9cf5051-0ff9-6cf9-cd67-9364a2516176@30@gni1:86/0 lens 488/432 e 0 to 0 dl 1433178836 ref 1 fl Interpret:/2/ffffffff rc -110/-1 nid is 30@gni1 |
| Comment by James A Simmons [ 03/Jun/15 ] |
|
Ah yes one of the service nodes failed to unmount. This time we killed off the node so a clean run was done and the same problem still exist. I uploaded both the client and OSS logs to ftp.whamcloud.com/uploads/ |
| Comment by James A Simmons [ 03/Jun/15 ] |
|
Your ftp site is down. The OSS log transfer is not complete. |
| Comment by Di Wang [ 03/Jun/15 ] |
|
I think it works again. Hmm, I saw two files there -rw-r--r-- 1 nobody ftp 34M Jun 3 09:58 client-dump-june-3.log -rw-r--r-- 1 nobody ftp 485M Jun 3 10:16 oss-dump-june-3.log Let me download to see what are there. You can keep uploading the OSS log. Thanks. |
| Comment by James A Simmons [ 03/Jun/15 ] |
|
That looks about right for the logs. A few MB might be missing from the OSS log but that should be okay. |
| Comment by Di Wang [ 03/Jun/15 ] |
|
James: what are the OSS and client (where you collected the debug log) IP addresses? And what is stripe layout of this file being read? Could you please post the stripe information here? Thanks. |
| Comment by James A Simmons [ 03/Jun/15 ] |
|
Client : 10.37.248.124@o2ib1 As you can tell the file is using MDS 4 [root@ninja06 ~]# lfs getstripe /lustre/sultan/stf008/scratch/jsimmons/test_mdt4/#test-dir.0/file.1 |
| Comment by Di Wang [ 04/Jun/15 ] |
|
According to the debug log, the file_read thread seems waiting there to get some pages The debug log start from 1433350220.121884, and the file_thread show up until 1433350251.100439 00000020:00000001:2.0:1433350251.100439:0:22229:0:(cl_io.c:866:cl_page_list_add()) Process entered ... 00000080:00000001:2.0:1433350251.128153:0:22229:0:(file.c:1353:ll_file_read()) Process leaving (rc=4194304 : 4194304 : 400000) And before read thread show up (from 1433350220.121884 to 1433350251.100439), it seems ptlrpc threads are reading pages from different stripes, and it issued 4 X 1M RPC to OST0000, 8 X 1M RPC to OST0036, 8 X 1M RPC to OST0037 and 7 X 1M RPC to OST0035. So it is probably read-ahead went badly, and And I also checked the debug log on the OSS side, it seems the longest RPC handled time is 43902us, which is still far less than 1 second 00000100:00100000:9.0:1433350292.704085:0:24865:0:(service.c:2125:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io02_011:18eebf31-d253-c2e2-a426-9fd09c02ffc4+9:9454:x1502877014836248:12345-10.37.248.130@o2ib1:3 Request procesed in 43821us (43902us total) trans 0 rc 1048576/1048576 So OSS seems fine this time. James: Could you please run the test without trace debug level on the client side, so we can get more log to see what happens on the client side? Anyway I still can not figure out why this only happens under striped directory. |
| Comment by James A Simmons [ 04/Jun/15 ] |
|
I placed new log file client-dump-june-3-no-trace.log which has trace disabled in the usual spot. |
| Comment by Di Wang [ 04/Jun/15 ] |
|
James: Thanks. But the read RPCs are missing in the debug log, or it is grabbed too late? I did not see any read RPC in the debug log. lctl set_param debug=all-trace The file.1 seems on MDT0007, is this the same file as MDT4 (stripe_count = 4)? And the size is more than 7G? and there seems a dead router in your system, and I am not sure how it will impact the performance. 00000800:00000100:3.0:1433379860.351507:0:2845:0:(o2iblnd_cb.c:2108:kiblnd_peer_connect_failed()) Deleting messages for 10.37.202.59@o2ib1: connection failed 00000400:00000200:2.0:1433379885.351487:0:20110:0:(router.c:1018:lnet_ping_router_locked()) rtr 10.37.202.59@o2ib1 50: deadline 4742051025 ping_notsent 1 alive 0 alive_count 1 lp_ping_timestamp 4742001025 |
| Comment by James A Simmons [ 04/Jun/15 ] |
|
That is not right. file.1 is an empty file in a DNE1 directory. Yes its OST stripe count is 4. I have had that dead router for ages :-/ Yes the debug is set as you asked and I ran the debug collection right after I started the md5sum on file.1. |
| Comment by James A Simmons [ 04/Jun/15 ] |
|
Okay, I have no idea where that data came from, but file.1 is supposed to be empty. |
| Comment by Di Wang [ 04/Jun/15 ] |
|
Hmm, interesting. The data are still there? or it is just the size of the file.1 is 7G. I definitely see a lot file read 00000080:00200000:0.0:1433379889.075253:0:2403:0:(file.c:1088:ll_file_io_generic()) file: file.1, type: 0 ppos: 7876902912, count: 4194304 And also on the OST side, I saw there are real data for the object 00000001:00001000:13.0:1433350227.207891:0:3039:0:(osd_io.c:830:osd_ldiskfs_map_ext_inode_pages()) inode 4800: map 256 pages from 37376 00000001:00001000:13.0:1433350227.207893:0:3039:0:(osd_io.c:774:osd_ldiskfs_map_nblocks()) blocks 37376-37631 requested for inode 4800 inode 4800 should be the object (0 354 0x162 0x600000407)on OST0000. hmm |
| Comment by Di Wang [ 04/Jun/15 ] |
|
James: could you please mount OST0000 as ldiskfs to see what are under /mnt/ost0/O/600000407 to see if there is file whose inode number is 4800, and stat that file? find /mnt/ost0/O/600000407 | xargs ls -i | grep 4800 stat that file. I want to know if the OI mapping is correctly for FID on OST, Thanks. |
| Comment by James A Simmons [ 04/Jun/15 ] |
|
[root@sultan-oss1 d2]# pwd |
| Comment by Di Wang [ 04/Jun/15 ] |
|
Jame: Thanks. I will check if somewhere in DNE2 patches can fill an empty object in OST? very strange. In the mean time, could you please check if this is repeatable on your environment? i.e. create an empty file on a remote directory, then do md5sum to see if an empty object will be filled. Are there anything else you did for file.1? |
| Comment by Di Wang [ 04/Jun/15 ] |
|
James: is this new formatted system? or upgraded? from 2.4? btw: the data under file.1 is garbage or likes data of other file? |
| Comment by James A Simmons [ 04/Jun/15 ] |
|
It is a 2.5 formatted file system. file.1 looks like garbage. Now I could of did something to file file.1 :-/ I'm going to try our cray system tomorrow with a vanilla 2.7.54 to see if the problem still exist. |
| Comment by James A Simmons [ 05/Jun/15 ] |
|
Can you list what the remaining patches are for DNE2 and what order they need to be applied in. Thanks. |
| Comment by James A Simmons [ 05/Jun/15 ] |
|
This morning I updated to the latest vanilla master and ended up in a state where I could not mount the file system. So I tried migrating back to lustre 2.5 and when I attempted to mount the file system I got these errors: [ 1025.018232] Lustre: Lustre: Build Version: 2.5.4--CHANGED-2.6.32-431.29.2.el6.atlas.x86_64 Something borked my file system. I'm going to have to reformat. We really need to increase the scope to make sure going from 2.5 to 2.8 works as well as the reverse. This includes a setup of 2.5 with remote directories. |
| Comment by Di Wang [ 05/Jun/15 ] |
|
James, I just update the patch. Here is the list based on master, you can applied these patches by this order http://review.whamcloud.com/#/c/14679/ will be the last one for DNE2 patches, there are a few more for fixing the racer, but probably not needed for current test. Sure I will add this upgrade test(2.5 with remote dir upgrade to 2.8) to cont-sanity 32c. And I will try downgrade manually. |
| Comment by James A Simmons [ 08/Jun/15 ] |
|
I reformatted the file system and the problems went away, so it was file system corruption that caused this. I know how this happened |
| Comment by Di Wang [ 08/Jun/15 ] |
|
James: Thanks. I will add this process to cont-sanity.sh 32c and check. |
| Comment by James A Simmons [ 08/Jun/15 ] |
|
Before we declare victory I moved my testing to our Cray system and the problem is still there. This time now that the file system corruption is gone we can get a better idea of what is going on. First I tried with the patch from [ 5064.214349] LNet: 21859:0:(o2iblnd_cb.c:411:kiblnd_handle_rx()) PUT_NACK from 10.37.202.60@o2ib1 I pushed logs to ftp.whamcloud.com/uoloads/ |
| Comment by Di Wang [ 08/Jun/15 ] |
|
I checked the debug log, it seems this slows still happens between o2ib and gni. 00000400:00000010:4.0:1433777204.396986:0:5507:0:(lib-lnet.h:240:lnet_md_free()) kfreed 'md': 128 at ffff8803ce5c81c0 (tot 381648128). 00000400:00000010:4.0:1433777204.396988:0:5507:0:(lib-lnet.h:273:lnet_msg_free()) kfreed 'msg': 352 at ffff8803cf242a00 (tot 381647776). 00000800:00004000:4.0:1433777204.396989:0:5507:0:(gnilnd_cb.c:3445:kgnilnd_check_fma_send_cq()) conn->gnc_tx_in_use refcount 0 00000800:00004000:4.0:1433777204.396990:0:5507:0:(gnilnd_cb.c:3446:kgnilnd_check_fma_send_cq()) conn ffff8803ea1b3800->27@gni1-- (3) 00000800:00000040:4.0:1433777204.396992:0:5507:0:(gnilnd_cb.c:3328:kgnilnd_check_fma_send_cq()) SMSG send CQ 0 not ready (data 0x140002253101005) processed 1 00000800:00000040:4.0:1433777204.396993:0:5507:0:(gnilnd_cb.c:3481:kgnilnd_check_fma_rcv_cq()) SMSG RX CQ 0 empty data 0x0 processed 0 00000800:00000040:4.0:1433777204.396994:0:5507:0:(gnilnd_cb.c:3170:kgnilnd_check_rdma_cq()) SEND RDMA CQ 0 empty processed 0 00000800:00000040:4.0:1433777204.396995:0:5507:0:(gnilnd_cb.c:3328:kgnilnd_check_fma_send_cq()) SMSG send CQ 0 not ready (data 0xffffffffa03ce1f9) processed 0 00000800:00000040:4.0:1433777204.396996:0:5507:0:(gnilnd_cb.c:3481:kgnilnd_check_fma_rcv_cq()) SMSG RX CQ 0 empty data 0xffffffffa040d230 processed 0 00000800:00000040:4.0:1433777204.396997:0:5507:0:(gnilnd_cb.c:3170:kgnilnd_check_rdma_cq()) SEND RDMA CQ 0 empty processed 0 00000800:00000040:4.0:1433777204.396999:0:5507:0:(gnilnd_cb.c:5080:kgnilnd_scheduler()) scheduling: found_work 0 busy_loops 75 00000800:00000040:4.0:1433777204.397000:0:5506:0:(gnilnd_cb.c:5084:kgnilnd_scheduler()) awake after schedule 00000800:00000040:4.0:1433777204.397001:0:5506:0:(gnilnd_cb.c:3328:kgnilnd_check_fma_send_cq()) SMSG send CQ 0 not ready (data 0x0) processed 0 00000800:00000040:4.0:1433777204.397002:0:5506:0:(gnilnd_cb.c:3481:kgnilnd_check_fma_rcv_cq()) SMSG RX CQ 0 empty data 0x0 processed 0 00000800:00000040:4.0:1433777204.397003:0:5506:0:(gnilnd_cb.c:3170:kgnilnd_check_rdma_cq()) SEND RDMA CQ 0 empty processed 0 00000800:00000040:4.0:1433777204.397004:0:5506:0:(gnilnd_cb.c:5080:kgnilnd_scheduler()) scheduling: found_work 0 busy_loops 76 00000100:00000400:3.0:1433777204.401048:0:5521:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1433777092/real 1433777092] req@ffff8803e9ad99c0 x1503423118714204/t0(0) o3->sultan-OST0000-osc-ffff8803f36ce800@10.37.248.69@o2ib1:6/4 lens 608/432 e 4 to 1 dl 1433777204 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 00000100:00000200:3.0:1433777204.401055:0:5521:0:(events.c:97:reply_in_callback()) @@@ type 6, status 0 req@ffff8803e9ad99c0 x1503423118714204/t0(0) o3->sultan-OST0000-osc-ffff8803f36ce800@10.37.248.69@o2ib1:6/4 lens 608/432 e 4 to 1 dl 1433777204 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 00000100:00000200:3.0:1433777204.401060:0:5521:0:(events.c:118:reply_in_callback()) @@@ unlink req@ffff8803e9ad99c0 x1503423118714204/t0(0) o3->sultan-OST0000-osc-ffff8803f36ce800@10.37.248.69@o2ib1:6/4 lens 608/432 e 4 to 1 dl 1433777204 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 So client is using gni interface, OST0000 is using o2ib interface. And also it seems there are network partitions on the connection to OST0000, though I do not know the real course is network or sth else? 00000100:00080000:5.0:1433777204.401753:0:5520:0:(import.c:1170:ptlrpc_connect_interpret()) reconnected to sultan-OST0000_UUID@10.37.248.69@o2ib1 after partition 00000100:00080000:5.0:1433777204.401754:0:5520:0:(import.c:1188:ptlrpc_connect_interpret()) ffff8803ced85000 sultan-OST0000_UUID: changing import state from CONNECTING to RECOVER 00000100:00080000:5.0:1433777204.401756:0:5520:0:(import.c:1488:ptlrpc_import_recovery_state_machine()) ffff8803ced85000 sultan-OST0000_UUID: changing import state from RECOVER to FULL 00000100:02000000:5.0:1433777204.401764:0:5520:0:(import.c:1494:ptlrpc_import_recovery_state_machine()) sultan-OST0000-osc-ffff8803f36ce800: Connection restored to 10.37.248.69@o2ib1 (at 10.37.248.69@o2ib1) |
| Comment by James A Simmons [ 08/Jun/15 ] |
|
If the DNE2 directory gives no problem then it is a issue with small packets. |
| Comment by James A Simmons [ 10/Jun/15 ] |
|
I tracked down what is causing the bug. Setting map_on_demand=256 for the ko2iblnd driver triggers this bug which for some reason can only be triggered on a Cray router. Peter should I open another ticket that properly defines the problem and close this ticket? |
| Comment by Jian Yu [ 10/Jun/15 ] |
|
Yes, James. Peter suggested to close this one and open a new ticket for tracking the issue you found. Thank you. |
| Comment by Peter Jones [ 11/Jun/15 ] |
|
Yes please! |
| Comment by Andreas Dilger [ 12/Jun/15 ] |
|
James, Di, [ 1062.791599] LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. quota=on. Opts: [ 1063.389770] LustreError: 24686:0:(ofd_fs.c:594:ofd_server_data_init()) sultan-OST0000: unsupported read-only filesystem feature(s) 2 The unknown read-only feature appears to be OBD_ROCOMPAT_IDX_IN_IDIF. That feature shouldn't automatically be enabled, and needs active participation from the administrator: static ssize_t ldiskfs_osd_index_in_idif_seq_write(struct file *file, const char *buffer, size_t count, loff_t *off) { LCONSOLE_WARN("%s: OST-index in IDIF has been enabled, " "it cannot be reverted back.\n", osd_name(dev)); return -EPERM; James, did you set the index_in_idif feature in /proc, or is there a bug here that needs to be filed? Looking at the code it doesn't appear that this flag could have been set automatically. |
| Comment by Di Wang [ 12/Jun/15 ] |
|
Andreas: I already added 2.5 to 2.8 dne upgrade to DNE patches series. I will try this downgrade steps locally to see if it is easy to be reproduced. |
| Comment by Di Wang [ 14/Jun/15 ] |
|
http://review.whamcloud.com/#/c/15275/1 DNE upgrade test cases |