Details
-
Task
-
Resolution: Not a Bug
-
Major
-
None
-
Lustre 2.4.0
-
11227
Description
We are working on getting our Lustre servers upgraded from 1.8.7 to Lustre 2.4.1 with data in place.
Attachments
Issue Links
- is related to
-
LU-4761 console quota message spew
-
- Resolved
-
Activity
Minh, can we escalate this? After our work last night it seems that this is highly reproducible. Also, I don't see it as an LNET routing issue, since the other Lustre file system we tested on these clients performed as expected.
An update: we can reproduce the failure from a single host attached via Ethernet using the TCP LNET driver. This eliminates the routers as a direct cause.
Here are Mahidhar's notes on what he observed.
The IOR runs are triggering a very consistent set of errors (I did a few runs and got the same set of things happen on multiple nodes). Btw, the node recovers automatically after about 10-15 minutes. During the time we can't kill the IOR process ... even as root. It cleans up when the node recovers. #1 As soon as the run starts, if a node has problems (this is pretty quick within a few seconds of the job start): Mar 12 00:08:07 gcn-2-42 kernel: LustreError: 4870:0:(osc_cache.c:1555:osc_enter_cache()) monkey-OST006c-osc-ffff881022873400: grant { dirty: 2097152/33554432 dirty_pages: 512/8519680 dropped: 0 avail: 0, reserved: 0, flight: 2 } lru {in list: 0, left: 513, waiters: 0 }try to reserve 4096. Mar 12 00:08:07 gcn-2-42 kernel: LustreError: 4870:0:(osc_cache.c:1160:osc_extent_tree_dump0()) Dump object ffff8807cfdb0de8 extents at osc_enter_cache:1556, mppr: 256. Mar 12 00:08:07 gcn-2-42 kernel: LustreError: 4870:0:(osc_cache.c:1165:osc_extent_tree_dump0()) extent ffff8807cfde4870@{[0 -> 255/255], [2|0|+|rpc|wiY|ffff8807cfdb0de8], [1048576|256|+|-|ffff8807cfd72eb8|256|ffff88102f3df540]} in tree 1. Mar 12 00:08:07 gcn-2-42 kernel: LustreError: 4870:0:(osc_cache.c:1165:osc_extent_tree_dump0()) extent ffff8807cfde4918@{[256 -> 511/511], [2|0|+|rpc|wiY|ffff8807cfdb0de8], [1048576|256|+|-|ffff8807cfd72eb8|256|ffff88102f3de080]} in tree 2. #2 This then triggers timeouts after about 8-10 minutes: Mar 12 00:16:30 gcn-2-42 kernel: Lustre: 3387:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394607487/real 1394607487] req@ffff88101f863400 x1462348057527252/t0(0) o4->monkey-OST006c-osc-ffff881022873400@172.25.33.240@tcp:6/4 lens 488/448 e 5 to 1 dl 1394608590 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Mar 12 00:16:30 gcn-2-42 kernel: Lustre: monkey-OST006c-osc-ffff881022873400: Connection to monkey-OST006c (at 172.25.33.240@tcp) was lost; in progress operations using this service will wait for recovery to complete Mar 12 00:16:30 gcn-2-42 kernel: LustreError: 11-0: monkey-OST006c-osc-ffff881022873400: Communicating with 172.25.33.240@tcp, operation ost_connect failed with -16. Mar 12 00:16:55 gcn-2-42 kernel: Lustre: monkey-OST006c-osc-ffff881022873400: Connection restored to monkey-OST006c (at 172.25.33.240@tcp) Mar 12 00:27:27 gcn-2-42 kernel: Lustre: 3392:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394608615/real 1394608615] req@ffff88101f863000 x1462348057562536/t0(0) o4->monkey-OST006c-osc-ffff881022873400@172.25.33.240@tcp:6/4 lens 488/448 e 0 to 1 dl 1394609247 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 ... ... goes on for some time ... ... #3 Finally the node recovers and I see the following error: Mar 12 00:39:02 gcn-2-42 kernel: LustreError: 4870:0:(vvp_io.c:1088:vvp_io_commit_write()) Write page 512 of inode ffff8807d03ecab8 failed -5 -Mahidhar
The clients are either direct attached via Ethernet, or behind LNET routers bridging between Ethernet and InfiniBand. The client host shown in the example is on Ethernet, and uses the following lnet.conf:
options lnet networks=tcp(bond0)
The compute nodes on InfiniBand use:
options lnet networks="o2ib(ib1)" routes="tcp 1 10.7.104.54@o2ib; tcp 2 10.7.104.[54-101]@o2ib" dead_router_check_interval=60 live_router_check_interval=60
The preferred router is listed first, and then the group of routers.
The routers use:
options lnet networks="tcp(bond0),o2ib(ib1)" forwarding="enabled"
The Lustre servers use:
options lnet networks="tcp(bond0)" routes="o2ib 192.168.230.[63-78]@tcp 192.168.230.[95-126]@tcp" dead_router_check_interval=60 live_router_check_interval=60
Do we know the network topography? Are there any routers in between the client and the servers? Can we get the LNET configuration for the client/router/servers?
The LNET failure is being repeated between the MDS and an OSS.
This was spotted by seeing a message about a slow reply:
[root@monkey-mds-10-3 ~]# dmesg
Lustre: 11080:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394570829/real 1394570829] req@ffff880638534000 x1462310351923796/t0(0) o8->monkey-OST0065-osc@172.25.32.111@tcp:28/4 lens 400/544 e 0 to 1 dl 1394570885 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 11080:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 34 previous similar messages
Lustre: MGS: Client 3d8c91ee-addb-8c84-31d6-137714d50e7e (at 172.25.32.111@tcp) reconnecting
Lustre: Skipped 3 previous similar messages
[root@monkey-mds-10-3 ~]# lctl ping 172.25.32.111@tcp failed to ping 172.25.32.111@tcp: Input/output error [root@monkey-mds-10-3 ~]# ping -c 4 172.25.32.111 PING 172.25.32.111 (172.25.32.111) 56(84) bytes of data. 64 bytes from 172.25.32.111: icmp_seq=1 ttl=64 time=0.223 ms 64 bytes from 172.25.32.111: icmp_seq=2 ttl=64 time=0.124 ms 64 bytes from 172.25.32.111: icmp_seq=3 ttl=64 time=0.146 ms 64 bytes from 172.25.32.111: icmp_seq=4 ttl=64 time=0.151 ms --- 172.25.32.111 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 2999ms rtt min/avg/max/mdev = 0.124/0.161/0.223/0.037 ms [root@monkey-mds-10-3 ~]#
It also shows up as packets queued on the network. All of these servers use the TCP LNET driver.
[root@monkey-mds-10-3 ~]# !181 netstat -at | grep ^tcp | grep -v '^tcp 0 0 ' tcp 0 1984 money-mds-10-3-10g.lus:1023 money-oss-2-1-10g.lustr:988 ESTABLISHED [root@monkey-mds-10-3 ~]# netstat -at | grep ^tcp | grep -v '^tcp 0 0 ' tcp 0 2344 money-mds-10-3-10g.lus:1023 money-oss-2-1-10g.lustr:988 ESTABLISHED
On one of our file systems, we're seeing two problems. One is repeated messages about not being able to clean up orphans. Note that this is after a shutdown of all clients, a reset of the Lustre servers including a lustre_rmmod, and a mount of the MDT with the abort_recov option.
Mar 11 13:13:22 monkey-mds-10-3 kernel: LustreError: 11286:0:(osp_precreate.c:737:osp_precreate_cleanup_orphans()) monkey-OST0025-osc: cannot cleanup orphans: rc = -11 Mar 11 13:13:22 monkey-mds-10-3 kernel: LustreError: 11286:0:(osp_precreate.c:737:osp_precreate_cleanup_orphans()) Skipped 3 previous similar messages Mar 11 13:15:43 monkey-mds-10-3 kernel: LustreError: 11234:0:(osp_precreate.c:737:osp_precreate_cleanup_orphans()) monkey-OST000b-osc: cannot cleanup orphans: rc = -11 Mar 11 13:15:43 monkey-mds-10-3 kernel: LustreError: 11234:0:(osp_precreate.c:737:osp_precreate_cleanup_orphans()) Skipped 2 previous similar messages
The second issue is a hanging connection between a client and an OSS, that includes LNET pings failing.
[ux455215@oasis-dm4 temp_project]$ cd turbulence/tars/ [ux455215@oasis-dm4 tars]$ ls DD0000_1024ppm.tar DD0074_1024ppm.tar DD0098_1024ppm.tar ts_1024ppmDD0110.tar ts_1024ppmDD0134.tar ts_1024ppmDD0154.tar ts_1024ppmDD0178.tar ts_1024ppmDD0202.tar ts_1024ppmDD0226.tar DD0060_1024ppm.tar DD0078_1024ppm.tar DD0102_1024ppm.tar ts_1024ppmDD0114.tar ts_1024ppmDD0138.tar ts_1024ppmDD0158.tar ts_1024ppmDD0182.tar ts_1024ppmDD0206.tar DD0064_1024ppm.tar DD0082_1024ppm.tar DD0106_1024ppm.tar ts_1024ppmDD0118.tar ts_1024ppmDD0142.tar ts_1024ppmDD0162.tar ts_1024ppmDD0186.tar ts_1024ppmDD0210.tar DD0068_1024ppm.tar DD0086_1024ppm.tar dsets-to-grab.txt ts_1024ppmDD0122.tar ts_1024ppmDD0146.tar ts_1024ppmDD0166.tar ts_1024ppmDD0190.tar ts_1024ppmDD0214.tar DD0070_1024ppm.tar DD0090_1024ppm.tar dsets-to-untar.txt ts_1024ppmDD0126.tar ts_1024ppmDD0148.tar ts_1024ppmDD0170.tar ts_1024ppmDD0194.tar ts_1024ppmDD0218.tar DD0072_1024ppm.tar DD0094_1024ppm.tar dsets-to-untar.txt~ ts_1024ppmDD0130.tar ts_1024ppmDD0150.tar ts_1024ppmDD0174.tar ts_1024ppmDD0198.tar ts_1024ppmDD0222.tar [ux455215@oasis-dm4 tars]$ ls -l ...hangs...
[root@oasis-dm4 ~]# lctl ping 172.25.33.103@tcp failed to ping 172.25.33.103@tcp: Input/output error [root@oasis-dm4 ~]# ping -c 4 172.25.33.103 PING 172.25.33.103 (172.25.33.103) 56(84) bytes of data. 64 bytes from 172.25.33.103: icmp_seq=1 ttl=63 time=0.109 ms 64 bytes from 172.25.33.103: icmp_seq=2 ttl=63 time=0.159 ms 64 bytes from 172.25.33.103: icmp_seq=3 ttl=63 time=0.155 ms 64 bytes from 172.25.33.103: icmp_seq=4 ttl=63 time=0.158 ms [root@monkey-oss-16-3 ~]# lctl ping 192.168.230.54@tcp failed to ping 192.168.230.54@tcp: Input/output error [root@monkey-oss-16-3 ~]# ping -c 4 192.168.230.54 PING 192.168.230.54 (192.168.230.54) 56(84) bytes of data. 64 bytes from 192.168.230.54: icmp_seq=1 ttl=63 time=0.153 ms 64 bytes from 192.168.230.54: icmp_seq=2 ttl=63 time=0.175 ms 64 bytes from 192.168.230.54: icmp_seq=3 ttl=63 time=0.175 ms 64 bytes from 192.168.230.54: icmp_seq=4 ttl=63 time=0.175 ms
dmesg from client:
Lustre: Mounted monkey-client Lustre: 24401:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394569229/real 1394569229] req@ffff8806189e8c00 x1462311819674200/t0(0) o8->monkey-OST0078-osc-ffff880c3428b800@172.25.33.103@tcp:28/4 lens 400/544 e 0 to 1 dl 1394569234 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 24401:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394569254/real 1394569254] req@ffff8806354eb400 x1462311819674824/t0(0) o8->monkey-OST0078-osc-ffff880c3428b800@172.25.33.103@tcp:28/4 lens 400/544 e 0 to 1 dl 1394569264 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 24401:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 11 previous similar messages Lustre: 24401:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394569279/real 1394569279] req@ffff88063566ac00 x1462311819675332/t0(0) o8->monkey-OST0078-osc-ffff880c3428b800@172.25.33.103@tcp:28/4 lens 400/544 e 0 to 1 dl 1394569294 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 24401:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 11 previous similar messages [root@oasis-dm4 ~]# lctl dl | grep 0078 123 UP osc monkey-OST0078-osc-ffff880c3428b800 d81c416b-007d-7139-df5b-77b081b31fb4 5 [root@oasis-dm4 ~]# lctl dl -t | grep 0078 123 UP osc monkey-OST0078-osc-ffff880c3428b800 d81c416b-007d-7139-df5b-77b081b31fb4 5 172.25.33.103@tcp
dmesg on OSS:
LDISKFS-fs (sdf): mounted filesystem with ordered data mode. quota=off. Opts: Lustre: 28444:0:(ofd_dev.c:221:ofd_process_config()) For interoperability, skip this ost.quota_type. It is obsolete. Lustre: monkey-OST0078: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 Lustre: monkey-OST0078: Will be in recovery for at least 2:30, or until 1 client reconnects Lustre: monkey-OST0078: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted. Lustre: monkey-OST0078: deleting orphan objects from 0x0:12305142 to 0x0:12305281
/var/log/messages on OSS (172.25.32.253@tcp is the MDS):
Mar 11 13:05:11 monkey-oss-16-3 kernel: LustreError: 137-5: monkey-OST0078_UUID: not available for connect from 172.25.32.253@tcp (no target) Mar 11 13:05:11 monkey-oss-16-3 kernel: LustreError: Skipped 2 previous similar messages Mar 11 13:07:25 monkey-oss-16-3 kernel: LustreError: 137-5: monkey-OST0018_UUID: not available for connect from 172.25.32.253@tcp (no target) Mar 11 13:07:25 monkey-oss-16-3 kernel: LustreError: Skipped 4 previous similar messages
According to Minh, the root cause of this problem was caused by half of the TCP channel bonding using jumbo MTU=9000 frames and the other half using MTU=1500 frames. That was causing random packet loss for large IOs.