Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
Client is running 2.6.27.45-lustre-1.8.3.ddn3.3. Connectivity is 10GigE
-
3
-
10103
Description
A customer is seeing a problem on a client where the client loses access to Lustre when the node is subjected to memory pressure from an errant application.
Lustre starts reporting -113 (No route to host) errors for certain NIDS in the filesystem despite the TCP/IP network being functional. After the memory pressure is relieved the Lustre errors remain. I am collecting logs currently.
From the customer report:
Lnet is reporting no-route-to-host for a significant number of OSS / MDSs (client log attached).
Mar 29 09:23:27 cgp-bigmem kernel: [589295.826095] LustreError: 4980:0:(events.c:66:request_out_callback()) @@@ type 4, status 113 req@ffff881d2e995400 x1363985318437337/t0 o8>lus03-OST0000_UUID@172.17.128.130@tcp:28/4 lens 368/584 e 0 to 1 dl 1301387122 ref 2 fl Rpc:N/0/0 rc 0/0
but from user-space on the client, all those nodes are pingable:
cgp-bigmem:/var/log# ping 172.17.128.130
PING 172.17.128.130 (172.17.128.130) 56(84) bytes of data.
64 bytes from 172.17.128.130: icmp_seq=1 ttl=62 time=0.102 ms
64 bytes from 172.17.128.130: icmp_seq=2 ttl=62 time=0.091 ms
64 bytes from 172.17.128.130: icmp_seq=3 ttl=62 time=0.091 ms
64 bytes from 172.17.128.130: icmp_seq=4 ttl=62 time=0.090 ms
however a lnet ping hangs:
cgp-bigmem:~# lctl ping 172.17.128.130@tcp
From another client, the ping works as expected
farm2-head1:# lctl ping 172.17.128.130@tcp
12345-0@lo
12345-172.17.128.130@tcp
cgp-bigmem:~# lfs check servers | grep -v active
error: check 'lus01-OST0007-osc-ffff88205bd52000' Resource temporarily unavailable
error: check 'lus01-OST0008-osc-ffff88205bd52000' Resource temporarily unavailable
error: check 'lus01-OST0009-osc-ffff88205bd52000' Resource temporarily unavailable
error: check 'lus01-OST000a-osc-ffff88205bd52000' Resource temporarily unavailable
error: check 'lus01-OST000b-osc-ffff88205bd52000' Resource temporarily unavailable
error: check 'lus01-OST000c-osc-ffff88205bd52000' Resource temporarily unavailable
error: check 'lus01-OST000d-osc-ffff88205bd52000' Resource temporarily unavailable
error: check 'lus01-OST000e-osc-ffff88205bd52000' Resource temporarily unavailable
error: check 'lus02-MDT0000-mdc-ffff8880735ea000' Resource temporarily unavailable
error: check 'lus03-OST0000-osc-ffff8840730a1400' Resource temporarily unavailable
error: check 'lus03-OST0001-osc-ffff8840730a1400' Resource temporarily unavailable
error: check 'lus03-OST0002-osc-ffff8840730a1400' Resource temporarily unavailable
error: check 'lus03-OST0003-osc-ffff8840730a1400' Resource temporarily unavailable
error: check 'lus03-OST0004-osc-ffff8840730a1400' Resource temporarily unavailable
error: check 'lus03-OST0005-osc-ffff8840730a1400' Resource temporarily unavailable
error: check 'lus03-OST0006-osc-ffff8840730a1400' Resource temporarily unavailable
error: check 'lus03-OST0007-osc-ffff8840730a1400' Resource temporarily unavailable
error: check 'lus03-OST0008-osc-ffff8840730a1400' Resource temporarily unavailable
error: check 'lus03-OST0009-osc-ffff8840730a1400' Resource temporarily unavailable
error: check 'lus03-OST000a-osc-ffff8840730a1400' Resource temporarily unavailable
error: check 'lus03-OST000b-osc-ffff8840730a1400' Resource temporarily unavailable
error: check 'lus03-OST000c-osc-ffff8840730a1400' Resource temporarily unavailable
error: check 'lus03-OST0019-osc-ffff8840730a1400' Resource temporarily unavailable
error: check 'lus03-OST001a-osc-ffff8840730a1400' Resource temporarily unavailable
error: check 'lus05-OST0010-osc-ffff886070dab800' Resource temporarily unavailable
error: check 'lus05-OST0012-osc-ffff886070dab800' Resource temporarily unavailable
error: check 'lus05-OST0014-osc-ffff886070dab800' Resource temporarily unavailable
error: check 'lus05-OST0016-osc-ffff886070dab800' Resource temporarily unavailable
error: check 'lus05-OST0018-osc-ffff886070dab800' Resource temporarily unavailable
error: check 'lus05-OST001a-osc-ffff886070dab800' Resource temporarily unavailable
error: check 'lus05-OST001c-osc-ffff886070dab800' Resource temporarily unavailable
error: check 'lus05-OST000f-osc-ffff886070dab800' Resource temporarily unavailable
error: check 'lus05-OST0011-osc-ffff886070dab800' Resource temporarily unavailable
error: check 'lus05-OST0013-osc-ffff886070dab800' Resource temporarily unavailable
error: check 'lus05-OST0015-osc-ffff886070dab800' Resource temporarily unavailable
error: check 'lus05-OST0017-osc-ffff886070dab800' Resource temporarily unavailable
error: check 'lus05-OST0019-osc-ffff886070dab800' Resource temporarily unavailable
error: check 'lus05-OST001b-osc-ffff886070dab800' Resource temporarily unavailable
error: check 'lus05-OST001d-osc-ffff886070dab800' Resource temporarily unavailable
error: check 'lus04-OST0001-osc-ffff88806e9d8c00' Resource temporarily unavailable
error: check 'lus04-OST0003-osc-ffff88806e9d8c00' Resource temporarily unavailable
error: check 'lus04-OST0005-osc-ffff88806e9d8c00' Resource temporarily unavailable
error: check 'lus04-OST0007-osc-ffff88806e9d8c00' Resource temporarily unavailable
error: check 'lus04-OST0009-osc-ffff88806e9d8c00' Resource temporarily unavailable
error: check 'lus04-OST000b-osc-ffff88806e9d8c00' Resource temporarily unavailable
error: check 'lus04-OST000d-osc-ffff88806e9d8c00' Resource temporarily unavailable
Attachments
Activity
Reporter | Original: Ashley Pittman [ apittman ] | New: Shuichi Ihara [ ihara ] |
Resolution | New: Fixed [ 1 ] | |
Status | Original: Reopened [ 4 ] | New: Resolved [ 5 ] |
Resolution | Original: Fixed [ 1 ] | |
Status | Original: Resolved [ 5 ] | New: Reopened [ 4 ] |
Comment |
[ to Sebastien Piechurski (I didn't see your comment here but I did on my mail notification), Try bz 21776 1attachment 29521 first which is a port for 1.8.x. ] |
Comment |
[ I tried to apply the b1_8 patch on a 1.8.5 tree. It broke compilation with the following error: CC [M] /usr/src/lustre-1.8.5/lustre/ptlrpc/niobuf.o /usr/src/lustre-1.8.5/lustre/ptlrpc/niobuf.c: In function ‘ptl_send_rpc’: /usr/src/lustre-1.8.5/lustre/ptlrpc/niobuf.c:534: error: label ‘out’ used but not defined From what I understood, this is because I was missing patch in bz 21776 attachment 29316, but this one does not apply as it relies on the presence of libcfs/include/libcfs/libcfs_prim.h which does not exist in b1_8. However, this patch is marked as landed on 1.8.6+, so I don't understand how I could apply |
Attachment | New: kern.log.gz [ 10253 ] |
Resolution | New: Fixed [ 1 ] | |
Status | Original: Reopened [ 4 ] | New: Resolved [ 5 ] |
Resolution | Original: Duplicate [ 3 ] | |
Status | Original: Resolved [ 5 ] | New: Reopened [ 4 ] |
Fix Version/s | New: Lustre 1.8.6 [ 10022 ] | |
Resolution | New: Duplicate [ 3 ] | |
Status | Original: Open [ 1 ] | New: Resolved [ 5 ] |
Assignee | Original: Robert Read [ rread ] | New: Zhenyu Xu [ bobijam ] |