Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
Lustre 2.2.0
-
None
-
MDS HW
----------------------------------------------------------------------------------------------------
Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
Vendor ID: AuthenticAMD
CPU family: 16
64Gb RAM
Interconnect IB 40Gb/s
MDT LSI 5480 Pikes Peak
SSDs SLC
----------------------------------------------------------------------------------------------------
OSS HW
----------------------------------------------------------------------------------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
Vendor ID: GenuineIntel
CPU family: 6
64Gb RAM
Interconnect IB 40Gb/s
OST LSI 7900
----------------------------------------------------------------------------------------------------
Router nodes
-------------------
12 router nodes - IB 40Gb/s
Clients
---------
Cray XE6 - Lustre 1.8.6
1 MDS + 1 fail over
12 OSS - 6 OST per OSSMDS HW ---------------------------------------------------------------------------------------------------- Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 Vendor ID: AuthenticAMD CPU family: 16 64Gb RAM Interconnect IB 40Gb/s MDT LSI 5480 Pikes Peak SSDs SLC ---------------------------------------------------------------------------------------------------- OSS HW ---------------------------------------------------------------------------------------------------- Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 Vendor ID: GenuineIntel CPU family: 6 64Gb RAM Interconnect IB 40Gb/s OST LSI 7900 ---------------------------------------------------------------------------------------------------- Router nodes ------------------- 12 router nodes - IB 40Gb/s Clients --------- Cray XE6 - Lustre 1.8.6 1 MDS + 1 fail over 12 OSS - 6 OST per OSS
-
3
-
3997
Description
Dear support, during a normal usage of the file system, we noticed an major drop down of the I/O performance (ex. from 5Gb/s to 0.9Gb/s ) and into the log a bunch of:
May 31 10:17:39 weisshorn06 kernel: Lustre: scratch-OST001d: Client scratch-MDT0000-mdtlov_UUID (at 148.187.7.101@o2ib2) reconnecting
May 31 10:17:39 weisshorn06 kernel: Lustre: 7445:0:(filter.c:2699:filter_connect_internal()) scratch-OST001d: Received MDS connection for group 0
May 31 10:17:39 weisshorn06 kernel: Lustre: scratch-OST001d: received MDS connection from 148.187.7.101@o2ib2
May 31 10:17:39 weisshorn06 kernel: Lustre: 7445:0:(llog_net.c:168:llog_receptor_accept()) changing the import ffff880f8b248800 - ffff880edff5b800
May 31 10:17:39 weisshorn06 kernel: Lustre: 7445:0:(llog_net.c:168:llog_receptor_accept()) Skipped 1 previous similar message
May 31 10:17:39 weisshorn06 kernel: Lustre: 7445:0:(filter.c:2555:filter_llog_connect()) scratch-OST001d: Recovery from log 0x140001f/0x0:b3e57b20
May 31 10:21:32 weisshorn01 kernel: LustreError: 7733:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x5ef sub-object on OST idx 49/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 9518:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x2560 sub-object on OST idx 12/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 4004:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x225b sub-object on OST idx 12/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 3986:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x1257 sub-object on OST idx 12/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 7357:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x76b0 sub-object on OST idx 49/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 9539:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x879 sub-object on OST idx 12/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 7324:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0xf3ad sub-object on OST idx 12/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 3988:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x2d00 sub-object on OST idx 49/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 7746:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x1220 sub-object on OST idx 49/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 9640:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x230f sub-object on OST idx 12/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 7328:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x5b4 sub-object on OST idx 12/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 9476:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x2591 sub-object on OST idx 12/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 4062:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x28f3 sub-object on OST idx 12/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 9654:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x2251 sub-object on OST idx 12/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 7737:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x251c sub-object on OST idx 12/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 7347:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x2bd9 sub-object on OST idx 12/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 7337:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x2560 sub-object on OST idx 49/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 7725:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x23d8 sub-object on OST idx 12/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 3908:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x8d2 sub-object on OST idx 12/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 9484:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x57b sub-object on OST idx 12/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 7403:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0xc7e sub-object on OST idx 49/38: rc = -5
May 31 10:21:33 weisshorn01 kernel: LustreError: 9689:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x16c9b sub-object on OST idx 12/38: rc = -5
May 31 10:21:49 weisshorn08 kernel: Lustre: scratch-OST0017: Client scratch-MDT0000-mdtlov_UUID (at 148.187.7.101@o2ib2) reconnecting
May 31 10:21:49 weisshorn08 kernel: Lustre: Skipped 1 previous similar message
May 31 10:21:49 weisshorn08 kernel: Lustre: 7332:0:(filter.c:2699:filter_connect_internal()) scratch-OST0017: Received MDS connection for group 0
May 31 10:21:49 weisshorn08 kernel: Lustre: 7332:0:(filter.c:2699:filter_connect_internal()) Skipped 1 previous similar message
May 31 10:21:49 weisshorn08 kernel: Lustre: scratch-OST0017: received MDS connection from 148.187.7.101@o2ib2
May 31 10:21:49 weisshorn08 kernel: Lustre: Skipped 1 previous similar message
May 31 10:21:49 weisshorn08 kernel: Lustre: 7332:0:(llog_net.c:168:llog_receptor_accept()) changing the import ffff880fc7844800 - ffff880fc68b5000
May 31 10:21:49 weisshorn08 kernel: Lustre: 7332:0:(llog_net.c:168:llog_receptor_accept()) Skipped 3 previous similar messages
May 31 10:21:49 weisshorn08 kernel: Lustre: 7332:0:(filter.c:2555:filter_llog_connect()) scratch-OST0017: Recovery from log 0x1400019/0x0:b3e57b1a
May 31 10:21:49 weisshorn08 kernel: Lustre: 7332:0:(filter.c:2555:filter_llog_connect()) Skipped 1 previous similar message
May 31 10:21:49 weisshorn04 kernel: Lustre: scratch-OST0025: Client scratch-MDT0000-mdtlov_UUID (at 148.187.7.101@o2ib2) reconnecting
May 31 10:21:49 weisshorn04 kernel: Lustre: 7284:0:(filter.c:2699:filter_connect_internal()) scratch-OST0025: Received MDS connection for group 0
May 31 10:21:49 weisshorn04 kernel: Lustre: scratch-OST0025: received MDS connection from 148.187.7.101@o2ib2
May 31 10:21:49 weisshorn04 kernel: Lustre: 7284:0:(llog_net.c:168:llog_receptor_accept()) changing the import ffff8802d1d4c800 - ffff880fbe6d3800
May 31 10:21:49 weisshorn04 kernel: Lustre: 7284:0:(llog_net.c:168:llog_receptor_accept()) Skipped 1 previous similar message
May 31 10:21:49 weisshorn04 kernel: Lustre: 7603:0:(filter.c:2555:filter_llog_connect()) scratch-OST0025: Recovery from log 0x1400027/0x0:b3e57b28
We are running an IOR benchmark (writing an aggregate file of 512Gb) and the first run was fine, but now we monitoring with LTOP and the output of the job that the performance are drop down.
Also few users complain about a behavior like that running hdf5 jobs, the first runs was fine but the following were really slow.
The load on the machine regarding CPU is really low.