Details
-
Bug
-
Resolution: Incomplete
-
Critical
-
None
-
Lustre 2.2.0
-
None
-
----------------------------------------------------------------------------------------------------
## MDS HW ##
----------------------------------------------------------------------------------------------------
Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
Vendor ID: AuthenticAMD
CPU family: 16
64Gb RAM
Interconnect IB 40Gb/s
---
MDT LSI 5480 Pikes Peak
SSDs SLC
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
## OSS HW ##
----------------------------------------------------------------------------------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
Vendor ID: GenuineIntel
CPU family: 6
64Gb RAM
Interconnect IB 40Gb/s
---
OSTs ---> LSI 7900 SATA Disks
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
## Router nodes ##
----------------------------------------------------------------------------------------------------
12 Cray XE6 Service nodes as router nodes - IB 40Gb/s
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
## Clients ##
----------------------------------------------------------------------------------------------------
~ 1500 Cray XE6 nodes - Lustre 1.8.6
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
## LUSTRE Config ##
----------------------------------------------------------------------------------------------------
1 MDS + 1 fail over (MDT on SSD array)
12 OSSs - 6 OSTs per OSS (72 OSTs)
Luster Servers ---> 2.2.51.0
Lustre Clients ---> 1.8.6 (~1500 nodes) / 2.2.51.0 (~20 nodes)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ## MDS HW ## ---------------------------------------------------------------------------------------------------- Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 Vendor ID: AuthenticAMD CPU family: 16 64Gb RAM Interconnect IB 40Gb/s --- MDT LSI 5480 Pikes Peak SSDs SLC ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ## OSS HW ## ---------------------------------------------------------------------------------------------------- Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 Vendor ID: GenuineIntel CPU family: 6 64Gb RAM Interconnect IB 40Gb/s --- OSTs ---> LSI 7900 SATA Disks ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ## Router nodes ## ---------------------------------------------------------------------------------------------------- 12 Cray XE6 Service nodes as router nodes - IB 40Gb/s ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ## Clients ## ---------------------------------------------------------------------------------------------------- ~ 1500 Cray XE6 nodes - Lustre 1.8.6 ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ## LUSTRE Config ## ---------------------------------------------------------------------------------------------------- 1 MDS + 1 fail over (MDT on SSD array) 12 OSSs - 6 OSTs per OSS (72 OSTs) Luster Servers ---> 2.2.51.0 Lustre Clients ---> 1.8.6 (~1500 nodes) / 2.2.51.0 (~20 nodes) ----------------------------------------------------------------------------------------------------
-
3
-
4006
Description
During Lustre testing yesterday we observe this behaviours:
- Halt 4 nodes on a blade
- jobs doing IO intense such as IOR or MPIIO starts
Jul 17 16:38:15 nid00475 aprun.x[13684]: apid=1177710, Starting, user=20859, batch_id=377847, cmd_line="/usr/bin/apr
un.x -n 256 src/C/IOR -a MPIIO -B -b 4096m -t 4096K -k -r -w -e -g -s 1 -i 2 -F -C -o /scratch/weisshorn/fverzell/te
st5/IORtest-377847 ", num_nodes=64, node_list=64-65,126-129,190-191,702-705,766-769,830-833,894-897,958-961,1022-102
5,1086-1089,1150-1153,1214-1217,1278-1281,1294-1295,1342-1345,1406-1409,1470-1473,1534-1535
- then a few minutes later Lustre is acting up
Lustre server log:
Jul 17 16:39:57 weisshorn03 kernel: LNetError: 4754:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 11 seconds
Jul 17 16:39:57 weisshorn03 kernel: LNetError: 4754:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 148.187.7.73@o2ib2 (0): c: 0, oc: 1, rc: 5
Jul 17 16:39:58 weisshorn08 kernel: LNetError: 5045:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds
Jul 17 16:39:58 weisshorn08 kernel: LNetError: 5045:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 148.187.7.78@o2ib2 (0): c: 0, oc: 3, rc: 4
Jul 17 16:39:59 weisshorn13 kernel: LNet: 3394:0:(o2iblnd_cb.c:2340:kiblnd_passive_connect()) Conn race 148.187.7.81@o2ib2
Jul 17 16:39:59 weisshorn05 kernel: LNetError: 4875:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds
- Notice "IO Bulk write error" for nid833 which part of job mentioned above followed by "inactive thread " then dumptrace:
Jul 17 16:40:05 weisshorn14 kernel: LustreError: 7929:0:(ldlm_lib.c:2717:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@f
fff880f07f8b400 x1407748581382025/t0(0) o4->412fabdd-3b3a-df4b-bdc6-264145113d70@833@gni:0/0 lens 448/416 e 0 to 0 dl 1342536406 ref 1 fl Interpret:/0/0 rc 0/0
Jul 17 16:40:05 weisshorn14 kernel: Lustre: scratch-OST003f: Bulk IO write error with 412fabdd-3b3a-df4b-bdc6-264145113d70 (at 833@gni), c
lient will retry: rc -110
Jul 17 16:43:19 weisshorn13 kernel: Lustre: 6182:0:(service.c:1034:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/1), not sending early reply
Jul 17 16:43:19 weisshorn13 kernel: req@ffff880ddeb81400 x1407748579298657/t0(0) o4->a34c7ab8-980f-db22-6596-e1db30724c4d@12@gni:0/0 lens 448/416 e 1 to 0 dl 1342536204 ref 2 fl Interpret:/0/0 rc 0/0
Jul 17 16:43:19 weisshorn13 kernel: Lustre: 6182:0:(service.c:1034:ptlrpc_at_send_early_reply()) Skipped 19 previous similar messages
Jul 17 16:43:20 weisshorn13 kernel: LNet: Service thread pid 8102 was inactive for 600.00s. The thread might be hung, or it might only be
slow and will resume later. Dumping the stack trace for debugging purposes:
Jul 17 16:43:20 weisshorn13 kernel: Pid: 8102, comm: ll_ost_io_153
Jul 17 16:43:20 weisshorn13 kernel:
Jul 17 16:43:20 weisshorn13 kernel: Call Trace:
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8107bf8c>] ? lock_timer_base+0x3c/0x70
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff814edc52>] schedule_timeout+0x192/0x2e0
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8107c0a0>] ? process_timeout+0x0/0x10
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03a65c1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa05df0ad>] target_bulk_io+0x38d/0x8b0 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8105e7f0>] ? default_wake_function+0x0/0x20
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0b4c792>] ost_brw_write+0x1172/0x1380 [ost]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03a527b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa05d64a0>] ? target_bulk_timeout+0x0/0x80 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0b507c4>] ost_handle+0x2764/0x39e0 [ost]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0612c83>] ? ptlrpc_update_export_timer+0x1c3/0x360 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa06183c1>] ptlrpc_server_handle_request+0x3c1/0xcb0 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03a64ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03b0ef9>] ? lc_watchdog_touch+0x79/0x110 [libcfs]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0612462>] ? ptlrpc_wait_event+0xb2/0x2c0 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa06193cf>] ptlrpc_main+0x71f/0x1210 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0618cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0618cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0618cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Jul 17 16:43:20 weisshorn13 kernel:
- Nid833 is then being evicted:
Jul 17 16:47:44 weisshorn14 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 568
s: evicting client at 833@gni ns: filter-scratch-OST003f_UUID lock: ffff8808aa9fc480/0x9cc518d034bea3c8 lrc: 3/0,0 mode: PW/PW res: 22186
722/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20 remote: 0x629daae2e6cf7351 expref: 5 pid: 5811 timeout 42997
03474
On SMW console log:
[2012-07-17 16:48:07][c7-0c1s0n1]Lustre: 9196:0:(client.c:1492:ptlrpc_expire_one_request()) @@@ Request x1407748581382025
sent from scratch-OST003f-osc-ffff88041e142400 to NID 148.187.7.114@o2ib2 471s ago has timed out (471s prior to deadline
).
[2012-07-17 16:48:07][c7-0c1s0n1] req@ffff8801bea18800 x1407748581382025/t0 o4->scratch-OST003f_UUID@148.187.7.114@o2ib2
:6/4 lens 448/608 e 0 to 1 dl 1342536485 ref 2 fl Rpc:/0/0 rc 0/0
[2012-07-17 16:48:07][c7-0c1s0n1]Lustre: scratch-OST003f-osc-ffff88041e142400: Connection to service scratch-OST003f via
nid 148.187.7.114@o2ib2 was lost; in progress operations using this service will wait for recovery to complete.
[2012-07-17 16:48:07][c7-0c1s0n1]LustreError: 167-0: This client was evicted by scratch-OST003f; in progress operations u
sing this service will fail.
[2012-07-17 16:48:07][c7-0c1s0n1]Lustre: Server scratch-OST003f_UUID version (2.2.51.0) is much newer than client version
(1.8.6)
[2012-07-17 16:48:07][c7-0c1s0n1]Lustre: Skipped 72 previous similar messages
[2012-07-17 16:48:07][c7-0c1s0n1]Lustre: scratch-OST003f-osc-ffff88041e142400: Connection restored to service scratch-OST
003f using nid 148.187.7.114@o2ib2.
Job is talled then finally is killed due cputime limit exceeded
slurmd[rosa12]: *** JOB 377847 CANCELLED AT 17:03:36 DUE TO TIME LIMIT ***
aprun.x: Apid 1177710: Caught signal Terminated, sending to application
Attached the log file of Cray XE machine of the specific time range.
Hi Fabio,
Can you please to mount the MDT device as the type of "ldiskfs", then check which file(s) consumed such large space?