Details
-
Bug
-
Resolution: Incomplete
-
Critical
-
None
-
Lustre 2.2.0
-
None
-
----------------------------------------------------------------------------------------------------
## MDS HW ##
----------------------------------------------------------------------------------------------------
Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
Vendor ID: AuthenticAMD
CPU family: 16
64Gb RAM
Interconnect IB 40Gb/s
---
MDT LSI 5480 Pikes Peak
SSDs SLC
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
## OSS HW ##
----------------------------------------------------------------------------------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
Vendor ID: GenuineIntel
CPU family: 6
64Gb RAM
Interconnect IB 40Gb/s
---
OSTs ---> LSI 7900 SATA Disks
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
## Router nodes ##
----------------------------------------------------------------------------------------------------
12 Cray XE6 Service nodes as router nodes - IB 40Gb/s
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
## Clients ##
----------------------------------------------------------------------------------------------------
~ 1500 Cray XE6 nodes - Lustre 1.8.6
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
## LUSTRE Config ##
----------------------------------------------------------------------------------------------------
1 MDS + 1 fail over (MDT on SSD array)
12 OSSs - 6 OSTs per OSS (72 OSTs)
Luster Servers ---> 2.2.51.0
Lustre Clients ---> 1.8.6 (~1500 nodes) / 2.2.51.0 (~20 nodes)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ## MDS HW ## ---------------------------------------------------------------------------------------------------- Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 Vendor ID: AuthenticAMD CPU family: 16 64Gb RAM Interconnect IB 40Gb/s --- MDT LSI 5480 Pikes Peak SSDs SLC ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ## OSS HW ## ---------------------------------------------------------------------------------------------------- Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 Vendor ID: GenuineIntel CPU family: 6 64Gb RAM Interconnect IB 40Gb/s --- OSTs ---> LSI 7900 SATA Disks ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ## Router nodes ## ---------------------------------------------------------------------------------------------------- 12 Cray XE6 Service nodes as router nodes - IB 40Gb/s ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ## Clients ## ---------------------------------------------------------------------------------------------------- ~ 1500 Cray XE6 nodes - Lustre 1.8.6 ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ## LUSTRE Config ## ---------------------------------------------------------------------------------------------------- 1 MDS + 1 fail over (MDT on SSD array) 12 OSSs - 6 OSTs per OSS (72 OSTs) Luster Servers ---> 2.2.51.0 Lustre Clients ---> 1.8.6 (~1500 nodes) / 2.2.51.0 (~20 nodes) ----------------------------------------------------------------------------------------------------
-
3
-
4006
Description
During Lustre testing yesterday we observe this behaviours:
- Halt 4 nodes on a blade
- jobs doing IO intense such as IOR or MPIIO starts
Jul 17 16:38:15 nid00475 aprun.x[13684]: apid=1177710, Starting, user=20859, batch_id=377847, cmd_line="/usr/bin/apr
un.x -n 256 src/C/IOR -a MPIIO -B -b 4096m -t 4096K -k -r -w -e -g -s 1 -i 2 -F -C -o /scratch/weisshorn/fverzell/te
st5/IORtest-377847 ", num_nodes=64, node_list=64-65,126-129,190-191,702-705,766-769,830-833,894-897,958-961,1022-102
5,1086-1089,1150-1153,1214-1217,1278-1281,1294-1295,1342-1345,1406-1409,1470-1473,1534-1535
- then a few minutes later Lustre is acting up
Lustre server log:
Jul 17 16:39:57 weisshorn03 kernel: LNetError: 4754:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 11 seconds
Jul 17 16:39:57 weisshorn03 kernel: LNetError: 4754:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 148.187.7.73@o2ib2 (0): c: 0, oc: 1, rc: 5
Jul 17 16:39:58 weisshorn08 kernel: LNetError: 5045:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds
Jul 17 16:39:58 weisshorn08 kernel: LNetError: 5045:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 148.187.7.78@o2ib2 (0): c: 0, oc: 3, rc: 4
Jul 17 16:39:59 weisshorn13 kernel: LNet: 3394:0:(o2iblnd_cb.c:2340:kiblnd_passive_connect()) Conn race 148.187.7.81@o2ib2
Jul 17 16:39:59 weisshorn05 kernel: LNetError: 4875:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds
- Notice "IO Bulk write error" for nid833 which part of job mentioned above followed by "inactive thread " then dumptrace:
Jul 17 16:40:05 weisshorn14 kernel: LustreError: 7929:0:(ldlm_lib.c:2717:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@f
fff880f07f8b400 x1407748581382025/t0(0) o4->412fabdd-3b3a-df4b-bdc6-264145113d70@833@gni:0/0 lens 448/416 e 0 to 0 dl 1342536406 ref 1 fl Interpret:/0/0 rc 0/0
Jul 17 16:40:05 weisshorn14 kernel: Lustre: scratch-OST003f: Bulk IO write error with 412fabdd-3b3a-df4b-bdc6-264145113d70 (at 833@gni), c
lient will retry: rc -110
Jul 17 16:43:19 weisshorn13 kernel: Lustre: 6182:0:(service.c:1034:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/1), not sending early reply
Jul 17 16:43:19 weisshorn13 kernel: req@ffff880ddeb81400 x1407748579298657/t0(0) o4->a34c7ab8-980f-db22-6596-e1db30724c4d@12@gni:0/0 lens 448/416 e 1 to 0 dl 1342536204 ref 2 fl Interpret:/0/0 rc 0/0
Jul 17 16:43:19 weisshorn13 kernel: Lustre: 6182:0:(service.c:1034:ptlrpc_at_send_early_reply()) Skipped 19 previous similar messages
Jul 17 16:43:20 weisshorn13 kernel: LNet: Service thread pid 8102 was inactive for 600.00s. The thread might be hung, or it might only be
slow and will resume later. Dumping the stack trace for debugging purposes:
Jul 17 16:43:20 weisshorn13 kernel: Pid: 8102, comm: ll_ost_io_153
Jul 17 16:43:20 weisshorn13 kernel:
Jul 17 16:43:20 weisshorn13 kernel: Call Trace:
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8107bf8c>] ? lock_timer_base+0x3c/0x70
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff814edc52>] schedule_timeout+0x192/0x2e0
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8107c0a0>] ? process_timeout+0x0/0x10
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03a65c1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa05df0ad>] target_bulk_io+0x38d/0x8b0 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8105e7f0>] ? default_wake_function+0x0/0x20
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0b4c792>] ost_brw_write+0x1172/0x1380 [ost]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03a527b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa05d64a0>] ? target_bulk_timeout+0x0/0x80 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0b507c4>] ost_handle+0x2764/0x39e0 [ost]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0612c83>] ? ptlrpc_update_export_timer+0x1c3/0x360 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa06183c1>] ptlrpc_server_handle_request+0x3c1/0xcb0 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03a64ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03b0ef9>] ? lc_watchdog_touch+0x79/0x110 [libcfs]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0612462>] ? ptlrpc_wait_event+0xb2/0x2c0 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa06193cf>] ptlrpc_main+0x71f/0x1210 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0618cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0618cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0618cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Jul 17 16:43:20 weisshorn13 kernel:
- Nid833 is then being evicted:
Jul 17 16:47:44 weisshorn14 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 568
s: evicting client at 833@gni ns: filter-scratch-OST003f_UUID lock: ffff8808aa9fc480/0x9cc518d034bea3c8 lrc: 3/0,0 mode: PW/PW res: 22186
722/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20 remote: 0x629daae2e6cf7351 expref: 5 pid: 5811 timeout 42997
03474
On SMW console log:
[2012-07-17 16:48:07][c7-0c1s0n1]Lustre: 9196:0:(client.c:1492:ptlrpc_expire_one_request()) @@@ Request x1407748581382025
sent from scratch-OST003f-osc-ffff88041e142400 to NID 148.187.7.114@o2ib2 471s ago has timed out (471s prior to deadline
).
[2012-07-17 16:48:07][c7-0c1s0n1] req@ffff8801bea18800 x1407748581382025/t0 o4->scratch-OST003f_UUID@148.187.7.114@o2ib2
:6/4 lens 448/608 e 0 to 1 dl 1342536485 ref 2 fl Rpc:/0/0 rc 0/0
[2012-07-17 16:48:07][c7-0c1s0n1]Lustre: scratch-OST003f-osc-ffff88041e142400: Connection to service scratch-OST003f via
nid 148.187.7.114@o2ib2 was lost; in progress operations using this service will wait for recovery to complete.
[2012-07-17 16:48:07][c7-0c1s0n1]LustreError: 167-0: This client was evicted by scratch-OST003f; in progress operations u
sing this service will fail.
[2012-07-17 16:48:07][c7-0c1s0n1]Lustre: Server scratch-OST003f_UUID version (2.2.51.0) is much newer than client version
(1.8.6)
[2012-07-17 16:48:07][c7-0c1s0n1]Lustre: Skipped 72 previous similar messages
[2012-07-17 16:48:07][c7-0c1s0n1]Lustre: scratch-OST003f-osc-ffff88041e142400: Connection restored to service scratch-OST
003f using nid 148.187.7.114@o2ib2.
Job is talled then finally is killed due cputime limit exceeded
slurmd[rosa12]: *** JOB 377847 CANCELLED AT 17:03:36 DUE TO TIME LIMIT ***
aprun.x: Apid 1177710: Caught signal Terminated, sending to application
Attached the log file of Cray XE machine of the specific time range.
I've added Fanyong to CC list, I think if MDT size is growing faster than you thought, then it's very likely because our OI files are growing forever (
LU-1512). It's a design defect of IAM, Fanyong has already worked out a patch, but I don't know if it's possible to apply to existed filesystem.Fanyong, could you comment on this?