Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1642

Clients get disconnected and reconnected during heavy IO immediately after the halt of a blade.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Incomplete
    • Critical
    • None
    • Lustre 2.2.0
    • None
    • 3
    • 4006

    Description

      During Lustre testing yesterday we observe this behaviours:

      • Halt 4 nodes on a blade
      • jobs doing IO intense such as IOR or MPIIO starts

      Jul 17 16:38:15 nid00475 aprun.x[13684]: apid=1177710, Starting, user=20859, batch_id=377847, cmd_line="/usr/bin/apr
      un.x -n 256 src/C/IOR -a MPIIO -B -b 4096m -t 4096K -k -r -w -e -g -s 1 -i 2 -F -C -o /scratch/weisshorn/fverzell/te
      st5/IORtest-377847 ", num_nodes=64, node_list=64-65,126-129,190-191,702-705,766-769,830-833,894-897,958-961,1022-102
      5,1086-1089,1150-1153,1214-1217,1278-1281,1294-1295,1342-1345,1406-1409,1470-1473,1534-1535

      • then a few minutes later Lustre is acting up

      Lustre server log:

      Jul 17 16:39:57 weisshorn03 kernel: LNetError: 4754:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 11 seconds
      Jul 17 16:39:57 weisshorn03 kernel: LNetError: 4754:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 148.187.7.73@o2ib2 (0): c: 0, oc: 1, rc: 5

      Jul 17 16:39:58 weisshorn08 kernel: LNetError: 5045:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds
      Jul 17 16:39:58 weisshorn08 kernel: LNetError: 5045:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 148.187.7.78@o2ib2 (0): c: 0, oc: 3, rc: 4

      Jul 17 16:39:59 weisshorn13 kernel: LNet: 3394:0:(o2iblnd_cb.c:2340:kiblnd_passive_connect()) Conn race 148.187.7.81@o2ib2
      Jul 17 16:39:59 weisshorn05 kernel: LNetError: 4875:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds

      • Notice "IO Bulk write error" for nid833 which part of job mentioned above followed by "inactive thread " then dumptrace:

      Jul 17 16:40:05 weisshorn14 kernel: LustreError: 7929:0:(ldlm_lib.c:2717:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@f
      fff880f07f8b400 x1407748581382025/t0(0) o4->412fabdd-3b3a-df4b-bdc6-264145113d70@833@gni:0/0 lens 448/416 e 0 to 0 dl 1342536406 ref 1 fl Interpret:/0/0 rc 0/0
      Jul 17 16:40:05 weisshorn14 kernel: Lustre: scratch-OST003f: Bulk IO write error with 412fabdd-3b3a-df4b-bdc6-264145113d70 (at 833@gni), c
      lient will retry: rc -110

      Jul 17 16:43:19 weisshorn13 kernel: Lustre: 6182:0:(service.c:1034:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/1), not sending early reply
      Jul 17 16:43:19 weisshorn13 kernel: req@ffff880ddeb81400 x1407748579298657/t0(0) o4->a34c7ab8-980f-db22-6596-e1db30724c4d@12@gni:0/0 lens 448/416 e 1 to 0 dl 1342536204 ref 2 fl Interpret:/0/0 rc 0/0
      Jul 17 16:43:19 weisshorn13 kernel: Lustre: 6182:0:(service.c:1034:ptlrpc_at_send_early_reply()) Skipped 19 previous similar messages

      Jul 17 16:43:20 weisshorn13 kernel: LNet: Service thread pid 8102 was inactive for 600.00s. The thread might be hung, or it might only be
      slow and will resume later. Dumping the stack trace for debugging purposes:
      Jul 17 16:43:20 weisshorn13 kernel: Pid: 8102, comm: ll_ost_io_153
      Jul 17 16:43:20 weisshorn13 kernel:
      Jul 17 16:43:20 weisshorn13 kernel: Call Trace:
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8107bf8c>] ? lock_timer_base+0x3c/0x70
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff814edc52>] schedule_timeout+0x192/0x2e0
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8107c0a0>] ? process_timeout+0x0/0x10
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03a65c1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa05df0ad>] target_bulk_io+0x38d/0x8b0 [ptlrpc]
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8105e7f0>] ? default_wake_function+0x0/0x20
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0b4c792>] ost_brw_write+0x1172/0x1380 [ost]
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03a527b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa05d64a0>] ? target_bulk_timeout+0x0/0x80 [ptlrpc]
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0b507c4>] ost_handle+0x2764/0x39e0 [ost]
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0612c83>] ? ptlrpc_update_export_timer+0x1c3/0x360 [ptlrpc]
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa06183c1>] ptlrpc_server_handle_request+0x3c1/0xcb0 [ptlrpc]
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03a64ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03b0ef9>] ? lc_watchdog_touch+0x79/0x110 [libcfs]
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0612462>] ? ptlrpc_wait_event+0xb2/0x2c0 [ptlrpc]
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa06193cf>] ptlrpc_main+0x71f/0x1210 [ptlrpc]
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0618cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0618cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0618cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
      Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Jul 17 16:43:20 weisshorn13 kernel:

      • Nid833 is then being evicted:

      Jul 17 16:47:44 weisshorn14 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 568
      s: evicting client at 833@gni ns: filter-scratch-OST003f_UUID lock: ffff8808aa9fc480/0x9cc518d034bea3c8 lrc: 3/0,0 mode: PW/PW res: 22186
      722/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20 remote: 0x629daae2e6cf7351 expref: 5 pid: 5811 timeout 42997
      03474

      On SMW console log:

      [2012-07-17 16:48:07][c7-0c1s0n1]Lustre: 9196:0:(client.c:1492:ptlrpc_expire_one_request()) @@@ Request x1407748581382025
      sent from scratch-OST003f-osc-ffff88041e142400 to NID 148.187.7.114@o2ib2 471s ago has timed out (471s prior to deadline
      ).
      [2012-07-17 16:48:07][c7-0c1s0n1] req@ffff8801bea18800 x1407748581382025/t0 o4->scratch-OST003f_UUID@148.187.7.114@o2ib2
      :6/4 lens 448/608 e 0 to 1 dl 1342536485 ref 2 fl Rpc:/0/0 rc 0/0
      [2012-07-17 16:48:07][c7-0c1s0n1]Lustre: scratch-OST003f-osc-ffff88041e142400: Connection to service scratch-OST003f via
      nid 148.187.7.114@o2ib2 was lost; in progress operations using this service will wait for recovery to complete.
      [2012-07-17 16:48:07][c7-0c1s0n1]LustreError: 167-0: This client was evicted by scratch-OST003f; in progress operations u
      sing this service will fail.
      [2012-07-17 16:48:07][c7-0c1s0n1]Lustre: Server scratch-OST003f_UUID version (2.2.51.0) is much newer than client version
      (1.8.6)
      [2012-07-17 16:48:07][c7-0c1s0n1]Lustre: Skipped 72 previous similar messages
      [2012-07-17 16:48:07][c7-0c1s0n1]Lustre: scratch-OST003f-osc-ffff88041e142400: Connection restored to service scratch-OST
      003f using nid 148.187.7.114@o2ib2.

      Job is talled then finally is killed due cputime limit exceeded

      slurmd[rosa12]: *** JOB 377847 CANCELLED AT 17:03:36 DUE TO TIME LIMIT ***
      aprun.x: Apid 1177710: Caught signal Terminated, sending to application

      Attached the log file of Cray XE machine of the specific time range.

      Attachments

        1. weisshorn02.tar.gz
          0.6 kB
        2. weiss02.tar.gz
          2 kB
        3. smw.log
          129 kB
        4. sdb.log
          574 kB
        5. nid2CrayMapping.txt
          89 kB

        Activity

          People

            green Oleg Drokin
            fverzell Fabio Verzelloni
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: