Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3998

ll_ost_io process hung after mounting 1.8.9 client on 2.4.1 server (after upgraded)

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • Lustre 2.4.1
    • 3
    • 10697

    Description

      At SDSC test cluster, here are what we did:

      All clients and server were running 1.8.9-wc1 with 1 MDS, 4 OSS. Each OSS has only 1 OST

      1. Upgrade all servers to 2.4.1 (actually b2_4 build 40)
      2. booted server and mount MDT, OSS...no issue
      3. Reformatted 3 additional OSTs on each OSS and mounted
      4. mounted 1.8.9-wc1 clients but can not access the FS

      Below stack dump happens on all OSS.

      LNet: Service thread pid 3541 completed after 430.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      Lustre: rhino-OST0015: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) refused reconnection, still busy with 5 active RPCs
      Lustre: Skipped 18 previous similar messages
      LustreError: 6262:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff880410b31400 x1446546855247995/t0(0) o4->53e70e6e-2fe3-c152-77d3-3aed9ca43fc1@10.2.255.254@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379965478 ref 1 fl Interpret:/0/0 rc 0/0
      LustreError: 6262:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 32 previous similar messages
      Lustre: rhino-OST0015: Bulk IO write error with 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib), client will retry: rc -110
      Lustre: Skipped 31 previous similar messages
      Lustre: rhino-OST0001: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) reconnecting
      Lustre: Skipped 158 previous similar messages
      Lustre: rhino-OST0015: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) refused reconnection, still busy with 1 active RPCs
      Lustre: Skipped 23 previous similar messages
      LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff880182d8a800 x1446546855249432/t0(0) o4->53e70e6e-2fe3-c152-77d3-3aed9ca43fc1@10.2.255.254@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966105 ref 1 fl Interpret:/2/0 rc 0/0
      LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 63 previous similar messages
      Lustre: rhino-OST0015: Bulk IO write error with 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib), client will retry: rc -110
      Lustre: Skipped 64 previous similar messages
      Lustre: rhino-OST0005: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
      Lustre: Skipped 160 previous similar messages
      Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 2 active RPCs
      Lustre: Skipped 26 previous similar messages
      LustreError: 6258:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff88027a8b8000 x1446547676281180/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966444 ref 1 fl Interpret:/2/0 rc 0/0
      LustreError: 6258:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 57 previous similar messages
      Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110
      Lustre: Skipped 57 previous similar messages
      Lustre: rhino-OST000d: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
      Lustre: Skipped 181 previous similar messages
      Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 1 active RPCs
      Lustre: Skipped 26 previous similar messages
      LustreError: 8027:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff88033304dc00 x1446547676282609/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966879 ref 1 fl Interpret:/2/0 rc 0/0
      LustreError: 8027:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 44 previous similar messages
      Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110
      Lustre: Skipped 44 previous similar messages
      Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
      Lustre: Skipped 164 previous similar messages
      Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 1 active RPCs
      Lustre: Skipped 28 previous similar messages
      LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff8804451a4400 x1446547676284352/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379967635 ref 1 fl Interpret:/2/0 rc 0/0
      LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 76 previous similar messages
      Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110
      Lustre: Skipped 76 previous similar messages
      Lustre: rhino-OST0015: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
      Lustre: Skipped 191 previous similar messages
      LNet: Service thread pid 3577 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Pid: 3577, comm: ll_ost_io01_083
      
      
      Call Trace:
       [<ffffffff81080fec>] ? lock_timer_base+0x3c/0x70
       [<ffffffff8150ef72>] schedule_timeout+0x192/0x2e0
       [<ffffffff81081100>] ? process_timeout+0x0/0x10
       [<ffffffffa03666d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
       [<ffffffffa061b608>] target_bulk_io+0x3b8/0x910 [ptlrpc]
       [<ffffffffa03762d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffff81063330>] ? default_wake_function+0x0/0x20
       [<ffffffffa063f9b8>] ? __ptlrpc_prep_bulk_page+0x68/0x170 [ptlrpc]
       [<ffffffffa0cbc364>] ost_brw_write+0x1034/0x15d0 [ost]
       [<ffffffffa06111a0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
       [<ffffffffa0cc242b>] ost_handle+0x3ecb/0x48e0 [ost]
       [<ffffffffa03720f4>] ? libcfs_id2str+0x74/0xb0 [libcfs]
       [<ffffffffa06613c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
       [<ffffffffa0377e05>] ? lc_watchdog_touch+0xd5/0x170 [libcfs]
       [<ffffffffa0658729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
       [<ffffffffa03762d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
       [<ffffffffa066275e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
       [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
       [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      
      LustreError: dumping log to /tmp/lustre-log.1379967822.3577
      attached is the lustre-log
      

      We will upgrade the 2.4.1 GA if this issue is fixed in the patches between build #40 and GA version.

      Attachments

        Activity

          [LU-3998] ll_ost_io process hung after mounting 1.8.9 client on 2.4.1 server (after upgraded)
          pjones Peter Jones added a comment -

          SDSC have long ago moved to more current versions so I think that this is no longer a concern

          pjones Peter Jones added a comment - SDSC have long ago moved to more current versions so I think that this is no longer a concern

          According to the stack trace:
          [<ffffffffa061b608>] target_bulk_io+0x3b8/0x910 [ptlrpc]
          This references:
          do

          { long timeoutl = req->rq_deadline - cfs_time_current_sec(); cfs_duration_t timeout = timeoutl <= 0 ? CFS_TICK : cfs_time_seconds(timeoutl); *lwi = LWI_TIMEOUT_INTERVAL(timeout, cfs_time_seconds(1), target_bulk_timeout, desc); rc = l_wait_event(desc->bd_waitq, <--- !ptlrpc_server_bulk_active(desc) || exp->exp_failed || exp->exp_abort_active_req, lwi); LASSERT(rc == 0 || rc == -ETIMEDOUT); /* Wait again if we changed deadline. */ }

          while ((rc == -ETIMEDOUT) &&
          (req->rq_deadline > cfs_time_current_sec()));

          After talking with Andreas the l_wait_event() waits for the bulk to be delivered to the client or the client to be evicted. If, as in this case, the bulk is never delivered and the client is not evicted, then according to an Adaptive timeout Algorithm, the timeout can grow to a maximum of 900s. So it's quiet possible for a thread to appear hung, but it really is just waiting on the timeout.

          Moreover, according to Andreas, the server doesn't perform any send retries.

          As far as I understand the setup, it appears that there is one way communication happening. The client is able to send to the server but the server is not able to send to the client. This can be caused by the miss-configured router. If the routes are ordered such that the "bad router" is first on the server side, it'll continuously get picked when sending to the client. But on the client side the "good router" might be listed first and it'll always be picked, and thus the client is able to send to the server. So we have an asymmetric communication there.

          I believe this explains the aspects of the problem:
          1. the client not able to mount the FS
          2. The message that dumps the server stack on the server side due to >200s of inactivity

          ashehata Amir Shehata (Inactive) added a comment - According to the stack trace: [<ffffffffa061b608>] target_bulk_io+0x3b8/0x910 [ptlrpc] This references: do { long timeoutl = req->rq_deadline - cfs_time_current_sec(); cfs_duration_t timeout = timeoutl <= 0 ? CFS_TICK : cfs_time_seconds(timeoutl); *lwi = LWI_TIMEOUT_INTERVAL(timeout, cfs_time_seconds(1), target_bulk_timeout, desc); rc = l_wait_event(desc->bd_waitq, <--- !ptlrpc_server_bulk_active(desc) || exp->exp_failed || exp->exp_abort_active_req, lwi); LASSERT(rc == 0 || rc == -ETIMEDOUT); /* Wait again if we changed deadline. */ } while ((rc == -ETIMEDOUT) && (req->rq_deadline > cfs_time_current_sec())); After talking with Andreas the l_wait_event() waits for the bulk to be delivered to the client or the client to be evicted. If, as in this case, the bulk is never delivered and the client is not evicted, then according to an Adaptive timeout Algorithm, the timeout can grow to a maximum of 900s. So it's quiet possible for a thread to appear hung, but it really is just waiting on the timeout. Moreover, according to Andreas, the server doesn't perform any send retries. As far as I understand the setup, it appears that there is one way communication happening. The client is able to send to the server but the server is not able to send to the client. This can be caused by the miss-configured router. If the routes are ordered such that the "bad router" is first on the server side, it'll continuously get picked when sending to the client. But on the client side the "good router" might be listed first and it'll always be picked, and thus the client is able to send to the server. So we have an asymmetric communication there. I believe this explains the aspects of the problem: 1. the client not able to mount the FS 2. The message that dumps the server stack on the server side due to >200s of inactivity

          It would be useful to know if this thread became "unstuck" after this stack was printed. The stack dump itself is not surprising if there was a failed router (this is just for debugging as printed at the top of the stack):

          The thread might be hung, or it might only be slow and will resume later.

          adilger Andreas Dilger added a comment - It would be useful to know if this thread became "unstuck" after this stack was printed. The stack dump itself is not surprising if there was a failed router (this is just for debugging as printed at the top of the stack): The thread might be hung, or it might only be slow and will resume later.
          mdiep Minh Diep added a comment -

          Yes, Amir. I think your assessment is correct. The issue here is why the thread was hung. Shouldn't it timed out and evict/resend...?

          mdiep Minh Diep added a comment - Yes, Amir. I think your assessment is correct. The issue here is why the thread was hung. Shouldn't it timed out and evict/resend...?
          ashehata Amir Shehata (Inactive) added a comment - - edited

          The way the current code works, is that even if a node which is not configured as a router and is being used as a router, the clients and servers using it in their routes, will think it's up. Currently clients/servers can ping the routers on regular intervals, but the ping assumes that if the node is up then the router is up. It doesn't take into consideration if the routing service is actually on. So servers and clients will continue using that node, but that node will drop all messages not destined for it.

          I wonder if the server thread is hung or simply waiting for a response (ACK) on the message it just sent? From the stack trace attached it appears that it's waiting on a timeout. I suspect that after the server sends a message it'll wait for a response, but since the message is dropped at the "bad router" then no response is received and the server will eventually timeout.

          Am I looking at the right place? Or is there another area where it indicates that the server's thread is hung?

          ashehata Amir Shehata (Inactive) added a comment - - edited The way the current code works, is that even if a node which is not configured as a router and is being used as a router, the clients and servers using it in their routes, will think it's up. Currently clients/servers can ping the routers on regular intervals, but the ping assumes that if the node is up then the router is up. It doesn't take into consideration if the routing service is actually on. So servers and clients will continue using that node, but that node will drop all messages not destined for it. I wonder if the server thread is hung or simply waiting for a response (ACK) on the message it just sent? From the stack trace attached it appears that it's waiting on a timeout. I suspect that after the server sends a message it'll wait for a response, but since the message is dropped at the "bad router" then no response is received and the server will eventually timeout. Am I looking at the right place? Or is there another area where it indicates that the server's thread is hung?
          mdiep Minh Diep added a comment -

          Amir,

          After investigated, it turned out that one of the router was configured as a client. What this means (IMHO) is the client requests for reconnect and the server sent the reply to the 'bad' router and package get dropped. What shouldn't be happening is the server's thread hung.

          mdiep Minh Diep added a comment - Amir, After investigated, it turned out that one of the router was configured as a client. What this means (IMHO) is the client requests for reconnect and the server sent the reply to the 'bad' router and package get dropped. What shouldn't be happening is the server's thread hung.
          pjones Peter Jones added a comment -

          Amir

          Could you please comment on this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Amir Could you please comment on this one? Thanks Peter
          green Oleg Drokin added a comment -

          So writes take hundreds of seconds? I wonder why?

          green Oleg Drokin added a comment - So writes take hundreds of seconds? I wonder why?

          People

            ashehata Amir Shehata (Inactive)
            mdiep Minh Diep
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: