Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3998

ll_ost_io process hung after mounting 1.8.9 client on 2.4.1 server (after upgraded)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • Lustre 2.4.1
    • 3
    • 10697

    Description

      At SDSC test cluster, here are what we did:

      All clients and server were running 1.8.9-wc1 with 1 MDS, 4 OSS. Each OSS has only 1 OST

      1. Upgrade all servers to 2.4.1 (actually b2_4 build 40)
      2. booted server and mount MDT, OSS...no issue
      3. Reformatted 3 additional OSTs on each OSS and mounted
      4. mounted 1.8.9-wc1 clients but can not access the FS

      Below stack dump happens on all OSS.

      LNet: Service thread pid 3541 completed after 430.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      Lustre: rhino-OST0015: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) refused reconnection, still busy with 5 active RPCs
      Lustre: Skipped 18 previous similar messages
      LustreError: 6262:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff880410b31400 x1446546855247995/t0(0) o4->53e70e6e-2fe3-c152-77d3-3aed9ca43fc1@10.2.255.254@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379965478 ref 1 fl Interpret:/0/0 rc 0/0
      LustreError: 6262:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 32 previous similar messages
      Lustre: rhino-OST0015: Bulk IO write error with 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib), client will retry: rc -110
      Lustre: Skipped 31 previous similar messages
      Lustre: rhino-OST0001: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) reconnecting
      Lustre: Skipped 158 previous similar messages
      Lustre: rhino-OST0015: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) refused reconnection, still busy with 1 active RPCs
      Lustre: Skipped 23 previous similar messages
      LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff880182d8a800 x1446546855249432/t0(0) o4->53e70e6e-2fe3-c152-77d3-3aed9ca43fc1@10.2.255.254@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966105 ref 1 fl Interpret:/2/0 rc 0/0
      LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 63 previous similar messages
      Lustre: rhino-OST0015: Bulk IO write error with 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib), client will retry: rc -110
      Lustre: Skipped 64 previous similar messages
      Lustre: rhino-OST0005: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
      Lustre: Skipped 160 previous similar messages
      Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 2 active RPCs
      Lustre: Skipped 26 previous similar messages
      LustreError: 6258:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff88027a8b8000 x1446547676281180/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966444 ref 1 fl Interpret:/2/0 rc 0/0
      LustreError: 6258:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 57 previous similar messages
      Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110
      Lustre: Skipped 57 previous similar messages
      Lustre: rhino-OST000d: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
      Lustre: Skipped 181 previous similar messages
      Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 1 active RPCs
      Lustre: Skipped 26 previous similar messages
      LustreError: 8027:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff88033304dc00 x1446547676282609/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966879 ref 1 fl Interpret:/2/0 rc 0/0
      LustreError: 8027:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 44 previous similar messages
      Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110
      Lustre: Skipped 44 previous similar messages
      Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
      Lustre: Skipped 164 previous similar messages
      Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 1 active RPCs
      Lustre: Skipped 28 previous similar messages
      LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff8804451a4400 x1446547676284352/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379967635 ref 1 fl Interpret:/2/0 rc 0/0
      LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 76 previous similar messages
      Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110
      Lustre: Skipped 76 previous similar messages
      Lustre: rhino-OST0015: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
      Lustre: Skipped 191 previous similar messages
      LNet: Service thread pid 3577 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Pid: 3577, comm: ll_ost_io01_083
      
      
      Call Trace:
       [<ffffffff81080fec>] ? lock_timer_base+0x3c/0x70
       [<ffffffff8150ef72>] schedule_timeout+0x192/0x2e0
       [<ffffffff81081100>] ? process_timeout+0x0/0x10
       [<ffffffffa03666d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
       [<ffffffffa061b608>] target_bulk_io+0x3b8/0x910 [ptlrpc]
       [<ffffffffa03762d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffff81063330>] ? default_wake_function+0x0/0x20
       [<ffffffffa063f9b8>] ? __ptlrpc_prep_bulk_page+0x68/0x170 [ptlrpc]
       [<ffffffffa0cbc364>] ost_brw_write+0x1034/0x15d0 [ost]
       [<ffffffffa06111a0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
       [<ffffffffa0cc242b>] ost_handle+0x3ecb/0x48e0 [ost]
       [<ffffffffa03720f4>] ? libcfs_id2str+0x74/0xb0 [libcfs]
       [<ffffffffa06613c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
       [<ffffffffa0377e05>] ? lc_watchdog_touch+0xd5/0x170 [libcfs]
       [<ffffffffa0658729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
       [<ffffffffa03762d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
       [<ffffffffa066275e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
       [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
       [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      
      LustreError: dumping log to /tmp/lustre-log.1379967822.3577
      attached is the lustre-log
      

      We will upgrade the 2.4.1 GA if this issue is fixed in the patches between build #40 and GA version.

      Attachments

        Activity

          People

            ashehata Amir Shehata (Inactive)
            mdiep Minh Diep
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: