Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1274

Client threads block for sometime before being evicted and can never reconnect afterward

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.3.0, Lustre 2.1.2
    • Lustre 2.1.0
    • None
    • rhel 6.1 clients with Bull 's Advanced Edition suite.
      Client version: lustre-modules-2.1.0-2.6.32_131.0.15.el6.x86_64_Bull.2.204.el6
      Server version: lustre-modules-2.1.0-2.6.32_131.17.1.bl6.Bull.27.0.x86_64_Bull.2.207.bl6
    • 3
    • 4602

    Description

      We have a new customer hitting something very similar to LU-816.
      The difference is they are running Lustre 2.1.

      The sequence of the problem is as follows (giving hours corresponding to the logs):
      1/ The client reports a process that was blocked for more than 120 seconds with the following trace (no kernel or Lustre messages before this in the 40 minutes preceding):
      (thread blocked for 2 minutes reported at 13:43:27, so blocked since 13:41:27)

      INFO: task ls:12058 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      ls D 0000000000000008 0 12058 11471 0x00000084
      ffff8805ee2dfa58 0000000000000086 ffff8805ee2dfa68 ffff880bbc28e600
      ffff8805ee2dfd08 ffff880616207800 ffff8805ee2dfad8 ffff880620230a38
      ffff880604529038 ffff8805ee2dffd8 000000000000f598 ffff880604529038
      Call Trace:
      [<ffffffff814dc84e>] __mutex_lock_slowpath+0x13e/0x180
      [<ffffffff814dc6eb>] mutex_lock+0x2b/0x50
      [<ffffffffa068c7b8>] cl_lock_mutex_get+0x78/0xd0 [obdclass]
      [<ffffffffa068eb4a>] cl_lock_hold_mutex+0xca/0x710 [obdclass]
      [<ffffffffa09dff73>] ? lov_io_init_raid0+0x4b3/0x920 [lov]
      [<ffffffffa069060e>] cl_lock_request+0x5e/0x1f0 [obdclass]
      [<ffffffffa09d6bf9>] ? lov_io_init+0x99/0x120 [lov]
      [<ffffffffa0a73ab0>] cl_glimpse_lock+0x180/0x390 [lustre]
      [<ffffffffa0a74274>] cl_glimpse_size+0x184/0x190 [lustre]
      [<ffffffffa0a2a28f>] ll_inode_revalidate_it+0x6f/0x1f0 [lustre]
      [<ffffffff81181804>] ? do_path_lookup+0x94/0xa0
      [<ffffffff8117efc5>] ? putname+0x45/0x50
      [<ffffffffa0a2a459>] ll_getattr_it+0x49/0x170 [lustre]
      [<ffffffffa0a2a5b7>] ll_getattr+0x37/0x40 [lustre]
      [<ffffffff81177c11>] vfs_getattr+0x51/0x80
      [<ffffffffa0a1572e>] ? ll_ddelete+0x5e/0x300 [lustre]
      [<ffffffff81177ca0>] vfs_fstatat+0x60/0x80
      [<ffffffff81177d2e>] vfs_lstat+0x1e/0x20
      [<ffffffff81177d54>] sys_newlstat+0x24/0x50
      [<ffffffff810d1b62>] ? audit_syscall_entry+0x272/0x2a0
      [<ffffffff8100c635>] ? math_state_restore+0x45/0x60
      [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

      2/ A thread on the server side is blocked and reported 200secs later with the following trace:
      (thread reported blocked for 200s at 13:45:42, so blocked since 13:42:22)

      Lustre: Service thread pid 24618 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume lat
      er. Dumping the stack trace for debugging purposes:
      Pid: 24618, comm: ll_ost_425

      Call Trace:
      [<ffffffff810657ec>] ? lock_timer_base+0x3c/0x70
      [<ffffffff8147eab0>] schedule_timeout+0x190/0x2d0
      [<ffffffff81065900>] ? process_timeout+0x0/0x10
      [<ffffffffa0595731>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      [<ffffffffa0747453>] ptlrpc_set_wait+0x323/0x730 [ptlrpc]
      [<ffffffff8104c780>] ? default_wake_function+0x0/0x20
      [<ffffffffa0750e84>] ? lustre_msg_set_status+0x94/0x110 [ptlrpc]
      [<ffffffffa0747916>] ptlrpc_queue_wait+0xb6/0x290 [ptlrpc]
      [<ffffffffa02a85ec>] ? lprocfs_counter_add+0x12c/0x170 [lvfs]
      [<ffffffffa072cf3e>] ldlm_server_glimpse_ast+0x10e/0x3d0 [ptlrpc]
      [<ffffffffa081130d>] filter_intent_policy+0x35d/0x710 [obdfilter]
      [<ffffffffa05a4805>] ? cfs_hash_bd_lookup_intent+0xe5/0x130 [libcfs]
      [<ffffffffa070bb8a>] ldlm_lock_enqueue+0x2da/0xa50 [ptlrpc]
      [<ffffffffa072a485>] ? ldlm_export_lock_get+0x15/0x20 [ptlrpc]
      [<ffffffffa05a3872>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
      [<ffffffffa0732407>] ldlm_handle_enqueue0+0x447/0x1090 [ptlrpc]
      [<ffffffffa0754240>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
      [<ffffffffa07330b6>] ldlm_handle_enqueue+0x66/0x70 [ptlrpc]
      [<ffffffffa07330c0>] ? ldlm_server_completion_ast+0x0/0x680 [ptlrpc]
      [<ffffffffa06dc870>] ? ost_blocking_ast+0x0/0xe40 [ost]
      [<ffffffffa072ce30>] ? ldlm_server_glimpse_ast+0x0/0x3d0 [ptlrpc]
      [<ffffffffa06e4c6a>] ost_handle+0x22ba/0x4b90 [ost]
      [<ffffffff8104c792>] ? default_wake_function+0x12/0x20
      [<ffffffff8103b9b9>] ? __wake_up_common+0x59/0x90
      [<ffffffffa0750444>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
      [<ffffffffa0761459>] ptlrpc_main+0xc79/0x19d0 [ptlrpc]
      [<ffffffff810017ac>] ? __switch_to+0x1ac/0x320
      [<ffffffffa07607e0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc]
      [<ffffffff810041aa>] child_rip+0xa/0x20
      [<ffffffffa07607e0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc]
      [<ffffffff810041a0>] ? child_rip+0x0/0x20

      3/ Client reports another thread which was blocked since 13:43:27 (reported at 13:45:27):

      kernel INFO: task ldlm_cb_05:9780 blocked for more than 120 seconds.
      kernel "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      kernel ldlm_cb_05 D 0000000000000006 0 9780 2 0x00000080
      ffff880604557b90 0000000000000046 ffff880604557b58 ffff880604557b54
      ffff880604557b50 ffff88063fc24f00 ffff880655435f80 0000000100610371
      ffff88060ac5e6b8 ffff880604557fd8 000000000000f598 ffff88060ac5e6b8
      Call Trace:
      [<ffffffff814dc84e>] __mutex_lock_slowpath+0x13e/0x180
      [<ffffffff814dc6eb>] mutex_lock+0x2b/0x50
      [<ffffffffa068c7b8>] cl_lock_mutex_get+0x78/0xd0 [obdclass]
      [<ffffffffa097d91e>] osc_ldlm_glimpse_ast+0x7e/0x160 [osc]
      [<ffffffffa076e304>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
      [<ffffffffa074d6bb>] ldlm_callback_handler+0x142b/0x21c0 [ptlrpc]
      [<ffffffffa07709ac>] ? lustre_msg_get_transno+0x7c/0xe0 [ptlrpc]
      [<ffffffffa076e304>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
      [<ffffffffa077ef59>] ptlrpc_main+0xbb9/0x1990 [ptlrpc]
      [<ffffffffa077e3a0>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
      [<ffffffff8100c1ca>] child_rip+0xa/0x20
      [<ffffffffa077e3a0>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
      [<ffffffff8100c1c0>] ? child_rip+0x0/0x20

      4/ The 2 threads on the client stay blocked (and are reported every 2 minutes) until the server finally evicts the client a first time at 13:49:38 :

      LustreError: 0:0:(ldlm_lockd.c:356:waiting_locks_callback()) ### lock callback timer expired after 155s: evicting client at 172.17.1.
      230@o2ib ns: filter-scratch-OST001c_UUID lock: ffff8800774e0d80/0xaa844a441355c6d7 lrc: 3/0,0 mode: PW/PW res: 3545740/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551
      615) flags: 0x10020 remote: 0x49baa5ebdb4811c0 expref: 15 pid: 5451 timeout 4476555308

      5/ The client side "ls" thread unblocks (no more reports of the watchdog) with the message:

      1333111826 2012 Mar 30 13:50:26 willowb11 kern err kernel LustreError: 11982:0:(cl_io.c:1700:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 14 remaining pages
      1333111826 2012 Mar 30 13:50:26 willowb11 kern err kernel LustreError: 11982:0:(cl_io.c:965:cl_io_cancel()) Canceling ongoing page trasmission

      But the ldlm_cb_02 is still locked at that point.

      6/ The server evicts the client a second time at 13:53:34 :

      Lustre: scratch-OST001c: haven't heard from client e33a9e90-14c3-9556-290a-e2f09abad36a (at 172.17.1.230@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8805ebfa1400, cur 1333112014 expire 1333111864 last 1333111787

      7/ At that point, no more threads are locked, but it is impossible for the client to reconnect to the OST, getting these messages:

      LustreError: 167-0: This client was evicted by scratch-OST0023; in progress operations using this service will fail.
      LustreError: 2829:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805f7714000 x1397861693879877/t0(0) o-1->scratch-OST0023_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
      LustreError: 2830:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805e483d400 x1397861693879881/t0(0) o-1->scratch-OST0023_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
      LustreError: 2830:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805e4beec00 x1397861693879901/t0(0) o-1->scratch-OST0023_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
      LustreError: 2830:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 1 previous similar message
      LustreError: 11-0: an error occurred while communicating with 172.17.0.7@o2ib. The ldlm_enqueue operation failed with -107
      Lustre: scratch-OST001c-osc-ffff880620978000: Connection to service scratch-OST001c via nid 172.17.0.7@o2ib was lost; in progress operations using this service will wait for recovery to complete.
      LustreError: 167-0: This client was evicted by scratch-OST001c; in progress operations using this service will fail.
      LustreError: 2831:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805e5af9800 x1397861693879966/t0(0) o-1->scratch-OST0023_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
      LustreError: 2831:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 2 previous similar messages
      Lustre: 9014:0:(cl_lock.c:2025:cl_lock_page_out()) Writing 1 pages error: -108
      LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805da30f400 x1397861693880082/t0(0) o-1->scratch-OST001c_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
      LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 17 previous similar messages
      LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805d9f09400 x1397861693880324/t0(0) o-1->scratch-OST0023_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
      LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 1 previous similar message
      LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805ec787c00 x1397861693881645/t0(0) o-1->scratch-OST0023_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
      LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 14 previous similar messages
      LustreError: 2833:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8806228ff000 x1397861693882624/t0(0) o-1->scratch-OST001c_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
      LustreError: 2833:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 13 previous similar messages
      LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805e4a01c00 x1397861693883727/t0(0) o-1->scratch-OST001c_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
      LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 14 previous similar messages

      I attach the full logs of client, server holding the concerned OST, and a stack trace of all threads on the client once in the final state (unable to reconnect).

      The only issue found is to reboot the client. This is critical as this can happen several times an hour and is happening on login nodes of the cluster, thus impacting a lot of users at once.

      Attachments

        Issue Links

          Activity

            People

              jay Jinshan Xiong (Inactive)
              spiechurski Sebastien Piechurski
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: