Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-952

Hung thread with HIGH OSS load

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 1.8.6
    • None
    • lustre1.8.5 with ofed1.5.3 with kernel 2.6.18-238.12.1.el5. AT NASA AMES
    • 4
    • 4769

    Description

      We started getting the following error with the oss getting to a high load and filesystem becomeing unsable.

      Dec 26 11:52:16 service102 kernel: Lustre: Service thread pid 10832 was inactive for 506.00s. The thread might be hung, or it
      might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Dec 26 11:52:16 service102 kernel: Lustre: Skipped 1 previous similar message
      Dec 26 11:52:16 service102 kernel: Pid: 10832, comm: ll_ost_323
      Dec 26 11:52:16 service102 kernel:
      Dec 26 11:52:16 service102 kernel: Call Trace:
      Dec 26 11:52:16 service102 kernel: [<ffffffff887d0d1d>] libcfs_debug_vmsg2+0x70d/0x970 [libcfs]
      Dec 26 11:52:16 service102 kernel: [<ffffffff88b63d02>] start_this_handle+0x301/0x3cb [jbd2]
      Dec 26 11:52:16 service102 kernel: [<ffffffff800a2f36>] autoremove_wake_function+0x0/0x2e
      Dec 26 11:52:18 service102 kernel: [<ffffffff88b63e77>] jbd2_journal_start+0xab/0xdf [jbd2]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88ba4c25>] ldiskfs_journal_start_sb+0x55/0xa0 [ldiskfs]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88c16a72>] fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs]
      Dec 26 11:52:18 service102 kernel: [<ffffffff8002cc0e>] mntput_no_expire+0x19/0x88
      Dec 26 11:52:18 service102 kernel: [<ffffffff887fca00>] push_ctxt+0x370/0x380 [lvfs]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88c31a08>] filter_client_add+0x508/0xc30 [obdfilter]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88c30de7>] filter_export_stats_init+0x117/0x650 [obdfilter]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88c32665>] filter_connect+0x535/0x8c0 [obdfilter]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88936107>] lustre_msg_add_op_flags+0x47/0x120 [ptlrpc]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88bf6500>] ost_handle+0x0/0x55b0 [ost]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88900976>] target_handle_connect+0x21c6/0x2e80 [ptlrpc]
      Dec 26 11:52:19 service102 kernel: [<ffffffff8892ca48>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
      Dec 26 11:52:19 service102 kernel: [<ffffffff88930f75>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
      Dec 26 11:52:19 service102 kernel: [<ffffffff88931038>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
      Dec 26 11:52:19 service102 kernel: [<ffffffff88bf6daf>] ost_handle+0x8af/0x55b0 [ost]
      Dec 26 11:52:19 service102 kernel: [<ffffffff889405e9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
      Dec 26 11:52:19 service102 kernel: [<ffffffff88940d45>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
      Dec 26 11:52:19 service102 kernel: [<ffffffff8008ca4e>] __wake_up_common+0x3e/0x68
      Dec 26 11:52:19 service102 kernel: [<ffffffff88941cd6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
      Dec 26 11:52:20 service102 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
      Dec 26 11:52:20 service102 kernel: [<ffffffff88940d70>] ptlrpc_main+0x0/0x1120 [ptlrpc]
      Dec 26 11:52:20 service102 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
      Dec 26 11:52:20 service102 kernel:
      Dec 26 11:52:20 service102 kernel: Pid: 11020, comm: ll_ost_511
      Dec 26 11:52:20 service102 kernel:
      Dec 26 11:52:20 service102 kernel: Call Trace:
      Dec 26 11:52:20 service102 kernel: [<ffffffff887d0d1d>] libcfs_debug_vmsg2+0x70d/0x970 [libcfs]
      Dec 26 11:52:20 service102 kernel: [<ffffffff801632b8>] list_add+0xc/0xe
      Dec 26 11:52:20 service102 kernel: [<ffffffff88b63d02>] start_this_handle+0x301/0x3cb [jbd2]
      Dec 26 11:52:20 service102 kernel: [<ffffffff800a2f36>] autoremove_wake_function+0x0/0x2e
      Dec 26 11:52:20 service102 kernel: [<ffffffff88b63e77>] jbd2_journal_start+0xab/0xdf [jbd2]
      Dec 26 11:52:20 service102 kernel: [<ffffffff88ba4c25>] ldiskfs_journal_start_sb+0x55/0xa0 [ldiskfs]
      Dec 26 11:52:20 service102 kernel: [<ffffffff88c16a72>] fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs]
      Dec 26 11:52:20 service102 kernel: [<ffffffff8002cc0e>] mntput_no_expire+0x19/0x88
      Dec 26 11:52:20 service102 kernel: [<ffffffff887fca00>] push_ctxt+0x370/0x380 [lvfs]
      Dec 26 11:52:20 service102 kernel: [<ffffffff88c31a08>] filter_client_add+0x508/0xc30 [obdfilter]
      Dec 26 11:52:20 service102 kernel: [<ffffffff88c30de7>] filter_export_stats_init+0x117/0x650 [obdfilter]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88c32665>] filter_connect+0x535/0x8c0 [obdfilter]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88936107>] lustre_msg_add_op_flags+0x47/0x120 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88bf6500>] ost_handle+0x0/0x55b0 [ost]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88900976>] target_handle_connect+0x21c6/0x2e80 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff8892ca48>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88930f75>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88931038>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88bf6daf>] ost_handle+0x8af/0x55b0 [ost]
      Dec 26 11:52:21 service102 kernel: [<ffffffff889405e9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88940d45>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff8008ca4e>] __wake_up_common+0x3e/0x68
      Dec 26 11:52:21 service102 kernel: [<ffffffff88941cd6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
      Dec 26 11:52:21 service102 kernel: [<ffffffff88940d70>] ptlrpc_main+0x0/0x1120 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
      Dec 26 11:52:21 service102 kernel:
      Dec 26 11:52:21 service102 kernel: Pid: 10674, comm: ll_ost_165

      Attachments

        1. service103.messages
          925 kB
        2. service102.messages
          1.18 MB
        3. messages.service113.gz
          31 kB

        Issue Links

          Activity

            People

              niu Niu Yawei (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: