Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-952

Hung thread with HIGH OSS load

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 1.8.6
    • None
    • lustre1.8.5 with ofed1.5.3 with kernel 2.6.18-238.12.1.el5. AT NASA AMES
    • 4
    • 4769

    Description

      We started getting the following error with the oss getting to a high load and filesystem becomeing unsable.

      Dec 26 11:52:16 service102 kernel: Lustre: Service thread pid 10832 was inactive for 506.00s. The thread might be hung, or it
      might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Dec 26 11:52:16 service102 kernel: Lustre: Skipped 1 previous similar message
      Dec 26 11:52:16 service102 kernel: Pid: 10832, comm: ll_ost_323
      Dec 26 11:52:16 service102 kernel:
      Dec 26 11:52:16 service102 kernel: Call Trace:
      Dec 26 11:52:16 service102 kernel: [<ffffffff887d0d1d>] libcfs_debug_vmsg2+0x70d/0x970 [libcfs]
      Dec 26 11:52:16 service102 kernel: [<ffffffff88b63d02>] start_this_handle+0x301/0x3cb [jbd2]
      Dec 26 11:52:16 service102 kernel: [<ffffffff800a2f36>] autoremove_wake_function+0x0/0x2e
      Dec 26 11:52:18 service102 kernel: [<ffffffff88b63e77>] jbd2_journal_start+0xab/0xdf [jbd2]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88ba4c25>] ldiskfs_journal_start_sb+0x55/0xa0 [ldiskfs]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88c16a72>] fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs]
      Dec 26 11:52:18 service102 kernel: [<ffffffff8002cc0e>] mntput_no_expire+0x19/0x88
      Dec 26 11:52:18 service102 kernel: [<ffffffff887fca00>] push_ctxt+0x370/0x380 [lvfs]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88c31a08>] filter_client_add+0x508/0xc30 [obdfilter]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88c30de7>] filter_export_stats_init+0x117/0x650 [obdfilter]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88c32665>] filter_connect+0x535/0x8c0 [obdfilter]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88936107>] lustre_msg_add_op_flags+0x47/0x120 [ptlrpc]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88bf6500>] ost_handle+0x0/0x55b0 [ost]
      Dec 26 11:52:18 service102 kernel: [<ffffffff88900976>] target_handle_connect+0x21c6/0x2e80 [ptlrpc]
      Dec 26 11:52:19 service102 kernel: [<ffffffff8892ca48>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
      Dec 26 11:52:19 service102 kernel: [<ffffffff88930f75>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
      Dec 26 11:52:19 service102 kernel: [<ffffffff88931038>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
      Dec 26 11:52:19 service102 kernel: [<ffffffff88bf6daf>] ost_handle+0x8af/0x55b0 [ost]
      Dec 26 11:52:19 service102 kernel: [<ffffffff889405e9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
      Dec 26 11:52:19 service102 kernel: [<ffffffff88940d45>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
      Dec 26 11:52:19 service102 kernel: [<ffffffff8008ca4e>] __wake_up_common+0x3e/0x68
      Dec 26 11:52:19 service102 kernel: [<ffffffff88941cd6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
      Dec 26 11:52:20 service102 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
      Dec 26 11:52:20 service102 kernel: [<ffffffff88940d70>] ptlrpc_main+0x0/0x1120 [ptlrpc]
      Dec 26 11:52:20 service102 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
      Dec 26 11:52:20 service102 kernel:
      Dec 26 11:52:20 service102 kernel: Pid: 11020, comm: ll_ost_511
      Dec 26 11:52:20 service102 kernel:
      Dec 26 11:52:20 service102 kernel: Call Trace:
      Dec 26 11:52:20 service102 kernel: [<ffffffff887d0d1d>] libcfs_debug_vmsg2+0x70d/0x970 [libcfs]
      Dec 26 11:52:20 service102 kernel: [<ffffffff801632b8>] list_add+0xc/0xe
      Dec 26 11:52:20 service102 kernel: [<ffffffff88b63d02>] start_this_handle+0x301/0x3cb [jbd2]
      Dec 26 11:52:20 service102 kernel: [<ffffffff800a2f36>] autoremove_wake_function+0x0/0x2e
      Dec 26 11:52:20 service102 kernel: [<ffffffff88b63e77>] jbd2_journal_start+0xab/0xdf [jbd2]
      Dec 26 11:52:20 service102 kernel: [<ffffffff88ba4c25>] ldiskfs_journal_start_sb+0x55/0xa0 [ldiskfs]
      Dec 26 11:52:20 service102 kernel: [<ffffffff88c16a72>] fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs]
      Dec 26 11:52:20 service102 kernel: [<ffffffff8002cc0e>] mntput_no_expire+0x19/0x88
      Dec 26 11:52:20 service102 kernel: [<ffffffff887fca00>] push_ctxt+0x370/0x380 [lvfs]
      Dec 26 11:52:20 service102 kernel: [<ffffffff88c31a08>] filter_client_add+0x508/0xc30 [obdfilter]
      Dec 26 11:52:20 service102 kernel: [<ffffffff88c30de7>] filter_export_stats_init+0x117/0x650 [obdfilter]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88c32665>] filter_connect+0x535/0x8c0 [obdfilter]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88936107>] lustre_msg_add_op_flags+0x47/0x120 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88bf6500>] ost_handle+0x0/0x55b0 [ost]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88900976>] target_handle_connect+0x21c6/0x2e80 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff8892ca48>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88930f75>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88931038>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88bf6daf>] ost_handle+0x8af/0x55b0 [ost]
      Dec 26 11:52:21 service102 kernel: [<ffffffff889405e9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff88940d45>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff8008ca4e>] __wake_up_common+0x3e/0x68
      Dec 26 11:52:21 service102 kernel: [<ffffffff88941cd6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
      Dec 26 11:52:21 service102 kernel: [<ffffffff88940d70>] ptlrpc_main+0x0/0x1120 [ptlrpc]
      Dec 26 11:52:21 service102 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
      Dec 26 11:52:21 service102 kernel:
      Dec 26 11:52:21 service102 kernel: Pid: 10674, comm: ll_ost_165

      Attachments

        1. messages.service113.gz
          31 kB
          Mahmoud Hanafi
        2. service102.messages
          1.18 MB
          Mahmoud Hanafi
        3. service103.messages
          925 kB
          Mahmoud Hanafi

        Issue Links

          Activity

            People

              niu Niu Yawei (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: