Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10250

replay-single test_74: hang and timed out

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Jinshan Xiong <jinshan.xiong@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/96539eb0-cadb-11e7-8027-52540065bddc.

      The sub-test test_74 failed with the following error:

      Timeout occurred after 128 mins, last suite running was replay-single, restarting cluster to continue tests
      

      Please provide additional information about the failure here.

      Info required for matching: replay-single 74

      On the client stack, there is a stack trace:

      [ 4440.087054] INFO: task touch:18639 blocked for more than 120 seconds.
      [ 4440.089288] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 4440.091541] touch           D 0000000000000000     0 18639  17432 0x00000080
      [ 4440.093752]  ffff880069033a40 0000000000000086 ffff880036a79fa0 ffff880069033fd8
      [ 4440.096088]  ffff880069033fd8 ffff880069033fd8 ffff880036a79fa0 ffff880066647650
      [ 4440.098281]  7fffffffffffffff ffff880066647648 ffff880036a79fa0 0000000000000000
      [ 4440.100517] Call Trace:
      [ 4440.102304]  [<ffffffff816a9589>] schedule+0x29/0x70
      [ 4440.104209]  [<ffffffff816a7099>] schedule_timeout+0x239/0x2c0
      [ 4440.106187]  [<ffffffffc09e3108>] ? ptlrpc_set_add_new_req+0xd8/0x150 [ptlrpc]
      [ 4440.108288]  [<ffffffffc0bc2bc0>] ? osc_io_ladvise_end+0x50/0x50 [osc]
      [ 4440.110272]  [<ffffffffc0a1324b>] ? ptlrpcd_add_req+0x22b/0x300 [ptlrpc]
      [ 4440.112208]  [<ffffffffc09f5e40>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc]
      [ 4440.114221]  [<ffffffff816a993d>] wait_for_completion+0xfd/0x140
      [ 4440.116026]  [<ffffffff810c4820>] ? wake_up_state+0x20/0x20
      [ 4440.117835]  [<ffffffffc0bc2d84>] osc_io_setattr_end+0xc4/0x180 [osc]
      [ 4440.119639]  [<ffffffffc0bc4381>] ? osc_io_setattr_start+0x471/0x6e0 [osc]
      [ 4440.121517]  [<ffffffffc0c15450>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
      [ 4440.123344]  [<ffffffffc08201ed>] cl_io_end+0x5d/0x150 [obdclass]
      [ 4440.125122]  [<ffffffffc0c1552b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
      [ 4440.126825]  [<ffffffffc0c15b75>] lov_io_call.isra.5+0x85/0x140 [lov]
      [ 4440.128586]  [<ffffffffc0c15c66>] lov_io_end+0x36/0xb0 [lov]
      [ 4440.130210]  [<ffffffffc08201ed>] cl_io_end+0x5d/0x150 [obdclass]
      [ 4440.131901]  [<ffffffffc082287f>] cl_io_loop+0x13f/0xc70 [obdclass]
      [ 4440.133544]  [<ffffffffc0cbb4d0>] cl_setattr_ost+0x240/0x3a0 [lustre]
      [ 4440.135229]  [<ffffffffc0c95b05>] ll_setattr_raw+0x1185/0x1290 [lustre]
      [ 4440.136821]  [<ffffffffc0c95c7c>] ll_setattr+0x6c/0xd0 [lustre]
      [ 4440.138380]  [<ffffffff8121ee71>] notify_change+0x2c1/0x420
      [ 4440.139813]  [<ffffffff81233b59>] utimes_common+0xd9/0x1c0
      [ 4440.141276]  [<ffffffff81233d7e>] do_utimes+0x13e/0x180
      [ 4440.142646]  [<ffffffff81233ec4>] SyS_utimensat+0x64/0xb0
      [ 4440.144085]  [<ffffffff816ac5c8>] ? page_fault+0x28/0x30
      [ 4440.145450]  [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
      [ 4560.146045] INFO: task touch:18639 blocked for more than 120 seconds.
      

      That shows the client was waiting for a setattr to complete, and on the OFD side:

      [ 4163.936021] LNet: Service thread pid 15315 was inactive for 40.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 4163.941814] Pid: 15315, comm: ll_ost00_000
      [ 4163.944340] 
      Call Trace:
      [ 4163.948946]  [<ffffffff816a9589>] schedule+0x29/0x70
      [ 4163.951469]  [<ffffffffc07084d5>] cv_wait_common+0x125/0x150 [spl]
      [ 4163.954051]  [<ffffffff810b1920>] ? autoremove_wake_function+0x0/0x40
      [ 4163.956663]  [<ffffffffc0708515>] __cv_wait+0x15/0x20 [spl]
      [ 4163.959223]  [<ffffffffc085781f>] txg_wait_synced+0xef/0x140 [zfs]
      [ 4163.961745]  [<ffffffffc080cc65>] dmu_tx_wait+0x275/0x3c0 [zfs]
      [ 4163.964242]  [<ffffffffc080ce41>] dmu_tx_assign+0x91/0x490 [zfs]
      [ 4163.966725]  [<ffffffffc0ecbe1a>] ? tgt_txn_start_cb+0x1da/0x3b0 [ptlrpc]
      [ 4163.969213]  [<ffffffffc1098f37>] osd_trans_start+0xa7/0x3a0 [osd_zfs]
      [ 4163.971617]  [<ffffffffc11c1fab>] ofd_trans_start+0x6b/0xe0 [ofd]
      [ 4163.973952]  [<ffffffffc11c4943>] ofd_attr_set+0x433/0xb00 [ofd]
      [ 4163.976281]  [<ffffffffc11afff3>] ofd_setattr_hdl+0x303/0x950 [ofd]
      [ 4163.978596]  [<ffffffffc0ed7c05>] tgt_request_handle+0x925/0x13b0 [ptlrpc]
      [ 4163.980977]  [<ffffffffc0e7be8e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
      [ 4163.983391]  [<ffffffff810ba598>] ? __wake_up_common+0x58/0x90
      [ 4163.985676]  [<ffffffffc0e7f632>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [ 4163.987983]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
      [ 4163.990182]  [<ffffffff816a9000>] ? __schedule+0x350/0x8b0
      [ 4163.992348]  [<ffffffffc0e7eba0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      [ 4163.994528]  [<ffffffff810b099f>] kthread+0xcf/0xe0
      [ 4163.996574]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0
      [ 4163.998609]  [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
      [ 4164.000651]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0
      
      [ 4164.004284] LustreError: dumping log to /tmp/lustre-log.1510820580.15315
      

      It's waiting a sync trans, which handled the setattr from client, and it seems the trans is never completed.

      I guess this is the same issue LU-4440. Unfortunately there is no useful description on the ticket.

      Attachments

        Issue Links

          Activity

            [LU-10250] replay-single test_74: hang and timed out

            it would be helpful to get a full set of backtraces. dmu_tx_wait() can be waiting for a new TXG while current TXG is pinned by another process.

            bzzz Alex Zhuravlev added a comment - it would be helpful to get a full set of backtraces. dmu_tx_wait() can be waiting for a new TXG while current TXG is pinned by another process.

            This is for our NOAA file system which is also production. It is 2.12 LTS using ZFS as back end.

            simmonsja James A Simmons added a comment - This is for our NOAA file system which is also production. It is 2.12 LTS using ZFS as back end.

            James, could you please file a separate ticket for your issue. It is of course fine to link to this ticket, but that allows tracking the problem with your production system better, and it may be that the problem you are seeing is unrelated (being stuck in "dmu_tx_wait()" is basically the equivalent of being stuck in "start_this_handle()" for ldiskfs, so there are a hundred ways to get there.

            Also, I wasn't aware that you are running ZFS in production? I thought you had ldiskfs-based filesystems on Spider2?

            adilger Andreas Dilger added a comment - James, could you please file a separate ticket for your issue. It is of course fine to link to this ticket, but that allows tracking the problem with your production system better, and it may be that the problem you are seeing is unrelated (being stuck in " dmu_tx_wait() " is basically the equivalent of being stuck in " start_this_handle() " for ldiskfs, so there are a hundred ways to get there. Also, I wasn't aware that you are running ZFS in production? I thought you had ldiskfs-based filesystems on Spider2?

            This is a problem hit on real hardware. This problem took down our production file system

            simmonsja James A Simmons added a comment - This is a problem hit on real hardware. This problem took down our production file system

            This seems to be hitting occasionally - unexplained and unreported hang in ZFS dmu_tx_wait() (see linked tickets). I don't know if this is an artifact of our VM testing, or a problem that also hits on real hardware, but it is something to be aware of and keep an eye open for.

            adilger Andreas Dilger added a comment - This seems to be hitting occasionally - unexplained and unreported hang in ZFS dmu_tx_wait() (see linked tickets). I don't know if this is an artifact of our VM testing, or a problem that also hits on real hardware, but it is something to be aware of and keep an eye open for.

            The stack traces dumped on the client are because osc_io_setattr_end() is using wait_for_completion() directly. It would be better to use wait_for_completion_interruptible() so that the wait is interruptible in case the user wants to kill the process and get on with their day.

            That doesn't resolve the issue on the server, but the client may as well be fixed at the same time.

            adilger Andreas Dilger added a comment - The stack traces dumped on the client are because osc_io_setattr_end() is using wait_for_completion() directly. It would be better to use wait_for_completion_interruptible() so that the wait is interruptible in case the user wants to kill the process and get on with their day. That doesn't resolve the issue on the server, but the client may as well be fixed at the same time.

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: