Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      Running SWL, OSS has repeated timeouts

      Nov  5 15:23:57 iws9 kernel: LNet: Service thread pid 23042 was inactive for 200.00s. The thread might be hung, or it 
      might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Nov  5 15:23:57 iws9 kernel: Pid: 23042, comm: ll_ost00_004
      Nov  5 15:23:57 iws9 kernel:
      Nov  5 15:23:57 iws9 kernel: Call Trace:
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa067c380>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff815395c3>] io_schedule+0x73/0xc0
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa05b2f8f>] cv_wait_common+0xaf/0x130 [spl]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa05b3028>] __cv_wait_io+0x18/0x20 [spl]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa06bd2eb>] zio_wait+0x10b/0x1e0 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0614939>] dbuf_read+0x439/0x850 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0614ef1>] __dbuf_hold_impl+0x1a1/0x4f0 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa06152bd>] dbuf_hold_impl+0x7d/0xb0 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0616790>] dbuf_hold+0x20/0x30 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa061d0d7>] dmu_buf_hold_noread+0x87/0x140 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa061d1cb>] dmu_buf_hold+0x3b/0x90 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0612fb8>] ? dbuf_rele_and_unlock+0x268/0x400 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0686e5a>] zap_lockdir+0x5a/0x770 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff81178fcd>] ? kmem_cache_alloc_node_trace+0x1cd/0x200
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa06889ca>] zap_lookup_norm+0x4a/0x190 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0688ba3>] zap_lookup+0x33/0x40 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa062cc76>] dmu_tx_hold_zap+0x146/0x210 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa1034255>] osd_declare_object_create+0x2a5/0x440 [osd_zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa11738e4>] ofd_precreate_objects+0x4e4/0x19d0 [ofd]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa04b4b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa1180a9b>] ? ofd_grant_create+0x23b/0x3e0 [ofd]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa116384e>] ofd_create_hdl+0x56e/0x2640 [ofd]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c28e80>] ? lustre_pack_reply_v2+0x220/0x280 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c930ec>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c3a9e1>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c39ba0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0
      Nov  5 15:23:57 iws9 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
      Nov  5 15:23:57 iws9 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
      Nov  5 15:23:57 iws9 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      Lustre-log dump attached

      Attachments

        Issue Links

          Activity

            [LU-7404] ZFS OSS - Numerous timeouts - SWL

            Hi Nathaniel,

            We've tried 0.6.5.4 before and it didn't help.

            Only ZFS Master includes the patches the upstream ZFS developer mentioned and we tried that on Hyperion yesterday, unfortunately it didn't help either.

            jay Jinshan Xiong (Inactive) added a comment - Hi Nathaniel, We've tried 0.6.5.4 before and it didn't help. Only ZFS Master includes the patches the upstream ZFS developer mentioned and we tried that on Hyperion yesterday, unfortunately it didn't help either.
            utopiabound Nathaniel Clark added a comment - - edited

            Given the discussion on zfs#4210 should I push a patch to move to 0.6.5.4?

            utopiabound Nathaniel Clark added a comment - - edited Given the discussion on zfs#4210 should I push a patch to move to 0.6.5.4?
            jay Jinshan Xiong (Inactive) added a comment - - edited

            I filed a ticket on upstreaming zfs at: https://github.com/zfsonlinux/zfs/issues/4210

            jay Jinshan Xiong (Inactive) added a comment - - edited I filed a ticket on upstreaming zfs at: https://github.com/zfsonlinux/zfs/issues/4210

            I'm dropping the priority of this issue because it's not blocking 2.8 release any more. I will keep this ticket open till I find the root cause.

            jay Jinshan Xiong (Inactive) added a comment - I'm dropping the priority of this issue because it's not blocking 2.8 release any more. I will keep this ticket open till I find the root cause.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17712/
            Subject: LU-7404 zfs: reset ZFS baseline to 0.6.4.2
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 182b30b7699858c73a990c36c51b70c40858a1fe

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17712/ Subject: LU-7404 zfs: reset ZFS baseline to 0.6.4.2 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 182b30b7699858c73a990c36c51b70c40858a1fe

            Hi Nathaniel,

            Cliff has verified that the same issue can be seen on current master, where the baseline of ZFS is 0.6.5.3. I will investigate if this is a problem of OSD-ZFS, or a problem of ZFS baseline. We will file an issue upstream if it turns out to be a problem of ZFS.

            jay Jinshan Xiong (Inactive) added a comment - Hi Nathaniel, Cliff has verified that the same issue can be seen on current master, where the baseline of ZFS is 0.6.5.3. I will investigate if this is a problem of OSD-ZFS, or a problem of ZFS baseline. We will file an issue upstream if it turns out to be a problem of ZFS.

            This hang isn't the same as deadlock as zfsonlinux/spl#484 "Disable dynamic taskqs by default to avoid deadlock" that was fixed in 0.6.5.3 is it? If it isn't has someone opened a bug upstream for this?

            utopiabound Nathaniel Clark added a comment - This hang isn't the same as deadlock as zfsonlinux/spl#484 "Disable dynamic taskqs by default to avoid deadlock" that was fixed in 0.6.5.3 is it? If it isn't has someone opened a bug upstream for this?
            yujian Jian Yu added a comment -

            Here is the patch to reset ZFS baseline to version 0.6.4.2: http://review.whamcloud.com/17712

            yujian Jian Yu added a comment - Here is the patch to reset ZFS baseline to version 0.6.4.2: http://review.whamcloud.com/17712

            This problem has been isolated to the update from ZFS 0.6.4.2 to 0.6.5.2, commit v2_7_61_0-39-ge94d375d8a, patch http://review.whamcloud.com/16399 "LU-7153 build: Update SPL/ZFS to 0.6.5.2".

            One option for debugging would be to bisect the ZFS code upstream to see which ZFS patch has introduced this.

            adilger Andreas Dilger added a comment - This problem has been isolated to the update from ZFS 0.6.4.2 to 0.6.5.2, commit v2_7_61_0-39-ge94d375d8a, patch http://review.whamcloud.com/16399 " LU-7153 build: Update SPL/ZFS to 0.6.5.2". One option for debugging would be to bisect the ZFS code upstream to see which ZFS patch has introduced this.

            Current testing with DNE+ZFS on Hyperion has shown that 2.7.56 does not have this timeout problem, while 2.7.62 does have the timeouts. Testing is underway with the 2.7.59 tag to see if the timeout problem is present there as well. The LU-6750 patch was landed as v2_7_56_0-5-g27929cc (i.e. 5 patches past 2.7.56) so it wasn't present in the 2.7.56 testing that passed. The LU-4865 patch was landed as v2_7_59_0-20-g3e43691 (i.e. 20 patches past 2.7.59) so the 2.7.59 testing will give us a good half-way mark without being affected by LU-4865.

            I've pushed http://review.whamcloud.com/17112 to revert the LU-6750 patch in addition to the LU-4865 patch reversion, based on the current tip of master, for the next stage of testing after 2.7.59, depending on those results.

            adilger Andreas Dilger added a comment - Current testing with DNE+ZFS on Hyperion has shown that 2.7.56 does not have this timeout problem, while 2.7.62 does have the timeouts. Testing is underway with the 2.7.59 tag to see if the timeout problem is present there as well. The LU-6750 patch was landed as v2_7_56_0-5-g27929cc (i.e. 5 patches past 2.7.56) so it wasn't present in the 2.7.56 testing that passed. The LU-4865 patch was landed as v2_7_59_0-20-g3e43691 (i.e. 20 patches past 2.7.59) so the 2.7.59 testing will give us a good half-way mark without being affected by LU-4865 . I've pushed http://review.whamcloud.com/17112 to revert the LU-6750 patch in addition to the LU-4865 patch reversion, based on the current tip of master, for the next stage of testing after 2.7.59, depending on those results.
            jgmitter Joseph Gmitter (Inactive) added a comment - - edited

            Issues seen after reverting patch for LU-4865.

            jgmitter Joseph Gmitter (Inactive) added a comment - - edited Issues seen after reverting patch for LU-4865 .

            People

              jay Jinshan Xiong (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: