[LU-7602] Repeated timeouts with ZFS 0.6.5.2 Created: 23/Dec/15  Updated: 23/Dec/15  Resolved: 23/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: Jian Yu
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Hyperion/SWL -


Attachments: File iws2.stackes.txt.gz     File lustre-log.1450900053.30734.gz    
Issue Links:
Duplicate
duplicates LU-7404 ZFS OSS - Numerous timeouts - SWL Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This bug created to track activity from http://review.whamcloud.com/17712
LU-7602 zfs: reset ZFS baseline to 0.6.4.2

ZFS 0.6.5.2 is known to introduce I/O problems
Typical timeout - slightly different from the stack traces in the Gerrit ticket

Dec 23 11:47:33 iws2 kernel: LNet: Service thread pid 30734 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Dec 23 11:47:33 iws2 kernel: Pid: 30734, comm: ll_ost00_000
Dec 23 11:47:33 iws2 kernel:
Dec 23 11:47:33 iws2 kernel: Call Trace:
Dec 23 11:47:33 iws2 kernel: [<ffffffffa06cb330>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
Dec 23 11:47:33 iws2 kernel: [<ffffffff815395d3>] io_schedule+0x73/0xc0
Dec 23 11:47:33 iws2 kernel: [<ffffffffa05a3eaf>] cv_wait_common+0xaf/0x130 [spl]
Dec 23 11:47:33 iws2 kernel: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
Dec 23 11:47:33 iws2 kernel: [<ffffffffa05a3f48>] __cv_wait_io+0x18/0x20 [spl]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa070c29b>] zio_wait+0x10b/0x1e0 [zfs]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa06638a9>] dbuf_read+0x439/0x850 [zfs]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa066c168>] dmu_buf_hold+0x68/0x90 [zfs]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa0661fa8>] ? dbuf_rele_and_unlock+0x268/0x390 [zfs]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa06d5e0a>] zap_lockdir+0x5a/0x770 [zfs]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa06d797a>] zap_lookup_norm+0x4a/0x190 [zfs]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa06d7b53>] zap_lookup+0x33/0x40 [zfs]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa067bbe6>] dmu_tx_hold_zap+0x146/0x210 [zfs]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa107b3b5>] osd_declare_object_create+0x2d5/0x440 [osd_zfs]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa11bba24>] ofd_precreate_objects+0x4e4/0x19d0 [ofd]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa04bc6c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa11c8bdb>] ? ofd_grant_create+0x23b/0x3e0 [ofd]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa11ab83e>] ofd_create_hdl+0x56e/0x2640 [ofd]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa0bbefe0>] ? lustre_pack_reply_v2+0x220/0x280 [ptlrpc]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa0c294cc>] tgt_request_handle+0x8ec/0x1470 [ptlrpc]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa0bd0b41>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
Dec 23 11:47:33 iws2 kernel: [<ffffffffa0bcfd00>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
Dec 23 11:47:33 iws2 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0
Dec 23 11:47:33 iws2 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Dec 23 11:47:33 iws2 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
Dec 23 11:47:33 iws2 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Dec 23 11:47:33 iws2 kernel:


 Comments   
Comment by Jian Yu [ 23/Dec/15 ]

Hi Cliff,

Patch http://review.whamcloud.com/17712 hit build failure on sles11sp2 server. I created TEI-4369 to disable the build.

In the meantime, since builds on other distros passed, could you please verify whether the timeout issue is resolved or not after resetting ZFS baseline to 0.6.4.2? Thank you.

Comment by Andreas Dilger [ 23/Dec/15 ]

Cliff, do you have the stack traces for all the threads on the OSS? It seems this ll_ost00_000 thread is waiting for the ZFS TXG to commit, but it would be useful to know what the other threads are doing in the meantime.

Comment by Andreas Dilger [ 23/Dec/15 ]

Closing this as a duplicate of LU-7404 since that already has more information in it. The stack trace shown here is from the OSS, which is blocked on the OST object precreate. The stack trace shown in the 17712 ticket is the timeout on the MDS caused by waiting for new OST object precreation to complete, which is only a symptom of the actual deadlock problem on the OSS.

Comment by Cliff White (Inactive) [ 23/Dec/15 ]

I dumped the stacks on iws2. It's a while since the error, this file includes all the timeout stacks

Generated at Sat Feb 10 02:10:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.