[LU-7409] llog declares write region that don't match actually write region later for osd_zfs Created: 07/Nov/15 Updated: 16/Apr/20 Resolved: 16/Apr/20 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jinshan Xiong (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | zfs | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
The typical stack trace is as follows: Call Trace: [<ffffffff8156bf13>] ? panic+0xac/0x179 [<ffffffffa0f4e5cc>] ? zio_wait+0x21c/0x3e0 [zfs] [<ffffffffa0e7ef87>] ? dmu_tx_dirty_buf+0x247/0x3d0 [zfs] [<ffffffffa0f4e2f3>] ? zio_destroy+0xb3/0x170 [zfs] [<ffffffffa0e5e55f>] ? dbuf_dirty+0x5f/0x16d0 [zfs] [<ffffffff8157156b>] ? _spin_unlock+0x2b/0x40 [<ffffffffa0e848ea>] ? dnode_rele+0x5a/0xa0 [zfs] [<ffffffffa0e61501>] ? dmu_buf_will_dirty+0x91/0x100 [zfs] [<ffffffffa0e6cc70>] ? dmu_write+0xa0/0x230 [zfs] [<ffffffffa08444c1>] ? osd_write+0x1d1/0x3a0 [osd_zfs] [<ffffffffa06b9bdd>] ? dt_record_write+0x3d/0x130 [obdclass] [<ffffffffa067955a>] ? llog_osd_write_rec+0xd6a/0x1b70 [obdclass] [<ffffffffa06673f6>] ? llog_write_rec+0xb6/0x270 [obdclass] [<ffffffffa066c1b8>] ? llog_write+0x298/0x430 [obdclass] [<ffffffffa066c1cf>] ? llog_write+0x2af/0x430 [obdclass] [<ffffffffa14780a1>] ? record_marker+0x1c1/0x1e0 [mgs] [<ffffffffa14779ea>] ? record_start_log+0x38a/0x4a0 [mgs] [<ffffffffa14787cf>] ? mgs_write_log_lov+0x38f/0x6b0 [mgs] [<ffffffffa148a5c6>] ? mgs_write_log_mdt+0x326/0x1630 [mgs] [<ffffffff810c156d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffffa148d475>] ? mgs_write_log_target+0xb55/0x1980 [mgs] [<ffffffff810c156d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffffa057cc11>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa1471d89>] ? mgs_target_reg+0xa19/0xe50 [mgs] [<ffffffffa0940b3f>] ? tgt_request_handle+0x8cf/0x1300 [ptlrpc] [<ffffffffa08eb85a>] ? ptlrpc_main+0xdaa/0x18b0 [ptlrpc] [<ffffffffa08eaab0>] ? ptlrpc_main+0x0/0x18b0 [ptlrpc] [<ffffffff810a728e>] ? kthread+0x9e/0xc0 [<ffffffff8100c38a>] ? child_rip+0xa/0x20 [<ffffffff815714b0>] ? _spin_unlock_irq+0x30/0x40 [<ffffffff8100bb90>] ? restore_args+0x0/0x30 [<ffffffff810a71f0>] ? kthread+0x0/0xc0 [<ffffffff8100c380>] ? child_rip+0x0/0x20 Patch will be submitted shortly |
| Comments |
| Comment by Gerrit Updater [ 07/Nov/15 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/17085 |
| Comment by Alex Zhuravlev [ 08/Nov/15 ] |
|
LLOG can't declare exact region as actual size is choosed after dmu_tx_assign() - it's append, essentially. |
| Comment by Alex Zhuravlev [ 08/Nov/15 ] |
|
Ricardo developed a patch to allow append declaration, but I have no idea where that patch is. is your zfs build configured with debug enabled? |
| Comment by Jinshan Xiong (Inactive) [ 08/Nov/15 ] |
|
Yes, this is with debug enabled, and it failed at file system mount time. I saw there is a discussion about llog append therefore this patch is not a final solution, but simply an attempt to pass the check so that I can do some real test. |
| Comment by Alex Zhuravlev [ 08/Nov/15 ] |
|
well, then what's the purpose of the ticket? looks like a duplication of LU-2160 ? |
| Comment by Jinshan Xiong (Inactive) [ 08/Nov/15 ] |
|
it can allow me to mount my MDS after debug enabled. |
| Comment by Andreas Dilger [ 08/Nov/15 ] |
|
Alex, I recall that the actual patch to the ZFS declare code is not very complex? How hard would it be to recreate that patch? |
| Comment by Alex Zhuravlev [ 09/Nov/15 ] |
|
I didn't see the patch, but in theory - yes. it should be consisting of two pieces: my original point was that we can't land this patch. |
| Comment by Andreas Dilger [ 09/Nov/15 ] |
|
Does declaring/holding a large range of the file cause ZFS to serialize IO to that region, or is this just accounting and serialization happens elsewhere? I'm just recalling the case of file creates where the TXG is serialized because (IIRC) you cannot modify a dnode in the same TXG as it is created in. |
| Comment by Alex Zhuravlev [ 09/Nov/15 ] |
|
declaration is just accounting and actual serializations happens against specific dbufs at actual write. |
| Comment by Jinshan Xiong (Inactive) [ 09/Nov/15 ] |
|
This patch can work reasonably well on normal llog because I reserved large enough buffer for cushion, but it has problems with cat log for unlink case. Right now cat log is being used for unlink, changelog, and HSM, but I think the change to these logs are predictable at declare phase? |
| Comment by Alex Zhuravlev [ 09/Nov/15 ] |
|
nope, we can't really predict this as the transactions can take different time and while one thread has been executing a single transaction, another is able to produce many more (each adding a new record). and we may have lots of threads like these. if we try to declare large ranges, then this can result in huge credits. the credits can be multiplied by N depending on the pool's configuration. ZFS reserves memory for all the credits promised. so that a transaction going to modify few llogs may want to reserve 1,5GB memory. and again we've got to run few hundred threads usually. IOW, this approach doesn't work at scale and we need a proper support in ZFS's debugging code to understand append case. as a short term solution, we don't use debugging. |
| Comment by Jinshan Xiong (Inactive) [ 09/Nov/15 ] |
|
thanks for explanation, I see it now. It looks like ZFS temporarily reserves arc buffer for the data to be written. However, in our case for append, the reserved space are counted multiple times in the same txg, is it possible for us to add some code in dmu_tx_try_assign() and count it more accurately? |
| Comment by Alex Zhuravlev [ 09/Nov/15 ] |
|
I think it's better and simpler to modify ZFS instead. especially given Brian B. wasn't against. |
| Comment by Alex Zhuravlev [ 16/Apr/20 ] |
|
we don't plan to configure ZFS with debugging enabled, so this shouldn't be an issue |