patchless server kernel (LU-20)

[LU-684] replace dev_rdonly kernel patch with dm-flakey Created: 14/Sep/11  Updated: 10/Apr/18  Resolved: 15/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.8.0
Fix Version/s: Lustre 2.11.0

Type: Technical task Priority: Critical
Reporter: Alex Zhuravlev Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: cea, llnl
Environment:

RHEL6


Issue Links:
Blocker
is blocked by LU-8729 conf-sanity test_84: FAIL: /dev/mappe... Resolved
Related
is related to LU-4416 support for 3.12 linux kernel Resolved
is related to LU-10893 all conf-sanity tests failed: format ... Resolved
is related to LU-9111 update osd-ldiskfs to not depend on d... Resolved
Story Points: 2
Rank (Obsolete): 5605

 Description   

   #define BIO_RDONLY 31 /* device is readonly */
#define BIO_POOL_IDX(bio) ((bio)->bi_flags >> BIO_POOL_OFFSET)

void bio_free(struct bio *bio, struct bio_set *bs)
{
void *p;

if (bio_has_allocated_vec(bio))
bvec_free_bs(bs, bio->bi_io_vec, BIO_POOL_IDX(bio));

...
BIO_POOL_IDX()

void bvec_free_bs(struct bio_set *bs, struct bio_vec *bv, unsigned int idx)
{
BIO_BUG_ON(idx >= BIOVEC_NR_POOLS);

with RDONLY set BIO_POOL_IDX() can get larger than BIOVEC_NR_POOLS and we hit the assertion.



 Comments   
Comment by Mikhail Pershin [ 14/Sep/11 ]

small addition from my side. This is not just theoretical issue, I hit kernel bug during OST recovery testing with rhel6 and ext4.

Comment by Mikhail Pershin [ 14/Sep/11 ]

comparing rhel6 rdonly patch with rhel5 one I found that flag BIO_RDONLY is not used in rhel5 patch at all. Moreover it is not checked in lustre/rhel6 code also, so I wonder can it be just removed?

Comment by Andreas Dilger [ 16/Sep/11 ]

There needs to be some mechanism by which the block device can be made to discard writes, so that recovery testing can work.

That said, dev_read_only() is one of the very last remaining kernel patches that is needed, and getting rid of it would be a huge step toward patchless kernels. The only other required patch today is jbd2-jcberr (for the journal commit callback), but LU-433 (http://review.whamcloud.com/#change,983) is ready to remove this patch, and the remaining patches in the RHEL6 kernel series are only for compatibility and/or performance.

The two uses that I'm aware of that need dev_rdonly functionality are:

  • avoiding waiting on the disk for failback of targets. I'm not sure this is a valid usage, and I recall that Fan Yong may have removed this at some point to fix other problems (filesystem never being clean after unmount)
  • creating a write barrier during failover testing.

If this could be handled inside the OSD in some manner it would be ideal, but I'm not sure that is possible if there are dependent operations being done. For example, if "mkdir foo; touch foo/bar" is done then as long as this state is only present in the OSD cache and not on disk, we might be able to drop this patch entirely. It wouldn't actually handle testing of interrupted mid-stream transactions, if we have tests that do this today, as opposed to just a write barrier between to filesystem-level operations.

Comment by Andreas Dilger [ 16/Sep/11 ]

The other issue is that dev_read_only is not usable for ZFS, because the pool could abort if it is missing commits that it thought were flushed to disk. I recall that Ricardo was investigating this, but I don't recall if there was any outcome?

Comment by Alex Zhuravlev [ 16/Sep/11 ]

there is full-feature "read-only" called freeze in DMU, where basically everything is flushed except uber-block.

with ldiskfs we easily can get corrupted image because some blocks are purged from the memory while another are not.

Comment by Andreas Dilger [ 10/Nov/11 ]

I also note that Linux 3.1 has a "dm-flakey" target for device mapper, which supports a "drop_writes" option that could be used in place of our dev_read_only kernel patch. This means we would need to configure DM/LVM for all of our testing environments, but it wouldn't be needed for production use.

Comment by Andreas Dilger [ 31/Jan/12 ]

Change topic to reflect that it is now focussed on removing the dev-readonly patch entirely from the kernel patches, since this is one of the few remaining patches that is required for Lustre servers.

Comment by James A Simmons [ 28/Jan/13 ]

With the work going on to support SLES11 SP2 and FC18 the discussion has merged about going a patch less as possible. Now that RHEL5 is no longer supported server side we can move to using dm-flakey. Looking at the current code it seems the dev_read_only_patch is needed to support the OBD_IOC_SET_READONLY ioctl. The approach I'm thinking of is to have the lustre utils call dm-setup to handle setting the disk to read only. The question is does this approach work for ZFS as well?

Comment by Alex Zhuravlev [ 28/Jan/13 ]

I'd say that ZFS's approach is better for this purpose because it keep the filesystem consisten being read-only. on ldisks we rely on luck basically, any read on RO-enabled filesystem may lead to wrong/stale data.

Comment by James A Simmons [ 28/Jan/13 ]

You are talking about osd-zfs calling spa_freeze kernel side. So for the ldiskfs case we have to make calls into the dm_flakey kernel driver to support this same functionality.

Comment by James A Simmons [ 22/May/13 ]

Now that the major work of 2.4 is done its time to look at this again. I spent some time looking at using fault injection that Linux supports. In order for this to work some extra setup has to be done on the test box. First the kernel needs to be configured with CONFIG_FAIL_MAKE_REQUEST. Currently RHEL6 kernels only build this for debug kernels. So the first setup is too decided do we want to do lustre testing with debug kernels only or enable this feature with standard kernels being built with lbuild. After that we can handle setting fail_make_request two ways. One is mount debugfs and set the values we want or set the fail_make_request vales on the linux kernel command line.
For Lustre the latter choice seems the logical one. Either way the options we need to set is

<interval>,<probability>,<space>,<times>

For our testing the defaults that make sense are interval=0, probability=100, space=0, times=-1. Once we have that setup then its a matter of flipping make_it_fail in struct hd_struct (blkdev->->bd_part).

As you can see coding would not be difficult to do. Its the setup that would require more work.

Comment by Alex Zhuravlev [ 05/Jun/13 ]

I do see how dm_flakey can replay rdonly patch, but this approach has been known to bring issues when we need to read data back. would it be possible to skip writes, but save data for subsequent reads ?

Comment by James A Simmons [ 05/Jun/13 ]

Alex can you point me to reproducer for what you are talking about. I just set up my test system to use the fail_make_request code. One thing I don't see is how are you putting the disk back into read/write mode. I don't see dev_clear_rdonly being used anywhere.

Comment by Alex Zhuravlev [ 06/Jun/13 ]

well, we've already seen few times this with ldiskfs/mballoc which verifies on-disk structure on load and assert if they don't match other structures (already in memory).
dev_clear_rdonly is called in __blkdev_put() at umount time.

Comment by James A Simmons [ 06/Jun/13 ]

After I pieced together the fault injection system I noticed that it disables access to the disk for both reads and writes. Is their a reason why the fail over test set the disk to read access only? If our test case requires blocking writes only then we will have to go with the dm-flaky driver and fix the mballoc issues.

Comment by Alex Zhuravlev [ 07/Jun/13 ]

yes, the reason is to "forget" writes to simulate poweroff when buffered writes are lost.

Comment by James A Simmons [ 07/Jun/13 ]

Would disabling the reads alter the out come of the fail over tests?

Comment by Alex Zhuravlev [ 07/Jun/13 ]

I think so. inbetween turning the device read-only and failover we usually still do some processing, which in turn, may need to read something from disk.

Comment by Peter Jones [ 21/Jun/13 ]

There is a patch in gerrit for this work - http://review.whamcloud.com/#/c/6651/

Comment by Yang Sheng [ 10/Jul/13 ]

I think we can create a kernel module like loop device. And modify mount_lustre.c to support a option like '-o disw', Provide a parameter via ioctl() to control this module to discard write or not. This module just can be used in test and avoid it may hurt production environment. After this work done, we can get rid of dev_readonly patch.

Comment by Andreas Dilger [ 10/Jul/13 ]

That's essentially what dm-flakey does, and has the added benefit of being in the kernel already.

Comment by Hongchao Zhang [ 23/Jul/13 ]

there is a panic during using dm-flakey,

steps:
echo "0 `blockdev --getsize $realdev` linear $realdev 0" | dmsetup create $md_dev
(mount lustre with $md_dev)
echo "0 `blockdev --getsize $realdev` flakey $realdev 0 0 20000 1 drop_writes" | dmsetup load $md_dev
dmsetup suspend $md_dev
dmsetup resume $md_dev

BUG: unable to handle kernel paging request at ffff88015981afff
IP: [<ffffffffa00252dc>] corrupt_bio_data+0x6c/0xb0 [dm_flakey]
PGD 1a86063 PUD 0
Oops: 0002 1 SMP
last sysfs file: /sys/devices/virtual/block/dm-2/dm/suspended
CPU 0
Modules linked in: dm_flakey lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) exportfs mdd(U) mgs(U) lquota(U) lfsck(U) jbd obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ipv6 microcode sg uvcvideo videodev v4l2_compat_ioctl32 i2c_core snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc iTCO_wdt iTCO_vendor_support virtio_balloon shpchp e1000 ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ahci pata_acpi ata_generic ata_piix virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 2544, comm: loop0 Not tainted 2.6.32 #2 Parallels Software International Inc. Parallels Virtual Platform/Parallels Virtual Platform
RIP: 0010:[<ffffffffa00252dc>] [<ffffffffa00252dc>] corrupt_bio_data+0x6c/0xb0 [dm_flakey]
RSP: 0018:ffff88007c177dd0 EFLAGS: 00010297
RAX: ffff88005981b000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 00000000ffffffff RSI: ffff88005ee42280 RDI: ffff88007910cc00
RBP: ffff88007c177de0 R08: ffff88007910cc00 R09: 0000000000000000
R10: 0000000000001000 R11: 0000000000000000 R12: ffff88007910cc00
R13: ffff88007ac578b8 R14: ffff88007c0b4c00 R15: ffff88007ac41290
FS: 0000000000000000(0000) GS:ffff88000c400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffff88015981afff CR3: 0000000001a85000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process loop0 (pid: 2544, threadinfo ffff88007c176000, task ffff88007c3ec040)
Stack:
0000000000000000 ffff880000000000 ffff88007c177e00 ffffffffa0025373
<d> 0000000000001000 ffff880000000000 ffff88007c177e40 ffffffffa00018af
<d> 0000100000001000 ffff88007aeea600 ffff88007aeea714 ffff88007c177e80
Call Trace:
[<ffffffffa0025373>] flakey_end_io+0x53/0x60 [dm_flakey]
[<ffffffffa00018af>] clone_endio+0x5f/0xd0 [dm_mod]
[<ffffffff811b149d>] bio_endio+0x1d/0x40
[<ffffffff8135ebff>] loop_thread+0xdf/0x270
[<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8135eb20>] ? loop_thread+0x0/0x270
[<ffffffff81091d66>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81091cd0>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: 03 48 0f af c1 48 b9 00 00 00 00 00 88 ff ff 48 c1 e0 0c 48 01 c8 48 01 d0 74 08 8b 56 28 44 39 d2 76 02 c9 c3 8b 4e 30 83 ea 01 <88> 0c 10 4c 8b 4f 20 48 8b 07 8b 4e 28 44 89 ca 83 e2 01 83 fa
RIP [<ffffffffa00252dc>] corrupt_bio_data+0x6c/0xb0 [dm_flakey]
RSP <ffff88007c177dd0>
CR2: ffff88015981afff
--[ end trace a7cecfc161703a56 ]--
Kernel panic - not syncing: Fatal exception
Pid: 2544, comm: loop0 Tainted: G D --------------- 2.6.32 #2
Call Trace:
[<ffffffff814fd57a>] ? panic+0xa0/0x168
[<ffffffff81501714>] ? oops_end+0xe4/0x100
[<ffffffff81043bab>] ? no_context+0xfb/0x260
[<ffffffff81043e35>] ? __bad_area_nosemaphore+0x125/0x1e0
[<ffffffff8104f2bd>] ? check_preempt_curr+0x6d/0x90
[<ffffffff81043f03>] ? bad_area_nosemaphore+0x13/0x20
[<ffffffff81044661>] ? __do_page_fault+0x321/0x480
[<ffffffff81060262>] ? default_wake_function+0x12/0x20
[<ffffffff810920e6>] ? autoremove_wake_function+0x16/0x40
[<ffffffff8135d751>] ? transfer_none+0xa1/0xc0
[<ffffffff8135d581>] ? lo_splice_actor+0x81/0xc0
[<ffffffff811a7829>] ? page_cache_pipe_buf_release+0x19/0x30
[<ffffffff810724c7>] ? current_fs_time+0x27/0x30
[<ffffffff8135d500>] ? lo_splice_actor+0x0/0xc0
[<ffffffff815036ce>] ? do_page_fault+0x3e/0xa0
[<ffffffff81500a85>] ? page_fault+0x25/0x30
[<ffffffffa00252dc>] ? corrupt_bio_data+0x6c/0xb0 [dm_flakey]
[<ffffffffa0025373>] ? flakey_end_io+0x53/0x60 [dm_flakey]
[<ffffffffa00018af>] ? clone_endio+0x5f/0xd0 [dm_mod]
[<ffffffff811b149d>] ? bio_endio+0x1d/0x40
[<ffffffff8135ebff>] ? loop_thread+0xdf/0x270
[<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8135eb20>] ? loop_thread+0x0/0x270
[<ffffffff81091d66>] ? kthread+0x96/0xa0
[<ffffffff8100c14a>] ? child_rip+0xa/0x20
[<ffffffff81091cd0>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20

is there any other means to load the "flakey" table on demand except "suspend"+"resume"?

Comment by Hongchao Zhang [ 26/Jul/13 ]

status update:

1, the problem in dm-flakey(drivers/md/dm-flakey.c" was found, which "corrupt_bio_data" feature is always applied even it isn't specified in the table.
2, the test-framework is modified to use the dm-flakey, and the replay tests pass locally (it's a little complex for it uses loop device, and these loop device
will need to be cleaned up after test)

the patch still need to some improvement, especially about the creation/deletion of "loop device"(from file) and "dm device"(from normal block device).

Comment by Hongchao Zhang [ 01/Aug/13 ]

the initial patch is tracked at http://review.whamcloud.com/#/c/7200/

Comment by James A Simmons [ 01/Aug/13 ]

So much better than my patch. No kernel side work needed

Comment by Jodi Levi (Inactive) [ 18/Mar/14 ]

Hongchao,
Will you be able to refresh this patch for 2.7?
thank you!

Comment by Hongchao Zhang [ 20/Mar/14 ]

status update:
the patch is in the test and will push it to Gerrit soon.

Comment by Hongchao Zhang [ 28/Apr/14 ]

the patch has been updated

Comment by Nathan Rutman [ 25/Jun/14 ]

The two uses that I'm aware of that need dev_rdonly functionality are:

  • avoiding waiting on the disk for failback of targets. I'm not sure this is a valid usage, and I recall that Fan Yong may have removed this at some point to fix other problems (filesystem never being clean after unmount)
  • creating a write barrier during failover testing.

For the first point, you mean avoiding the delay waiting for service shutdown, or avoiding the delay of having to wait for writes to flush before starting the service again (on the other node)?
Because the "insure writing stops before service restart" seems like a vital requirement, and so I wonder if this functionality can't live just in the test-framework, but needs to be integrated more closely into Lustre itself; whenever a service is shutdown for failover, the service shutdown should really wait for the writes to flush or drop the writes.

Comment by James A Simmons [ 03/Jul/14 ]

Patch is updated at http://review.whamcloud.com/#/c/7200

Comment by James A Simmons [ 07/Jul/14 ]

I'm running locally with the 7200 patch with no problems but I see issues in Maloo. The logs reported lack important information. Can some look into why this patch is failing when it shouldn't be.

Comment by Christopher Morrone [ 06/Jul/16 ]

I know work is under way on this ticket. It would be nice to get a status update here, since the last comment was from two years ago.

Comment by Jian Yu [ 08/Jul/16 ]

Patch http://review.whamcloud.com/7200 passed local testing on master branch. It's pending testing by autotest and review.

Comment by Peter Jones [ 12/Aug/16 ]

I have dropped the priority to Critical rather than Blocker to more accurately reflect the sitaution with this important enhancement

Comment by Peter Jones [ 20/Oct/16 ]

Work is still ongoing for this effort but may not make 2.9

Comment by Jean-Baptiste Riaux (Inactive) [ 06/Mar/17 ]

Can we make sure it lands into 2.10 ?

Comment by James A Simmons [ 14/Mar/17 ]

Jean have you been testing this patch?

Comment by Jian Yu [ 14/Mar/17 ]

Yes, James. However, testing on the patch is blocked by LU-8729 now. The issue is:
https://jira.hpdd.intel.com/browse/LU-8729?focusedCommentId=180716&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-180716

Comment by James A Simmons [ 26/Oct/17 ]

LU-8729 no longer blocks this patch. Its been tested and reviewed.

Comment by James A Simmons [ 05/Dec/17 ]

Is it still planned to land this work for 2.11?

Comment by Gerrit Updater [ 15/Mar/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/7200/
Subject: LU-684 tests: replace dev_read_only patch with dm-flakey
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 54b9e3f789358bd9dfb94b77fe33a4faa1e28ab2

Comment by Peter Jones [ 15/Mar/18 ]

Landed for 2.11

Generated at Sat Feb 10 05:41:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.