patchless server kernel
(LU-20)
|
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Technical task | Priority: | Critical |
| Reporter: | Alex Zhuravlev | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | cea, llnl | ||
| Environment: |
RHEL6 |
||
| Issue Links: |
|
||||||||||||||||||||||||
| Story Points: | 2 | ||||||||||||||||||||||||
| Rank (Obsolete): | 5605 | ||||||||||||||||||||||||
| Description |
|
#define BIO_RDONLY 31 /* device is readonly */ void bio_free(struct bio *bio, struct bio_set *bs) if (bio_has_allocated_vec(bio)) ... void bvec_free_bs(struct bio_set *bs, struct bio_vec *bv, unsigned int idx) with RDONLY set BIO_POOL_IDX() can get larger than BIOVEC_NR_POOLS and we hit the assertion. |
| Comments |
| Comment by Mikhail Pershin [ 14/Sep/11 ] |
|
small addition from my side. This is not just theoretical issue, I hit kernel bug during OST recovery testing with rhel6 and ext4. |
| Comment by Mikhail Pershin [ 14/Sep/11 ] |
|
comparing rhel6 rdonly patch with rhel5 one I found that flag BIO_RDONLY is not used in rhel5 patch at all. Moreover it is not checked in lustre/rhel6 code also, so I wonder can it be just removed? |
| Comment by Andreas Dilger [ 16/Sep/11 ] |
|
There needs That said, dev_read_only() is one of the very last remaining kernel patches that is needed, and getting rid of it would be a huge step toward patchless kernels. The only other required patch today is jbd2-jcberr (for the journal commit callback), but The two uses that I'm aware of that need dev_rdonly functionality are:
If this could be handled inside the OSD in some manner it would be ideal, but I'm not sure that is possible if there are dependent operations being done. For example, if "mkdir foo; touch foo/bar" is done then as long as this state is only present in the OSD cache and not on disk, we might be able to drop this patch entirely. It wouldn't actually handle testing of interrupted mid-stream transactions, if we have tests that do this today, as opposed to just a write barrier between to filesystem-level operations. |
| Comment by Andreas Dilger [ 16/Sep/11 ] |
|
The other issue is that dev_read_only is not usable for ZFS, because the pool could abort if it is missing commits that it thought were flushed to disk. I recall that Ricardo was investigating this, but I don't recall if there was any outcome? |
| Comment by Alex Zhuravlev [ 16/Sep/11 ] |
|
there is full-feature "read-only" called freeze in DMU, where basically everything is flushed except uber-block. with ldiskfs we easily can get corrupted image because some blocks are purged from the memory while another are not. |
| Comment by Andreas Dilger [ 10/Nov/11 ] |
|
I also note that Linux 3.1 has a "dm-flakey" target for device mapper, which supports a "drop_writes" option that could be used in place of our dev_read_only kernel patch. This means we would need to configure DM/LVM for all of our testing environments, but it wouldn't be needed for production use. |
| Comment by Andreas Dilger [ 31/Jan/12 ] |
|
Change topic to reflect that it is now focussed on removing the dev-readonly patch entirely from the kernel patches, since this is one of the few remaining patches that is required for Lustre servers. |
| Comment by James A Simmons [ 28/Jan/13 ] |
|
With the work going on to support SLES11 SP2 and FC18 the discussion has merged about going a patch less as possible. Now that RHEL5 is no longer supported server side we can move to using dm-flakey. Looking at the current code it seems the dev_read_only_patch is needed to support the OBD_IOC_SET_READONLY ioctl. The approach I'm thinking of is to have the lustre utils call dm-setup to handle setting the disk to read only. The question is does this approach work for ZFS as well? |
| Comment by Alex Zhuravlev [ 28/Jan/13 ] |
|
I'd say that ZFS's approach is better for this purpose because it keep the filesystem consisten being read-only. on ldisks we rely on luck basically, any read on RO-enabled filesystem may lead to wrong/stale data. |
| Comment by James A Simmons [ 28/Jan/13 ] |
|
You are talking about osd-zfs calling spa_freeze kernel side. So for the ldiskfs case we have to make calls into the dm_flakey kernel driver to support this same functionality. |
| Comment by James A Simmons [ 22/May/13 ] |
|
Now that the major work of 2.4 is done its time to look at this again. I spent some time looking at using fault injection that Linux supports. In order for this to work some extra setup has to be done on the test box. First the kernel needs to be configured with CONFIG_FAIL_MAKE_REQUEST. Currently RHEL6 kernels only build this for debug kernels. So the first setup is too decided do we want to do lustre testing with debug kernels only or enable this feature with standard kernels being built with lbuild. After that we can handle setting fail_make_request two ways. One is mount debugfs and set the values we want or set the fail_make_request vales on the linux kernel command line. <interval>,<probability>,<space>,<times> For our testing the defaults that make sense are interval=0, probability=100, space=0, times=-1. Once we have that setup then its a matter of flipping make_it_fail in struct hd_struct (blkdev->->bd_part). As you can see coding would not be difficult to do. Its the setup that would require more work. |
| Comment by Alex Zhuravlev [ 05/Jun/13 ] |
|
I do see how dm_flakey can replay rdonly patch, but this approach has been known to bring issues when we need to read data back. would it be possible to skip writes, but save data for subsequent reads ? |
| Comment by James A Simmons [ 05/Jun/13 ] |
|
Alex can you point me to reproducer for what you are talking about. I just set up my test system to use the fail_make_request code. One thing I don't see is how are you putting the disk back into read/write mode. I don't see dev_clear_rdonly being used anywhere. |
| Comment by Alex Zhuravlev [ 06/Jun/13 ] |
|
well, we've already seen few times this with ldiskfs/mballoc which verifies on-disk structure on load and assert if they don't match other structures (already in memory). |
| Comment by James A Simmons [ 06/Jun/13 ] |
|
After I pieced together the fault injection system I noticed that it disables access to the disk for both reads and writes. Is their a reason why the fail over test set the disk to read access only? If our test case requires blocking writes only then we will have to go with the dm-flaky driver and fix the mballoc issues. |
| Comment by Alex Zhuravlev [ 07/Jun/13 ] |
|
yes, the reason is to "forget" writes to simulate poweroff when buffered writes are lost. |
| Comment by James A Simmons [ 07/Jun/13 ] |
|
Would disabling the reads alter the out come of the fail over tests? |
| Comment by Alex Zhuravlev [ 07/Jun/13 ] |
|
I think so. inbetween turning the device read-only and failover we usually still do some processing, which in turn, may need to read something from disk. |
| Comment by Peter Jones [ 21/Jun/13 ] |
|
There is a patch in gerrit for this work - http://review.whamcloud.com/#/c/6651/ |
| Comment by Yang Sheng [ 10/Jul/13 ] |
|
I think we can create a kernel module like loop device. And modify mount_lustre.c to support a option like '-o disw', Provide a parameter via ioctl() to control this module to discard write or not. This module just can be used in test and avoid it may hurt production environment. After this work done, we can get rid of dev_readonly patch. |
| Comment by Andreas Dilger [ 10/Jul/13 ] |
|
That's essentially what dm-flakey does, and has the added benefit of being in the kernel already. |
| Comment by Hongchao Zhang [ 23/Jul/13 ] |
|
there is a panic during using dm-flakey, steps: BUG: unable to handle kernel paging request at ffff88015981afff Pid: 2544, comm: loop0 Not tainted 2.6.32 #2 Parallels Software International Inc. Parallels Virtual Platform/Parallels Virtual Platform is there any other means to load the "flakey" table on demand except "suspend"+"resume"? |
| Comment by Hongchao Zhang [ 26/Jul/13 ] |
|
status update: 1, the problem in dm-flakey(drivers/md/dm-flakey.c" was found, which "corrupt_bio_data" feature is always applied even it isn't specified in the table. the patch still need to some improvement, especially about the creation/deletion of "loop device"(from file) and "dm device"(from normal block device). |
| Comment by Hongchao Zhang [ 01/Aug/13 ] |
|
the initial patch is tracked at http://review.whamcloud.com/#/c/7200/ |
| Comment by James A Simmons [ 01/Aug/13 ] |
|
So much better than my patch. No kernel side work needed |
| Comment by Jodi Levi (Inactive) [ 18/Mar/14 ] |
|
Hongchao, |
| Comment by Hongchao Zhang [ 20/Mar/14 ] |
|
status update: |
| Comment by Hongchao Zhang [ 28/Apr/14 ] |
|
the patch has been updated |
| Comment by Nathan Rutman [ 25/Jun/14 ] |
For the first point, you mean avoiding the delay waiting for service shutdown, or avoiding the delay of having to wait for writes to flush before starting the service again (on the other node)? |
| Comment by James A Simmons [ 03/Jul/14 ] |
|
Patch is updated at http://review.whamcloud.com/#/c/7200 |
| Comment by James A Simmons [ 07/Jul/14 ] |
|
I'm running locally with the 7200 patch with no problems but I see issues in Maloo. The logs reported lack important information. Can some look into why this patch is failing when it shouldn't be. |
| Comment by Christopher Morrone [ 06/Jul/16 ] |
|
I know work is under way on this ticket. It would be nice to get a status update here, since the last comment was from two years ago. |
| Comment by Jian Yu [ 08/Jul/16 ] |
|
Patch http://review.whamcloud.com/7200 passed local testing on master branch. It's pending testing by autotest and review. |
| Comment by Peter Jones [ 12/Aug/16 ] |
|
I have dropped the priority to Critical rather than Blocker to more accurately reflect the sitaution with this important enhancement |
| Comment by Peter Jones [ 20/Oct/16 ] |
|
Work is still ongoing for this effort but may not make 2.9 |
| Comment by Jean-Baptiste Riaux (Inactive) [ 06/Mar/17 ] |
|
Can we make sure it lands into 2.10 ? |
| Comment by James A Simmons [ 14/Mar/17 ] |
|
Jean have you been testing this patch? |
| Comment by Jian Yu [ 14/Mar/17 ] |
|
Yes, James. However, testing on the patch is blocked by |
| Comment by James A Simmons [ 26/Oct/17 ] |
|
|
| Comment by James A Simmons [ 05/Dec/17 ] |
|
Is it still planned to land this work for 2.11? |
| Comment by Gerrit Updater [ 15/Mar/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/7200/ |
| Comment by Peter Jones [ 15/Mar/18 ] |
|
Landed for 2.11 |