[LU-20] patchless server kernel Created: 17/Nov/10  Updated: 06/Sep/22  Resolved: 05/Nov/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Improvement Priority: Critical
Reporter: nasf (Inactive) Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: cea, llnl

Attachments: PNG File fio_sdck_block_size_read.png     PNG File fio_sdck_block_size_write.png     PNG File fio_sdck_io_depth_read.png     PNG File fio_sdck_io_depth_write.png     PNG File mdtest_create_8thr.png     PNG File mdtest_remove_8thr.png     PNG File mdtest_stat_8thr.png     PNG File sgpdd_16devs_rsz_read.png     PNG File sgpdd_16devs_rsz_write.png    
Issue Links:
Blocker
is blocking LU-9761 Add ldiskfs support to dkms for patch... Resolved
is blocked by LU-8685 Fix JBD2 issue in EL7 Kernels Resolved
Related
is related to LU-3406 Submit raid5-mmp-unplug-dev patch ups... Resolved
is related to LU-433 remove jbd2-jcberr patch from kernel Resolved
is related to LU-8729 conf-sanity test_84: FAIL: /dev/mappe... Resolved
is related to LU-9339 fix RHEL 7.2 project quota build error Resolved
is related to LU-2442 metadata performance degradation on c... Resolved
is related to LU-9698 osd-ldiskfs: unknown symbol error on ... Resolved
is related to LU-7643 Remove kernel version string from Lus... Resolved
is related to LU-2473 ldiskfs RHEL6.4 support Resolved
is related to LU-9111 update osd-ldiskfs to not depend on d... Resolved
is related to LUDOC-83 Patchless Server Doc Changes Resolved
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-2498 set block device IO scheduler to "dea... Technical task Closed Prakash Surya  
LU-684 replace dev_rdonly kernel patch with ... Technical task Resolved Oleg Drokin  
LU-2564 move dynlock from ldiskfs to osd-ldiskfs Technical task Resolved WC Triage  
LU-3406 Submit raid5-mmp-unplug-dev patch ups... Technical task Resolved Bruno Faccini  
LU-3966 Submit quota lock improvement patches... Technical task Resolved Niu Yawei  
Bugzilla ID: 21,524
Rank (Obsolete): 4869

 Description   

Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. There are a number of different patches, each one needs to use equivalent functionality which already exists in the kernel, or work to get the patch accepted upstream.

Corresponding to bugzilla link:
https://bugzilla.lustre.org/show_bug.cgi?id=21524



 Comments   
Comment by nasf (Inactive) [ 24/Nov/10 ]

Compare b1_x and b2_x, the later one needs less kernel patches, so our evaluation effort for patchless-server will base on b2_x. This is the kernel patches list under current master branch:

====================
2.6-rhel5-kgdb-ga.patch
bio_add_page.patch
blkdev_tunables-2.6-rhel5.patch
blkdev_tunables-2.6-sles11.patch
dev_read_only-2.6.18-vanilla.patch
dev_read_only-2.6.27-vanilla.patch
export-2.6.18-vanilla.patch
export-2.6.27-vanilla.patch
export-show_task-2.6.18-vanilla.patch
export_symbol_numa-2.6-fc5.patch
export_symbols-2.6.12.patch
export-symbols-for-dmu.patch
i_filter_data.patch
iopen-misc-2.6.22-vanilla.patch
jbd2-commit-timer-no-jiffies-rounding.diff
jbd2-jcberr-2.6-rhel5.patch
jbd2-jcberr-2.6-sles11.patch
jbd-jcberr-2.6.18-vanilla.patch
jbd-journal-chksum-2.6.18-vanilla.patch
jbd-slab-race-2.6-rhel5.patch
jbd-stats-2.6-rhel5.patch
lockdep_chains-2.6.18-vanilla.patch
lustre_version.patch
md-avoid-bug_on-when-bmc-overflow.patch
md-rebuild-policy.patch
mpt-fusion-max-sge.patch
prune-icache-use-trylock-rhel5.patch
quota-large-limits-rhel5.patch
quota-support-64-bit-quota-format.patch
raid5-configurable-cachesize-rhel5.patch
raid5-large-io-rhel5.patch
raid5-maxsectors-rhel5.patch
raid5-merge-ios-rhel5.patch
raid5-mmp-unplug-dev.patch
raid5-stats-rhel5.patch
raid5-stripe-by-stripe-handling-rhel5.patch
raid5-zerocopy-rhel5.patch
sd_iostats-2.6.27-vanilla.patch
sd_iostats-2.6-rhel5.patch
small-fixes-about-jbd.patch
vfs_races-2.6.22-vanilla.patch
vfs_races-2.6-rhel5.patch
====================

Comment by nasf (Inactive) [ 24/Nov/10 ]

According to the different process means, we can divided these patches as five types:

====================
1. No one uses them any longer, remove them directly.

1.1) export-show_task-2.6.18-vanilla.patch
This patch exports "sched_show_task()", which is no called in b2_x any longer, can be dropped directly.

1.2) export_symbols-2.6.12.patch
This patch exports "is_subdir()", which is no called in b2_x any longer, can be dropped directly.

1.3) i_filter_data.patch
This patch introduces new member "i_filterdata" into "inode" structure. Currently, it is only used by SOM. Since SOM does not exist in any released branches yet, we can ignore it.

1.4) jbd-slab-race-2.6-rhel5.patch
No one uses it any longer, remove it.

1.5) lockdep_chains-2.6.18-vanilla.patch
No one uses it any longer, remove it.

2. Accepted/fixed by upstream kernel.

2.1) bio_add_page.patch
It is used on 2.6-sles11 system, which was introduced by the patch for bug 21137. Unfortunately, such patch caused other issues on b1_8 and has been reverted from b1_8, so not sure whether it will be reverted from master or not. Anyway, related fix has been contained in the latest stable kernel linux-2.6.36. So we can remove this patch.

2.2) jbd2-commit-timer-no-jiffies-rounding.diff
As comment by Andreas in bug 21524:
"Fixes a bug where transactions were not committed if there was no filesystem activity. This should be in the upstream 2.6.32 kernel. Not needed for DMU."
The latest stable kernel linux-2.6.36 has fixed it with "round_jiffies_up()". So we can remove this patch.

2.3) md-avoid-bug_on-when-bmc-overflow.patch
The latest stable kernel linux-2.6.36 has contained such patch. So we can remove this patch.

2.4) quota-large-limits-rhel5.patch
The latest stable kernel linux-2.6.36 has contained such patch. So we can remove this patch.

2.5) quota-support-64-bit-quota-format.patch
The latest stable kernel linux-2.6.36 has contained such patch. So we can remove this patch.

2.6) small-fixes-about-jbd.patch
The latest stable kernel linux-2.6.36 has fixed it already. So we can remove this patch.

3. Similar functionality can be replaced by other interface or mechanism.

3.1) export-2.6.

{18,27}-vanilla.patch
This patch exports "jbd2_log_start_commit()", "security_inode_unlink()" and "log_start_commit()". In the latest stable kernel linux-2.6.36, "jbd2_log_start_commit()" and "log_start_commit()" are exported already by default. As for "security_inode_unlink()", it is used by "filter_vfs_unlink()" when destroy OST object. In fact, such security related checking is between VFS unlink and lower layer ldiskfs unlink, which is unnecessary for Lustre, because Lustre does not need such security processing (selinux/smack, and so on) on OSS side. So we can drop such call directly, and remove this patch.

3.2) export_symbol_numa-2.6-fc5.patch
This patch exports "node_2_cpu_mask()". But in fact, in current b2_x, we only need "node_to_cpumask()", which can be implemented by "node_to_cpumask_map[]" in the latest stable kernel linux-2.6.36. So we can remove this patch.

3.3) iopen-misc-2.6.22-vanilla.patch vfs_races-2.6.22-vanilla.patch vfs_races-2.6-rhel5.patch
These patches only to introduce and export "d_rehash_cond()" and "d_move_locked()", such two functions are used by ldiskfs to provide the _iopen_ functionality needed for open-by-fid and recovery.
There are several ways to remove the patch. One is as described by Andreas in bug 21524:
"With the introduction of the filesystem "export operations" used by NFS, it should be possible to replace the use of _iopen_ (in particular mds_fid2dentry()) with sb->s_export_op->get_dentry().
This uses the NFS filehandle (fh) data to do a lookup-by-ino+gen in a dcache safe manner. Unfortunately, the default encoding of the fh is hidden in fs/exportfs/expfs.c::export_encode_fh() if it isn't explicitly specified in sb->s_export_op (which it isn't by default in ext3/ext4). That said, the encoding is very simple and it would be possible to replicate it and install a duplicate->encode_fh() method, since it is only used local to ldiskfs."
But I think it is not the simplest way. I prefer to another means:
Introduce new spin_lock "ldiskfs_dcache_lock" to wrap kernel spin_lock "dcache_lock", then replace "d_rehash_cond()" with "d_rehash()", and replace "d_move_locked()" with "d_move()", without "dcache_lock" held when call "d_rehash()" and "d_move()", the "ldiskfs_dcache_lock" can protect the race gap. The call sequence just likes:
lock(&ldiskfs_dcache_lock);
lock(&dcache_lock);
dcache process;
unlock(&dcache_lock);
d_rehash()/d_move();
unlock(&ldiskfs_dcache_lock);
We have used the similar process for patchless-client. I think it should work well for patchless-server also.

3.4) jbd2-jcberr-2.6-{rhel5,sles11}.patch jbd-jcberr-2.6.18-vanilla.patch
As comment by Andreas in bug 21524:
"This allows the jbd transaction commit callbacks to be registered. The ext4 jbd2 code has a different commit callback (one per transaction) that could be used to provide equivalent functionality. This would require modifying the existing ext4 commit callback (used by mballoc when freeing data blocks) to be mutiplexed so it will store 2 different callback functionss and 2 different lists of callback data. Not needed for DMU."
We can use "journal_s->j_commit_callback()" to replace the current journal commit callback. So we can remove this patch. But that also means only "jbd2 + ext4-ldiskfs" can be used for patchless-server in future.

3.5) jbd-journal-chksum-2.6.18-vanilla.patch
This patch enable "journal_checksum" for "jbd", similar functions have been implemented in "jbd2". So to enable "journal_checksum" supporting, we can use "jbd2 + ext4-ldiskfs", instead of "patched-jbd" + "ext3-ldiskfs" mode. Just ignore such patch for patchless-server.

3.6) jbd-stats-2.6-rhel5.patch
This patch supplies more statistical information for "jbd", similar functions have been implemented in "jbd2". So to get such information when lustre runs, we can use "jbd2 + ext4-ldiskfs", instead of "patched-jbd" + "ext3-ldiskfs" mode. Just ignore such patch for patchless-server.

3.7) lustre_version.patch
This patch only defines "LUSTRE_KERNEL_VERSION", we can define it inside lustre itself and remove this patch.

3.8) prune-icache-use-trylock-rhel5.patch
This patch is introduced by bug 18399, which has been resolved through other means within lustre. So we can remove it directly.

4. Performance related kernel patches, need to be accepted by upstream kernel, but not yet.

4.1) blkdev_tunables-2.6-{rhel5,sles11}.patch
As comment by Andreas in bug 21524:
"Increases the default maximum device sectors to 1MB. This is important for MD RAID devices because they are not able to grow beyond the default amount, only shrink. This isn't really needed for hardware RAID or DMU, but in theory could be accepted by upstream kernel, and/or the MD RAID devices could be patched to properly handle the case when the component device max_sectors is larger than default."
Unfortunately, the latest stable kernel linux-2.6.36 does not contain related fix yet. On the other hand, consider comment #4 for bug 21524:
"I also wonder if the blkdev-tunables-2.6.27-sles11.patch is still required. Ever since 2.6.24/25 there is improved scatter gathering, which sends chained sg lists to the device. Unfortunately, I didn't have the chance yet to test how well that works."
So under patchless-server mode, we are not sure whether the performance for MD RAID device will be affected or not.

4.2) md-rebuild-policy.patch
This patch optimizes the MD device rebuilding. In theory it could be accepted by upstream kernel. Unfortunately, the latest stable kernel linux-2.6.36 does not contain related patch yet.

4.3) mpt-fusion-max-sge.patch
This patch enlarges FUSION_MAX_SGE range which affects the scatter gather I/O performance (introduced by the patch for bug 17086). In theory it could be accepted by upstream kernel. Unfortunately, the latest stable kernel linux-2.6.36 does not contain related patch yet.

4.4) raid5-{configurable-cachesize,large-io,maxsectors,merge-ios,mmp-unplug,stats,stripe-by-stripe-handling,zerocopy}-rhel5.patch
The serial raid5 related patches are used for lustre performance/system load on raid5 device. In theory them could be accepted by upstream kernel. Unfortunately, the latest stable kernel linux-2.6.36 does not contain related patches yet.

5. Other optional kernel patches.

5.1) 2.6-rhel5-kgdb-ga.patch
This is an optional kernel patch for kernel level GDB supporting. Lustre can run well without such patch. We need do nothing for it.

5.2) dev_read_only-2.6.{18,27}

-vanilla.patch
This patch is used for testing only now, to simulate a server crash for ldiskfs by discarding all of the writes to the filesystem. Lustre can run well without such patch, but some recovery/replay related tests can not be done.
As comment by Andreas in bug 21524:
"For recovery testing we could simulate this by using a special loopback or DM device that also discards writes to the device."
Anyway, it is not small work to simulate a server crash of losing write requests, if without patched kernel.

5.3) export-symbols-for-dmu.patch
This is only used for DMU in future. I am not familiar with DMU, but it is quite different from current ldiskfs-based one, and many current kernel patches are meaningless for DMU. We just ignore it for patchless-server temporarily.

5.4) sd_iostats-2.6

{.27-vanilla,-rhel5}

.patch
As comment by Andreas in bug 21524:
"Purely needed for informational purposes. In principle, James Bottomley would be interested to have such stats in the upstream kernel, but he didn't want the current implementation since it is using /proc instead of /sys. Can be dropped until a need arises to port it again."
====================

Comment by nasf (Inactive) [ 24/Nov/10 ]

As described above, section 4 is the most difficult to be resolved for patchless-server, because we can not guarantee the performance if without such kernel patches (we need more time and resource to investigate and verify the performance under patchless-server mode, maybe some of them are unnecessary because of new mechanism introduced). Another issue is that we have to port Lustre to some specified new kernel to obtain some kernel functional supporting (without above kernel patches applied), but whether customer can accept that is uncertain. Because if customer can accept to upgrade the server kernel to specified version, they maybe encounter similar driver (upgrade/recompile) issues as he used patched-server before.

So I think it is better to remove those kernel patches with kernel upgrading step by step, unless some customers require patchless-server very urgent.

Comment by Robert Read (Inactive) [ 25/Nov/10 ]

Thanks, nice analysis. I think the general expectation is this will be possible only with newer kernels, so we can start this with SLES 11 SP1 and RHEL 6 (2.6.32) and above. Note the RHEL 6 port (22375) has fewer patches already, so it seems a lot of this work has already been done.

Comment by Ashley Pittman (Inactive) [ 25/Nov/10 ]

Being something that only works on HEAD kernel.org kernels and hence RHEL kernels 12/18 months down is all I'm expecting to be honest, the fact that it will take that long to filter down into distributions in no way undermines the need from our perspective.

Comment by Robert Read (Inactive) [ 25/Nov/10 ]

The main reason for us to support patchless kernels is so users can use their vendor kernels, and not lose their vendor support, in theory. It also simplifies our build and maintenance. Supporting the tip of kernel.org is a secondary, though also desirable, effect. However, as mentioned on the mailing list, supporting new kernel version will generally still require changes to Lustre, so it still won't necessarily be free.

Comment by Ashley Pittman (Inactive) [ 25/Nov/10 ]

I think we're talking about the same thing, I was assuming that kernel.org would be a step along the road to vendor kernels but if it's the other way round that's fine also, at the end of the day it's vendor kernels that matter.

I'm not expecting the maintenance burden to go away even with patchless kernels, it's the support, distribution and security update headache that comes with having to build custom kernels I'm interested in avoiding.

Comment by Robert Read (Inactive) [ 25/Nov/10 ]

Yes, sounds like we are in violent agreement.

Comment by nasf (Inactive) [ 29/Nov/10 ]

Consider the latest RHEL6 release (linux-2.6.32-71), all these patches above need the same processing as for kernel linux-2.6.36.

Comment by Build Master (Inactive) [ 28/Jul/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #233
LU-20 remove unnecessary DMU export patch

Oleg Drokin : d4ea36c7373eddb01bcdda32ca8894764f61e1cb
Files :

  • lustre/kernel_patches/series/2.6-rhel5.series
  • lustre/kernel_patches/patches/export-symbols-for-dmu.patch
Comment by Build Master (Inactive) [ 28/Jul/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #233
LU-20 remove unnecessary DMU export patch

Oleg Drokin : d4ea36c7373eddb01bcdda32ca8894764f61e1cb
Files :

  • lustre/kernel_patches/series/2.6-rhel5.series
  • lustre/kernel_patches/patches/export-symbols-for-dmu.patch
Comment by Build Master (Inactive) [ 28/Jul/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #233
LU-20 remove unnecessary DMU export patch

Oleg Drokin : d4ea36c7373eddb01bcdda32ca8894764f61e1cb
Files :

  • lustre/kernel_patches/series/2.6-rhel5.series
  • lustre/kernel_patches/patches/export-symbols-for-dmu.patch
Comment by Build Master (Inactive) [ 28/Jul/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #233
LU-20 remove unnecessary DMU export patch

Oleg Drokin : d4ea36c7373eddb01bcdda32ca8894764f61e1cb
Files :

  • lustre/kernel_patches/patches/export-symbols-for-dmu.patch
  • lustre/kernel_patches/series/2.6-rhel5.series
Comment by Build Master (Inactive) [ 28/Jul/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #233
LU-20 remove unnecessary DMU export patch

Oleg Drokin : d4ea36c7373eddb01bcdda32ca8894764f61e1cb
Files :

  • lustre/kernel_patches/patches/export-symbols-for-dmu.patch
  • lustre/kernel_patches/series/2.6-rhel5.series
Comment by Build Master (Inactive) [ 28/Jul/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #233
LU-20 remove unnecessary DMU export patch

Oleg Drokin : d4ea36c7373eddb01bcdda32ca8894764f61e1cb
Files :

  • lustre/kernel_patches/patches/export-symbols-for-dmu.patch
  • lustre/kernel_patches/series/2.6-rhel5.series
Comment by Build Master (Inactive) [ 28/Jul/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #233
LU-20 remove unnecessary DMU export patch

Oleg Drokin : d4ea36c7373eddb01bcdda32ca8894764f61e1cb
Files :

  • lustre/kernel_patches/series/2.6-rhel5.series
  • lustre/kernel_patches/patches/export-symbols-for-dmu.patch
Comment by Build Master (Inactive) [ 28/Jul/11 ]

Integrated in lustre-master » i686,client,el5,ofa #233
LU-20 remove unnecessary DMU export patch

Oleg Drokin : d4ea36c7373eddb01bcdda32ca8894764f61e1cb
Files :

  • lustre/kernel_patches/series/2.6-rhel5.series
  • lustre/kernel_patches/patches/export-symbols-for-dmu.patch
Comment by Build Master (Inactive) [ 28/Jul/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #233
LU-20 remove unnecessary DMU export patch

Oleg Drokin : d4ea36c7373eddb01bcdda32ca8894764f61e1cb
Files :

  • lustre/kernel_patches/series/2.6-rhel5.series
  • lustre/kernel_patches/patches/export-symbols-for-dmu.patch
Comment by Build Master (Inactive) [ 28/Jul/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #233
LU-20 remove unnecessary DMU export patch

Oleg Drokin : d4ea36c7373eddb01bcdda32ca8894764f61e1cb
Files :

  • lustre/kernel_patches/patches/export-symbols-for-dmu.patch
  • lustre/kernel_patches/series/2.6-rhel5.series
Comment by Build Master (Inactive) [ 28/Jul/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #233
LU-20 remove unnecessary DMU export patch

Oleg Drokin : d4ea36c7373eddb01bcdda32ca8894764f61e1cb
Files :

  • lustre/kernel_patches/series/2.6-rhel5.series
  • lustre/kernel_patches/patches/export-symbols-for-dmu.patch
Comment by Build Master (Inactive) [ 28/Jul/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #233
LU-20 remove unnecessary DMU export patch

Oleg Drokin : d4ea36c7373eddb01bcdda32ca8894764f61e1cb
Files :

  • lustre/kernel_patches/series/2.6-rhel5.series
  • lustre/kernel_patches/patches/export-symbols-for-dmu.patch
Comment by Build Master (Inactive) [ 28/Jul/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #233
LU-20 remove unnecessary DMU export patch

Oleg Drokin : d4ea36c7373eddb01bcdda32ca8894764f61e1cb
Files :

  • lustre/kernel_patches/patches/export-symbols-for-dmu.patch
  • lustre/kernel_patches/series/2.6-rhel5.series
Comment by Build Master (Inactive) [ 28/Jul/11 ]

Integrated in lustre-master » i686,server,el5,ofa #233
LU-20 remove unnecessary DMU export patch

Oleg Drokin : d4ea36c7373eddb01bcdda32ca8894764f61e1cb
Files :

  • lustre/kernel_patches/series/2.6-rhel5.series
  • lustre/kernel_patches/patches/export-symbols-for-dmu.patch
Comment by Nathan Rutman [ 04/Oct/11 ]

LU-723 includes patches to separate ldiskfs build from Lustre build, which should help in the creation of a patchless server

Comment by Build Master (Inactive) [ 27/Oct/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #317
LU-20 ldiskfs: remove spurious warning message (Revision 6541ea42978e73e0262273f6bd5ce0b71689fdc2)

Result = SUCCESS
Oleg Drokin : 6541ea42978e73e0262273f6bd5ce0b71689fdc2
Files :

  • ldiskfs/kernel_patches/patches/ext4-remove-extents-warning-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
Comment by Build Master (Inactive) [ 27/Oct/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #317
LU-20 ldiskfs: remove spurious warning message (Revision 6541ea42978e73e0262273f6bd5ce0b71689fdc2)

Result = SUCCESS
Oleg Drokin : 6541ea42978e73e0262273f6bd5ce0b71689fdc2
Files :

  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-remove-extents-warning-rhel5.patch
Comment by Build Master (Inactive) [ 27/Oct/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #317
LU-20 ldiskfs: remove spurious warning message (Revision 6541ea42978e73e0262273f6bd5ce0b71689fdc2)

Result = SUCCESS
Oleg Drokin : 6541ea42978e73e0262273f6bd5ce0b71689fdc2
Files :

  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-remove-extents-warning-rhel5.patch
Comment by Build Master (Inactive) [ 27/Oct/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #317
LU-20 ldiskfs: remove spurious warning message (Revision 6541ea42978e73e0262273f6bd5ce0b71689fdc2)

Result = SUCCESS
Oleg Drokin : 6541ea42978e73e0262273f6bd5ce0b71689fdc2
Files :

  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-remove-extents-warning-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
Comment by Build Master (Inactive) [ 27/Oct/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #317
LU-20 ldiskfs: remove spurious warning message (Revision 6541ea42978e73e0262273f6bd5ce0b71689fdc2)

Result = SUCCESS
Oleg Drokin : 6541ea42978e73e0262273f6bd5ce0b71689fdc2
Files :

  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-remove-extents-warning-rhel5.patch
Comment by Build Master (Inactive) [ 27/Oct/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #317
LU-20 ldiskfs: remove spurious warning message (Revision 6541ea42978e73e0262273f6bd5ce0b71689fdc2)

Result = SUCCESS
Oleg Drokin : 6541ea42978e73e0262273f6bd5ce0b71689fdc2
Files :

  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-remove-extents-warning-rhel5.patch
Comment by Build Master (Inactive) [ 27/Oct/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #317
LU-20 ldiskfs: remove spurious warning message (Revision 6541ea42978e73e0262273f6bd5ce0b71689fdc2)

Result = SUCCESS
Oleg Drokin : 6541ea42978e73e0262273f6bd5ce0b71689fdc2
Files :

  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-remove-extents-warning-rhel5.patch
Comment by Build Master (Inactive) [ 27/Oct/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #317
LU-20 ldiskfs: remove spurious warning message (Revision 6541ea42978e73e0262273f6bd5ce0b71689fdc2)

Result = SUCCESS
Oleg Drokin : 6541ea42978e73e0262273f6bd5ce0b71689fdc2
Files :

  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-remove-extents-warning-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
Comment by Build Master (Inactive) [ 27/Oct/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #317
LU-20 ldiskfs: remove spurious warning message (Revision 6541ea42978e73e0262273f6bd5ce0b71689fdc2)

Result = FAILURE
Oleg Drokin : 6541ea42978e73e0262273f6bd5ce0b71689fdc2
Files :

  • ldiskfs/kernel_patches/patches/ext4-remove-extents-warning-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
Comment by Build Master (Inactive) [ 27/Oct/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #317
LU-20 ldiskfs: remove spurious warning message (Revision 6541ea42978e73e0262273f6bd5ce0b71689fdc2)

Result = SUCCESS
Oleg Drokin : 6541ea42978e73e0262273f6bd5ce0b71689fdc2
Files :

  • ldiskfs/kernel_patches/patches/ext4-remove-extents-warning-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
Comment by Build Master (Inactive) [ 27/Oct/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #317
LU-20 ldiskfs: remove spurious warning message (Revision 6541ea42978e73e0262273f6bd5ce0b71689fdc2)

Result = SUCCESS
Oleg Drokin : 6541ea42978e73e0262273f6bd5ce0b71689fdc2
Files :

  • ldiskfs/kernel_patches/patches/ext4-remove-extents-warning-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
Comment by Build Master (Inactive) [ 27/Oct/11 ]

Integrated in lustre-master » i686,client,el5,ofa #317
LU-20 ldiskfs: remove spurious warning message (Revision 6541ea42978e73e0262273f6bd5ce0b71689fdc2)

Result = SUCCESS
Oleg Drokin : 6541ea42978e73e0262273f6bd5ce0b71689fdc2
Files :

  • ldiskfs/kernel_patches/patches/ext4-remove-extents-warning-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
Comment by Build Master (Inactive) [ 27/Oct/11 ]

Integrated in lustre-master » i686,server,el5,ofa #317
LU-20 ldiskfs: remove spurious warning message (Revision 6541ea42978e73e0262273f6bd5ce0b71689fdc2)

Result = SUCCESS
Oleg Drokin : 6541ea42978e73e0262273f6bd5ce0b71689fdc2
Files :

  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-remove-extents-warning-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
Comment by Andreas Dilger [ 31/Aug/12 ]

Current patch list for RHEL6:

lustre_version.patch
mpt-fusion-max-sge-rhel6.patch
raid5-mmp-unplug-dev-rhel6.patch
vfs_races-2.6.32-rhel6.patch
dev_read_only-2.6.32-rhel6.patch
blkdev_tunables-2.6-rhel6.patch
export-2.6.32-vanilla.patch
jbd2-jcberr-2.6-rhel6.patch
bh_lru_size_config.patch

Patches that are currently kept only to simplify Lustre/kernel upgrade/testing and can be removed at any time:

  • lustre_version
  • vfs_races
  • jbd2-jcberr
  • export-2.6.32-vanilla - only needed for obdfilter and will become optional when it is removed

Patches that need some work to be removed:

  • dev_read_only - needs to be replaced by "dm-flakey" module, only needed for acceptance-small testing

Performance/functionality patches that could/should be submitted upstream:

  • mpt-fusion-max-sge-rhel6
  • raid5-mmp-unplug-dev
  • bh_lru_size_config
Comment by Andreas Dilger [ 20/Oct/12 ]

Note that it is possible to build/run ZFS based servers without any kernel patches using the --without-ldiskfs configure option. This would allow fixing Lustre kernel build issues for the newer kernels before the ldiskfs patch is complete.

Comment by Bruno Faccini (Inactive) [ 24/May/13 ]

Andreas, nasf,
The current 2.6-rhel6.series list in master contains :

lustre_version.patch
mpt-fusion-max-sge-rhel6.patch
raid5-mmp-unplug-dev-rhel6.patch
dev_read_only-2.6.32-rhel6.patch
blkdev_tunables-2.6-rhel6.patch
export-2.6.32-vanilla.patch
jbd2-jcberr-2.6-rhel6.patch
bh_lru_size_config.patch
replace_dqptr_sem.patch
ipoib-locking-fix.patch

So if I refer to your last reviews and classification, there are 2 new patches to take care, replace_dqptr_sem and ipoib-locking-fix, and also blkdev_tunables you did not comment in your last update. But do we know if the status of the "old" ones/list has changed since end of August 2012 ?

Comment by Andreas Dilger [ 25/May/13 ]

Bruno,
The replace-dqptr-sem patch is needed for performance only, so not a hard requirement. Ideally, this would be pushed upstream by Niu or Lai and not kept in our patch series. Ipoib-locking is already in the upstream kernels, and RHEL 6.5 so we won't need it for long.

I think it would be worthwhile to try pushing bh_lru_size upstream as a patch that is not changing the default value, but at least making it a kernel tunable. If you could come up with some benchmark that showed real performance improvements (e.g. ext4 + mdtest in a large directory with quotas enabled), then we might get it upstream with the larger LRU size.

Similarly, blkdev_tunables would need to have some real benchmark results to get it upstream.

The raid5-mmp-unplug patch could probably be accepted easily if you first do a code inspection of the MD RAID code to verify that the "sync read" flag (or whatever it is) is still valid with newer kernels, and is not only in the old raid5 patch series we used to have.

Comment by Bruno Faccini (Inactive) [ 25/May/13 ]

Thank's for your help and advices, Andreas. I will try to get something done in the directions you indicated.

Comment by Bruno Faccini (Inactive) [ 31/May/13 ]

Concerning raid5-mmp-unplug I will work on it as part of LU-3406 you created specifically.
May be we should also make a reference ("is related" ?) to LU-3406 there ??

Comment by Andreas Dilger [ 08/Oct/13 ]

I submitted http://review.whamcloud.com/7881 to remove the old jbd2-jcberr and lustre-version patches from the RHEL6 and SLES11SP2 series files. Those have not been required since Lustre 2.2, and can be removed in Lustre 2.6 since there is no ability to upgrade older Lustre to 2.6+ directly anyway.

Comment by James A Simmons [ 08/Oct/13 ]

Besides those patches we also have export-2.6.32-vanilla.patch which is no longer needed due to lustre using the OSD layer now. There is the bio_add_page.patch which was for an old SLES11 SP1 problem but this patch is not used by one so it can go away as well.

Comment by Sebastien Buisson (Inactive) [ 07/Feb/14 ]

Hi,

Working for RedHat adoption of Lustre kernel patches, I have been studying the effects of the patch blkdev_tunables on IO performance.

Here is a comprehensive description of what has been done, and what has been found out. In short, I was not able to see the benefit of this patch in Bull's typical IO server environment.

Initial objective:
==================

Prove usefulness of patch blkdev_tunables-3.0-sles11.patch when doing 1MB IOs. We expect to demonstrate the benefits of this patch in terms of performance when issuing 1MB IOs to disks.

Tests carried out:
==================

  • Hardware configuration:

Storage array:
DDN SFA10Kt, FC connection to the server.
7 TB LUN, RAID 8+2 10 disks.

Node:
2 sockets/16 cores Intel Xeon SandyBridge E5-2660 2.20GHz, 32GB memory
Emulex LPe12002-M8 8Gb 2-port PCIe Fibre Channel Adapter

  • Software configuration:

Kernels tested:
2.6.32-358.el6.x86_64 (RHEL6.4)
2.6.32-358.el6.x86_64 + blkdev_tunables-2.6-rhel6.patch

On all devices, for the two kernels tested, we apply the following tuning parameters:
queue/scheduler -> noop
queue/read_ahead_kb -> 0
queue/nr_requests -> 256
queue/max_sectors_kb -> 32767
Moreover, we set the following kernel module parameter values for both kernels:
options lpfc lpfc_sg_seg_cnt=256
options lpfc lpfc_hba_queue_depth=512
options lpfc lpfc_lun_queue_depth=30
options mpt2sas max_sgl_entries=256
options mpt2sas max_sectors=4096
Finally, with the patched kernel, we set:
default_max_sectors -> 2048
default_max_segments -> 256
CONFIG_SCSI_MAX_SG_SEGMENTS -> 256
and default values for others.

  • Benchmarks:

All benchmarks are run 10 times to get a relevant average value.

. fio on 1 device, varying block size (iodepth=16 ioengine=libaio direct=1)

See fio_sdck_block_size_read.png and fio_sdck_block_size_write.png.

. fio on 1 device, varying IO depth (blocksize=1m ioengine=libaio direct=1)

See fio_sdck_io_depth_read.png and fio_sdck_io_depth_write.png.

. sgpdd_survey on 16 devices, varying record size (crg=1 thr=16)

See sgpdd_16devs_rsz_read.png and sgpdd_16devs_rsz_write.png.

  • Results:

No significant improvement was shown by the tests, when hitting just one device or the whole storage array.
With the blkdev_tunables patch applied, we even see a drop in write performance between 512KB and 1MB IOs with fio on a single device.

Code analysis:
==============

The patch is named blkdev_tunables-3.0-sles11.patch, and is taken from Lustre master branch. It does not correspond exactly to rhel sources, but on the contrary to the blkdev patch available for rhel in the Lustre source tree, this one introduces new kernel parameters instead of roughly modifying defines.

  • Part related to max_sectors in the block layer:
--- a/block/blk-settings.c	2013-02-06 12:40:44.000000000 -0500
+++ b/block/blk-settings.c	2013-02-06 12:55:28.000000000 -0500
@@ -19,6 +19,12 @@
[...]
+int default_max_sectors = BLK_DEF_MAX_SECTORS;
+module_param(default_max_sectors, int, 0);
[...]
@@ -255,7 +261,7 @@

    	limits->max_hw_sectors = max_hw_sectors;
    	limits->max_sectors = min_t(unsigned int, max_hw_sectors,
-				    BLK_DEF_MAX_SECTORS);
+				    default_max_sectors);
    }
    EXPORT_SYMBOL(blk_limits_max_hw_sectors);

For information, note that (include/linux/blkdev.h):

BLK_DEF_MAX_SECTORS     = 1024

But in block/blk-settings.c, we have this interesting comment:

/**
    *    max_sectors is a soft limit imposed by the block layer for
    *    filesystem type requests.  This value can be overridden on a
    *    per-device basis in /sys/block/<device>/queue/max_sectors_kb.
    *    The soft limit can not exceed max_hw_sectors.
    **/

And in block/blk-sysfs.c we can find the corresponding function, which is:

static ssize_t
queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)

So adding a new 'default_max_sectors' kernel module option does not seem very useful, as max_sectors_kb under /sys can be used to change this value at runtime.
Note that at Bull we always set max_sectors_kb to its maximum value (max_hw_sectors_kb).

  • Part related to max_segments in the block layer:
--- a/block/blk-settings.c	2013-02-06 12:40:44.000000000 -0500
+++ b/block/blk-settings.c	2013-02-06 12:55:28.000000000 -0500
@@ -19,6 +19,12 @@
[...]
+int default_max_segments = BLK_MAX_SEGMENTS;
+module_param(default_max_segments, int, 0);
[...]
@@ -108,7 +114,7 @@
     */
    void blk_set_default_limits(struct queue_limits *lim)
    {
-	lim->max_segments = BLK_MAX_SEGMENTS;
+	lim->max_segments = default_max_segments;
    	lim->max_integrity_segments = 0;
    	lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
    	lim->max_segment_size = BLK_MAX_SEGMENT_SIZE;

For information, note that (include/linux/blkdev.h):

BLK_MAX_SEGMENTS        = 256

But in drivers/scsi/scsi_lib.c, we have:

struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
                                            request_fn_proc *request_fn)
{
[...]
           blk_queue_max_segments(q, min_t(unsigned short,
                        shost->sg_tablesize,
SCSI_MAX_SG_CHAIN_SEGMENTS));
[...}
}

So when using SCSI devices, the default value of max_segments is always overwritten by the min of (shost->sg_tablesize, SCSI_MAX_SG_CHAIN_SEGMENTS).
Note that at Bull we always use SCSI devices when performance matters.

For information, note that (include/scsi/scsi.h):

/*
  * Like SCSI_MAX_SG_SEGMENTS, but for archs that have sg chaining. This limit
  * is totally arbitrary, a setting of 2048 will get you at least 8mb ios.
  */
#ifdef ARCH_HAS_SG_CHAIN
#define SCSI_MAX_SG_CHAIN_SEGMENTS      2048
#else
#define SCSI_MAX_SG_CHAIN_SEGMENTS      SCSI_MAX_SG_SEGMENTS
#endif
  • Part related to max_sg_segments in the scsi layer:
--- a/drivers/scsi/Kconfig	2013-02-07 09:25:49.000000000 -0500
+++ b/drivers/scsi/Kconfig	2013-02-07 09:30:15.000000000 -0500
@@ -245,6 +245,15 @@ config SCSI_SCAN_ASYNC
	  there should be no noticeable performance impact as long as you have
	  logging turned off.

+config SCSI_MAX_SG_SEGMENTS
+	int "Maximum SCSI scatter gather segment size"
+	range 32 256
+	default "128"
+	depends on SCSI
+	help
+	  Control the maximum limit for scatter gather buffers for the
+	  SCSI device.
+
    config SCSI_SCAN_ASYNC
    	bool "Asynchronous SCSI scanning"
    	depends on SCSI
--- a/include/scsi/scsi.h	2013-02-07 09:55:02.000000000 -0500
+++ b/include/scsi/scsi.h	2013-02-07 09:55:20.000000000 -0500
@@ -20,7 +20,7 @@ struct scsi_cmnd;
     * to SG_MAX_SINGLE_ALLOC to pack correctly at the highest order.  The
     * minimum value is 32
     */
-#define SCSI_MAX_SG_SEGMENTS	128
+#define SCSI_MAX_SG_SEGMENTS	CONFIG_SCSI_MAX_SG_SEGMENTS

    /*
     * Like SCSI_MAX_SG_SEGMENTS, but for archs that have sg chaining.

But, in include/scsi/scsi.h we can read this interesting comment:

/*
    * The maximum number of SG segments that we will put inside a
    * scatterlist (unless chaining is used). Should ideally fit inside a
    * single page, to avoid a higher order allocation.  We could define this
    * to SG_MAX_SINGLE_ALLOC to pack correctly at the highest order.  The
    * minimum value is 32
    */
#define SCSI_MAX_SG_SEGMENTS    256

And we do have sg chaining on our architecture. So increasing SCSI_MAX_SG_SEGMENTS has not direct impact.

However this modification can have a side effect as we have:

#define SG_ALL      SCSI_MAX_SG_SEGMENTS

But again, for all the device drivers we use at Bull, SG_ALL is not used, as sg_tablesize is always set via kernel module options (lpfc: lpfc_sg_seg_cnt, mpt2sas: max_sgl_entries).

  • Part related to isci driver:
--- a/drivers/scsi/isci/init.c	2013-02-08 10:13:00.000000000 -0500
+++ b/drivers/scsi/isci/init.c	2013-02-08 10:15:04.000000000 -0500
@@ -118,6 +118,10 @@ unsigned char phy_gen = 3;
    module_param(phy_gen, byte, 0);
    MODULE_PARM_DESC(phy_gen, "PHY generation (1: 1.5Gbps 2: 3.0Gbps 3: 6.0Gbps)");

+u16 sg_table_size = SG_ALL;
+module_param(sg_table_size, ushort, 0);
+MODULE_PARM_DESC(sg_table_size, "Size in KB of scatter gather table");
+
    unsigned char max_concurr_spinup = 1;
    module_param(max_concurr_spinup, byte, 0);
    MODULE_PARM_DESC(max_concurr_spinup, "Max concurrent device spinup");
@@ -155,7 +159,6 @@ static struct scsi_host_template isci_sh
    	.can_queue			= ISCI_CAN_QUEUE_VAL,
    	.cmd_per_lun			= 1,
    	.this_id			= -1,
-	.sg_tablesize			= SG_ALL,
    	.max_sectors			= SCSI_DEFAULT_MAX_SECTORS,
    	.use_clustering			= ENABLE_CLUSTERING,
    	.eh_device_reset_handler	= sas_eh_device_reset_handler,
@@ -407,6 +410,7 @@ static struct isci_host *isci_host_alloc
    	isci_host->pdev = pdev;
    	isci_host->id = id;

+	isci_sht.sg_tablesize = sg_table_size;
    	shost = scsi_host_alloc(&isci_sht, sizeof(void *));
    	if (!shost)
    		return NULL;

But isci (Intel(R) C600 series chipset) is not used at Bull.

  • Part related to MPT Fusion driver:
--- a/drivers/message/fusion/Kconfig	2013-02-08 10:21:25.000000000 -0500
+++ b/drivers/message/fusion/Kconfig	2013-02-08 10:22:37.000000000 -0500
@@ -61,9 +61,9 @@
    	  LSISAS1078

    config FUSION_MAX_SGE
-	int "Maximum number of scatter gather entries for SAS and SPI (16 - 128)"
-	default "128"
-	range 16 128
+	int "Maximum number of scatter gather entries for SAS and SPI (16 - 256)"
+	default "256"
+	range 16 256
    	help
    	  This option allows you to specify the maximum number of scatter-
    	  gather entries per I/O. The driver default is 128, which matches
--- a/drivers/message/fusion/mptbase.h	2013-02-08 10:32:45.000000000 -0500
+++ b/drivers/message/fusion/mptbase.h	2013-02-08 10:32:55.000000000 -0500
@@ -168,8 +168,8 @@
    #ifdef  CONFIG_FUSION_MAX_SGE
    #if     CONFIG_FUSION_MAX_SGE  < 16
    #define MPT_SCSI_SG_DEPTH	16
-#elif   CONFIG_FUSION_MAX_SGE  > 128
-#define MPT_SCSI_SG_DEPTH	128
+#elif   CONFIG_FUSION_MAX_SGE  > 256
+#define MPT_SCSI_SG_DEPTH	256
    #else
    #define MPT_SCSI_SG_DEPTH	CONFIG_FUSION_MAX_SGE
    #endif

But MPT Fusion driver is not used at Bull.

General conclusion:
===================

Bull will not require the adoption of this kernel patch in future RHEL releases.

Comment by Andreas Dilger [ 06/Mar/14 ]

Hi Sebastien, sorry for taking so long to look at your comments, I was very busy with patch landings before the 2.6 feature freeze. Some replies to your comments:

Prove usefulness of patch blkdev_tunables-3.0-sles11.patch when doing 1MB IOs.

So adding a new 'default_max_sectors' kernel module option does not seem very useful, as max_sectors_kb under /sys can be used to change this value at runtime.
Note that at Bull we always set max_sectors_kb to its maximum value (max_hw_sectors_kb).

One reason for the default_max_sectors change is historical - on MD RAID devices there was a bug where it is not possible to tune the block device to have a larger request size than the minimum of any of the underlying devices, and since MD RAID devices are configured at boot changing the request size via the /sys tunable afterward was not possible. That bug looks to be fixed in Linux 3.2 via b1bd055d3, and is only relevant for systems that use MD RAID. It also wasn't always the case that mount.lustre would adjust the max_sectors_kb tunable for Lustre devices at startup time.

So when using SCSI devices, the default value of max_segments is always overwritten by the min of (shost->sg_tablesize, SCSI_MAX_SG_CHAIN_SEGMENTS).
Note that at Bull we always use SCSI devices when performance matters.
...
And we do have sg chaining on our architecture. So increasing SCSI_MAX_SG_SEGMENTS has not direct impact.
for all the device drivers we use at Bull, SG_ALL is not used, as sg_tablesize is always set via kernel module options (lpfc: lpfc_sg_seg_cnt, mpt2sas: max_sgl_entries).
But isci (Intel(R) C600 series chipset) is not used at Bull.
But MPT Fusion driver is not used at Bull.

What this means is that without setting sg_tablesize via module options, this will prevent 1MB RPC sizes from being submitted to disk. I agree that this is not a factor with your configuration and drivers, but the intent is to improve performance for all Lustre users and to make it easier for them to deploy a Lustre solution that performs as fast as possible. I suspect many sites do not have the expertise and understanding of the hardware configuration to increase the SG size via module parameters as you do.

It seems to me that it would be better to avoid setting lpfc_sg_seg_cnt=256 for the lpfc driver and max_sgl_entries=256 for the mpt2sas driver (and also for mpt3sas) and each of the drivers separately, and instead have a single option default_sg_tablesize=256 or default_max_segments=256 and it wouldn't have to be set in a different way for each driver. Unfortunately, there is no method for setting /sys/block/*/queue/max_segments at runtime, or we could do it for Lustre target devices at setup time via mount.lustre

Comment by Sebastien Buisson (Inactive) [ 06/Mar/14 ]

Hi Andreas,

Thanks for your insightful comments.

From what I understand :

  • default_max_sectors:
    A bug related to MD RAID has been fixed, so the part of the patch aiming at working around this issue is no longer necessary.
  • default_max_segments:
    This part of the patch has no impact with at least lpfc and mpt {2,3}sas, because modifying #defines or adding default_max_segments module option is always overwritten at the driver level. I have also checked that ib_srp behaves the same as lpfc and mpt{2,3}

    sas regarding this aspect. So with the patch, if people use these drivers but do not set their modules options right, they would think they are doing 1MB IOs whereas it is not the case. In the end, the patch would be useful only when the driver used does not overwrite sg_tablesize and also the SCSI layer is not involved.
    The other approach consisting in setting these parameters for Lustre targets at setup time via mount.lustre is not an option, because as you pointed the kernel does not offer any possibility to change max_segments at runtime (due to the implications on lower layers).

Thanks,
Sebastien.

Comment by Sebastien Buisson (Inactive) [ 02/Apr/14 ]

Hi,

Working for RedHat adoption of Lustre kernel patches, I have been studying the effects of the patch bh_lru_size_config.patch on metadata performance.

To do so, I ran tests with mdtest, on one ext4 formatted device, creating, stating and removing 1000000 files in the same directory. I have several test cases, varying the 'size' of the directory in which files are created:

  • target directory is empty
  • target directory already contains 100000 files
  • target directory already contains 500000 files
  • target directory already contains 2000000 files
  • target directory already contains 5000000 files
  • target directory already contains 10000000 files

To compare the effect of the patch, I ran the same series of tests with:

  • a standard RHEL 6.4 kernel
  • a RHEL 6.4 patched kernel with BH_LRU_SIZE set to 16
  • a RHEL 6.4 patched kernel with BH_LRU_SIZE set to 32
  • a RHEL 6.4 patched kernel with BH_LRU_SIZE set to 64

mdtest was run with 8 tasks, and quotas were enabled on the partition.

The performance I got is illustrated by figures in files mdtest_create_8thr.png, mdtest_remove_8thr.png and mdtest_stat_8thr.png.
Unfortunately, these figures do not demonstrate any tangible benefit from increasing BH_LRU_SIZE.

Considering the comments about BH_LRU_SIZE in fs/buffer.c, I had a look at function __ find_get_block(), and I used SystemTap to understand if the buffer head was found directly via lookup_bh_lru() (FAST), or more slowly with __ find_get_block_slow() (SLOW), or if it was not found at all ( __find_get_block() returns NULL) (FAIL). I got the following statistics, comparing the behavior when using an empty target directory and one containing 10000000 files.

FAST empty 10000000
std rhel6.4 %77.89 %75.93
bh_lru_size=16 %78.99 %76.07
bh_lru_size=64 %84.08 %76.98
SLOW empty 10000000
std %20,60 %20,60
bh_lru_size=16 %19,50 %20,46
bh_lru_size=64 %14,41 %19,55
FAIL empty 10000000
std %1,51 %3,47
bh_lru_size=16 %1,51 %3,47
bh_lru_size=64 %1,51 %3,47

These figures show a delta of 6% on the FAST path when using an empty target directory, and a delta of 1% when using a target directory populated with 10000000 files. Improvement with an empty directory is very modest, but it is even slighter with a large directory.

All these results could mean I am not benchmarking the bh_lru_size patch the right way.
Could you please advise, and give some hints on what you think could be the best way to demonstrate usefulness of this patch?

Thanks,
Sebastien.

Comment by Andreas Dilger [ 03/Apr/14 ]

Sebastien,
There are some important things to consider when testing this patch. This is important when testing in cases when there are a number of buffers being accessed repeatedly. This does not happen under default ext4 testing conditions. In particular, when running under Lustre, there are extra files that modified with each file created. The last_rcvd file, the ChangeLog, quota files, and the LAST_ID file.

It makes more sense to use the mds_survey script to run these tests so that all of these extra files are updated.

I think it was Liang that added this patch when he was working on the MDS SMP scaling patch. Maybe he can point to the bug and test results where this was posted.

Comment by Liang Zhen (Inactive) [ 03/Apr/14 ]

Unfortunately I cannot find historical data, but as far as I can remember, I worked out this patch for "parallel directory operations" (PDO), and the major reason that we have this patch is because Lustre stack will consume more LRU slots then VFS file access, for example, file creation in Lustre MDT stack:

  • name find of ldiskfs will consume about 3 slots
  • creating inode will take about 3 slots
  • IAM will still take about 3 slots
  • name insert of ldiskfs will consume another 3-4 slots.
  • we also have some attr_set/xattr_set, which will access the LRU as well.

Which means we can barely hit cached buffers in LRU if size is 8, and based on my vague memory, I tested with some extreme environment like MDT on SSD or ramdisk, and zero-stripecount file, it showed like 5+% performance improvements after increasing LRU size to 16.

Comment by Sebastien Buisson (Inactive) [ 03/Apr/14 ]

Thanks Liang and Andreas.

I will keep you posted with the results of my tests with mds-survey.
However, I am a bit worried about the justification of this bh_lru_size patch for adoption in upstream kernel and RedHat, if we can demonstrate benefit only when running Lustre. Do you have any idea that would help generalize the performance benefit from this patch?

Thanks,
Sebastien.

Comment by Andreas Dilger [ 03/Apr/14 ]

It may be that for vanilla ext4 this difference would be visible if a large ACL is enabled (to spill into a separate block) along with SELinux.

Comment by Sebastien Buisson (Inactive) [ 03/Apr/14 ]

Oh great, I will try that, thanks!

Comment by Sebastien Buisson (Inactive) [ 18/Apr/14 ]

Hi,

I set a high number of ACLs on my test directory (named 10000000) with setfacl (until I get 'setfacl: /mydir/10000000: No space left on device' message), but when I look at the directory with debugfs I see:

debugfs:  stat 10000000
Inode: 83297492   Type: directory    Mode:  0755   Flags: 0x81000
Generation: 1051070824    Version: 0x00000000:04692681
User: 15314   Group:  1638   Size: 458170368
File ACL: 1332742176    Directory ACL: 0
Links: 2   Blockcount: 894872
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x533ebb8b:2b4a703c -- Fri Apr  4 16:02:51 2014
 atime: 0x533e6ebd:8bddae58 -- Fri Apr  4 10:35:09 2014
 mtime: 0x5335792e:4acd3ed8 -- Fri Mar 28 14:29:18 2014
crtime: 0x5332a834:143eda38 -- Wed Mar 26 11:13:08 2014
Size of extra inode fields: 28
EXTENTS:
(0-32767):1332742177-1332774944, (32768-65535):1332774945-1332807712, (65536-98303):1332807713-1332840480, (98304-11185
7):1332840481-1332854034

So it seems I do not really manage to spill ACLs into a separate block, do I?

Comment by Andreas Dilger [ 18/Apr/14 ]

Yes, the "File ACL: 1332742176" line shows you have an external xattr block with the ACLs.

Just to clarify, the "ACL" name is for historic reasons and only indicates an xattr block, it does not necessarily mean there are ACLs in that block. It just happens that in this test the xattrs ARE ACLs.

Comment by Sebastien Buisson (Inactive) [ 16/May/14 ]

Hi,

I finally had the opportunity to run the benchmarks you suggested.
The tests I launched were:
(a) mdtest on ramdisk device, single shared dir, with large ACL and SELinux
(b) mdtest on ramdisk device, single shared dir, with large ACL but NO SELinux
(c) mds-survey on ramdisk device, quota enabled, shared directory
(d) mds-survey on ramdisk device, quota enabled, directory per process

The tables below show performance gain (in percentage) when increasing BH_LRU_SIZE to 16 (default value is 8).

(a)

files tasks dir size (#files) Creation Stat Removal
1000000 1 0 -8,7 -2,7 -0,5
1000000 1 100000 -5,2 -0,5 -1,1
1000000 1 500000 -5,1 -3,7 -1,5
1000000 1 2000000 -5,1 -4,0 -8,5
1000000 1 5000000 -4,2 -5,3 -10,2
1000000 1 10000000 -3,5 -8,0 -10,9
1000000 8 0 -0,3 -3,8 -1,2
1000000 8 100000 -1,2 -3,7 -1,5
1000000 8 500000 0,5 -3,2 -5,3
1000000 8 2000000 -1,7 -6,1 -8,7
1000000 8 5000000 -5,9 -7,7 -11,9
1000000 8 10000000 -4,1 -8,8 -13,6

(b)

files tasks dir size (#files) Creation Stat Removal
1000000 1 0 0,0 -0,9 -1,1
1000000 1 100000 1,0 -3,0 -3,5
1000000 1 500000 3,7 -3,0 -2,4
1000000 1 2000000 1,1 3,6 -0,2
1000000 1 5000000 3,5 0,1 5,9
1000000 1 10000000 9,0 3,8 6,4
1000000 8 0 2,4 -1,2 -4,3
1000000 8 100000 -0,2 -1,8 -2,4
1000000 8 500000 1,1 -0,3 2,0
1000000 8 2000000 -0,3 -2,8 -3,3
1000000 8 5000000 0,3 -3,1 -1,3
1000000 8 10000000 1,5 0,0 0,7

(c)

files dir threads create lookup destroy
1000000 1 1 11,3 1,2 7,2
1000000 1 2 6,4 2,3 6,9
1000000 1 4 1,9 3,0 1,3
1000000 1 8 -0,6 4,3 0,7
1000000 1 16 0,5 4,4 0,6

(d)

files dir threads create lookup destroy
1000000 4 4 3,2 28,5 5,3
1000000 8 8 1,2 33,9 2,0
1000000 16 16 0,6 7,9 -0,2



To sum up briefly, it is very difficult to show performance improvement with mdtest. The only positive case is on Create without SELinux when using 1 thread. Strangely the more threads we have, the poorer is the gain in performance.
We can see more improvements thanks to mds-survey, but we have the same nearly-no-gain-phenomenon when increasing the number of threads.

Do you have any comments on this?
Do you think the mds-survey results would be enough to justify integration of the bh_lru_size patch in the upstream kernel? One thing that could help would be to add a kernel parameter instead of a configure time option, but I do not know if it is feasible.

Thanks,
Sebastien.

Comment by Sebastien Buisson (Inactive) [ 04/Jun/14 ]

Hi,

In order to ease adoption of buffer head per-CPU LRU size tuning in the upstream kernel, I propose a new patch that makes the bh_lru_size a kernel parameter, instead of a kernel config option:
http://review.whamcloud.com/10588
Could you please have a look at it and review it?

Once we come with a satisfying patch, I will try to propose it to the upstream kernel.

Thanks,
Sebastien.

Comment by Andreas Dilger [ 27/Jun/14 ]

Based on the response from Andrew Morton, maybe it would be better to just submit the original patch upstream (possibly with an improved patch description), with a default of 16 entries in the LRU cache and see if he accepts it?

Comment by Sebastien Buisson (Inactive) [ 27/Jun/14 ]

Hi Andreas,

I am glad you saw the discussion in the upstream kernel mailing lists.
Indeed, Andrew Morton seems to support the idea of increasing the default bh_lru_size value to 16, which is for him a value not less arbitrary than 8.
So you are right, I will try to submit the original patch.

Cheers,
Sebastien.

Comment by Sebastien Buisson (Inactive) [ 27/Jun/14 ]

The discussion about inclusion in the upstream kernel is taking place here:
http://marc.info/?t=140362683900001&r=1&w=2

Comment by James A Simmons [ 30/Jun/14 ]

This is excellent news about the first patch being considered for merger upstream. I think we have reached the point were we should push our kernel patch upstream. We have to main outstanding pieces of work to push upstream. The quota improvements (LU-3966) and the mmp plug patch (LU-3406). Lets get them up to date and push them upstream.

Comment by James A Simmons [ 03/Nov/14 ]

What happened to the push for the bh_lru_size patch going upstream?

Comment by Sebastien Buisson (Inactive) [ 04/Nov/14 ]

Hi,

The patch landed upstream:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=86cf78d73de8c6bfa89804b91ee0ace71a459961

Cheers,
Sebastien.

Comment by James A Simmons [ 04/Nov/14 ]

Oh that is excellent news. Means we can ask Redhat to integrate it into their kernel.

Comment by Sebastien Buisson (Inactive) [ 04/Nov/14 ]

We already did, in private bugzilla 1053108. But the more people ask, the sooner we will get it I guess

Comment by Andreas Dilger [ 04/Nov/14 ]

It would be good to change the patches in our tree to match the upstream one. Not only is the upstream patch less complex, but this will also avoid confusion when the vendor kernel includes the patch, so that nobody thinks we need to keep a patch with just the extra config options around.

Comment by Sebastien Buisson (Inactive) [ 05/Nov/14 ]

bh_lru_size patch modification in the Lustre sources is managed here:
http://review.whamcloud.com/12577

Comment by Gerrit Updater [ 19/Nov/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12577/
Subject: LU-20 kernel: increase BH_LRU_SIZE to 16
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 539d4de9218a16c0c3d110d4f873f67864cb4f8f

Comment by James A Simmons [ 31/Dec/15 ]

I was thinking for 2.9 we could at least make patchless server kernels the default. With what is left we could just patch the server kernel's code when --with-test is enabled. Does this sound reasonable?

Comment by Andreas Dilger [ 02/Jan/16 ]

James, the work to allow patch less kernels is largely done. What is needed is to update and merge the dm-fail patch to allow block device fault injection so that we can test recovery without the dev_rdonly patch. Otherwise, I'm not a big fan of doing all our testing on a kernel that isn't what we are giving to users. That work is largely complete in LU-684 but just needs to be taken to completion.

The other area of uncertainty is whether the quota patches are needed from the 2.6-rhel6 patch series? They improve quota performance in general, and also with multiple MDTs on the same node significantly. I don't know if they code was fixed in the upstream kernels already, or if the patches were just not ported to the new kernels.

Comment by James A Simmons [ 04/Jan/16 ]

The quota improvements were all ported upstream. For RHEL7 and SLES12 you will notice no quota patches are needed. We also have tunable patches which Bull showed are not useful. They should be dropped at this point. Lastly we have the raid5-mmp-unplug-dev patches which was submitted to the dm-devel guys and they pointed out that the patch has many bad flaws. A lot of work is needed for that.

Comment by Christopher Morrone [ 04/Jan/16 ]
 I'm not a big fan of doing all our testing on a kernel that isn't what we are giving to users

A fairly large constituent of Lustre users are not using your kernel already, and I suspect that constituent to continue to grow thanks to at least three factors: possible packaging by a major distro, the lustre kenrel upstreaming effort, and ZFS. Your concern is certainly valid, but I think it is just a reality that we need to face. We will just have to make our best effort to make reasonable choices about what we test against. I don't think we should let that concern delay moving to patchless by default.

If folks can push LU-684 to completion in the 2.9 timeframe, then great; I am all for that. But I think James's suggestion is a pretty reasonable alternative if the LU-684 work doesn't look like it is going to happen in time.

But if we can agree that is reasonable, then it might also make sense to just make the change now, and then hope that the LU-684 work makes it unnecessary when 2.9 comes out.

Comment by James A Simmons [ 27/Apr/16 ]

It appears the work for LU-684 is falling behind. Should we look at my suggestion?

Comment by Christopher Morrone [ 01/Jun/16 ]

Reading this a second time, I'm not as sure what you are suggesting, James. Do you want to have Intel's build farm continue as it is for patches pushed to gerrit, but for builds of tags have it build against stock kernels?

Comment by James A Simmons [ 06/Jun/16 ]

Yes Chris that is what I'm suggesting.

Comment by Peter Jones [ 12/Aug/16 ]

I have dropped the priority to Critical rather than Blocker to more accurately reflect the sitaution with this important enhancement

Comment by Peter Jones [ 20/Oct/16 ]

Work is still ongoing for this effort but may not make 2.9

Comment by Gerrit Updater [ 28/Mar/17 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/26220
Subject: LU-20 osd-ldiskfs: Make readonly patches optional
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 48df3575beb8a2928957cd6f15ce3d27eb1807ef

Comment by James A Simmons [ 13/Apr/17 ]

With the landing of LU-4017 should we consider closing this work. Its no longer possible to patchless kernels for the server in the ldiskfs case.

Comment by Christopher Morrone [ 13/Apr/17 ]

What?? Seriously? That's terrible.

Comment by James A Simmons [ 13/Apr/17 ]

You can't even build ldiskfs anymore without a patched RHEL kernel  

Comment by Bob Glossman (Inactive) [ 13/Apr/17 ]

James,

I don't think that is true.  The new related base kernel and ldiskfs changes are only in el7 patch series.  They don't exist on el6 or on any sles version.  So those aren't impacted and still build.

 

Comment by James A Simmons [ 13/Apr/17 ]

Yeah I just realized that which is why I edited out that comment. The question I do have does the new project quota stuff even work on SLES?

Comment by Bob Glossman (Inactive) [ 13/Apr/17 ]

James,

I very recently asked that exact same question.  Current implementation only supports the feature on el7.   Undetermined at this time if it will be extended to other distros or versions.

Comment by Christopher Morrone [ 13/Apr/17 ]

OK, so if ldiskfs can be built on SLES without this feature, then we need a way to also make the feature optional on RHEL. We need to be able to build without the feature if the kernel is not patched to add said feature.

That way we can keep this 10 year goal of being kernel patch free alive. And it is an important goal I think.

I would argue that if that can't be accomplished in time for 2.10, then the LU-4017 feature needs to be disabled altogether until the feature can be optional in the build.

Comment by Andreas Dilger [ 14/Apr/17 ]

Yes, the 2.10 release will be able to build without this feature, even on RHEL.

Comment by Christopher Morrone [ 24/Apr/17 ]

So just to be 100% clear, ldisk will still be buildable on RHEL7 with stock kernels?

Comment by James A Simmons [ 24/Apr/17 ]

Yes. I'm running with RHEL7 stock kernel right now with latest master and ldiskfs.

Comment by Andreas Dilger [ 24/Apr/17 ]

Yes, the fix for ldiskfs to build against a non-project-quota kernel was landed last week (patch https://review.whamcloud.com/26647 "LU-9339 ldiskfs: Make ldiskfs buildable on kernels with no project quota"), and I'm using this at home as well (RHEL 7.2 kernel).

In fact, with patch https://review.whamcloud.com/26220 "LU-20 osd-ldiskfs: Make readonly patches optional" and weak module support it will be possible to use the same ldiskfs code against a patched and unpatched kernel.

Comment by James A Simmons [ 15/May/17 ]

Chris I have a question. With the patch https://review.whamcloud.com/26220 we can build osd-ldiskfs modules that work with both patched and unpatched kernels without a rebuild. Now if I install osd-ldiskfs modules on a unpatched kernel I do see no dev_read_only missing errors. So the question is how could add a install script in the spec file that can detect a patched kernel and install special osd-ldiskfs module?

Comment by Christopher Morrone [ 15/May/17 ]

Sigh.  That approach is almost entirely impractical when it comes time to do the packaging.  Not only that, it would appear to break the install of the kmp-lustre-tests for builds with ldiskfs but without a patched kernel.  Which is exactly the situation we're trying keep support for.

Looks like someone should open another 2.10 blocker.

 

Comment by Christopher Morrone [ 15/May/17 ]

Oh, sorry, 26220 hasn't landed yet.  I'll just -1 it.

Comment by Peter Jones [ 23/May/17 ]

In order to go to patchless servers for ldiskfs deployments we need a version of RHEL 7.x which include the fix for the upstream bug "jbd2: incorrect unlock on j_list_lock ". We have been told that this will be available in the near future but at this stage it seems more likely to be available in the 2.10.1 timeframe rather than 2.10.0

Comment by Christopher Morrone [ 23/May/17 ]

Peter, I don't think that is how we should look at this ticket. While it is true that the kernel has a bug there that is a problem for lustre, lustre can build and run without that patch. We could, and should, move that patch out of the Lustre tree. That patch is a kernel patch, and should be housed in a kernel repository.

What is really holding up this ticket is still subtask 2, LU-684. The dev_rdonly patch is, as I understand, an entirely Lustre-specific patch for the kernel. It will never be upstreamed, and will always be a burden on the Lustre developers to maintain.

Once we finally finish LU-684, it will be possible to reasonable delete the "lustre/kernel_patches" directory from the Lustre repository, and make a much cleaner separation between building the kernel and building Lustre.

So LU-684 remains the real blocker to calling this ticket complete. But even when it is done, there should still be some minor work to remove lustre/kernel_patches from the tree before this ticket is closed.

Comment by Aurelien Degremont (Inactive) [ 24/May/17 ]

I'm backing Chris here. LU-684 is definitely THE ticket that should be closed to move forward patchless server!

Comment by Peter Jones [ 24/May/17 ]

Yes of course finding a resolution for LU-684 is our ultimate and work on that continues. My strong preference would be for that to have been resolved ahead of the code freeze but, as that is by no means looking certain at present I was looking at whether we could adopt a contingency of having two build options - one as today for use in tests and the other being patchless and, while usable in production, not usable for all tests. This is a suggestion that has been made by several community members who are anxious to take advantage of patchless servers. Unfortunately this relies on Red Hat's schedule to be practical.

Comment by Christopher Morrone [ 24/May/17 ]

I was with you until the last sentence. What relies on Red Hat's schedule, and why?

Comment by Andreas Dilger [ 25/May/17 ]

Per Peter's previous comment:

we need a version of RHEL 7.x which include the fix for the upstream bug "jbd2: incorrect unlock on j_list_lock "

Until that patch (http://review.whamcloud.com/23050 from LU-8685) is included into RHEL7, we will continue to patch the RHEL kernel shipped with Lustre to fix that bug.

Of course, it is possible for anyone to use an unpatched kernel today with ZFS, or to build and run Lustre with a RHEL6 kernel, and this has been true for a couple of releases at least. The presence of kernel patches in the Lustre tree doesn't prevent that. While Intel will continue to apply the kernel patches until such a time that LU-684 and LU-8685 are fixed, it doesn't prevent others from building their Lustre RPMs differently.

Comment by James A Simmons [ 06/Jun/17 ]

Great news!!!!! A new RHEL7.3 kernel has been released and it have the jbd2 fix. Time to move to kernel-3.10.0-514.21.1.el7. Patchless servers are again within are grasp.

Comment by Christopher Morrone [ 07/Jun/17 ]

Until that patch (http://review.whamcloud.com/23050 from LU-8685) is included into RHEL7, we will continue to patch the RHEL kernel shipped with Lustre to fix that bug.

I don't understand that logic at all. A patched kernel could have been built completely external to the Lustre tree and allowed us to continue forward with completing this ticket. I really don't understand why this was deemed as a blocker, or had to happen sequentially.

Maybe I'll restate what I think the goal of this ticket really is: Elminate the need for a "Lustre kernel" by eliminating all of the Lustre specific kernel patches (ignoring ldiskfs).

The jbd2 fix, while effecting lustre, is not necessarily lustre specific. Therefore it does not need to be living in lustre/kernel_patches, and we don't need infrastructure in Lustre's main build system to pause in the middle of building lustre to go patch, build, and package a kernel. Instead patching, build, and packaging the kernel can be a completely external process that takes place before, and independantly of each lustre build.

Thats the goal. Nothing that I can see really stands in the way of that goal, unless I'm missing something (and I did read LU-8685).

Comment by Christopher Morrone [ 07/Jun/17 ]

In any event, hopefully the James' revelation that the RHEL kernel is now shipping with the fix means that we can drop change 26220 and just finish this ticket?

Comment by James A Simmons [ 07/Jun/17 ]

Actually its the fix for LU-8685 that has now landed for RHEL7. We no longer need the patch jbd2-fix-j_list_lock-unlock-3.10-rhel7.patch !!! We still need 26220 so osd-ldiskfs will work properly with patch-less kernels. At LUG it was brought up if the lustre tree needs to be patched to build ldiskfs with just standard RHEL kernel rpms. I tried it out and building just works out of the box with patch-less kernels. All I did was

rpm -ivh kernel-devel-3.10.0-514.21.1.el7.x86_64.rpm

If you want to build ldiskfs just do:
rpm -ivh kernel-debuginfo-common-x86_64-3.10.0-514.21.1.el7.x86_64.rpm

cd ~/lustre-release
sh ./autogen.sh
./configure --with-linux=/usr/src/kernels/3.10.0-514.21.1.el7.x86_64
make rpms

install lustre rpms and reboot

I gave the above example for those people like me that have multiple entries in /usr/src/kernel.
That is all that is needed now. We have arrived

Comment by Christopher Morrone [ 07/Jun/17 ]

Oh, I was confused for a moment and thought that LU-8685 was the LU-684 blocker...but no, it is LU-8729 that blocks LU-684. Now I really don't understand why people though LU-8685 was a blocker for this ticket.

As far as I can tell we're still in the same spot as always: we need LU-684 finished. The 26220 patch is just a temporary hack to ease a packaging/distribution issue because LU-684 has not yet been completed. But 26220 will not let us close this ticket.

Comment by Minh Diep [ 07/Jun/17 ]

James, what are result rpms? did you have kmod-lustre-osd-ldiskfs? I don't see where in your steps include the ext4 sources

Comment by James A Simmons [ 07/Jun/17 ]

kmod-lustre-2.9.58_57_gc252b3b_dirty-1.el7.x86_64.rpm
lustre-iokit-2.9.58_57_gc252b3b_dirty-1.el7.x86_64.rpm
kmod-lustre-osd-ldiskfs-2.9.58_57_gc252b3b_dirty-1.el7.x86_64.rpm
lustre-osd-ldiskfs-mount-2.9.58_57_gc252b3b_dirty-1.el7.x86_64.rpm
kmod-lustre-tests-2.9.58_57_gc252b3b_dirty-1.el7.x86_64.rpm
lustre-resource-agents-2.9.58_57_gc252b3b_dirty-1.el7.x86_64.rpm
lustre-2.9.58_57_gc252b3b_dirty-1.el7.x86_64.rpm
lustre-tests-2.9.58_57_gc252b3b_dirty-1.el7.x86_64.rpm
lustre-debuginfo-2.9.58_57_gc252b3b_dirty-1.el7.x86_64.rpm

The ext4 source can be founded at

/usr/src/debug/kernel-3.10.0-514.21.1.el7/linux-3.10.0-514.21.1.el7.x86_64/fs/ext4/*

which is provided kernel-debug-debuginfo-*. If you look at the LB_EXT4_SRC_DIR macro in lustre-build-ldiskfs.m4 you will see it does the right thing by default.

Comment by Minh Diep [ 07/Jun/17 ]

yup, just found that out too. Thanks. this is great news!

Comment by Christopher Morrone [ 07/Jun/17 ]

How many tests actually use the dev_rdonly/dm-flakey functionality?

If it is small, perhaps the best path forward is to simply disable those tests until LU-684 is complete. That would allow us to unblock this ticket, LU-20.

Comment by Andreas Dilger [ 08/Jun/17 ]

All of the recovery-* tests depend on this functionality to some extent, to allow clients to submit writes to the server that are dropped deterministically before the server restarts.

Note there is no reason that the presence of the dev-rdonly patch in our kernel prevents people from building patchless kernels. We need that for the testing ldiskfs, but it is not needed for production with either ldiskfs or ZFS. Note that if you want project quota support for ldiskfs then kernel patches are needed regardless (project quota for ZFS will similarly need ZFS to be patched).

Comment by Gerrit Updater [ 09/Jun/17 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/27549
Subject: LU-20 osd-ldiskfs: Make readonly patches optional
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: aface21135cc936be2cf72fc2e092a4784fbecc0

Comment by Gerrit Updater [ 16/Jun/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27549/
Subject: LU-20 osd-ldiskfs: Make readonly patches optional
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0f0a43b4ba6660a88f7922aadaba1a69c297142c

Comment by Bob Glossman (Inactive) [ 20/Jun/17 ]

The recent landing of 'LU-20 osd-ldiskfs: Make readonly patches optional', https://review.whamcloud.com/27549 has broken lustre on el6. This mod added calls to kallsyms_lookup_name(), a kernel API not previously used. On el6 this API isn't globally visible to kernel modules, it has no EXPORT() statement. This leads to install time errors like:

WARNING: /lib/modules/2.6.32-696.3.2.el6_lustre.x86_64/extra/lustre-osd-ldiskfs/fs/osd_ldiskfs.ko needs unknown symbol kallsyms_lookup_name
WARNING: /lib/modules/2.6.32-696.3.1.el6_lustre.x86_64/weak-updates/lustre-osd-ldiskfs/fs/osd_ldiskfs.ko needs unknown symbol kallsyms_lookup_name
WARNING: /lib/modules/2.6.32-696.3.2.el6_lustre.x86_64/extra/lustre-osd-ldiskfs/fs/osd_ldiskfs.ko needs unknown symbol kallsyms_lookup_name
WARNING: /lib/modules/2.6.32-696.3.2.el6.x86_64/weak-updates/lustre-osd-ldiskfs/fs/osd_ldiskfs.ko needs unknown symbol kallsyms_lookup_name

and runtime errors like:

osd_ldiskfs: Unknown symbol kallsyms_lookup_name (err 0)
LustreError: 158-c: Can't load module 'osd-ldiskfs'

This flaw blocks any use of ldiskfs on el6
It's a pretty serious regression.

Comment by James A Simmons [ 20/Jun/17 ]

Technically rhel6 server side is no longer supported. Testing of rhel6 server should have been stopped by now.

Comment by Bob Glossman (Inactive) [ 20/Jun/17 ]

James,
Even if not technically supported there's still a lot of el6 of various vintages out in the field. If we break it I'm sure we will be making a lot of our users unhappy. I don't think we should.

Comment by Peter Jones [ 28/Feb/18 ]

I don't think that we can fully close out this ticket until servers only run on distros with kernels that include the project quotas patches. However, the option remains for those who prefer to use patchless servers rather than project quotas.

Comment by Andreas Dilger [ 05/Nov/19 ]

All of the kernels can run patchless now, and our testing has moved over to dm-flakey so the dev_readonly patches are no longer needed (and have been removed from 4.x kernels).

Generated at Sat Feb 10 01:02:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.