[LU-11200] Centos 8 arm64 server support Created: 03/Aug/18 Updated: 11/Jun/20 Resolved: 11/Jun/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Oleg Drokin | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
I am trying to bring up arm64 support for centos7 and it turns out centos 7 / rhel 7 for arm64 uses kernel 4.14 that we do not have any ldiskfs patches for. I checked and the closest match is ubuntu18 4.15 kernel - the ldiskfs series has only a single trivial reject in ext4-data-in-dirent.patch We also need a way to select it in configure that's a bit more complicated since I don't know how to do it easily yet. Then there are compile errors: /home/green/git/lustre-release/lustre/ptlrpc/../../lustre/ldlm/ldlm_lockd.c:135:74: error: macro "DEFINE_TIMER" requires 4 arguments, but only 2 given
static CFS_DEFINE_TIMER(waiting_locks_timer, waiting_locks_callback, 0, 0);
^
|
| Comments |
| Comment by Andreas Dilger [ 03/Aug/18 ] |
|
It probably makes sense to add a small patch to the 4.14 ldiskfs to make it match Ubuntu to resolve the conflict rather than creating a whole new set of patches. |
| Comment by Oleg Drokin [ 03/Aug/18 ] |
|
I just forked that one patch and reuse the ubuntu patches otherwise, that's what we typically did in the past. |
| Comment by James A Simmons [ 03/Aug/18 ] |
|
Actually the latest RHEL alt kernels for ARM/Power8 are 4.15 kernels. I was looking at doing this work since I have been assigned ARM server support. I have been running RHEL ARM clients for some time |
| Comment by James A Simmons [ 03/Aug/18 ] |
|
Never mind I was wrong. We are at 4.11. Need to look update our clients kernels. |
| Comment by James A Simmons [ 03/Aug/18 ] |
|
I'm going to take over this ticket since I need to go to a ARM conference in November to present ARM lustre server support. |
| Comment by Oleg Drokin [ 03/Aug/18 ] |
|
I am fine with it. Hopefully it does not take too long to have the patches to compile current master on rhel7 arm64. I sent you the ldiskfs series update, but it still needs the configure magic of course. - please make it a separate patch from all the other possible compile fixes (the DEFINE_TIMER and whatever else might crop up) so we can enable arm64 builder permanently and make sure it no longer breaks. |
| Comment by Gerrit Updater [ 05/Aug/18 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/32939 |
| Comment by Gerrit Updater [ 05/Aug/18 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/32940 |
| Comment by James A Simmons [ 09/Aug/18 ] |
|
I have attached to this ticket the lustre version of the e2fsprogs which is the latest in the lustre-master-test test branch. |
| Comment by Andreas Dilger [ 09/Aug/18 ] |
|
I assume from the e2fsprogs attachments that you didn't have any problems building the master-lustre-test branch on ARM? |
| Comment by James A Simmons [ 10/Aug/18 ] |
|
No problem. Same for Power8. Only have trouble with Ubuntu systems. I will track that down. |
| Comment by Peter Jones [ 10/Aug/18 ] |
|
James Is anything needed for clients to work - either on master or on b2_10? Peter |
| Comment by James A Simmons [ 10/Aug/18 ] |
|
Technically I'm testing on a Power8 which is at RHE7.5 alt using the 4.14 kernel. The ARM system I have is in the progress of moving up to RHE7.5 alt. Both platforms use the same kernel. For my testing on Power8 I need the two patches from this ticket as well as the patch from |
| Comment by Andreas Dilger [ 10/Aug/18 ] |
|
You probably mean something different than |
| Comment by James A Simmons [ 10/Aug/18 ] |
|
Yes a typo. I updated the comment to reflect the proper LU |
| Comment by Gerrit Updater [ 18/Aug/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32939/ |
| Comment by James A Simmons [ 19/Aug/18 ] |
|
Client support for ARM and Power8 restored. Now for server support. |
| Comment by Peter Jones [ 19/Aug/18 ] |
|
James So what (if anything) would be needed on b2_10 to offer ARM/Power8 client support? Peter |
| Comment by James A Simmons [ 20/Aug/18 ] |
|
First we need to port a bunch of patches to support newer kernels. Then figure out what to do with the ko2iblnd stack with 64K and map_on_demand work. |
| Comment by James A Simmons [ 28/Nov/18 ] |
|
Found an issue with RHEL7 alt kernel for my Power8 system. Now I can attempt to mount ldiskfs but I'm encountering: [156572.269381] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodela lloc [156794.172421] INFO: task mount.lustre:37664 blocked for more than 120 seconds. [156794.172515] Tainted: P W OE ------------ 4.14.0-49.6.1.el7a.ppc64le #1 [156794.172524] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [156794.172556] mount.lustre D 0 37664 37663 0x00042080 [156794.172605] Call Trace: [156794.172636] [c000001dd7f16ba0] [c00000000001cde0] __switch_to+0x330/0x660 [156794.172699] [c000001dd7f16c00] [c000000000c5ff64] __schedule+0x354/0xaf0 [156794.172758] [c000001dd7f16cd0] [c000000000c60748] schedule+0x48/0xc0 [156794.172817] [c000001dd7f16d00] [c000000000c65e88] rwsem_down_read_failed+0x148/0x1f0 [156794.172888] [c000001dd7f16d80] [c000000000c65038] down_read+0x78/0x80 [156794.172964] [c000001dd7f16db0] [d00000000dc31dc4] ldiskfs_readdir+0x704/0xa40 [ldiskfs] [156794.173046] [c000001dd7f16ed0] [d00000000e94f718] osd_ios_general_scan+0x148/0x350 [osd_ldiskfs] [156794.173136] [c000001dd7f16fb0] [d00000000e95ab28] osd_initial_OI_scrub+0x178/0x1770 [osd_ldiskfs] [156794.173226] [c000001dd7f17150] [d00000000e95cdfc] osd_scrub_setup+0x8bc/0x1300 [osd_ldiskfs] [156794.173314] [c000001dd7f172e0] [d00000000e923b18] osd_device_alloc+0x6e8/0xa90 [osd_ldiskfs] [156794.173424] [c000001dd7f173c0] [d00000002821e478] class_setup+0xaf8/0x10d0 [obdclass] [156794.173516] [c000001dd7f17510] [d000000028229924] class_process_config+0x1d64/0x3980 [obdclass] [156794.173620] [c000001dd7f17640] [d000000028232dc8] do_lcfg+0x358/0x890 [obdclass] [156794.173713] [c000001dd7f17770] [d000000028237af4] lustre_start_simple+0x1d4/0x460 [obdclass] [156794.173817] [c000001dd7f17840] [d000000028279404] osd_start+0x714/0xd40 [obdclass] [156794.173911] [c000001dd7f17960] [d000000028287508] server_fill_super+0x268/0x1cc0 [obdclass] [156794.174015] [c000001dd7f17a60] [d00000002823cb40] lustre_fill_super+0x7c0/0x1050 [obdclass] [156794.174098] [c000001dd7f17b20] [c0000000004494d0] mount_nodev+0x160/0x390 [156794.174178] [c000001dd7f17b90] [d000000028234ad4] lustre_mount+0x54/0x70 [obdclass] [156794.174249] [c000001dd7f17be0] [c00000000044b97c] mount_fs+0x8c/0x230 [156794.174309] [c000001dd7f17c80] [c000000000487978] vfs_kern_mount+0x78/0x1b0
|
| Comment by Shuichi Ihara [ 03/Apr/19 ] |
|
Hi James, |
| Comment by James A Simmons [ 04/Apr/19 ] |
|
I think I resolved the locking issues for ldiskfs. The problem was try_lookup_one_len() can return NULL. In the original ldiskfs scrub code the dentry was then allocated in that case. So if try_lookup_one_len() doesn't find a dentry we have to create one. |
| Comment by James A Simmons [ 22/Apr/19 ] |
|
Shuichi Ihara try both: |
| Comment by Baptiste Gerondeau (Inactive) [ 12/Jun/19 ] |
|
Hi, Thank you for your patches and efforts ! I've tested the latest series on an ARM64 machine (VM hosted on a ThunderX2 with Infiniband to be precise). [bgerdeb@lustretest ~]$ sudo FSTYPE=ldiskfs LDISKFS_MKFS_OPTS="^metadata_csum" /usr/lib64/lustre/tests/llmount.sh mkfs.lustre FATAL: Unable to build fs /dev/loop0 (256) mkfs.lustre FATAL: mkfs failed 256 Furthermore, dmesg shows : [ 164.229457] libcfs: loading out-of-tree module taints kernel. Those last 'print_req_error' seem to be inherent to this precise version of the kernel/EXT4 drivers (Ubuntu's bug tracker is full of those errors). When compiling Lustre (master + both patches cherry picked) against the latest CentOS 7.6 aarch64 kernel (yesterday's 4.14.0-115.8.1.el7a.aarch64, as well as 4.14.0-115.7.1.el7a.aarch64) with only ZFS enabled, I can llmount.sh just fine. |
| Comment by James A Simmons [ 12/Jun/19 ] |
|
metadata_csum is new. As for the loop device you might need the new e2fsprogs being worked on for RHEL8. In fact give me some time and I will build you updated e2fsprogs for you. I do plan on updating the |
| Comment by Baptiste Gerondeau (Inactive) [ 13/Jun/19 ] |
|
Thanks a lot ! Looking forward to trying out those patches ! |
| Comment by James A Simmons [ 13/Jun/19 ] |
|
Updated the e2fsprogs. I hope to have the RHEL7.6alt patches ready over the next few days. |
| Comment by Baptiste Gerondeau (Inactive) [ 13/Jun/19 ] |
|
Thanks ! Downloaded and reinstalled your e2fsprogs over mine, tested and got the same result as above. |
| Comment by James A Simmons [ 17/Jun/19 ] |
|
Just updated the ldiskfs patch. Currently we are running 4.14.43-200 and that kernel is nearly identical to the Ubuntu 18 LTS kernel so its a really easy update. That patch plus the |
| Comment by Baptiste Gerondeau (Inactive) [ 18/Jun/19 ] |
|
I can't seem to find a 4.14.43-200 kernel (for aarch64). The closest I have is Fedora's 4.14.13-200. Is it that one ? |
| Comment by James A Simmons [ 18/Jun/19 ] |
|
I saw 4.14.43-200 kernels for Fedora x86 but I can't seem to find one for ARM. The kernel I'm using is directly from the vendor using RedHat. Can you try to see if you have build issues with this version? I do see the latest fedora is 30 for ARM and it is a 5.0 kernel which is being worked on |
| Comment by Baptiste Gerondeau (Inactive) [ 19/Jun/19 ] |
|
So I tried with the aforementioned kernel : the good news is that I only had to do minor changes to your rhel7.5alt patch (see : this file), which I already had to do before but failed to mention it seems...) for it to apply to this kernel's EXT4 tree. The bad news is that it fails to compile with GCC 5.4.0 where it succeeded before; here are the errors : In file included from include/linux/kernel.h:10:0,
[... make -j12 so it goes on ..] /home/bgerdeb/lustre_build/lustre/lustrearm/lustre/osd-ldiskfs/osd_handler.c:3695:20: warning: ‘struct ldiskfs_dentry_param’ declared inside parameter list else if (d_type & LDISKFS_DIRENT_LUFID) { Note, I always cherry pick your patches on top of master. Will rebase and retry. EDIT : Tried to put in bold the errors, of course thanks to Atlassian's wonderful text editor, it failed. |
| Comment by James A Simmons [ 20/Jun/19 ] |
|
It seems Fedora is a different beast then RHEL7.6alt. Would you be willing to try Fedora30 since its a 5.0 kernel and that support is being worked on. I'm hoping we can just use the same ldiskfs patch set as the 5.0 series. |
| Comment by Peter Jones [ 20/Jun/19 ] |
|
Talking to Marvell at ISC this week and they seemed to think that focusing on RHEL8 should be the first priority in this arena. |
| Comment by James A Simmons [ 20/Jun/19 ] |
|
Are you suggesting we drop RHEL7 alt support? Why that seems like a good idea people might be stuck with what their vendor supports |
| Comment by Peter Jones [ 20/Jun/19 ] |
|
I am working to the assumption that it will take a little while to get Arm server support working so focusing on something with a longer shelf life would be best |
| Comment by James A Simmons [ 21/Jun/19 ] |
|
We are much closer than you think. The challenge has been getting people to test the needed changes. Now that I have worked out most of the issues for |
| Comment by James A Simmons [ 24/Jun/19 ] |
|
I have resolved the ldiskfs locking issues that ARM testers have reported. Should be in a good position to finish this work off. |
| Comment by Baptiste Gerondeau (Inactive) [ 26/Jun/19 ] |
|
I have tested your patches against Fedora 30's 5.1.1 kernel, and the patches do not apply to the ext4. I'd advise to use the upstream CentOS 7.6alt kernel : here which is the one you get doing a netinstall by default (if you don't run yum update afterwards). Concerning 7.6 vs RHEL8, I'm happy to test both. |
| Comment by Baptiste Gerondeau (Inactive) [ 25/Jul/19 ] |
|
I have tested out the latest lustre-release master on RHEL8 ARM64 VM and can confirm that it builds installs, runs and passes (most of) the sanity test suite with LDISKFS (in all-on-one-node configuration at the moment) ! Here are the results : results-ldiskfs-rhel8-2507.yml |
| Comment by James A Simmons [ 10/Oct/19 ] |
|
Baptiste I updated https://review.whamcloud.com/#/c/34714 This should make RHLE8 ARM fully functional. |
| Comment by Baptiste Gerondeau (Inactive) [ 28/Oct/19 ] |
|
Thanks a lot ! |
| Comment by Andreas Dilger [ 11/Jun/20 ] |
|
The last patch for this ticket was landed in 2.14, and RHEL8 clients are working for all arches. |