[LU-12265] LustreError: 141027:0:(osd_iam_lfix.c:188:iam_lfix_init()) Bad magic in node 1861726 #34: 0xcc != 0x1976 or bad cnt: 0 170: rc = -5 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.12.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Hello,
I have been running the IO500 benchmark suite recently across our all-flash NVMe-based filesystem and I have twice now come across the following errors that cause client IO errors and run failure, and I was hoping to find out more about what they indicate?

The following are errors on one of the servers, which is a combined OSS & MDS, and is one of 24 such servers:

May 06 09:29:43 dac-e-3 kernel: LustreError: 141015:0:(osd_iam_lfix.c:188:iam_lfix_init()) Skipped 11 previous similar messages
May 06 09:29:43 dac-e-3 kernel: LustreError: 141015:0:(osd_iam_lfix.c:188:iam_lfix_init()) Bad magic in node 1861726 #34: 0xcc != 0x1976 or bad cnt: 0 170: rc = -5
May 06 08:49:09 dac-e-3 kernel: LustreError: 140855:0:(osd_iam_lfix.c:188:iam_lfix_init()) Skipped 9 previous similar messages
May 06 08:49:09 dac-e-3 kernel: LustreError: 140855:0:(osd_iam_lfix.c:188:iam_lfix_init()) Bad magic in node 1861726 #34: 0xcc != 0x1976 or bad cnt: 0 170: rc = -5
May 06 08:47:25 dac-e-3 kernel: LustreError: 141027:0:(osd_iam_lfix.c:188:iam_lfix_init()) Bad magic in node 1861726 #34: 0xcc != 0x1976 or bad cnt: 0 170: rc = -5

I see no other lustre errors on any other servers, or on any of the clients, but the client application sees an error.

These errors are also only rarely seen so I'm not sure if I can easily reproduce them - I have been running this benchmark suite very intensely the past few days and we are fairly frequently re-formatting all of the hardware and rebuilding filesystems on this hardware as it is a pool of hardware that we use in a filesystem-on-demand style of usage.

At the time of the errors I was running an mdtest benchmark from the 'md easy' portion of the suite, with 128 clients, 32 ranks, so a very large number of files were being created at the time:

mdtest-1.9.3 was launched with 2048 total task(s) on 128 node(s)
Command line used: /home/mjr208/projects/benchmarking/io-500-src-stonewall-fix/bin/mdtest "-C" "-n" "140000" "-u" "-L" "-F" "-d" "/dac/fs1/mjr208/job11312297-2019-05-05-2356/mdt_easy"
Path: /dac/fs1/mjr208/job11312297-2019-05-05-2356
FS: 412.6 TiB   Used FS: 24.2%   Inodes: 960.0 Mi   Used Inodes: 0.0%

2048 tasks, 286720000 files
ior ERROR: open64() failed, errno 5, Input/output error (aiori-POSIX.c:376)
Abort(-1) on node 480 (rank 480 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 480
ior ERROR: open64() failed, errno 5, Input/output error (aiori-POSIX.c:376)
ior ERROR: open64() failed, errno 5, Input/output error (aiori-POSIX.c:376)
Abort(-1) on node 486 (rank 486 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 486
ior ERROR: open64() failed, errno 5, Input/output error (aiori-POSIX.c:376)
Abort(-1) on node 488 (rank 488 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 488
ior ERROR: open64() failed, errno 5, Input/output error (aiori-POSIX.c:376)
Abort(-1) on node 491 (rank 491 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 491
ior ERROR: open64() failed, errno 5, Input/output error (aiori-POSIX.c:376)
Abort(-1) on node 492 (rank 492 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 492
ior ERROR: open64() failed, errno 5, Input/output error (aiori-POSIX.c:376)
Abort(-1) on node 493 (rank 493 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 493
Abort(-1) on node 482 (rank 482 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 482

The filesystem itself is configured using DNE and specifically we are using DNE2 striped directories for all mdtest runs. We are using a large number of MDTs, 24 at the moment, one-per server, (which other than this problem, is otherwise working excellently), and the directory-stripe is '-1', so we are striping all the directories over all 24 MDTs, one per server. Each server contains 12 NVMe drives, and we partition one of the drives so it has both an OST and MDT partition.

Lustre and Kernel versions are as follows:

Server: kernel-3.10.0-957.el7_lustre.x86_64
Server: lustre-2.12.0-1.el7.x86_64

Clients: kernel-3.10.0-957.10.1.el7.x86_64
Clients: lustre-client-2.10.7-1.el7.x86_64

Could I get some advice on what this error indicates here?

Attachments

Issue Links

is related to

LU-12268 LDISKFS-fs error: ldiskfs_find_dest_de:2066: bad entry in directory: rec_len is smaller than minimal - offset=0( 0), inode=201, rec_len=0, name_len=0

Resolved

LU-14188 rw_semaphore in the iam_container structure that never been used

Resolved

is related to

LU-14188 rw_semaphore in the iam_container structure that never been used

Resolved

LU-15016 OI Scrub backup and rebuild

Open

Activity

[LU-12265] LustreError: 141027:0:(osd_iam_lfix.c:188:iam_lfix_init()) Bad magic in node 1861726 #34: 0xcc != 0x1976 or bad cnt: 0 170: rc = -5

Alexander Boyko added a comment - 28/Sep/21 1:36 PM - edited

FYI I pushed patch https://review.whamcloud.com/45072 "LU-12268 osd: BUG_ON for IAM corruption". It detects IAM bh overflow early and fail the node. This prevents on disk FS corruption, and gets more data for analyze.

Alexander Boyko added a comment - 28/Sep/21 1:36 PM - edited FYI I pushed patch https://review.whamcloud.com/45072 " LU-12268 osd: BUG_ON for IAM corruption ". It detects IAM bh overflow early and fail the node. This prevents on disk FS corruption, and gets more data for analyze.

Gerrit Updater added a comment - 28/Sep/21 7:51 AM

"Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45071
Subject: LU-12265 osd: fix corrupted OI file online
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c0b2d11c325e042f724447ee45bc1ca1d2ff5379

Gerrit Updater added a comment - 28/Sep/21 7:51 AM "Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45071 Subject: LU-12265 osd: fix corrupted OI file online Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c0b2d11c325e042f724447ee45bc1ca1d2ff5379

Andreas Dilger added a comment - 17/Sep/21 7:40 PM

Aside from determining and fixing the root cause of this IAM corruption, it makes sense for the IAM/OSD code to handle this in a more robust manner. If the IAM block is corrupted, the current remedy is only to delete and rebuild all the OI files. It would be useful (and not more disruptive) to just reset the corrupt IAM block and then trigger a full OI Scrub to verify/reinsert any missing FIDs. This makes the OI file at least somewhat self-healing.

As part of this process, it might make sense to try and scan/repair the IAM file itself. However, since we need a full OI Scrub to find any FIDs affected by the corruption, it probably makes more sense to build a new "shadow OI" file (for the corrupted OI file only, because they can grow to tens of GB in size for a large MDT). That is what LU-15016 is about.

Since there are other benefits to rebuilding the OI file (compact/free old entries, improve insertion speed) I don't think it is worthwhile to spend too much time on repairing the existing OI file, just enough to keep the system usable.

Andreas Dilger added a comment - 17/Sep/21 7:40 PM Aside from determining and fixing the root cause of this IAM corruption, it makes sense for the IAM/OSD code to handle this in a more robust manner. If the IAM block is corrupted, the current remedy is only to delete and rebuild all the OI files. It would be useful (and not more disruptive) to just reset the corrupt IAM block and then trigger a full OI Scrub to verify/reinsert any missing FIDs. This makes the OI file at least somewhat self-healing. As part of this process, it might make sense to try and scan/repair the IAM file itself. However, since we need a full OI Scrub to find any FIDs affected by the corruption, it probably makes more sense to build a new "shadow OI" file (for the corrupted OI file only, because they can grow to tens of GB in size for a large MDT). That is what LU-15016 is about. Since there are other benefits to rebuilding the OI file (compact/free old entries, improve insertion speed) I don't think it is worthwhile to spend too much time on repairing the existing OI file, just enough to keep the system usable.

Artem Blagodarenko (Inactive) added a comment - 11/Dec/20 6:35 AM

We faced the problem while having these patches applied

2291-kernel-locking-rwsem-Fix-possible-missed-wakeup.patch
2290-kernel-futex-Fix-possible-missed-wakeup.patch
2289-kernel-futex-Use-smp_store_release-in-mark_wake_fute.patch
2288-kernel-sched-wake_q-Fix-wakeup-ordering-for-wake_q.patch

There are no other rwsem-related patches to apply and the problem still exists.

Artem Blagodarenko (Inactive) added a comment - 11/Dec/20 6:35 AM We faced the problem while having these patches applied 2291-kernel-locking-rwsem-Fix-possible-missed-wakeup.patch 2290-kernel-futex-Fix-possible-missed-wakeup.patch 2289-kernel-futex-Use-smp_store_release-in-mark_wake_fute.patch 2288-kernel-sched-wake_q-Fix-wakeup-ordering- for -wake_q.patch There are no other rwsem-related patches to apply and the problem still exists.

Andreas Dilger added a comment - 10/Dec/20 12:04 PM

See earlier comment in this ticket:

r/w semaphores are broken in RH kernels up to RH7.7 , see https://access.redhat.com/solutions/3393611
It would be good to check whether the problem still exists with kernel kernel-3.10.0-1062.el7 :

Red Hat Enterprise Linux 7.7

The issue was fixed in kernel-3.10.0-1062.el7 from Errata RHSA-2019:2029

Andreas Dilger added a comment - 10/Dec/20 12:04 PM See earlier comment in this ticket: r/w semaphores are broken in RH kernels up to RH7.7 , see https://access.redhat.com/solutions/3393611 It would be good to check whether the problem still exists with kernel kernel-3.10.0-1062.el7 : Red Hat Enterprise Linux 7.7 The issue was fixed in kernel-3.10.0-1062.el7 from Errata RHSA-2019:2029

Artem Blagodarenko (Inactive) added a comment - 10/Dec/20 11:53 AM

adilger, do you know the exact rootcause of the problem? I am asking to know what patches we need to prevent this bug happen again. Thanks.

Artem Blagodarenko (Inactive) added a comment - 10/Dec/20 11:53 AM adilger , do you know the exact rootcause of the problem? I am asking to know what patches we need to prevent this bug happen again. Thanks.

Andreas Dilger added a comment - 10/Dec/20 11:26 AM

Artem, I think this problem was fixed in the RHEL7 kernel. It was seen by a number of sites that had this same kernel, but upgrading to the later RHEL7 kernels fixed the problem.

Andreas Dilger added a comment - 10/Dec/20 11:26 AM Artem, I think this problem was fixed in the RHEL7 kernel. It was seen by a number of sites that had this same kernel, but upgrading to the later RHEL7 kernels fixed the problem.

Artem Blagodarenko (Inactive) added a comment - 07/Dec/20 1:04 PM

Faced with this problem on one of our clusters. While researching iam code found there is dead code. Created ~~LU-14188~~ and https://review.whamcloud.com/#/c/40890/ that removes this useless code.

I wonder if this semaphore was used somewhere in past and can be useful now.

Artem Blagodarenko (Inactive) added a comment - 07/Dec/20 1:04 PM Faced with this problem on one of our clusters. While researching iam code found there is dead code. Created LU-14188 and https://review.whamcloud.com/#/c/40890/ that removes this useless code. I wonder if this semaphore was used somewhere in past and can be useful now.

Alexander Zarochentsev added a comment - 23/Sep/19 10:43 AM

r/w semaphores are broken in RH kernels up to RH7.7 , see https://access.redhat.com/solutions/3393611
It would be good to check whether the problem still exists with kernel kernel-3.10.0-1062.el7 :

Red Hat Enterprise Linux 7.7

The issue was fixed in kernel-3.10.0-1062.el7 from Errata RHSA-2019:2029

Alexander Zarochentsev added a comment - 23/Sep/19 10:43 AM r/w semaphores are broken in RH kernels up to RH7.7 , see https://access.redhat.com/solutions/3393611 It would be good to check whether the problem still exists with kernel kernel-3.10.0-1062.el7 : Red Hat Enterprise Linux 7.7 The issue was fixed in kernel-3.10.0-1062.el7 from Errata RHSA-2019:2029

Andreas Dilger added a comment - 11/Jun/19 10:41 PM

Matt has been hitting it regularly in his large-scale IO-500 runs in CAM-79.

Andreas Dilger added a comment - 11/Jun/19 10:41 PM Matt has been hitting it regularly in his large-scale IO-500 runs in CAM-79.

People

Assignee:: Hongchao Zhang

Reporter:: Matt Rásó-Barnett (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 06/May/19 9:05 AM

Updated:: 28/Jan/22 1:42 AM