Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.0
-
None
-
3
-
9223372036854775807
Description
Hello,
I have been running the IO500 benchmark suite recently across our all-flash NVMe-based filesystem and I have twice now come across the following errors that cause client IO errors and run failure, and I was hoping to find out more about what they indicate?
The following are errors on one of the servers, which is a combined OSS & MDS, and is one of 24 such servers:
May 06 09:29:43 dac-e-3 kernel: LustreError: 141015:0:(osd_iam_lfix.c:188:iam_lfix_init()) Skipped 11 previous similar messages May 06 09:29:43 dac-e-3 kernel: LustreError: 141015:0:(osd_iam_lfix.c:188:iam_lfix_init()) Bad magic in node 1861726 #34: 0xcc != 0x1976 or bad cnt: 0 170: rc = -5 May 06 08:49:09 dac-e-3 kernel: LustreError: 140855:0:(osd_iam_lfix.c:188:iam_lfix_init()) Skipped 9 previous similar messages May 06 08:49:09 dac-e-3 kernel: LustreError: 140855:0:(osd_iam_lfix.c:188:iam_lfix_init()) Bad magic in node 1861726 #34: 0xcc != 0x1976 or bad cnt: 0 170: rc = -5 May 06 08:47:25 dac-e-3 kernel: LustreError: 141027:0:(osd_iam_lfix.c:188:iam_lfix_init()) Bad magic in node 1861726 #34: 0xcc != 0x1976 or bad cnt: 0 170: rc = -5
I see no other lustre errors on any other servers, or on any of the clients, but the client application sees an error.
These errors are also only rarely seen so I'm not sure if I can easily reproduce them - I have been running this benchmark suite very intensely the past few days and we are fairly frequently re-formatting all of the hardware and rebuilding filesystems on this hardware as it is a pool of hardware that we use in a filesystem-on-demand style of usage.
At the time of the errors I was running an mdtest benchmark from the 'md easy' portion of the suite, with 128 clients, 32 ranks, so a very large number of files were being created at the time:
mdtest-1.9.3 was launched with 2048 total task(s) on 128 node(s) Command line used: /home/mjr208/projects/benchmarking/io-500-src-stonewall-fix/bin/mdtest "-C" "-n" "140000" "-u" "-L" "-F" "-d" "/dac/fs1/mjr208/job11312297-2019-05-05-2356/mdt_easy" Path: /dac/fs1/mjr208/job11312297-2019-05-05-2356 FS: 412.6 TiB Used FS: 24.2% Inodes: 960.0 Mi Used Inodes: 0.0% 2048 tasks, 286720000 files ior ERROR: open64() failed, errno 5, Input/output error (aiori-POSIX.c:376) Abort(-1) on node 480 (rank 480 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 480 ior ERROR: open64() failed, errno 5, Input/output error (aiori-POSIX.c:376) ior ERROR: open64() failed, errno 5, Input/output error (aiori-POSIX.c:376) Abort(-1) on node 486 (rank 486 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 486 ior ERROR: open64() failed, errno 5, Input/output error (aiori-POSIX.c:376) Abort(-1) on node 488 (rank 488 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 488 ior ERROR: open64() failed, errno 5, Input/output error (aiori-POSIX.c:376) Abort(-1) on node 491 (rank 491 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 491 ior ERROR: open64() failed, errno 5, Input/output error (aiori-POSIX.c:376) Abort(-1) on node 492 (rank 492 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 492 ior ERROR: open64() failed, errno 5, Input/output error (aiori-POSIX.c:376) Abort(-1) on node 493 (rank 493 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 493 Abort(-1) on node 482 (rank 482 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 482
The filesystem itself is configured using DNE and specifically we are using DNE2 striped directories for all mdtest runs. We are using a large number of MDTs, 24 at the moment, one-per server, (which other than this problem, is otherwise working excellently), and the directory-stripe is '-1', so we are striping all the directories over all 24 MDTs, one per server. Each server contains 12 NVMe drives, and we partition one of the drives so it has both an OST and MDT partition.
Lustre and Kernel versions are as follows:
Server: kernel-3.10.0-957.el7_lustre.x86_64 Server: lustre-2.12.0-1.el7.x86_64 Clients: kernel-3.10.0-957.10.1.el7.x86_64 Clients: lustre-client-2.10.7-1.el7.x86_64
Could I get some advice on what this error indicates here?