[LU-297] llverfs: Open '/mnt/lustre/dir00007/file015' failed:Input/output error Created: 09/May/11  Updated: 13/May/16  Resolved: 13/May/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0, Lustre 1.8.6
Fix Version/s: Lustre 2.1.0, Lustre 1.8.7

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre Branch: b1_8
Lustre Build: http://newbuild.whamcloud.com/job/lustre-reviews/363/arch=x86_64,build_type=server,distro=el5,ib_stack=inkernel/
Test System: DDN SFA10000E (1Client+1MGS/MDT+1OST)
Network: tcp


Severity: 3
Bugzilla ID: 24,017
Rank (Obsolete): 4918

 Description   

While running llverfs in full mode on Lustre filesystem with 1TB OST, it hit -EIO error after writing (7*32+15)*4G=956G data into the filesystem as follows:

llverfs: Open '/mnt/lustre/dir00007/file015' failed:Input/output error

Dmesg log showed that:

Lustre: DEBUG MARKER: ======== run llverfs in full mode on the Lustre filesystem /mnt/lustre ========
LustreError: 21145:0:(mds_open.c:441:mds_create_objects()) error creating objects for inode 41943290: rc = -5
LustreError: 21145:0:(mds_open.c:826:mds_finish_open()) mds_create_objects: rc = -5

Before running llverfs, the Lustre disk space and inode usage info are as follows:

# lfs df -h /mnt/lustre
UUID                       bytes        Used   Available Use% Mounted on
largefs-MDT0000_UUID      224.0G      461.0M      210.7G   0% /mnt/lustre[MDT:0]
largefs-OST0000_UUID     1007.9G      472.6M      956.3G   0% /mnt/lustre[OST:0]

filesystem summary:      1007.9G      472.6M      956.3G   0% /mnt/lustre


# lfs df -i /mnt/lustre
UUID                      Inodes       IUsed       IFree IUse% Mounted on
largefs-MDT0000_UUID    58597923          26    58597897   0% /mnt/lustre[MDT:0]
largefs-OST0000_UUID    67108864          72    67108792   0% /mnt/lustre[OST:0]

filesystem summary:     58597923          26    58597897   0% /mnt/lustre

After running llverfs, the Lustre disk space and inode usage info are as follows:

# lfs df -h /mnt/lustre
UUID                       bytes        Used   Available Use% Mounted on
largefs-MDT0000_UUID      224.0G      461.0M      210.7G   0% /mnt/lustre[MDT:0]
largefs-OST0000_UUID     1007.9G      956.5G      276.8M 100% /mnt/lustre[OST:0]

filesystem summary:      1007.9G      956.5G      276.8M 100% /mnt/lustre


# lfs df -i /mnt/lustre
UUID                      Inodes       IUsed       IFree IUse% Mounted on
largefs-MDT0000_UUID    58598163         274    58597889   0% /mnt/lustre[MDT:0]
largefs-OST0000_UUID    13492969         328    13492641   0% /mnt/lustre[OST:0]

filesystem summary:     58598163         274    58597889   0% /mnt/lustre

Maloo report: https://maloo.whamcloud.com/test_sets/4b6fc288-7a43-11e0-a1b3-52540025f9af



 Comments   
Comment by Jian Yu [ 09/May/11 ]

The failure occurred while performing the following codes of lustre/utils/llverfs.c:

static int dir_write(char *chunk_buf, size_t chunksize,
             time_t time_st, unsigned long dir_num)
{
    ......
    for (; dir_num < num_dirs; num_files++, file_num++) {
        ......

        fd = open_file(new_file(tempfile, tempdir, file_num),
                   O_WRONLY | O_CREAT | O_TRUNC | O_LARGEFILE);
        ......
    }
    ......
}

I'll investigate more in mds_create_objects() which returned -EIO instead of -ENOSPC while opening/creating a file with OST out of space.

Comment by Andreas Dilger [ 09/May/11 ]

This EIO error is likely involving the alloc_rr->osc_precreate() codepath, where it is checking whether the OST is full (OSCC_FLAG_ENOSPC) before allocating more objects there. I agree that returning -ENOSPC is the better thing to do in this case, and -EIO is not correct for OSCC_FLAG_ENOSPC. I think it is still correct to return -EIO in the case where the OSTs are unavailable due to inactive devices, instead of the case where the OSTs are full.

Because running llverfs takes such a long time, it does not stop the test when it hits this EIO error, just to allow testing to make progress. Looking at this code and the test results, llverfs doesn't consider an error from open_file() fatal (it just breaks out of the loop and begin the read test), and write_chunks() calls write_retry() that should record the error but continue trying to write until it hit too many errors, then moves on to the read verification. It only returns an error at the end of the test.

Looking at https://maloo.whamcloud.com/test_sets/4b6fc288-7a43-11e0-a1b3-52540025f9af it does appear that the full test on Lustre was run until the filesystem was full, and then the read verification completed. This means it should be possible to start running llverfs on a 24TB ldiskfs filesystem while debugging this problem in Lustre, and if possible also start running on a 128TB+ ldiskfs filesystem in order to speed up this process. My calculations show a 24TB test will take about 88 hours, and a 128TB test will take about 473 hours = 19 days, so it makes sense to start those tests as soon as possible after making script adjustments (as described below) and verifying them one last time on the 1TB filesystem.

I noticed something incorrect in the test script - after the partial llverfs run on ldiskfs the test files were deleted before the e2fsck was run. This makes the e2fsck check much less valuable, because it hides any kernel or e2fsck bugs related to how the files are allocated on disk. The e2fsck should be run after each llverfs test is finished, but before the test files are deleted. There is currently no e2fsck check at all after the partial llverfs run on Lustre, which needs to be added.

Also, for proper problem isolation there should be a full run of llverfs on the ldiskfs-mounted OST filesystem after the partial ldiskfs test, because I suspect there may be some bugs in ext4 or the ldiskfs patches themselves, even before we start testing obdfilter running on large filesystems. Again, there should be a full e2fsck run after llverfs is run.

Comment by Andreas Dilger [ 10/May/11 ]

If you are modifying llverfs, it might be convenient to print out the current MB/s with each status line, and in a summary at the end, along with an estimate of how long the test will continue (based on the number of total_size = file_size * num_files to create for partial mode, or the total_size = free space in full mode).

The best way to do this, I think, is to keep a global "curr_bytes" counter for the number of bytes read or written since the start of the the dir_write() or dir_read() phase. Inside show_filename() it should use gettimeofday() for the last_time a message was printed (instead of the simple time() value used now) and it should track the last_bytes value:

static struct timeval last_time;
static __u64 last_bytes;
static char last_op;
struct timeval curr_time;
double curr_delta, overall_delta, total_time, remain_time;
int remain_hours, remain_minutes, remain_seconds;

if (last_op != op[0])

{ last_bytes = 0; last_time = start_time; last_op = op[0]; }

curr_delta = (curr_time.tv_sec - last_time.tv_sec) +
(double)(curr_time.tv_usec - last_time.tv_usec) / 1000000;
overall_delta = (curr_time.tv_sec - start_time.tv_sec) +
(double)(curr_time.tv_usec - start_time.tv_usec) / 1000000;
curr_rate = (curr_bytes - last_bytes) / curr_delta;
overall_rate = curr_bytes / overall_delta;

total_time = total_size / curr_rate;
remain_time = total_time - curr_time;
remain_hours = (remain_time / 3600);
remain_minutes = (remain_time - remain_hours * 3600)/60;
remain_seconds = (remain_time - remain_hours * 3600 - remain_minutes*60);

"%s Filename %s current %g MB/s, overall %g MB/s, est %u:%u:%u left", op,
filename, curr_rate, total_rate, remain_hours, remain_minutes, remain_seconds

last_time = curr_time;
last_bytes = curr_bytes;

Comment by Jian Yu [ 25/May/11 ]

Status: I would work on this ticket after Lustre 1.8.6 pre-release/release testing.

Comment by Jian Yu [ 17/Jul/11 ]

If you are modifying llverfs, it might be convenient to print out the current MB/s with each status line, and in a summary at the end, along with an estimate of how long the test will continue (based on the number of total_size = file_size * num_files to create for partial mode, or the total_size = free space in full mode).

The patch for master branch is in http://review.whamcloud.com/1110.

Comment by Jian Yu [ 17/Jul/11 ]

For the EIO error while opening/creating a new file with a full OST, the lctl debug log on master branch showed that:

00020000:00000001:1.0:1310895705.290808:0:7325:0:(obd_class.h:742:obd_precreate()) Process entered
00000008:00000001:1.0:1310895705.290809:0:7325:0:(osc_create.c:354:osc_precreate()) Process entered
00000008:00000001:1.0:1310895705.290810:0:7325:0:(osc_create.c:365:osc_precreate()) Process leaving via out (rc=1000 : 1000 : 0x3e8)
00020000:00000001:1.0:1310895705.290811:0:7325:0:(obd_class.h:748:obd_precreate()) Process leaving (rc=1000 : 1000 : 3e8)
00020000:00000001:1.0:1310895705.290812:0:7325:0:(lov_qos.c:685:alloc_rr()) Process leaving (rc=0 : 0 : 0)
00020000:00000001:1.0:1310895705.290813:0:7325:0:(obd_class.h:1101:obd_putref()) Process entered
00020000:00000001:1.0:1310895705.290814:0:7325:0:(obd_class.h:1106:obd_putref()) Process leaving
00020000:00000001:1.0:1310895705.290815:0:7325:0:(lov_qos.c:952:alloc_qos()) Process leaving (rc=0 : 0 : 0)
00020000:00000001:1.0:1310895705.290816:0:7325:0:(lov_qos.c:983:alloc_idx_array()) Process leaving (rc=0 : 0 : 0)
00020000:00000001:1.0:1310895705.290819:0:7325:0:(lov_qos.c:1062:qos_prep_create()) Process leaving via out_err (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)

So, the code path is qos_prep_create()->alloc_idx_array()->alloc_qos()->alloc_rr()->obd_precreate()->osc_precreate(). In osc_precreate(), it did check that the OSCC_FLAG_NOSPC was set and returned "rc=1000":

int osc_precreate(struct obd_export *exp)
{
        //......
        if (oscc->oscc_flags & OSCC_FLAG_NOSPC ||
            oscc->oscc_flags & OSCC_FLAG_RDONLY ||
            oscc->oscc_flags & OSCC_FLAG_EXITING)
                GOTO(out, rc = 1000);
        //......
}

However, it did not return the corresponding error code for each flag. So, the caller functions did not know what the real error was.

And the EIO error was actually generated from here:

int qos_prep_create(struct obd_export *exp, struct lov_request_set *set)
{
        //......
        stripes = alloc_idx_array(exp, lsm, newea, &idx_arr, &idx_cnt, flag);
        if (stripes <= 0)
                GOTO(out_err, rc = stripes ? stripes : -EIO);
        //......
}

The alloc_idx_array() returned 0 because alloc_rr() did not handle the return value of osc_precreate() properly.

Andreas, should I improve osc_precreate() to return specific error codes and make alloc_rr() to handle them correctly?

Comment by Andreas Dilger [ 17/Jul/11 ]

One possibility is to return 1000+(errno) so that it can be decoded by the caller, but does not affect the logic too much. The difficulty is what to do if different stripes return different errors, but e.g. taking the first one is ok too.

It isn't a high priority if you have other tasks however.

Comment by Jian Yu [ 18/Jul/11 ]

One possibility is to return 1000+(errno) so that it can be decoded by the caller, but does not affect the logic too much. The difficulty is what to do if different stripes return different errors, but e.g. taking the first one is ok too.

It isn't a high priority if you have other tasks however.

OK, I see. I'd work on my other tasks which have higher priorities then.

Comment by Build Master (Inactive) [ 11/Aug/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #254
LU-297 improve llverfs to display the current read/write performance

Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
Files :

  • lustre/utils/llverdev.c
  • lustre/utils/llverfs.c
Comment by Build Master (Inactive) [ 11/Aug/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #254
LU-297 improve llverfs to display the current read/write performance

Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
Files :

  • lustre/utils/llverdev.c
  • lustre/utils/llverfs.c
Comment by Build Master (Inactive) [ 11/Aug/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #254
LU-297 improve llverfs to display the current read/write performance

Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
Files :

  • lustre/utils/llverdev.c
  • lustre/utils/llverfs.c
Comment by Build Master (Inactive) [ 11/Aug/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #254
LU-297 improve llverfs to display the current read/write performance

Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
Files :

  • lustre/utils/llverdev.c
  • lustre/utils/llverfs.c
Comment by Build Master (Inactive) [ 11/Aug/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #254
LU-297 improve llverfs to display the current read/write performance

Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
Files :

  • lustre/utils/llverfs.c
  • lustre/utils/llverdev.c
Comment by Build Master (Inactive) [ 11/Aug/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #254
LU-297 improve llverfs to display the current read/write performance

Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
Files :

  • lustre/utils/llverfs.c
  • lustre/utils/llverdev.c
Comment by Build Master (Inactive) [ 11/Aug/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #254
LU-297 improve llverfs to display the current read/write performance

Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
Files :

  • lustre/utils/llverdev.c
  • lustre/utils/llverfs.c
Comment by Build Master (Inactive) [ 11/Aug/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #254
LU-297 improve llverfs to display the current read/write performance

Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
Files :

  • lustre/utils/llverdev.c
  • lustre/utils/llverfs.c
Comment by Build Master (Inactive) [ 11/Aug/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #254
LU-297 improve llverfs to display the current read/write performance

Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
Files :

  • lustre/utils/llverfs.c
  • lustre/utils/llverdev.c
Comment by Build Master (Inactive) [ 11/Aug/11 ]

Integrated in lustre-master » i686,server,el5,ofa #254
LU-297 improve llverfs to display the current read/write performance

Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
Files :

  • lustre/utils/llverfs.c
  • lustre/utils/llverdev.c
Comment by Build Master (Inactive) [ 11/Aug/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #254
LU-297 improve llverfs to display the current read/write performance

Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
Files :

  • lustre/utils/llverdev.c
  • lustre/utils/llverfs.c
Comment by Build Master (Inactive) [ 11/Aug/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #254
LU-297 improve llverfs to display the current read/write performance

Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
Files :

  • lustre/utils/llverfs.c
  • lustre/utils/llverdev.c
Comment by Build Master (Inactive) [ 11/Aug/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #254
LU-297 improve llverfs to display the current read/write performance

Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
Files :

  • lustre/utils/llverdev.c
  • lustre/utils/llverfs.c
Comment by Build Master (Inactive) [ 11/Aug/11 ]

Integrated in lustre-master » i686,client,el5,ofa #254
LU-297 improve llverfs to display the current read/write performance

Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
Files :

  • lustre/utils/llverfs.c
  • lustre/utils/llverdev.c
Comment by Andreas Dilger [ 13/May/16 ]

Patch was landed for Lustre 2.0.67 and the OST object precreate was totally rewritten in 2.4.0.

Generated at Sat Feb 10 01:05:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.