[LU-297] llverfs: Open '/mnt/lustre/dir00007/file015' failed:Input/output error Created: 09/May/11 Updated: 13/May/16 Resolved: 13/May/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.0.0, Lustre 1.8.6 |
| Fix Version/s: | Lustre 2.1.0, Lustre 1.8.7 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jian Yu | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Branch: b1_8 |
||
| Severity: | 3 |
| Bugzilla ID: | 24,017 |
| Rank (Obsolete): | 4918 |
| Description |
|
While running llverfs in full mode on Lustre filesystem with 1TB OST, it hit -EIO error after writing (7*32+15)*4G=956G data into the filesystem as follows: llverfs: Open '/mnt/lustre/dir00007/file015' failed:Input/output error Dmesg log showed that: Lustre: DEBUG MARKER: ======== run llverfs in full mode on the Lustre filesystem /mnt/lustre ======== LustreError: 21145:0:(mds_open.c:441:mds_create_objects()) error creating objects for inode 41943290: rc = -5 LustreError: 21145:0:(mds_open.c:826:mds_finish_open()) mds_create_objects: rc = -5 Before running llverfs, the Lustre disk space and inode usage info are as follows: # lfs df -h /mnt/lustre UUID bytes Used Available Use% Mounted on largefs-MDT0000_UUID 224.0G 461.0M 210.7G 0% /mnt/lustre[MDT:0] largefs-OST0000_UUID 1007.9G 472.6M 956.3G 0% /mnt/lustre[OST:0] filesystem summary: 1007.9G 472.6M 956.3G 0% /mnt/lustre # lfs df -i /mnt/lustre UUID Inodes IUsed IFree IUse% Mounted on largefs-MDT0000_UUID 58597923 26 58597897 0% /mnt/lustre[MDT:0] largefs-OST0000_UUID 67108864 72 67108792 0% /mnt/lustre[OST:0] filesystem summary: 58597923 26 58597897 0% /mnt/lustre After running llverfs, the Lustre disk space and inode usage info are as follows: # lfs df -h /mnt/lustre UUID bytes Used Available Use% Mounted on largefs-MDT0000_UUID 224.0G 461.0M 210.7G 0% /mnt/lustre[MDT:0] largefs-OST0000_UUID 1007.9G 956.5G 276.8M 100% /mnt/lustre[OST:0] filesystem summary: 1007.9G 956.5G 276.8M 100% /mnt/lustre # lfs df -i /mnt/lustre UUID Inodes IUsed IFree IUse% Mounted on largefs-MDT0000_UUID 58598163 274 58597889 0% /mnt/lustre[MDT:0] largefs-OST0000_UUID 13492969 328 13492641 0% /mnt/lustre[OST:0] filesystem summary: 58598163 274 58597889 0% /mnt/lustre Maloo report: https://maloo.whamcloud.com/test_sets/4b6fc288-7a43-11e0-a1b3-52540025f9af |
| Comments |
| Comment by Jian Yu [ 09/May/11 ] |
|
The failure occurred while performing the following codes of lustre/utils/llverfs.c: static int dir_write(char *chunk_buf, size_t chunksize,
time_t time_st, unsigned long dir_num)
{
......
for (; dir_num < num_dirs; num_files++, file_num++) {
......
fd = open_file(new_file(tempfile, tempdir, file_num),
O_WRONLY | O_CREAT | O_TRUNC | O_LARGEFILE);
......
}
......
}
I'll investigate more in mds_create_objects() which returned -EIO instead of -ENOSPC while opening/creating a file with OST out of space. |
| Comment by Andreas Dilger [ 09/May/11 ] |
|
This EIO error is likely involving the alloc_rr->osc_precreate() codepath, where it is checking whether the OST is full (OSCC_FLAG_ENOSPC) before allocating more objects there. I agree that returning -ENOSPC is the better thing to do in this case, and -EIO is not correct for OSCC_FLAG_ENOSPC. I think it is still correct to return -EIO in the case where the OSTs are unavailable due to inactive devices, instead of the case where the OSTs are full. Because running llverfs takes such a long time, it does not stop the test when it hits this EIO error, just to allow testing to make progress. Looking at this code and the test results, llverfs doesn't consider an error from open_file() fatal (it just breaks out of the loop and begin the read test), and write_chunks() calls write_retry() that should record the error but continue trying to write until it hit too many errors, then moves on to the read verification. It only returns an error at the end of the test. Looking at https://maloo.whamcloud.com/test_sets/4b6fc288-7a43-11e0-a1b3-52540025f9af it does appear that the full test on Lustre was run until the filesystem was full, and then the read verification completed. This means it should be possible to start running llverfs on a 24TB ldiskfs filesystem while debugging this problem in Lustre, and if possible also start running on a 128TB+ ldiskfs filesystem in order to speed up this process. My calculations show a 24TB test will take about 88 hours, and a 128TB test will take about 473 hours = 19 days, so it makes sense to start those tests as soon as possible after making script adjustments (as described below) and verifying them one last time on the 1TB filesystem. I noticed something incorrect in the test script - after the partial llverfs run on ldiskfs the test files were deleted before the e2fsck was run. This makes the e2fsck check much less valuable, because it hides any kernel or e2fsck bugs related to how the files are allocated on disk. The e2fsck should be run after each llverfs test is finished, but before the test files are deleted. There is currently no e2fsck check at all after the partial llverfs run on Lustre, which needs to be added. Also, for proper problem isolation there should be a full run of llverfs on the ldiskfs-mounted OST filesystem after the partial ldiskfs test, because I suspect there may be some bugs in ext4 or the ldiskfs patches themselves, even before we start testing obdfilter running on large filesystems. Again, there should be a full e2fsck run after llverfs is run. |
| Comment by Andreas Dilger [ 10/May/11 ] |
|
If you are modifying llverfs, it might be convenient to print out the current MB/s with each status line, and in a summary at the end, along with an estimate of how long the test will continue (based on the number of total_size = file_size * num_files to create for partial mode, or the total_size = free space in full mode). The best way to do this, I think, is to keep a global "curr_bytes" counter for the number of bytes read or written since the start of the the dir_write() or dir_read() phase. Inside show_filename() it should use gettimeofday() for the last_time a message was printed (instead of the simple time() value used now) and it should track the last_bytes value: static struct timeval last_time; if (last_op != op[0]) { last_bytes = 0; last_time = start_time; last_op = op[0]; } curr_delta = (curr_time.tv_sec - last_time.tv_sec) + total_time = total_size / curr_rate; "%s Filename %s current %g MB/s, overall %g MB/s, est %u:%u:%u left", op, last_time = curr_time; |
| Comment by Jian Yu [ 25/May/11 ] |
|
Status: I would work on this ticket after Lustre 1.8.6 pre-release/release testing. |
| Comment by Jian Yu [ 17/Jul/11 ] |
The patch for master branch is in http://review.whamcloud.com/1110. |
| Comment by Jian Yu [ 17/Jul/11 ] |
|
For the EIO error while opening/creating a new file with a full OST, the lctl debug log on master branch showed that: 00020000:00000001:1.0:1310895705.290808:0:7325:0:(obd_class.h:742:obd_precreate()) Process entered 00000008:00000001:1.0:1310895705.290809:0:7325:0:(osc_create.c:354:osc_precreate()) Process entered 00000008:00000001:1.0:1310895705.290810:0:7325:0:(osc_create.c:365:osc_precreate()) Process leaving via out (rc=1000 : 1000 : 0x3e8) 00020000:00000001:1.0:1310895705.290811:0:7325:0:(obd_class.h:748:obd_precreate()) Process leaving (rc=1000 : 1000 : 3e8) 00020000:00000001:1.0:1310895705.290812:0:7325:0:(lov_qos.c:685:alloc_rr()) Process leaving (rc=0 : 0 : 0) 00020000:00000001:1.0:1310895705.290813:0:7325:0:(obd_class.h:1101:obd_putref()) Process entered 00020000:00000001:1.0:1310895705.290814:0:7325:0:(obd_class.h:1106:obd_putref()) Process leaving 00020000:00000001:1.0:1310895705.290815:0:7325:0:(lov_qos.c:952:alloc_qos()) Process leaving (rc=0 : 0 : 0) 00020000:00000001:1.0:1310895705.290816:0:7325:0:(lov_qos.c:983:alloc_idx_array()) Process leaving (rc=0 : 0 : 0) 00020000:00000001:1.0:1310895705.290819:0:7325:0:(lov_qos.c:1062:qos_prep_create()) Process leaving via out_err (rc=18446744073709551611 : -5 : 0xfffffffffffffffb) So, the code path is qos_prep_create()->alloc_idx_array()->alloc_qos()->alloc_rr()->obd_precreate()->osc_precreate(). In osc_precreate(), it did check that the OSCC_FLAG_NOSPC was set and returned "rc=1000": int osc_precreate(struct obd_export *exp)
{
//......
if (oscc->oscc_flags & OSCC_FLAG_NOSPC ||
oscc->oscc_flags & OSCC_FLAG_RDONLY ||
oscc->oscc_flags & OSCC_FLAG_EXITING)
GOTO(out, rc = 1000);
//......
}
However, it did not return the corresponding error code for each flag. So, the caller functions did not know what the real error was. And the EIO error was actually generated from here: int qos_prep_create(struct obd_export *exp, struct lov_request_set *set)
{
//......
stripes = alloc_idx_array(exp, lsm, newea, &idx_arr, &idx_cnt, flag);
if (stripes <= 0)
GOTO(out_err, rc = stripes ? stripes : -EIO);
//......
}
The alloc_idx_array() returned 0 because alloc_rr() did not handle the return value of osc_precreate() properly. Andreas, should I improve osc_precreate() to return specific error codes and make alloc_rr() to handle them correctly? |
| Comment by Andreas Dilger [ 17/Jul/11 ] |
|
One possibility is to return 1000+(errno) so that it can be decoded by the caller, but does not affect the logic too much. The difficulty is what to do if different stripes return different errors, but e.g. taking the first one is ok too. It isn't a high priority if you have other tasks however. |
| Comment by Jian Yu [ 18/Jul/11 ] |
OK, I see. I'd work on my other tasks which have higher priorities then. |
| Comment by Build Master (Inactive) [ 11/Aug/11 ] |
|
Integrated in Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
|
| Comment by Build Master (Inactive) [ 11/Aug/11 ] |
|
Integrated in Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
|
| Comment by Build Master (Inactive) [ 11/Aug/11 ] |
|
Integrated in Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
|
| Comment by Build Master (Inactive) [ 11/Aug/11 ] |
|
Integrated in Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
|
| Comment by Build Master (Inactive) [ 11/Aug/11 ] |
|
Integrated in Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
|
| Comment by Build Master (Inactive) [ 11/Aug/11 ] |
|
Integrated in Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
|
| Comment by Build Master (Inactive) [ 11/Aug/11 ] |
|
Integrated in Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
|
| Comment by Build Master (Inactive) [ 11/Aug/11 ] |
|
Integrated in Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
|
| Comment by Build Master (Inactive) [ 11/Aug/11 ] |
|
Integrated in Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
|
| Comment by Build Master (Inactive) [ 11/Aug/11 ] |
|
Integrated in Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
|
| Comment by Build Master (Inactive) [ 11/Aug/11 ] |
|
Integrated in Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
|
| Comment by Build Master (Inactive) [ 11/Aug/11 ] |
|
Integrated in Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
|
| Comment by Build Master (Inactive) [ 11/Aug/11 ] |
|
Integrated in Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
|
| Comment by Build Master (Inactive) [ 11/Aug/11 ] |
|
Integrated in Oleg Drokin : 64f0f965dc80556739a3f86c15c0bb158b6289a2
|
| Comment by Andreas Dilger [ 13/May/16 ] |
|
Patch was landed for Lustre 2.0.67 and the OST object precreate was totally rewritten in 2.4.0. |