[LU-413] performance-sanity test_8: rank 0: open(f124836) error: Input/output error Created: 14/Jun/11 Updated: 23/Apr/14 Resolved: 23/Apr/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 1.8.8, Lustre 1.8.6, Lustre 1.8.9 |
| Fix Version/s: | Lustre 2.1.0, Lustre 1.8.7 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Johann Lombardi (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Branch: v1_8_6_RC2 |
||
| Severity: | 3 |
| Bugzilla ID: | 23,206 |
| Rank (Obsolete): | 4972 |
| Description |
|
performance-sanity test_8 failed as follows: ===== mdsrate-stat-large.sh Test preparation: creating 125125 files. + /usr/lib64/lustre/tests/mdsrate --create --dir /mnt/lustre/mdsrate --nfiles 125125 --filefmt 'f%%d' UUID Inodes IUsed IFree IUse% Mounted on lustre-MDT0000_UUID 415069 50 415019 0% /mnt/lustre[MDT:0] lustre-OST0000_UUID 125184 89 125095 0% /mnt/lustre[OST:0] lustre-OST0001_UUID 125184 89 125095 0% /mnt/lustre[OST:1] lustre-OST0002_UUID 125184 89 125095 0% /mnt/lustre[OST:2] lustre-OST0003_UUID 125184 89 125095 0% /mnt/lustre[OST:3] lustre-OST0004_UUID 125184 89 125095 0% /mnt/lustre[OST:4] lustre-OST0005_UUID 125184 89 125095 0% /mnt/lustre[OST:5] filesystem summary: 415069 50 415019 0% /mnt/lustre + chmod 0777 /mnt/lustre drwxrwxrwx 5 root root 4096 Jun 13 13:41 /mnt/lustre + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun -np 2 -machinefile /tmp/mdsrate-stat-large.machines /usr/lib64/lustre/tests/mdsrate --create --dir /mnt/lustre/mdsrate --nfiles 125125 --filefmt 'f%%d' " 0: client-10-ib starting at Mon Jun 13 13:49:30 2011 rank 0: open(f124836) error: Input/output error -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 4468 on node client-10-ib exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- rank 1: open(f124837) error: Input/output error UUID Inodes IUsed IFree IUse% Mounted on lustre-MDT0000_UUID 500096 124886 375210 25% /mnt/lustre[MDT:0] lustre-OST0000_UUID 125184 124985 199 100% /mnt/lustre[OST:0] lustre-OST0001_UUID 125184 125184 0 100% /mnt/lustre[OST:1] lustre-OST0002_UUID 125184 125184 0 100% /mnt/lustre[OST:2] lustre-OST0003_UUID 125184 123961 1223 99% /mnt/lustre[OST:3] lustre-OST0004_UUID 125184 124633 551 100% /mnt/lustre[OST:4] lustre-OST0005_UUID 125184 124377 807 99% /mnt/lustre[OST:5] filesystem summary: 500096 124886 375210 25% /mnt/lustre status script Total(sec) E(xcluded) S(low) ------------------------------------------------------------------------------------ test-framework exiting on error performance-sanity test_8: @@@@@@ FAIL: test_8 failed with 1 Dmesg on the MDS node: Lustre: DEBUG MARKER: ===== mdsrate-stat-large.sh Test preparation: creating 125125 files. Lustre: 8659:0:(lov_qos.c:459:qos_shrink_lsm()) using fewer stripes for object 278662: old 6 new 5 Lustre: 8681:0:(lov_qos.c:459:qos_shrink_lsm()) using fewer stripes for object 278663: old 6 new 5 Lustre: 8663:0:(lov_qos.c:459:qos_shrink_lsm()) using fewer stripes for object 279300: old 6 new 5 Lustre: 8663:0:(lov_qos.c:459:qos_shrink_lsm()) Skipped 636 previous similar messages Lustre: 8662:0:(lov_qos.c:459:qos_shrink_lsm()) using fewer stripes for object 280599: old 6 new 3 Lustre: 8662:0:(lov_qos.c:459:qos_shrink_lsm()) Skipped 1298 previous similar messages LustreError: 8685:0:(mds_open.c:441:mds_create_objects()) error creating objects for inode 281132: rc = -5 LustreError: 8685:0:(mds_open.c:826:mds_finish_open()) mds_create_objects: rc = -5 LustreError: 8681:0:(mds_open.c:441:mds_create_objects()) error creating objects for inode 281132: rc = -5 LustreError: 8681:0:(mds_open.c:826:mds_finish_open()) mds_create_objects: rc = -5 Lustre: DEBUG MARKER: performance-sanity test_8: @@@@@@ FAIL: test_8 failed with 1 Dmesg on the OSS node: Lustre: DEBUG MARKER: ===== mdsrate-stat-large.sh Test preparation: creating 125125 files. LustreError: 25861:0:(filter.c:3449:filter_precreate()) create failed rc = -28 LustreError: 27807:0:(filter.c:3449:filter_precreate()) create failed rc = -28 LustreError: 27804:0:(filter.c:3449:filter_precreate()) create failed rc = -28 LustreError: 27804:0:(filter.c:3449:filter_precreate()) Skipped 2 previous similar messages Lustre: DEBUG MARKER: performance-sanity test_8: @@@@@@ FAIL: test_8 failed with 1 Maloo report: https://maloo.whamcloud.com/test_sets/9b2e5a46-964f-11e0-9a27-52540025f9af This is an known issue: bug 23206 |
| Comments |
| Comment by Andreas Dilger [ 14/Jun/11 ] |
|
I think a prime issue here is that the "mdsrate_inodes_available()" function is incorrectly assuming that it can create min(num_ost_objects) files across all of the OSTs with wide striping. The MDS does not allocate objects perfectly evenly to avoid waiting for slow OSTs. I've uploaded http://review.whamcloud.com/#change,941 for master, and http://review.whamcloud.com/942 for b1_8. It would also be a good idea to land the patch from bugzilla 23206 from Dmitry. That may also resolve the issue, but has a much higher risk and is not suitable for 1.8.6-RC. |
| Comment by Peter Jones [ 16/Jun/11 ] |
|
Andreas has provided initial patches for this. We can reassign if another engineer takes over this effort |
| Comment by Build Master (Inactive) [ 30/Jun/11 ] |
|
Integrated in Oleg Drokin : fc73791d9bd7e71538a96f8700a8cca737598e1a
|
| Comment by Build Master (Inactive) [ 30/Jun/11 ] |
|
Integrated in Oleg Drokin : fc73791d9bd7e71538a96f8700a8cca737598e1a
|
| Comment by Build Master (Inactive) [ 30/Jun/11 ] |
|
Integrated in Oleg Drokin : fc73791d9bd7e71538a96f8700a8cca737598e1a
|
| Comment by Build Master (Inactive) [ 30/Jun/11 ] |
|
Integrated in Oleg Drokin : fc73791d9bd7e71538a96f8700a8cca737598e1a
|
| Comment by Build Master (Inactive) [ 30/Jun/11 ] |
|
Integrated in Oleg Drokin : fc73791d9bd7e71538a96f8700a8cca737598e1a
|
| Comment by Build Master (Inactive) [ 30/Jun/11 ] |
|
Integrated in Oleg Drokin : fc73791d9bd7e71538a96f8700a8cca737598e1a
|
| Comment by Build Master (Inactive) [ 30/Jun/11 ] |
|
Integrated in Oleg Drokin : fc73791d9bd7e71538a96f8700a8cca737598e1a
|
| Comment by Build Master (Inactive) [ 30/Jun/11 ] |
|
Integrated in Oleg Drokin : fc73791d9bd7e71538a96f8700a8cca737598e1a
|
| Comment by Build Master (Inactive) [ 30/Jun/11 ] |
|
Integrated in Oleg Drokin : fc73791d9bd7e71538a96f8700a8cca737598e1a
|
| Comment by Build Master (Inactive) [ 30/Jun/11 ] |
|
Integrated in Oleg Drokin : fc73791d9bd7e71538a96f8700a8cca737598e1a
|
| Comment by Build Master (Inactive) [ 30/Jun/11 ] |
|
Integrated in Oleg Drokin : fc73791d9bd7e71538a96f8700a8cca737598e1a
|
| Comment by Build Master (Inactive) [ 30/Jun/11 ] |
|
Integrated in Oleg Drokin : fc73791d9bd7e71538a96f8700a8cca737598e1a
|
| Comment by Build Master (Inactive) [ 30/Jun/11 ] |
|
Integrated in Oleg Drokin : fc73791d9bd7e71538a96f8700a8cca737598e1a
|
| Comment by Build Master (Inactive) [ 30/Jun/11 ] |
|
Integrated in Oleg Drokin : fc73791d9bd7e71538a96f8700a8cca737598e1a
|
| Comment by Build Master (Inactive) [ 26/Jul/11 ] |
|
Integrated in Johann Lombardi : 930243348131214ede3376790dbcdab50335d3ee
|
| Comment by Build Master (Inactive) [ 26/Jul/11 ] |
|
Integrated in Johann Lombardi : 930243348131214ede3376790dbcdab50335d3ee
|
| Comment by Build Master (Inactive) [ 26/Jul/11 ] |
|
Integrated in Johann Lombardi : 930243348131214ede3376790dbcdab50335d3ee
|
| Comment by Build Master (Inactive) [ 26/Jul/11 ] |
|
Integrated in Johann Lombardi : 930243348131214ede3376790dbcdab50335d3ee
|
| Comment by Build Master (Inactive) [ 26/Jul/11 ] |
|
Integrated in Johann Lombardi : 930243348131214ede3376790dbcdab50335d3ee
|
| Comment by Build Master (Inactive) [ 26/Jul/11 ] |
|
Integrated in Johann Lombardi : 930243348131214ede3376790dbcdab50335d3ee
|
| Comment by Build Master (Inactive) [ 26/Jul/11 ] |
|
Integrated in Johann Lombardi : 930243348131214ede3376790dbcdab50335d3ee
|
| Comment by Build Master (Inactive) [ 26/Jul/11 ] |
|
Integrated in Johann Lombardi : 930243348131214ede3376790dbcdab50335d3ee
|
| Comment by Build Master (Inactive) [ 26/Jul/11 ] |
|
Integrated in Johann Lombardi : 930243348131214ede3376790dbcdab50335d3ee
|
| Comment by Build Master (Inactive) [ 26/Jul/11 ] |
|
Integrated in Johann Lombardi : 930243348131214ede3376790dbcdab50335d3ee
|
| Comment by Build Master (Inactive) [ 26/Jul/11 ] |
|
Integrated in Johann Lombardi : 930243348131214ede3376790dbcdab50335d3ee
|
| Comment by Andreas Dilger [ 26/Jul/11 ] |
|
Landed to both master and b1_8. |
| Comment by Andreas Dilger [ 26/Jul/11 ] |
|
Reopen issue, because it is also tracking landing of bug 23206 patch from bugzilla. Assign to Johann for further reassignment. |
| Comment by Jian Yu [ 16/May/12 ] |
|
Lustre Tag: v1_8_8_WC1_RC1 performance-sanity test_8 failed with the same issue: |
| Comment by Jian Yu [ 16/May/12 ] |
|
Lustre Tag: v1_8_8_WC1_RC1 compilebench in parallel-scale-nfsv {3,4} tests also failed with this issue: |
| Comment by Jian Yu [ 31/May/12 ] |
|
Lustre client: 1.8.8-wc1 performance-sanity test_8 failed with the same issue: |
| Comment by Vladimir V. Saveliev [ 18/Sep/12 ] |
| Comment by Jian Yu [ 14/Feb/13 ] |
|
Lustre Tag: v1_8_9_WC1_RC1 The compilebench test in parallel-scale {,-nfsv3,-nfsv4}.sh all hit the same issue: |
| Comment by Bruno Faccini (Inactive) [ 14/Feb/13 ] |
|
Failures come from -28/ENOSPC errors during filter_precreate() on OSTs, so Compilebench tests must self-protect vs number of inodes available, like already done for performance-sanity. Also I see in ticket history that compilebench/parallel-scale-nfsv {3,4}tests already failed in May 2012, do we remember what was done already ? Some cleanup before, OSTs number/inode-number change ? On the other hand, if we want to do the same pre-check in our compilebench-based tests, we need to be able to evaluate its file number/consumption, but I don't find it in our sources, does it come from public benchmark sources ? |
| Comment by Jian Yu [ 20/Feb/13 ] |
|
Hi Bruno, FYI... The original compilebench source is in https://oss.oracle.com/~mason/compilebench/. In addition, I found Vladimir Saveliev uploaded a patch in http://review.whamcloud.com/4025. |
| Comment by Jian Yu [ 20/Feb/13 ] |
|
Lustre Tag: v1_8_9_WC1_RC2 The compilebench test in parallel-scale {,-nfsv3,-nfsv4}.sh all hit the same issue: |