[LU-6747] Intermittent rc=-EROFS from lod_statfs_and_check Created: 19/Jun/15 Updated: 13/Feb/16 Resolved: 10/Jul/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Olaf Faaland | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
lustre 2.7.54 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Running mdtest I get failures like this: 06/18/2015 16:22:12: Process 26(zwicky29): FAILED in create_remove_items_helper, unable to create file file.mdtest.26.0 (cwd=/p/lburn/faaland1/zfs-crada/mdtest/2/#test-dir.0/mdtest_tree.26.0): No space left on device The three servers involved are using the zfs backend and their pools have lots of free space; all are <1% full. In the lustre debug log on the MDS, I'm seeing Many other functions report exiting with -28 as well: lod_object.c:2104:lod_declare_xattr_set() I've attached a few thousand lines of debug output from the mds with both debug and debug_subsys set to -1. I can reproduce easily, so I can get debug output with specific subsystems turned off or on. |
| Comments |
| Comment by Olaf Faaland [ 19/Jun/15 ] |
|
mdtest is being run as follows: mdtest-1.8.3 was launched with 32 total task(s) on 32 nodes 32 tasks, 32000 files/directories Operation Duration Rate
This occurs intermittently; sometimes mdtest will create tens of thousands of files before this occurs, other times none or very few are created before it happens. When it does occur, all the clients in the mdtest job (32 in the example I gave above) report the same failure. There is no console log output on any of the servers, between the time I start running mdtest and after the error has occurred. The problem disappears on its own; re-running mdtest after it fails, with a new target directory on the same filesystem, resumes with no problem. This is true whether mdtest runs on any of the same nodes as the previous invocation or not. I sampled prealloc_status, prealloc_next_id, and prealloc_last_id for each of the two OSTs on a 1-second interval starting when I ran mdtest and stopping after the job failed. prealloc_status was 0 the entire time for both OSTs, and $(cat prealloc_last_id) - $(cat prealloc_next_id) got as low as 5030 for one OST. I also noticed that files are not created evenly on the OSTs according to ltop; it shows the lock count increasing on one OST for several seconds, then that OST's lock count stops changing and the other OST's lock count starts increasing. In the case of one 32-node mdtest run, it created all 32000 files on OST0001 and none on OST0000. Other times it finishes with the same number of locks on each OST, but they see-saw up. |
| Comment by Olaf Faaland [ 19/Jun/15 ] |
|
osp_statfs() reports many free blocks and many free files for OST0000 when called by lod_statfs_and_check: 274 00020000:00000001:19.0:1434585022.784732:0:63759:0:(lod_qos.c:193:lod_statfs_and_check()) Process entered |
| Comment by Gerrit Updater [ 19/Jun/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/15346 |
| Comment by Andreas Dilger [ 19/Jun/15 ] |
|
The -EROFS return is because the OST returned os_state=OS_STATE_READONLY to the MDS, and if all of the OSTs are returning this then there are no available OST objects and the MDS needs to return -ENOSPC. The main point of investigation should be why the OST(s) are returning OS_STATE_READONLY. As far as I can see, the only place that I can see which sets OS_STATE_READONLY is in osd-ldiskfs, but you are testing ZFS OSTs so it isn't clear where that is coming from (there is only a commented-out section in osd-zfs/osd_handler.c which references it). If you run with D_CACHE enabled on the OSS (lctl set_param debug=+cache) it would print the os_state field on the OST during statfs processing. Looking at the osd-zfs osd_statfs() method it seems possible that os_state is not being initialized properly and is just holding random data? I'd think that this would have triggered earlier if that were the case, or maybe I'm mistaken and it is initialized somewhere else, but in the osd-ldiskfs case it is initialized within its osd_statfs() call. |
| Comment by Andreas Dilger [ 19/Jun/15 ] |
|
Totally untested patch, but I think it is pretty reasonable and will be tested out soon enough. |
| Comment by Peter Jones [ 19/Jun/15 ] |
|
Assigning to Nathaniel for any follow on questions |
| Comment by Olaf Faaland [ 19/Jun/15 ] |
|
Both OSTs are indeed reporting a nonzero state intermittently: zwicky-lburn-oss1.out:00002000:00000020:0.0:1434584977.251447:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821990 free, 176813342 avail; 5658325578 objects: 5658303680 free; state 9125230e I'll try the patch. |
| Comment by Olaf Faaland [ 20/Jun/15 ] |
|
Nathaniel, The patch had an error in its usage of offsetof(), which I fixed, but the patch still fails to build on SLES12. It builds fine on RHEL. I looked at the SLES12 build console output but am not able to tell what the problem is. Can you take a look? thanks, |
| Comment by Olaf Faaland [ 20/Jun/15 ] |
|
The patch appears to have resolved the problem; mdtest has been running now for several hours without encountering the issue, and ran for several hours last night as well. I'm surprised I didn't see this problem earlier. I ran mdtest in the same manner, against the same SPL/ZFS/Lustre versions, on the same clients and servers, with SPL and ZFS built using the --enable-debug configure option. I didn't observe this problem. I then rebuilt SPL and ZFS without debug, and rebuilt lustre against it, which is when I began to see this. |
| Comment by Gerrit Updater [ 10/Jul/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15346/ |
| Comment by Peter Jones [ 10/Jul/15 ] |
|
Landed for 2.8 |