[LU-6492] less +F doesn't work Created: 24/Apr/15 Updated: 10/Oct/21 Resolved: 10/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Nick Tan (Inactive) | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 6.4 x86_64 |
||
| Epic/Theme: | lustre-2.5.3 |
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
node 1 streams data into an ever-growing file (e.g. a log file). node 2 can see the file changing with stat: $ stat logs/*.3 $ stat logs/*.3 On node 2, a tail -f (which calls fstat() to determine if the file has changed) works: 13:34:45.548212 read(3, ""..., 8192) = 5118 <0.005387> 13:34:45.555566 nanosleep({1, 0}, NULL) = 0 <1.000124> 13:34:46.555811 fstat(3, {st_mode=S_IFREG|0664, st_size=33712023, ...} ) = 0 <0.000185> 13:34:47.556374 fstat(3, {st_mode=S_IFREG|0664, st_size=33712023, ...}) = 0 <0.000175> 13:34:47.556684 nanosleep({1, 0} , NULL) = 0 <1.000134> 13:34:48.557234 nanosleep({1, 0}, NULL) = 0 <1.000110> 13:34:49.557444 fstat(3, {st_mode=S_IFREG|0664, st_size=33712023, ...} ) = 0 <0.037993> ) = 0 <0.000105> Interestingly, even though the file size didn't change (we can see the st_size there), it evidently indicated some kind of change (perhaps mtime?) because tail knew to try the read() again, and the read() succeeded. After the successful read, it did another read/fstat, and at that point the st_size was updated. However "less" in Follow mode (which doesn't do fstat() before read()) doesn't work: 13:33:11.949002 nanosleep( {1, 0}, 0x7fff832a7c30) = 0 <1.000156>13:33:12.949280 read(4, "", 5324) = 0 <0.000028> 13:33:12.949752 nanosleep({1, 0} , 0x7fff832a7c30) = 0 <1.000155> 13:33:14.950767 read(4, "", 5324) = 0 <0.000029> 13:33:14.951249 nanosleep({1, 0} , 0x7fff832a7c30) = 0 <1.000150> 13:33:16.952287 read(4, "", 5324) = 0 <0.000029> 13:33:16.952738 nanosleep({1, 0} , 0x7fff832a7c30) = 0 <1.000148> Despite the fact that the file is growing (as seen by stat in another terminal), the read() acts like the file descriptor is still at the end of the file. The logfile grows every 10 seconds, but you can leave that less running for hours and hours, it'll never see it. |
| Comments |
| Comment by Phil Schwan (Inactive) [ 18/Nov/15 ] |
|
This doesn't seem all that minor to me – it's a pretty fundamental violation of the POSIX semantics on which Lustre was based, non? This bug is basically saying that if you're extending a file on one node, you can't rely on read() returning the correct result on another node. That's major! In fact, I'm sure we had automated tests for cases exactly like this, back in 2005. I'm surprised a bug like this could even make it into a release. |
| Comment by Robert Read (Inactive) [ 18/Nov/15 ] |
|
Hey, Phil, nice to see you! I was curious so I tried to reproduce this on 2.7.0 and CentOS 6.7 using this: [ec2-user@client00 ~]$ for i in $(seq 100); do echo "$i $(date)"; sleep 10; done > /mnt/scratch/test/watchme Then ran "less +F" on a different client, and I am seeing the updates as they are written: ~ ~ 1 Wed Nov 18 17:21:41 UTC 2015 2 Wed Nov 18 17:21:51 UTC 2015 3 Wed Nov 18 17:22:01 UTC 2015 4 Wed Nov 18 17:22:11 UTC 2015 5 Wed Nov 18 17:22:21 UTC 2015 6 Wed Nov 18 17:22:31 UTC 2015 7 Wed Nov 18 17:22:41 UTC 2015 8 Wed Nov 18 17:22:51 UTC 2015 9 Wed Nov 18 17:23:01 UTC 2015 10 Wed Nov 18 17:23:11 UTC 2015 11 Wed Nov 18 17:23:21 UTC 2015 12 Wed Nov 18 17:23:31 UTC 2015 13 Wed Nov 18 17:23:41 UTC 2015 14 Wed Nov 18 17:23:51 UTC 2015 15 Wed Nov 18 17:24:01 UTC 2015 Waiting for data... (interrupt to abort) So either a bug crept into 2.5.3 (seems unlikely) or we'll need a more sophisticated reproducer. |
| Comment by Phil Schwan (Inactive) [ 02/Dec/15 ] |
|
Hóla, Roberto! That's... interesting. It's 100% reproducible on my machine with a particular workload, but we've poked around a few others and some workstations behave differently – despite being (by all appearances) the same (same kernel, same version of Lustre, same mounts, same config). So it is indeed not quite so simple as it first appeared. Fair enough. |