[LU-8441] Text file busy error after overwriting file Created: 26/Jul/16 Updated: 22/Nov/17 Resolved: 22/Nov/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.5.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Teresa Kamakea | Assignee: | Jian Yu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl, llnlfixready | ||
| Environment: |
lustre-2.5.5-6chaos_2.6.32_573.26.1.1chaos.ch5.4.x86_64.x86_64 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Here's our reproducer: sh -c 'cd /p/lscratchd/$USER && (f=toss-3321; rm -f $f; cp /bin/ls $f; od -N1 $f; ./$f; echo > $f; rm -f $f)'
This looks similiar to |
| Comments |
| Comment by Andreas Dilger [ 05/Aug/16 ] |
|
I tried running the reproducer (slightly enhanced to show what it is doing) several times in a row and didn't see any problems (on 2.5.3.90 and 2.8.0 clients, 2.5.41.1 server), but I'm not sure what it is supposed to show: [mythtv@twoshoes ~]$ sh -c 'cd /myth/tmp/mythtv && (f=toss-3321; rm -f $f; cp /bin/ls $f; od -N1 $f; ./$f -li; echo > $f; ls -li $f; rm -f $f)' 0000000 000177 0000001 total 1 144116715221943228 -rwxr-xr-x 1 mythtv mythtv 114024 2016-08-05 11:41 toss-3321 144116715221943228 -rwxr-xr-x 1 mythtv mythtv 1 2016-08-05 11:41 toss-3321 [mythtv@twoshoes ~]$ sh -c 'cd /myth/tmp/mythtv && (f=toss-3321; rm -f $f; cp /bin/ls $f; od -N1 $f; ./$f -li; echo > $f; ls -li $f; rm -f $f)' 0000000 000177 0000001 total 1 144116715221943230 -rwxr-xr-x 1 mythtv mythtv 114024 2016-08-05 11:41 toss-3321 144116715221943230 -rwxr-xr-x 1 mythtv mythtv 1 2016-08-05 11:41 toss-3321 Should this be reporting "text flie busy" at the "echo > $f" stage? |
| Comment by Oleg Drokin [ 05/Aug/16 ] |
|
Sounds like |
| Comment by Ned Bass [ 05/Aug/16 ] |
|
Andreas, yes the echo is what fails with ETXTBSY. $ bass6@cab690 ~ /dev/pts/49 Fri Aug 05 14:10:59 >
sh -x -c 'cd /p/lscratchd/$USER && (f=toss-3321; rm -f $f; cp /bin/ls $f; od -N1 $f; ./$f; echo > $f; rm -f $f)'
+ cd /p/lscratchd/bass6
+ f=toss-3321
+ rm -f toss-3321
+ cp /bin/ls toss-3321
+ od -N1 toss-3321
0000000 000177
0000001
+ ./toss-3321
bin tmp toss-3321 y z
+ echo
sh: toss-3321: Text file busy
+ rm -f toss-3321
|
| Comment by Peter Jones [ 08/Aug/16 ] |
|
Ned So it seems that this issue will not be present when your clients move to 2.8.x. Is this issue serious enough that you would consider upgrading your 2.5.x clients to a newer version or will you just wait to eradicate this when you upgrade to 2.8.x? Peter |
| Comment by Ned Bass [ 08/Aug/16 ] |
|
Peter, we do not plan to upgrade our clients to 2.8 until sometime in 2017. This bug impacts the workflow of at least one of our users. While not serious, bugs like this force our users to adopt workarounds that tend to propagate and long outlive the original problem. So it would be nice to not have to wait for 2.8 for a fix, and I would rank it as low priority. |
| Comment by Peter Jones [ 08/Aug/16 ] |
|
Sure Ned. Jian Can you please port the fix for Thanks Peter |
| Comment by Ned Bass [ 08/Aug/16 ] |
To clarify, that comment applies only to our existing client clusters. We will be bringing some new client and server clusters online running Lustre 2.8 over the next few weeks and months. |
| Comment by Ned Bass [ 08/Jun/17 ] |
|
Peter, this issue is still affecting our Lustre 2.8 clients. |
| Comment by Jian Yu [ 09/Jun/17 ] |
|
Hi Ned, I set up a test cluster with the latest 2.8 FE build on both clients and servers, tried to run the reproducer but didn't see the "Text file busy" issue: [bass6@onyx-21vm5 ~]$ lctl get_param -n version lustre: 2.8.0.51 kernel: patchless_client build: 2.8.0.51-g9bda4c1-CHANGED-2.6.32-642.15.1.el6.x86_64 [bass6@onyx-21vm5 ~]$ sh -x -c 'cd /mnt/lustre/$USER && (f=toss-3321; rm -f $f; cp /bin/ls $f; od -N1 $f; ./$f; echo > $f; rm -f $f)' + cd /mnt/lustre/bass6 + f=toss-3321 + rm -f toss-3321 + cp /bin/ls toss-3321 + od -N1 toss-3321 0000000 000177 0000001 + ./toss-3321 toss-3321 + echo + rm -f toss-3321 |
| Comment by Jian Yu [ 15/Aug/17 ] |
|
Hi Ned, Could you please reproduce the failure and gather Lustre debug logs as follows? # lctl set_param debug=-1 debug_mb=1024 # lctl clear 2. On Client node, run the following command to reproduce the failure: # sh -x -c 'cd /p/lscratchd/$USER && (f=toss-3321; rm -f $f; cp /bin/ls $f; od -N1 $f; ./$f; echo > $f; rm -f $f)' 3. On Client and MDS nodes, run the following command to gather debug logs: # lctl dk > /tmp/debug-$(hostname -s)-$(date +%s) 4. Upload the logs to this ticket for investigation. Thank you. |
| Comment by Teresa Kamakea [ 17/Aug/17 ] |
|
Jian, I tar'd up the log files and attached them. Let me know if you need anything else. Thanks |
| Comment by Jian Yu [ 18/Aug/17 ] |
|
Thank you for the logs, Teresa. Let me look into them. |
| Comment by Jian Yu [ 18/Aug/17 ] |
|
Hi Teresa, "Text file busy" is error code "-ETXTBSY" (-26). In Client debug log, the error code was returned as follows: 00000080:00000001:0.0:1502922949.651746:0:179938:0:(namei.c:506:ll_lookup_it()) Process entered 00000080:00200000:0.0:1502922949.651746:0:179938:0:(namei.c:513:ll_lookup_it()) VFS Op:name=toss-3321,dir=144115205255725787/33554436(ffff8802c5555638),intent=open|creat ------8<------ 00000002:00100000:0.0:1502922949.652793:0:179938:0:(mdc_locks.c:643:mdc_finish_enqueue()) @@@ op: 3 disposition: b, status: -26 req@ffff88031c0cf800 x1570675218372588/t0(0) o101->lsd-MDT0000-mdc-ffff88041e231000@172.19.2.102@o2ib100:12/10 lens 2464/600 e 0 to 0 dl 1502923055 ref 1 fl Complete:R/0/0 rc 301/301 The error status -26 was returned from MDS, while it seems the MDS debug log is inadequate because there is no -26 error code in the log. By looking into Lustre codes, I found "-ETXTBSY" was returned from mdt_write_get() and mdt_write_deny() in lustre/mdt/mdt_open.c. And both of the two functions were called from mdt_mfd_open(). It seems there is a race between the execution and write of the file toss-3321. Could you please try to add a "sync" between "./$f;" and "echo > $f;" in the command line to see if the error still occurs? If it's still reproducible, could you please gather the Lustre debug log on MDS again? Right before running the "sh -x -c ..." command, please run "lctl dk >/dev/null" on both Client and MDS to clear the debug buffer so that we can get the exact debug logs. Thank you. |
| Comment by Jian Yu [ 19/Aug/17 ] |
|
Hi Teresa, The similar issue was reported in |
| Comment by Teresa Kamakea [ 24/Aug/17 ] |
|
Thank you Jian. I'll check in with Ned on trying the back-ported patch. |
| Comment by Jian Yu [ 24/Aug/17 ] |
|
You're welcome, Teresa. |
| Comment by Ned Bass [ 19/Sep/17 ] |
|
Hi Jian Yu, while we haven't tested your backported patch, |
| Comment by Oleg Drokin [ 22/Sep/17 ] |
|
Hm, it's weird that the ticket is marked as resolved while apparently the problem is still present? Anyway I was looking into the client log provided to see if potentially |
| Comment by Peter Jones [ 22/Sep/17 ] |
|
Oleg We host the LLNL tree now - fs/lustre-release-fe-llnl.git Peter |
| Comment by Oleg Drokin [ 22/Sep/17 ] |
|
so I pulled that repo and tested 2.8.0_12.chaos tag and llnl/2.8.0-llnl branch (both appear to point at the same point) and still things don't match. In particular in file.c: 00000080:00000001:20.0:1502922949.651650:0:179942:0:(file.c:331:ll_md_close()) Process leaving (rc=0 : 0 : 0) 00000080:00000001:20.0:1502922949.651651:0:179942:0:(file.c:392:ll_file_release()) Process leaving (rc=0 : 0 : 0) But: #ifdef CONFIG_FS_POSIX_ACL
if (sbi->ll_flags & LL_SBI_RMT_CLIENT &&
inode == inode->i_sb->s_root->d_inode) {
struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
LASSERT(fd != NULL);
331>>> if (unlikely(fd->fd_flags & LL_FILE_RMTACL)) {
fd->fd_flags &= ~LL_FILE_RMTACL;
rct_del(&sbi->ll_rct, current_pid());
et_search_free(&sbi->ll_et, current_pid());
}
}
#endif
}
392>>>>
op_data = ll_prep_md_op_data(NULL, parent->d_inode, de->d_inode,
name, len, 0, LUSTRE_OPC_ANY, NULL);
if (IS_ERR(op_data))
RETURN(PTR_ERR(op_data));
|
| Comment by Olaf Faaland [ 11/Oct/17 ] |
|
Hi Oleg,
What two things you are comparing - what doesn't match what? |
| Comment by Teresa Kamakea [ 12/Oct/17 ] |
|
The version on the client when the debug log was generated: Lustre: Lustre: Build Version: -13chaos-CHANGED-2.6.32-696.3.1.1chaos.ch5.6.x86_64 The version on the MDS when the debug log was generated: Lustre: Lustre: Build Version: -13chaos-CHANGED-2.6.32-696.3.1.1chaos.ch5.6.x86_64 |
| Comment by Olaf Faaland [ 12/Oct/17 ] |
|
Oleg, Sorry for the confusion. Those look like Lustre 2.5 builds (I believe all our 2.8 builds are based on weak modules). I'll huddle with Teresa and get back to you.
|
| Comment by Oleg Drokin [ 16/Oct/17 ] |
|
Olaf: the code does not match the logs. I think the whole idea or reopening this ticket was due to 2.8 still exhibiting the ETXTBUSY issue? If that is so, please provide 2.8 logs with the problem. |
| Comment by Olaf Faaland [ 16/Oct/17 ] |
|
Oleg, Teresa will test with 2.8 and post here. |
| Comment by Teresa Kamakea [ 14/Nov/17 ] |
|
I have attached the debug logs. There are 3 different scenarios: lustre 2.8 client/lustre2.8 server, lustre 2.5 client/lustre2.5 server, and lustre2.8 client/lustre2.5 server. The text file busy error shows up in the 2.5/2.5 scenario but not the others. The specific version information is included in the log files. |
| Comment by Olaf Faaland [ 15/Nov/17 ] |
|
Ned was mistaken when he stated we still see the problem on Lustre 2.8. The cluster he tested on, and that the end-user reported the issue on, had been accidentally switched to the Lustre 2.5 client. We see the problem only in the 2.5 client. |
| Comment by Peter Jones [ 20/Nov/17 ] |
|
Olaf So am I correct in thinking that this means that the original theory that this is a duplicate of Peter |
| Comment by Olaf Faaland [ 20/Nov/17 ] |
No, the patch for Without that patch, a request is sent to the MDT which then causes side-affects on other clients. It has a similar change on the MDT, for the case where an MDT receives an open request from a client which has mounted read-only. In our case, no clients are mounted read-only, so neither of those changes would have any effect.
Correct, when all our systems are at 2.8.x we do not expect to see this. Since the issue does not occur in 2.8.x, it may well be that there is an existing patch that could be backported, but |
| Comment by Peter Jones [ 20/Nov/17 ] |
|
Ah yes. I did not read far enough back in the comments. So, yujian I think that the situation is now that this is believed to be a duplicate of |
| Comment by Oleg Drokin [ 21/Nov/17 ] |
|
It looks like |
| Comment by Jian Yu [ 21/Nov/17 ] |
|
Hi Olaf, |
| Comment by Olaf Faaland [ 21/Nov/17 ] |
|
Hi Jian and Oleg, Thank you for investigating. Given the complexity and risk we can close this notfix and we will do the same in our local ticket. For my education, can you tell me where the EBUSY is coming from in our broken case, and describe of some of the factors that lead to this? It need not be a complete and perfect description, just some hints that help understand the relevant code paths. |
| Comment by Jian Yu [ 21/Nov/17 ] |
|
Hi Olaf, Hi Oleg, |
| Comment by Oleg Drokin [ 22/Nov/17 ] |
|
This is mostly due to lingering file opens for write that got cached on the client. So when the exec comes it sees the file is opened for write and bails out (server side). We tried to just obtain a necessary ldlm lock before opening, but that proved to be very expensive. The cached open on the other hand is the real problem here, originally mostly aimed at nfs opened files, it managed to be enabled for other types of opens at times leading to such problems. |
| Comment by Olaf Faaland [ 22/Nov/17 ] |
|
Thanks, Jian and Oleg. That's all LLNL needs for this. You can close notfix (or whatever your normal process is). |
| Comment by Jian Yu [ 22/Nov/17 ] |
|
Thank you, Olaf. I'm closing this ticket as a duplicate of |