[LU-2380] Hang and eviction scenario when multiple tasks/nodes do ftruncate() on the same file in parallel Created: 23/Nov/12 Updated: 17/Mar/22 Resolved: 17/Mar/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Diego Moreno (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5649 | ||||||||
| Description |
|
We have an scenario where some of the Lustre clients are hang and consequently evicted causing frequent job failures. This issue arises when ftruncate() is done on the same file and by several tasks/nodes in parallel. Even if it's clear that this is not a good practice, this behavior is quite used by some middle-tier layer heavily used at customer I think Lustre should be more resilient to this kind of issues. This bug has been reproduced in lustre 2.1.3 with 36 nodes running 576 mpi taks. In this case 50% of the times the test failed but the problem could maybe also be reproduced with less nodes/tasks. This is the reproducer: % cat truncate.c #include <mpi.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> main(int argc,char **argv) { int npes,myrank,l,i,j,k,n,ierr; int *len; char *rank ,*name; int *buf; struct stat *bufstat; int fd; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&npes); MPI_Comm_rank(MPI_COMM_WORLD,&myrank); printf(" My rank is %d of %d \n",myrank,npes); buf=(int *) malloc(sizeof(int)*10000*npes); for (i=0;i<10000;i++) buf[i]=myrank*i; bufstat=(struct stat *) malloc(sizeof(struct stat)); for (i=0;i<10;i++) { if ((fd=open("/ptmp/user/mytrunk",O_CREAT|O_RDWR, 0666)) < 0) { perror("open()"); exit(-1); } if (fstat(fd,bufstat) < 0) { perror("fstat()"); exit(-1); } if (lseek(fd, 0, SEEK_SET) < 0) { perror("lseek(SEEK_SET)"); exit(-1); } if (lseek(fd, 0, SEEK_CUR) < 0) { perror("lseek(SEEK_CUR)"); exit(-1); } if ((ierr=ftruncate(fd,0)) < 0) { perror("ftruncate()"); exit(-1); } if (write(fd,buf,10000*npes) < 0) { perror("write()"); exit(-1); } close(fd); } MPI_Finalize(); } And this is what happened: % mpicc truncate.c -o truncate % srun -n 576 -N 36 --exclusive --resv-ports ./truncate srun: job 269744 queued and waiting for resources srun: job 269744 has been allocated resources <<<<<< Job/run stuck for minutes !!... >>>>>>> fstat(): Interrupted system call fstat(): Interrupted system call fstat(): Interrupted system call fstat(): Interrupted system call ftruncate(): Interrupted system call My rank is 10 of 16 My rank is 11 of 16 My rank is 13 of 16 My rank is 14 of 16 My rank is 15 of 16 fstat(): Interrupted system call My rank is 8 of 16 ftruncate(): Cannot send after transport endpoint shutdown My rank is 9 of 16 srun: error: lascaux2825: tasks 8-11,13-15: Exited with exit code 255 srun: Terminating job step 269744.0 slurmd[lascaux2825]: *** STEP 269744.0 KILLED AT 2012-06-12T16:09:25 WITH SIGNAL 9 *** slurmd[lascaux2824]: *** STEP 269744.0 KILLED AT 2012-06-12T16:09:25 WITH SIGNAL 9 *** srun: Job step aborted: Waiting up to 2 seconds for job step to finish. slurmd[lascaux2825]: *** STEP 269744.0 KILLED AT 2012-06-12T16:09:25 WITH SIGNAL 9 *** slurmd[lascaux2824]: *** STEP 269744.0 KILLED AT 2012-06-12T16:09:25 WITH SIGNAL 9 *** And one of both Nodes/Clients reported the folling messages : 1336119249 2012 May 4 10:14:09 cartan1174 kern warning kernel Lustre: ptmp-OST001a-osc-ffff880464a98000: Connection to service ptmp-OST001a via nid JO.BOO.ZO.LF@o2ib3 was lost; in progress operations using this service will wait for recovery to complete. 1336119249 2012 May 4 10:14:09 cartan1174 kern err kernel LustreError: 45831:0:(ldlm_request.c:1152:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway 1336119249 2012 May 4 10:14:09 cartan1174 kern err kernel LustreError: 45831:0:(ldlm_request.c:1714:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107 1336119249 2012 May 4 10:14:09 cartan1174 kern err kernel LustreError: 167-0: This client was evicted by ptmp-OST001a; in progress operations using this service will fail. 1336119249 2012 May 4 10:14:09 cartan1174 kern err kernel LustreError: 16559:0:(client.c:1057:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff880375166000 x1398228708669187/t0(0) o101->ptmp-OST001a_UUID@JO.BOO.ZO.LF@o2ib3:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1 1336119249 2012 May 4 10:14:09 cartan1174 kern err kernel LustreError: 16559:0:(client.c:1057:ptlrpc_import_delay_req()) Skipped 31 previous similar messages 1336119249 2012 May 4 10:14:09 cartan1174 kern err kernel LustreError: 16559:0:(client.c:1057:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff880d6719c800 x1398228708669192/t0(0) o101->ptmp-OST001a_UUID@JO.BOO.ZO.LF@o2ib3:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1 1336119249 2012 May 4 10:14:09 cartan1174 kern info kernel Lustre: ptmp-OST001a-osc-ffff880464a98000: Connection restored to service ptmp-OST001a using nid JO.BOO.ZO.LF@o2ib3. ============================================================= When one of the OSS handling the one of the OSTs hosting the file also reported on its side the following messages/errors: ==================================================== 1336119248 2012 May 4 10:14:08 cartan204 kern err kernel LustreError: 0:0:(ldlm_lockd.c:345:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at JO.BOO.ZL.BFP@o2ib3 ns: filter-ptmp-OST001a_UUID lock: ffff880d1d021b40/0x2f7b35259ae3e490 lrc: 3/0,0 mode: PW/PW res: 20974735/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x80010020 remote: 0x307f20f32f0be585 expref: 700 pid: 28661 timeout 4562700589 1336119249 2012 May 4 10:14:09 cartan204 kern err kernel LustreError: 4578:0:(ldlm_lockd.c:1970:ldlm_cancel_handler()) operation 103 from 12345-JO.BOO.ZL.BFP@o2ib3 with bad export cookie 3421386777158859540 1336119249 2012 May 4 10:14:09 cartan204 kern warning kernel Lustre: 28674:0:(ldlm_lib.c:866:target_handle_connect()) ptmp-OST001a: connection from 700d60eb-8a82-cf9e-f029-a5868449ec78@JO.BOO.ZL.BFP@o2ib3 t0 exp (null) cur 1336119249 last 0 1336119249 2012 May 4 10:14:09 cartan204 kern warning kernel Lustre: 28674:0:(filter.c:2813:filter_connect()) ptmp-OST001a: Received MDS connection (0x2f7b35259ae430da); group 0 1336119249 2012 May 4 10:14:09 cartan204 kern warning kernel Lustre: 28674:0:(filter.c:2813:filter_connect()) Skipped 5 previous similar messages ==================================================== |
| Comments |
| Comment by Peter Jones [ 23/Nov/12 ] |
|
Lai could you please look into this one? Thanks Peter |
| Comment by Andreas Dilger [ 23/Nov/12 ] |
|
We had a similar problem in the past with every process truncating the same file. However, in that case I don't think it was doing the stat and write of the file as well. From an application point of view, it doesn't make any sense for the application to be truncating then writing the same offset of the same file from all ranks of the program. This will cause all of the data to be written over top of itself many times, probably resulting in garbage data, or at best the data from some random rank to be the last one written. The major problem is that this write(), along with the stat() forces the lock to be on the client, but then the truncate from the different nodes forces the lock to be revoked. In essence this is a giant engine for causing locking conflicts between all of the clients, and there is no simple way for the operations from the clients to be parallelized at all, since they are always overlapping. It is worthwhile to check if the server is detecting this situation and forcing the clients to use lockless writes and lockless truncates? That would at least minimize or eliminate lock holding on the clients. It is also worth checking that glimpse locks are not granted to the client in the case of high contention where the server has entered lockless operation mode, even though no client is holding the lock. On a programming note, open(O_TRUNC) would be more efficient, and the lseek() calls are both redundant and useless, since newly open files use offset 0 already. Is it a programming bug in the reproducer or the library that these offsets should be related to the process rank? |
| Comment by Jinshan Xiong (Inactive) [ 23/Nov/12 ] |
|
Do you have a chance to check if there is any deadlock on the clients node? |
| Comment by Diego Moreno (Inactive) [ 06/Dec/12 ] |
|
So trying to make any progress on this ticket, do you need any information from us? |
| Comment by Diego Moreno (Inactive) [ 17/Jan/13 ] |
|
Coming back to this issue, we checked there isn't any deadlock on the client node. This issue is being seen quite often even if this can be an illogical behavior. Some users run this kind of pattern when they enable some kind of debugging logs on their programs. This problem is particularly annoying when the program has been coded in Fortran (ftrunk behavior quite different from C). So we have this scenario: users run this strange ftrunk pattern on 128 nodes and this provokes the eviction of about 500 Lustre clients. Even if users are not supposed to do that I think we could have a DoS issue if we don't fix that. What do you think? |
| Comment by Andreas Dilger [ 17/Jan/13 ] |
|
Diego, the reproducer test - is this just the C equivalent of what is happening in Fortran? I think user education that their debug logs are useless in this case would also help - having a different log file per MPI rank, or writing to a different offset within the debug file are useful options. The open() + stat() + truncate() + write() combination is a difficult one to improve. If it was open(O_TRUNC) + write() that would be a bit better, or open() + truncate() + write() would also be better, but in both cases the write() will still cause the client to get a DLM lock, which has to be revoked immediately by the next client. There is no way to have parallel locking or IO, since the IO range is always conflicting with every other client. In order to handle this on the OST, we would need to build in some "memory" to the OST objects so that it can detect there is high contention on the DLM locking, and just not grant the DLM lock to the client for the stat() call and . This could already be done by only returning a glimpse for stat(), but then there is still a lock conflict for the writes. Doing server-side locking for writes might help, as long as the writes are relatively small in size. |
| Comment by Patrick Farrell [ 17/Mar/22 ] |
|
Lockless truncate was removed in |