Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.1.3
-
None
-
3
-
5649
Description
We have an scenario where some of the Lustre clients are hang and consequently evicted causing frequent job failures.
This issue arises when ftruncate() is done on the same file and by several tasks/nodes in parallel. Even if it's clear that this is not a good practice, this behavior is quite used by some middle-tier layer heavily used at customer
site. The consequences are that customer suffers frequent job failures.
I think Lustre should be more resilient to this kind of issues.
This bug has been reproduced in lustre 2.1.3 with 36 nodes running 576 mpi taks. In this case 50% of the times the test failed but the problem could maybe also be reproduced with less nodes/tasks.
This is the reproducer:
% cat truncate.c
#include <mpi.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> main(int argc,char **argv) { int npes,myrank,l,i,j,k,n,ierr; int *len; char *rank ,*name; int *buf; struct stat *bufstat; int fd; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&npes); MPI_Comm_rank(MPI_COMM_WORLD,&myrank); printf(" My rank is %d of %d \n",myrank,npes); buf=(int *) malloc(sizeof(int)*10000*npes); for (i=0;i<10000;i++) buf[i]=myrank*i; bufstat=(struct stat *) malloc(sizeof(struct stat)); for (i=0;i<10;i++) { if ((fd=open("/ptmp/user/mytrunk",O_CREAT|O_RDWR, 0666)) < 0) { perror("open()"); exit(-1); } if (fstat(fd,bufstat) < 0) { perror("fstat()"); exit(-1); } if (lseek(fd, 0, SEEK_SET) < 0) { perror("lseek(SEEK_SET)"); exit(-1); } if (lseek(fd, 0, SEEK_CUR) < 0) { perror("lseek(SEEK_CUR)"); exit(-1); } if ((ierr=ftruncate(fd,0)) < 0) { perror("ftruncate()"); exit(-1); } if (write(fd,buf,10000*npes) < 0) { perror("write()"); exit(-1); } close(fd); } MPI_Finalize(); }
And this is what happened:
% mpicc truncate.c -o truncate % srun -n 576 -N 36 --exclusive --resv-ports ./truncate srun: job 269744 queued and waiting for resources srun: job 269744 has been allocated resources <<<<<< Job/run stuck for minutes !!... >>>>>>> fstat(): Interrupted system call fstat(): Interrupted system call fstat(): Interrupted system call fstat(): Interrupted system call ftruncate(): Interrupted system call My rank is 10 of 16 My rank is 11 of 16 My rank is 13 of 16 My rank is 14 of 16 My rank is 15 of 16 fstat(): Interrupted system call My rank is 8 of 16 ftruncate(): Cannot send after transport endpoint shutdown My rank is 9 of 16 srun: error: lascaux2825: tasks 8-11,13-15: Exited with exit code 255 srun: Terminating job step 269744.0 slurmd[lascaux2825]: *** STEP 269744.0 KILLED AT 2012-06-12T16:09:25 WITH SIGNAL 9 *** slurmd[lascaux2824]: *** STEP 269744.0 KILLED AT 2012-06-12T16:09:25 WITH SIGNAL 9 *** srun: Job step aborted: Waiting up to 2 seconds for job step to finish. slurmd[lascaux2825]: *** STEP 269744.0 KILLED AT 2012-06-12T16:09:25 WITH SIGNAL 9 *** slurmd[lascaux2824]: *** STEP 269744.0 KILLED AT 2012-06-12T16:09:25 WITH SIGNAL 9 ***
And one of both Nodes/Clients reported the folling messages :
=============================================================
1336119249 2012 May 4 10:14:09 cartan1174 kern warning kernel Lustre: ptmp-OST001a-osc-ffff880464a98000: Connection to service ptmp-OST001a via nid JO.BOO.ZO.LF@o2ib3 was lost; in progress operations using this service will wait for recovery to complete. 1336119249 2012 May 4 10:14:09 cartan1174 kern err kernel LustreError: 45831:0:(ldlm_request.c:1152:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway 1336119249 2012 May 4 10:14:09 cartan1174 kern err kernel LustreError: 45831:0:(ldlm_request.c:1714:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107 1336119249 2012 May 4 10:14:09 cartan1174 kern err kernel LustreError: 167-0: This client was evicted by ptmp-OST001a; in progress operations using this service will fail. 1336119249 2012 May 4 10:14:09 cartan1174 kern err kernel LustreError: 16559:0:(client.c:1057:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff880375166000 x1398228708669187/t0(0) o101->ptmp-OST001a_UUID@JO.BOO.ZO.LF@o2ib3:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1 1336119249 2012 May 4 10:14:09 cartan1174 kern err kernel LustreError: 16559:0:(client.c:1057:ptlrpc_import_delay_req()) Skipped 31 previous similar messages 1336119249 2012 May 4 10:14:09 cartan1174 kern err kernel LustreError: 16559:0:(client.c:1057:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff880d6719c800 x1398228708669192/t0(0) o101->ptmp-OST001a_UUID@JO.BOO.ZO.LF@o2ib3:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1 1336119249 2012 May 4 10:14:09 cartan1174 kern info kernel Lustre: ptmp-OST001a-osc-ffff880464a98000: Connection restored to service ptmp-OST001a using nid JO.BOO.ZO.LF@o2ib3.
=============================================================
When one of the OSS handling the one of the OSTs hosting the file also reported on its side the following messages/errors:
====================================================
1336119248 2012 May 4 10:14:08 cartan204 kern err kernel LustreError: 0:0:(ldlm_lockd.c:345:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at JO.BOO.ZL.BFP@o2ib3 ns: filter-ptmp-OST001a_UUID lock: ffff880d1d021b40/0x2f7b35259ae3e490 lrc: 3/0,0 mode: PW/PW res: 20974735/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x80010020 remote: 0x307f20f32f0be585 expref: 700 pid: 28661 timeout 4562700589 1336119249 2012 May 4 10:14:09 cartan204 kern err kernel LustreError: 4578:0:(ldlm_lockd.c:1970:ldlm_cancel_handler()) operation 103 from 12345-JO.BOO.ZL.BFP@o2ib3 with bad export cookie 3421386777158859540 1336119249 2012 May 4 10:14:09 cartan204 kern warning kernel Lustre: 28674:0:(ldlm_lib.c:866:target_handle_connect()) ptmp-OST001a: connection from 700d60eb-8a82-cf9e-f029-a5868449ec78@JO.BOO.ZL.BFP@o2ib3 t0 exp (null) cur 1336119249 last 0 1336119249 2012 May 4 10:14:09 cartan204 kern warning kernel Lustre: 28674:0:(filter.c:2813:filter_connect()) ptmp-OST001a: Received MDS connection (0x2f7b35259ae430da); group 0 1336119249 2012 May 4 10:14:09 cartan204 kern warning kernel Lustre: 28674:0:(filter.c:2813:filter_connect()) Skipped 5 previous similar messages
====================================================
Attachments
Issue Links
- is related to
-
LU-4881 Lustre client: eviction on open/truncate
-
- Resolved
-