[LU-563] Client evictions seen when running mdtest and repeated failover Created: 02/Aug/11 Updated: 13/Dec/11 Resolved: 13/Dec/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CHAOS 5, RHEL 6, Lustre 2.0.66-2chaos |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6571 |
| Description |
|
We have seen client evictions while running mdtest on 55 nodes, with 8 takes The command being run is the following: srun -v -O -l -D /p/lcrater2/surya1/tmp/ -N 55 -n $((55*8)) -- \
mdtest /p/lcrater2/surya1/tmp -I 10000 >MDTEST.stdout 2>MDTEST.stderr
Out of the 3 times this has been run with OSS failover/failback happening in Also, for what its worth, these issues have not yet appeared when failover is |
| Comments |
| Comment by Prakash Surya (Inactive) [ 02/Aug/11 ] |
|
This archive contains some of the results obtained by running the above Each directory should contain 4 files:
Also, the root directory should contain the fail.sh script. This script was |
| Comment by Christopher Morrone [ 02/Aug/11 ] |
|
It would be helpful to enable the rpctrace debug level and enable /proc/sys/lustre/dump_on_eviction. Then reproduce the test and collect and examine the lustre logs after the eviction occurs. With a little luck (and perhaps by enlarging the log buffer), we'll be able to see when the mds sends the lock callback and trace what happens to it. Does the rpc make it to the client? Does the client try to send the reply? |
| Comment by Peter Jones [ 02/Aug/11 ] |
|
Oleg will look into this one |
| Comment by Prakash Surya (Inactive) [ 03/Aug/11 ] |
|
I've run the test again twice yesterday and didn't see the eviction issues. The command was able to complete successfully. Also, I enabled the rpctrace and dlmtrace debug flags on the clients and am running it again. Hopefully the issue arises with this extra debug info enabled. Will the output be directed to the console? or will I need to run lctl dk to get at it? |
| Comment by Prakash Surya (Inactive) [ 05/Aug/11 ] |
|
Results from another run of this test. The same MDTEST.* and lustre.log-$DATE.gz files from this run are included. Additionally, a log dump from one of the evicted clients (zwicky136) with rpctrace and dlmtrace debug flags enabled is included. |
| Comment by Zhenyu Xu [ 10/Aug/11 ] |
|
Prakash, from MDTEST.stderr, i got srun: Node zwicky115, 8 tasks started srun: Node zwicky120, 8 tasks started 099: srun: mvapich: 2011-08-03T21:58:37: ABORT from MPI rank 99 [on zwicky114] [99] [MPI Abort by user] Aborting Program! 099: [99:zwicky114] Abort: MPI_Abort() code: 1, rank 99, MPI Abort by user Aborting program ! at line 99 in file mpid_init.c 097: [97] [MPI Abort by user] Aborting Program! 097: [97:zwicky114] Abort: MPI_Abort() code: 1, rank 97, MPI Abort by user Aborting program ! at line 99 in file mpid_init.c 100: [100] [MPI Abort by user] Aborting Program! 100: [100:zwicky114] Abort: MPI_Abort() code: 1, rank 100, MPI Abort by user Aborting program ! at line 99 in file mpid_init.c 048: slurmd[zwicky108]: *** STEP 1000293.0 KILLED AT 2011-08-03T21:58:37 WITH SIGNAL 9 *** 416: slurmd[zwicky160]: *** STEP 1000293.0 KILLED AT 2011-08-03T21:58:37 WITH SIGNAL 9 *** ... 240: slurmd[zwicky136]: *** STEP 1000293.0 KILLED AT 2011-08-03T21:58:37 WITH SIGNAL 9 *** what made zwicky114 abort the test (it said "MPI Abort by user")? And what makes following client be killed by SIGNAL 9? |
| Comment by Prakash Surya (Inactive) [ 10/Aug/11 ] |
|
I'm not entirely certain what caused the abortion, but I imagine it is related to the eviction. My guess would be that slurm killed the job. I know slurm does that at times, although I don't know how it decides when to kill it. Is Chris watching this issue? He might be able to give a better answer. My slurm and mdtest experience is still still pretty limited.. |
| Comment by Peter Jones [ 10/Aug/11 ] |
|
Chris was not watching but I just added him so hopefully he will now chip in and comment |
| Comment by Peter Jones [ 10/Aug/11 ] |
|
Bobijam Could you please try and reproduce this issue? Thanks Peter |
| Comment by Zhenyu Xu [ 11/Aug/11 ] |
|
would you mind also uploading zwicky114's debug log? zwicky136's client debug log 00000100:00080000:1.0:1312433918.391712:0:3861:0:(import.c:929:ptlrpc_connect_interpret()) @@@ lc2-MDT0000-mdc-ffff880213030400: evicting (reconnect/recover flags not set: 4) req@ffff8801e6267c00 x1375691751479664/t0(0) o-1->lc2-MDT0000_UUID@10.1.1.212@o2ib9:12/10 lens 368/264 e 0 to 0 dl 1312433938 ref 1 fl Interpret:RN/ffffffff/ffffffff rc 0/-1 zwicky136 was evicted by MDS, your test does not reboot MDS, is it? |
| Comment by Zhenyu Xu [ 11/Aug/11 ] |
|
Maybe we also need MDS and/or OSS debug log (depends on which server evict clients) to see what happens to make server evict clients. |
| Comment by Zhenyu Xu [ 11/Aug/11 ] |
|
from lustre.log-20110804 $ grep "evicting" lustre.log-20110804 13912:Aug 3 07:18:10 zwicky-mds2 kernel: Lustre: MGS: haven't heard from client 71505a8c-ba89-4584-7b3d-bc9313e6a77e (at 10.1.1.49@o2ib9) in 228 seconds. I think it's dead, and I am evicting it. exp ffff8807ed5bb400, cur 1312381090 expire 1312380940 last 1312380862 I guess zwicky47/48/49/50 are OSS nodes, aren't they, MDS evict OSS for they are failing over. While node with 192.168.65.159/192.168.65.113/192.168.65.129/192.168.65.136/192.168.65.130 (guessing they are zwicky159/113/129/136/130) cannot respond MDS's lock callback in time, hence got evicted from MDS. |
| Comment by Prakash Surya (Inactive) [ 11/Aug/11 ] |
|
Yes zwicky[47-50] are the OSS nodes undergoing failover and failback, So the evictions dealing with those nodes I think are to be expected. The MDS on the other hand is not being rebooted. So the question is, why can't the clients respond back to the MDS in time? Perhaps it is an adaptive timeout issue, where the MDS is getting stuck for some reason and doesn't get back to the client in a resonable amount of time. I will work on reproducing this issue again and enable the dump_on_eviction flag on the MDS as well as the clients. Then we may be able to correlate the logs from the server and the client and get a better overall picture of what is happening. Besides 'ha', are there any other flags I should set? Currently I am just using the defaults 'ioctl neterror warning error emerg config console' and adding '+rpctrace' and '+dlmtrace'. And would you still like me to dig up the console logs for zwicky114? If so, for what day? |
| Comment by Zhenyu Xu [ 11/Aug/11 ] |
|
I think '+ha' '+rpctrace' '+dlmtrace' is good enough. Yes, I just don't get what really happens to make mdtest.stderr states that zwicky114 was aborted by user, wondering whether it relates to the eviction issue we met here. |
| Comment by Peter Jones [ 12/Aug/11 ] |
|
Prakash When do you expect to be able to gather this information? Thanks Peter |
| Comment by Prakash Surya (Inactive) [ 12/Aug/11 ] |
|
Here's the results from another run of 'mdtest'. The directory structure is a little different than before, but I imagine it shouldn't be too hard to grok. Ping me if you have any questions. I had to remove some log files to get the archive under the 10MB limit. Also, this run of mdtest was actually able to successfully complete. Although it looks like some the clients still saw some evictions. The MDS didn't output any dump log files even with 'dump_on_eviction' enabled. Is this expected? Let me know if you need me to upload any the console logs as these are not in the archive. |
| Comment by Prakash Surya (Inactive) [ 12/Aug/11 ] |
|
Sorry for the delay Peter. |
| Comment by Zhenyu Xu [ 14/Aug/11 ] |
|
Prakash Surya, Your latest logs show that clients reconnection to MGS timeout and got evicted from MGS and later recovered. So your original test failure I doubt blames to eviction. Any only clients got eviction happens, so it is client not MDS dump logs when client got evicted. |
| Comment by Prakash Surya (Inactive) [ 15/Aug/11 ] |
|
This run is from a failed run of mdtest. It contains a dump from the MDS along with a few clients. |
| Comment by Prakash Surya (Inactive) [ 15/Aug/11 ] |
|
Hi bobijam, I just uploaded the results from another failed run which contains both client and MDS dump files. I haven't been able to look too closely at the dumps yet, but hopefully we can piece together a better picture of the problem with the information from the MDS. |
| Comment by Oleg Drokin [ 15/Aug/11 ] |
|
I looked through the logs, in particular this bit from client 151: 00010000:00010000:0.0:1313196271.425540:0:3932:0:(ldlm_request.c:1085:ldlm_cancel_pack()) ### packing ns: lc2-MDT0000-mdc-ffff88021302d000 lock: ffff8801eab96d80/0x3fe5176e75f9e59f lrc: 2/0,0 mode: --/PR res: 8589956318/75033 bits 0x3 rrc: 1 type: IBT flags: 0x30002090 remote: 0x7fc03742338bfc8b expref: -99 pid: 15372 timeout: 0 So we did EINTR right away without even attempting to send anything. 00010000:00020000:0.0:1313196271.445106:0:3932:0:(ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Naturally since we did not send anything, the server later evicted this client. It seems that the "going to sleep for 0 seconds" is unusual, all other instances of blocking callback were going to sleep for 23 seconds and succeeded. |
| Comment by Oleg Drokin [ 15/Aug/11 ] |
|
Ok, further looking into the log revealed this: So the import is not in the full state which prevents us from sending the cancel out. The no_delay bit of logic was added by a patch from bug 11974. In particular comment #34 from Andreas asks exactly about this scenario. Supposedly this could not happen and I explain it in comment #35, but of course the reality is slightly different from theory Basically what happened with at least some of the clients is they sent some requests to MDS and those requests got stuck waiting on a lock that was not released. Then clients timed out (deadline set at 73 seconds apparently) and are trying to reconnect. Now the original reason for why some of the requests got stuck in MDS is still unclear at this point, but I only reviewed two client logs. Also MDS log does not go into the past far enough to see it on that end (possibly MDS getting stuck waiting for objects from osc with locks held?). I guess "mds waiting for objects with locks held" scenario needs to be investigated first and if true, needs to be relieved as that was a massive source of failures in the past too. |
| Comment by Prakash Surya (Inactive) [ 15/Aug/11 ] |
|
Thanks for the analysis Oleg. Your description definitely helps my understanding of the problem. As far as the MDS log, would it help to increase the buffer size on the MDS and reproduce the problem? Currently the size of the buffer on the MDS is: # zwicky-mds2 /root > cat /proc/sys/lnet/debug_mb 81 |
| Comment by Oleg Drokin [ 15/Aug/11 ] |
|
Well, 82M is definitely not enough as we can see.If you can spare the memory, the maximum amount for the log is 2G |
| Comment by Prakash Surya (Inactive) [ 16/Aug/11 ] |
|
Here's the memory info for the MDS while it is essentially idle: # zwicky-mds2 /root > free
total used free shared buffers cached
Mem: 32881440 26163084 6718356 0 23873776 215796
-/+ buffers/cache: 2073512 30807928
Swap: 0 0 0
There's about 6.5G available, but I'm not familiar enough with the memory usage of the MDS to really know if it could spare the 2G for the log when it is under load. I will go ahead and try it and see what happens. Let me know if you think another size would be better. |
| Comment by Christopher Morrone [ 16/Aug/11 ] |
|
Prakash, the 6.5GB number by itself is not particularly useful. With Linux, the "free" value always tends to be fairly low because the kernel will always employ much of the free memory for buffers and cache. The kernel there is using almost 24GB for buffers and 215MB for cache. I don't think that 2GB for the log will be a significant problem. |
| Comment by Peter Jones [ 12/Dec/11 ] |
|
Is this still a live issue or has it become obsolete as testing has moved forward onto more current code? |
| Comment by Prakash Surya (Inactive) [ 13/Dec/11 ] |
|
On our end, we've moved on to testing more current 2.1 code. Unless Chris chimes in and says otherwise, I think its safe to obsolete this issue. |
| Comment by Peter Jones [ 13/Dec/11 ] |
|
Thanks Prakash. I will close out the ticket for now and you can always reopen it if Chris feels that it would be worth pursuing this further under this ticket (rather than opening a new one) |