[LU-563] Client evictions seen when running mdtest and repeated failover Created: 02/Aug/11  Updated: 13/Dec/11  Resolved: 13/Dec/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Prakash Surya (Inactive) Assignee: Zhenyu Xu
Resolution: Incomplete Votes: 0
Labels: None
Environment:

CHAOS 5, RHEL 6, Lustre 2.0.66-2chaos


Attachments: File MDTEST7.tar.gz     File mdtest-results.tar.gz     File results.tar.gz     File results.tar.gz    
Severity: 3
Rank (Obsolete): 6571

 Description   

We have seen client evictions while running mdtest on 55 nodes, with 8 takes
per node and continuous OSS failover taking place.

The command being run is the following:

srun -v -O -l -D /p/lcrater2/surya1/tmp/ -N 55 -n $((55*8))  -- \
        mdtest /p/lcrater2/surya1/tmp -I 10000 >MDTEST.stdout 2>MDTEST.stderr

Out of the 3 times this has been run with OSS failover/failback happening in
the background, none have completed successfully. Two of the runs suffered from
client evictions, and the third suffered from a node failure (it is unclear
whether this was due to Lustre or not).

Also, for what its worth, these issues have not yet appeared when failover is
not running in the background. I have run the above command a couple times
without simultaneously failing the OSS nodes, and those runs were able to
complete successfully.



 Comments   
Comment by Prakash Surya (Inactive) [ 02/Aug/11 ]

This archive contains some of the results obtained by running the above
mentioned tests. The results for each test are contained in their own separate
directory. Thus, the results for the first test run is contained in directory
run1, the second test in run2, etc.

Each directory should contain 4 files:

  • MDTEST.cli - contains the command line which was executed to kick
    off the test.
  • MDTEST.stdout - The stdout output of the command executed.
  • MDTEST.stderr - The stderr output of the command executed.
  • lustre.log-$DATE.gz - A collection of the Lustre console log messages for
    the day the test was run.

Also, the root directory should contain the fail.sh script. This script was
simultaneously executed while the aforementioned tests were running to stress
the failover/failback capabilities of the filesystem.

Comment by Christopher Morrone [ 02/Aug/11 ]

It would be helpful to enable the rpctrace debug level and enable /proc/sys/lustre/dump_on_eviction.

Then reproduce the test and collect and examine the lustre logs after the eviction occurs. With a little luck (and perhaps by enlarging the log buffer), we'll be able to see when the mds sends the lock callback and trace what happens to it.

Does the rpc make it to the client? Does the client try to send the reply?

Comment by Peter Jones [ 02/Aug/11 ]

Oleg will look into this one

Comment by Prakash Surya (Inactive) [ 03/Aug/11 ]

I've run the test again twice yesterday and didn't see the eviction issues. The command was able to complete successfully.

Also, I enabled the rpctrace and dlmtrace debug flags on the clients and am running it again. Hopefully the issue arises with this extra debug info enabled. Will the output be directed to the console? or will I need to run lctl dk to get at it?

Comment by Prakash Surya (Inactive) [ 05/Aug/11 ]

Results from another run of this test. The same MDTEST.* and lustre.log-$DATE.gz files from this run are included. Additionally, a log dump from one of the evicted clients (zwicky136) with rpctrace and dlmtrace debug flags enabled is included.

Comment by Zhenyu Xu [ 10/Aug/11 ]

Prakash,

from MDTEST.stderr, i got

srun: Node zwicky115, 8 tasks started
srun: Node zwicky120, 8 tasks started
099: srun: mvapich: 2011-08-03T21:58:37: ABORT from MPI rank 99 [on zwicky114]
[99] [MPI Abort by user] Aborting Program!
099: [99:zwicky114] Abort: MPI_Abort() code: 1, rank 99, MPI Abort by user Aborting program ! at line 99 in file mpid_init.c
097: [97] [MPI Abort by user] Aborting Program!
097: [97:zwicky114] Abort: MPI_Abort() code: 1, rank 97, MPI Abort by user Aborting program ! at line 99 in file mpid_init.c
100: [100] [MPI Abort by user] Aborting Program!
100: [100:zwicky114] Abort: MPI_Abort() code: 1, rank 100, MPI Abort by user Aborting program ! at line 99 in file mpid_init.c
048: slurmd[zwicky108]: *** STEP 1000293.0 KILLED AT 2011-08-03T21:58:37 WITH SIGNAL 9 ***
416: slurmd[zwicky160]: *** STEP 1000293.0 KILLED AT 2011-08-03T21:58:37 WITH SIGNAL 9 ***
...
240: slurmd[zwicky136]: *** STEP 1000293.0 KILLED AT 2011-08-03T21:58:37 WITH SIGNAL 9 ***

what made zwicky114 abort the test (it said "MPI Abort by user")? And what makes following client be killed by SIGNAL 9?

Comment by Prakash Surya (Inactive) [ 10/Aug/11 ]

I'm not entirely certain what caused the abortion, but I imagine it is related to the eviction. My guess would be that slurm killed the job. I know slurm does that at times, although I don't know how it decides when to kill it. Is Chris watching this issue? He might be able to give a better answer. My slurm and mdtest experience is still still pretty limited..

Comment by Peter Jones [ 10/Aug/11 ]

Chris was not watching but I just added him so hopefully he will now chip in and comment

Comment by Peter Jones [ 10/Aug/11 ]

Bobijam

Could you please try and reproduce this issue?

Thanks

Peter

Comment by Zhenyu Xu [ 11/Aug/11 ]

would you mind also uploading zwicky114's debug log?

zwicky136's client debug log

00000100:00080000:1.0:1312433918.391712:0:3861:0:(import.c:929:ptlrpc_connect_interpret()) @@@ lc2-MDT0000-mdc-ffff880213030400: evicting (reconnect/recover flags not set: 4) req@ffff8801e6267c00 x1375691751479664/t0(0) o-1->lc2-MDT0000_UUID@10.1.1.212@o2ib9:12/10 lens 368/264 e 0 to 0 dl 1312433938 ref 1 fl Interpret:RN/ffffffff/ffffffff rc 0/-1
00000100:00100000:0.0:1312433918.391715:0:3860:0:(client.c:1392:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:496561de-a766-0837-467a-298d86c35661:3860:1375691751479700:10.1.1.50@o2ib9:400
00000100:00080000:1.0:1312433918.391717:0:3861:0:(import.c:932:ptlrpc_connect_interpret()) ffff8801fae1a000 lc2-MDT0000_UUID: changing import state from CONNECTING to EVICTED
00000100:02020000:1.0:1312433918.391720:0:3861:0:(import.c:1312:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by lc2-MDT0000; in progress operations using this service will fail.

zwicky136 was evicted by MDS, your test does not reboot MDS, is it?

Comment by Zhenyu Xu [ 11/Aug/11 ]

Maybe we also need MDS and/or OSS debug log (depends on which server evict clients) to see what happens to make server evict clients.

Comment by Zhenyu Xu [ 11/Aug/11 ]

from lustre.log-20110804

$ grep "evicting" lustre.log-20110804

13912:Aug 3 07:18:10 zwicky-mds2 kernel: Lustre: MGS: haven't heard from client 71505a8c-ba89-4584-7b3d-bc9313e6a77e (at 10.1.1.49@o2ib9) in 228 seconds. I think it's dead, and I am evicting it. exp ffff8807ed5bb400, cur 1312381090 expire 1312380940 last 1312380862
15337:Aug 3 07:40:11 zwicky-mds2 kernel: Lustre: MGS: haven't heard from client 5d784239-e35d-b7d2-39aa-cd2ac325d166 (at 10.1.1.50@o2ib9) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88082f24a000, cur 1312382411 expire 1312382261 last 1312382184
16868:Aug 3 08:02:09 zwicky-mds2 kernel: Lustre: MGS: haven't heard from client bef2aaa1-8742-9da3-c1bb-ca4469b02bfa (at 10.1.1.47@o2ib9) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88082f27b800, cur 1312383729 expire 1312383579 last 1312383502
18391:Aug 3 08:24:08 zwicky-mds2 kernel: Lustre: MGS: haven't heard from client b99c22a5-3135-69ed-4913-98d6ef6890a8 (at 10.1.1.48@o2ib9) in 228 seconds. I think it's dead, and I am evicting it. exp ffff88082f244400, cur 1312385048 expire 1312384898 last 1312384820
18392:Aug 3 08:24:11 zwicky-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:353:waiting_locks_callback()) ### lock callback timer expired after 121s: evicting client at 192.168.65.159@o2ib6 ns: mdt-ffff88062e597800 lock: ffff880215a09900/0x7fc037415eb6f2fa lrc: 3/0,0 mode: PR/PR res: 8589955869/128531 bits 0x3 rrc: 2 type: IBT flags: 0x4000020 remote: 0x63f5e2e4272fe344 expref: 6 pid: 11575 timeout: 4720731264
18406:Aug 3 08:24:14 zwicky-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:353:waiting_locks_callback()) ### lock callback timer expired after 124s: evicting client at 192.168.65.113@o2ib6 ns: mdt-ffff88062e597800 lock: ffff8802233066c0/0x7fc037415eb6f1fe lrc: 3/0,0 mode: PR/PR res: 8589955877/128725 bits 0x3 rrc: 2 type: IBT flags: 0x4000020 remote: 0x1f6ab0dd71fd7054 expref: 6 pid: 11588 timeout: 4720734633
18410:Aug 3 08:24:15 zwicky-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:353:waiting_locks_callback()) ### lock callback timer expired after 125s: evicting client at 192.168.65.129@o2ib6 ns: mdt-ffff88062e597800 lock: ffff8807e1e5b240/0x7fc037415eb6f23d lrc: 3/0,0 mode: PR/PR res: 8589955863/130615 bits 0x3 rrc: 2 type: IBT flags: 0x4000020 remote: 0x8b1f51c6458607ef expref: 6 pid: 11614 timeout: 4720735521
19895:Aug 3 08:46:08 zwicky-mds2 kernel: Lustre: MGS: haven't heard from client 1899ad40-4369-3323-c979-64ddae8268dd (at 10.1.1.49@o2ib9) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880623bae000, cur 1312386368 expire 1312386218 last 1312386141
....
72070:Aug 3 21:36:40 zwicky-mds2 kernel: Lustre: MGS: haven't heard from client 5069322b-5692-cb6b-00fe-d476448b97d0 (at 10.1.1.48@o2ib9) in 228 seconds. I think it's dead, and I am evicting it. exp ffff88022e7fc000, cur 1312432600 expire 1312432450 last 1312432372
73623:Aug 3 21:58:16 zwicky-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:353:waiting_locks_callback()) ### lock callback timer expired after 177s: evicting client at 192.168.65.136@o2ib6 ns: mdt-ffff88062e597800 lock: ffff8804f8c9db40/0x7fc0374182497777 lrc: 3/0,0 mode: PR/PR res: 8589955938/117939 bits 0x3 rrc: 2 type: IBT flags: 0x4000020 remote: 0x2a0a36f5907d39a9 expref: 6 pid: 11581 timeout: 4769576674
73624:Aug 3 21:58:25 zwicky-mds2 kernel: LustreError: 11581:0:(ldlm_lockd.c:353:waiting_locks_callback()) ### lock callback timer expired after 186s: evicting client at 192.168.65.130@o2ib6 ns: mdt-ffff88062e597800 lock: ffff880358a02b40/0x7fc0374182497bb4 lrc: 3/0,0 mode: PR/PR res: 8589955935/118438 bits 0x3 rrc: 2 type: IBT flags: 0x4000020 remote: 0x86f348938dde31b9 expref: 6 pid: 11591 timeout: 4769585675
73654:Aug 3 21:58:41 zwicky-mds2 kernel: Lustre: MGS: haven't heard from client 6d540fa0-c8e4-467f-8ef8-a356affe3ffb (at 10.1.1.49@o2ib9) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88061ccb0400, cur 1312433921 expire 1312433771 last 1312433694
75122:Aug 3 22:20:49 zwicky-mds2 kernel: Lustre: MGS: haven't heard from client 80f3e272-b241-545d-dd92-03f431d17b56 (at 10.1.1.50@o2ib9) in 232 seconds. I think it's dead, and I am evicting it. exp ffff8804e172b400, cur 1312435249 expire 1312435099 last 1312435017
76644:Aug 3 22:42:49 zwicky-mds2 kernel: Lustre: MGS: haven't heard from client 283fce0a-2dfd-7651-7efa-8642eb937fb0 (at 10.1.1.47@o2ib9) in 228 seconds. I think it's dead, and I am evicting it. exp ffff88042df07000, cur 1312436569 expire 1312436419 last 1312436341
78079:Aug 3 23:04:46 zwicky-mds2 kernel: Lustre: MGS: haven't heard from client 2be415db-52d4-9d69-d61a-eef3f6ace520 (at 10.1.1.48@o2ib9) in 228 seconds. I think it's dead, and I am evicting it. exp ffff88061cec0000, cur 1312437886 expire 1312437736 last 1312437658
...

I guess zwicky47/48/49/50 are OSS nodes, aren't they, MDS evict OSS for they are failing over.

While node with 192.168.65.159/192.168.65.113/192.168.65.129/192.168.65.136/192.168.65.130 (guessing they are zwicky159/113/129/136/130) cannot respond MDS's lock callback in time, hence got evicted from MDS.

Comment by Prakash Surya (Inactive) [ 11/Aug/11 ]

Yes zwicky[47-50] are the OSS nodes undergoing failover and failback, So the evictions dealing with those nodes I think are to be expected. The MDS on the other hand is not being rebooted. So the question is, why can't the clients respond back to the MDS in time?

Perhaps it is an adaptive timeout issue, where the MDS is getting stuck for some reason and doesn't get back to the client in a resonable amount of time.

I will work on reproducing this issue again and enable the dump_on_eviction flag on the MDS as well as the clients. Then we may be able to correlate the logs from the server and the client and get a better overall picture of what is happening.

Besides 'ha', are there any other flags I should set? Currently I am just using the defaults 'ioctl neterror warning error emerg config console' and adding '+rpctrace' and '+dlmtrace'.

And would you still like me to dig up the console logs for zwicky114? If so, for what day?

Comment by Zhenyu Xu [ 11/Aug/11 ]

I think '+ha' '+rpctrace' '+dlmtrace' is good enough. Yes, I just don't get what really happens to make mdtest.stderr states that zwicky114 was aborted by user, wondering whether it relates to the eviction issue we met here.

Comment by Peter Jones [ 12/Aug/11 ]

Prakash

When do you expect to be able to gather this information?

Thanks

Peter

Comment by Prakash Surya (Inactive) [ 12/Aug/11 ]

Here's the results from another run of 'mdtest'. The directory structure is a little different than before, but I imagine it shouldn't be too hard to grok. Ping me if you have any questions. I had to remove some log files to get the archive under the 10MB limit.

Also, this run of mdtest was actually able to successfully complete. Although it looks like some the clients still saw some evictions.

The MDS didn't output any dump log files even with 'dump_on_eviction' enabled. Is this expected?

Let me know if you need me to upload any the console logs as these are not in the archive.

Comment by Prakash Surya (Inactive) [ 12/Aug/11 ]

Sorry for the delay Peter.

Comment by Zhenyu Xu [ 14/Aug/11 ]

Prakash Surya,

Your latest logs show that clients reconnection to MGS timeout and got evicted from MGS and later recovered. So your original test failure I doubt blames to eviction.

Any only clients got eviction happens, so it is client not MDS dump logs when client got evicted.

Comment by Prakash Surya (Inactive) [ 15/Aug/11 ]

This run is from a failed run of mdtest. It contains a dump from the MDS along with a few clients.

Comment by Prakash Surya (Inactive) [ 15/Aug/11 ]

Hi bobijam, I just uploaded the results from another failed run which contains both client and MDS dump files. I haven't been able to look too closely at the dumps yet, but hopefully we can piece together a better picture of the problem with the information from the MDS.

Comment by Oleg Drokin [ 15/Aug/11 ]

I looked through the logs, in particular this bit from client 151:

00010000:00010000:0.0:1313196271.425540:0:3932:0:(ldlm_request.c:1085:ldlm_cancel_pack()) ### packing ns: lc2-MDT0000-mdc-ffff88021302d000 lock: ffff8801eab96d80/0x3fe5176e75f9e59f lrc: 2/0,0 mode: --/PR res: 8589956318/75033 bits 0x3 rrc: 1 type: IBT flags: 0x30002090 remote: 0x7fc03742338bfc8b expref: -99 pid: 15372 timeout: 0
00010000:00010000:0.0:1313196271.425545:0:3932:0:(ldlm_request.c:1089:ldlm_cancel_pack()) 1 locks packed
00000100:00100000:0.0:1313196271.425548:0:3932:0:(client.c:1990:ptlrpc_set_wait()) set ffff8801fd74ccc0 going to sleep for 0 seconds
00000100:00100000:0.0:1313196271.425550:0:3932:0:(client.c:1726:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_01:a4a6aee8-112a-dc35-e195-4b88f07a26e6:3932:1376864457802283:10.1.1.212@o2ib9:103
00010000:00020000:0.0:1313196271.425552:0:3932:0:(ldlm_request.c:1169:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway

So we did EINTR right away without even attempting to send anything.
I don't see the interrupted set CERROR message that's right at the beginning of it, though.

00010000:00020000:0.0:1313196271.445106:0:3932:0:(ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
00010000:00010000:0.0:1313196271.463005:0:3932:0:(ldlm_lockd.c:1506:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: lc2-MDT0000-mdc-ffff88021302d000 lock: ffff8801eab96d80/0x3fe5176e75f9e59f lrc: 1/0,0 mode: --/PR res: 8589956318/75033 bits 0x3 rrc: 1 type: IBT flags: 0x30002090 remote: 0x7fc03742338bfc8b expref: -99 pid: 15372 timeout: 0

Naturally since we did not send anything, the server later evicted this client.

It seems that the "going to sleep for 0 seconds" is unusual, all other instances of blocking callback were going to sleep for 23 seconds and succeeded.
So this is likely the error that needs to be found.

Comment by Oleg Drokin [ 15/Aug/11 ]

Ok, further looking into the log revealed this:
00000100:02020000:1.0:1313196264.669907:0:3915:0:(client.c:1122:ptlrpc_check_status()) 11-0: an error occurred while communicating with 10.1.1.212@o2ib9. The mds_connect operation failed with -16
00000100:00080000:1.0:1313196264.669909:0:3915:0:(import.c:1125:ptlrpc_connect_interpret()) ffff8801fa85f800 lc2-MDT0000_UUID: changing import state from CONNECTING to DISCONN

So the import is not in the full state which prevents us from sending the cancel out.

The no_delay bit of logic was added by a patch from bug 11974. In particular comment #34 from Andreas asks exactly about this scenario. Supposedly this could not happen and I explain it in comment #35, but of course the reality is slightly different from theory

Basically what happened with at least some of the clients is they sent some requests to MDS and those requests got stuck waiting on a lock that was not released. Then clients timed out (deadline set at 73 seconds apparently) and are trying to reconnect.
At the same time server still maintains the connection and is able to send things like ASTs back to the client. Of course this would not help the client at all since the client cannot send anything back even if it wanted to until it reconnects, but the server is not letting the client to reconnect because it is already processing a request from this client.

Now the original reason for why some of the requests got stuck in MDS is still unclear at this point, but I only reviewed two client logs. Also MDS log does not go into the past far enough to see it on that end (possibly MDS getting stuck waiting for objects from osc with locks held?).

I guess "mds waiting for objects with locks held" scenario needs to be investigated first and if true, needs to be relieved as that was a massive source of failures in the past too.

Comment by Prakash Surya (Inactive) [ 15/Aug/11 ]

Thanks for the analysis Oleg. Your description definitely helps my understanding of the problem.

As far as the MDS log, would it help to increase the buffer size on the MDS and reproduce the problem? Currently the size of the buffer on the MDS is:

# zwicky-mds2 /root > cat /proc/sys/lnet/debug_mb
81
Comment by Oleg Drokin [ 15/Aug/11 ]

Well, 82M is definitely not enough as we can see.If you can spare the memory, the maximum amount for the log is 2G

Comment by Prakash Surya (Inactive) [ 16/Aug/11 ]

Here's the memory info for the MDS while it is essentially idle:

# zwicky-mds2 /root > free
             total       used       free     shared    buffers     cached
Mem:      32881440   26163084    6718356          0   23873776     215796
-/+ buffers/cache:    2073512   30807928
Swap:            0          0          0

There's about 6.5G available, but I'm not familiar enough with the memory usage of the MDS to really know if it could spare the 2G for the log when it is under load. I will go ahead and try it and see what happens. Let me know if you think another size would be better.

Comment by Christopher Morrone [ 16/Aug/11 ]

Prakash, the 6.5GB number by itself is not particularly useful. With Linux, the "free" value always tends to be fairly low because the kernel will always employ much of the free memory for buffers and cache. The kernel there is using almost 24GB for buffers and 215MB for cache. I don't think that 2GB for the log will be a significant problem.

Comment by Peter Jones [ 12/Dec/11 ]

Is this still a live issue or has it become obsolete as testing has moved forward onto more current code?

Comment by Prakash Surya (Inactive) [ 13/Dec/11 ]

On our end, we've moved on to testing more current 2.1 code. Unless Chris chimes in and says otherwise, I think its safe to obsolete this issue.

Comment by Peter Jones [ 13/Dec/11 ]

Thanks Prakash. I will close out the ticket for now and you can always reopen it if Chris feels that it would be worth pursuing this further under this ticket (rather than opening a new one)

Generated at Sat Feb 10 01:08:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.