[LU-9493] Recovered clients + evicted clients != total clients Created: 11/May/17  Updated: 16/Jan/22  Resolved: 16/Jan/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Ned Bass Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After an MDT completed recovery the accounting of recovered clients appears to be off. Only 2651 of 2827 were recovered yet 0 were evicted. Is there a third category not listed that would account for the other 176 clients? Or is this a bookkeeping bug? Perhaps not coincidentally, we see 176 clients get evicted over two hours after recovery completed.

May 10 09:17:07 zinc1 kernel: Lustre: lsh-MDT0000: Will be in recovery for at least 5:00, or until 2827 clients reconnect
...
May 10 09:22:07 zinc1 kernel: Lustre: lsh-MDT0000: Recovery over after 5:01, of 2827 clients 2651 recovered and 0 were evicted.
...
May 10 11:30:36 zinc1 kernel: Lustre: lsh-MDT0000: haven't heard from client 81e4b56c-c961-07f6-b732-42fce10b4acf (at 192.168.120.150@o2ib20) in 227 seconds. I think it's dead, and I am evicting it. exp ffff887f33220c00, cur 1494441035 expire 1494440885 last 1494440808
May 10 11:30:36 zinc1 kernel: Lustre: Skipped 175 previous similar messages




 Comments   
Comment by Peter Jones [ 12/May/17 ]

Mike

Can you please advise on this one?

Thanks

Peter

Comment by John Hammond [ 12/May/17 ]

Recovery over after 5:01, of 2827 clients 2651 recovered and 0 were evicted.

2827 is obd->obd_max_recoverable_clients
2651 is obd->obd_connected_clients
0 is obd->obd_stale_clients

obd->obd_max_recoverable_clients is incremented in target_handle_connect() and in tgt_clients_data_init() but never decremented.

Comment by Andreas Dilger [ 12/May/17 ]

Ned, just to clarify - when you write that clients are evicted after recovery completes, is this only the messages on the OSS indicating these clients were evicted, or were these clients running and then evicted at a later time (based on messages from the client)? It is entirely possible that the 175 clients were evicted during recovery, but nothing was printed to the console due to rate limiting, and there was no client eviction until a few hours later and that caused the "Skipped" message to be printed.

Comment by Christopher Morrone [ 12/May/17 ]

It was an MDS/MGS node, not an OSS. And it was the MDT recovery on the MDS/MGS node that is in question here.

Here is an unedited excerpt of the console log.  The evictions certainly seem to happen quite a bit after recovery, because there is an intermediate message between the recovery and the client evictions.

2017-05-10 09:22:06 [  513.754005] Lustre: lsh-MDT0000: Recovery over after 5:01, of 2827 clients 2651 recovered and 0 were evicted.
2017-05-10 09:22:06 [  514.049819] LustreError: 25982:0:(lod_object.c:1370:lod_verify_md_striping()) lsh-MDT0000-mdtlov: invalid lmv_user_md: magic = cd20cd0, stripe_offset = 2, stripe_count = 1: rc = -22
2017-05-10 09:22:08 [  515.942796] LustreError: 26212:0:(lod_object.c:1370:lod_verify_md_striping()) lsh-MDT0000-mdtlov: invalid lmv_user_md: magic = cd20cd0, stripe_offset = 1, stripe_count = 1: rc = -22
2017-05-10 09:22:08 [  515.962992] LustreError: 26212:0:(lod_object.c:1370:lod_verify_md_striping()) Skipped 1 previous similar message
2017-05-10 09:23:22 [  589.467482] Lustre: lsh-MDT0000: haven't heard from client c956df00-bd83-1e8b-a6f7-0913a42e04c7 (at (no nid)) in 377 seconds. I think it's dead, and I am evicting it. exp ffff883f7cf5f000, cur 1494433402 expire 1494433252 last 1494433025

<ConMan> Console [zinc1] log at 2017-05-10 10:00:00 PDT.
2017-05-10 10:40:52 [ 5239.602926] Lustre: MGS: Connection restored to 0d629026-0570-e357-c21e-1b205de673d5 (at 192.168.120.113@o2ib20)
2017-05-10 10:40:52 [ 5239.615440] Lustre: Skipped 53571 previous similar messages

<ConMan> Console [zinc1] log at 2017-05-10 11:00:00 PDT.
2017-05-10 11:30:35 [ 8222.170802] Lustre: lsh-MDT0000: haven't heard from client 81e4b56c-c961-07f6-b732-42fce10b4acf (at 192.168.120.150@o2ib20) in 227 seconds. I think it's dead, and I am evicting it. exp ffff887f33220c00, cur 1494441035 expire 1494440885 last 1494440808
2017-05-10 11:30:35 [ 8222.197727] Lustre: Skipped 175 previous similar messages

<ConMan> Console [zinc1] log at 2017-05-10 12:00:00 PDT.
2017-05-10 12:08:11 [10478.406270] Lustre: MGS: Connection restored to 5dc9e673-caec-e6ab-fdf5-bc77862035df (at 192.168.120.150@o2ib20)
2017-05-10 12:08:11 [10478.418793] Lustre: Skipped 311 previous similar messages

<ConMan> Console [zinc1] log at 2017-05-10 13:00:00 PDT.

<ConMan> Console [zinc1] log at 2017-05-10 14:00:00 PDT.
2017-05-10 14:15:32 [18118.782832] Lustre: lsh-MDT0000: haven't heard from client ae2100bd-5f45-ab33-f023-6176fa0a2654 (at 192.168.120.33@o2ib20) in 227 seconds. I think it's dead, and I am evicting it. exp ffff887f0ae13800, cur 1494450932 expire 1494450782 last 1494450705
2017-05-10 14:15:32 [18118.809645] Lustre: Skipped 1 previous similar message
2017-05-10 14:45:36 [19922.706809] Lustre: lsh-MDT0000: haven't heard from client 892d358b-efb0-a427-e3e3-fbc91f9c6c7a (at 192.168.120.164@o2ib20) in 227 seconds. I think it's dead, and I am evicting it. exp ffff887eeb1f9800, cur 1494452736 expire 1494452586 last 1494452509
2017-05-10 14:45:36 [19922.733748] Lustre: Skipped 1 previous similar message

<ConMan> Console [zinc1] log at 2017-05-10 15:00:00 PDT.

 

The one client that I know was evicted, 192.168.120.150@o2ib20, appears to have been rebooting during this time, for reasons unknown.  I see boot begin at 2017-05-10 09:03:41, and it is up by 2017-05-10 09:08:11.  No lustre messages however, so it may not have had the lustre client started at all.  Another reboot of that node happens at 2017-05-10 10:25:32, and this time Lustre modules are loaded at 2017-05-10 10:40:57 and mounts procede as usual.

So it looks to me like the MDT really should have evicted this node during recovery.  No?

As a side note, does the rate limiting code really not print out the first line until the following Skipped message is printed? It seems like we would want to print that line immediately so that we get the timestamp of the first one correct...

Comment by Andreas Dilger [ 14/May/17 ]

As a side note, does the rate limiting code really not print out the first line until the following Skipped message is printed? It seems like we would want to print that line immediately so that we get the timestamp of the first one correct...

The console rate limiting should always print out the first instance of the message, but later ones would be throttled if they are arriving too quickly afterward. The "skipped" message is not printed until the throttle timeout is exceeded and the message is hit again.

Looking at the code, it may be possible to print out when throttling was enabled, something like:

Lustre: Skipped 311 previous similar messages in the past 365s
Comment by Christopher Morrone [ 15/May/17 ]

If the rate limiting prints out the first instance of the message without delay, then it isn't terribly necessary to add the extra time information.  The first line would have the initial time stamp, and the "Skipped X previous similar messages" line would have a later timestamp.  So are you really sure that the first line is printed immediately?  If that was the case, then Ned's information would not have been ambiguous about how much time passed before the first eviction, right?  Or am I missing something?

 

Comment by Mikhail Pershin [ 16/Jan/22 ]

Looks outdated

Generated at Sat Feb 10 02:26:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.