[LU-1629] MDS waiting for 0 clients in recovery Created: 13/Jul/12  Updated: 01/Aug/12  Resolved: 01/Aug/12

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: Lustre 2.3.0

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: Li Wei (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Hyperion


Attachments: File failog    
Issue Links:
Duplicate
Related
Severity: 3
Rank (Obsolete): 4511

 Description   

Hyperion clustre, 100+ mounted clients.
Powered off entire clustre. Upon restart, MDS is in recovery waiting for 0 clients.
Example message: Jul 13 07:01:06 ehyperion-rst6 kernel: Lustre: lustre-MDT0000: Denying connection for new client 192.168.117.50@o2ib1 (at 11e711ab-a329-f07a-8312-6a40af7fc5a4), waiting for 0 clients in recovery for 2:38
At the end of the recovery timeout, clients mount

Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
Lustre: lustre-MDT0000: disconnecting 112 stale clients
Lustre: lustre-MDT0000: Recovery over after 5:00, of 112 clients 0 recovered and 112 were evicted.



 Comments   
Comment by Cliff White (Inactive) [ 13/Jul/12 ]

System log from MGS/MDS for the recovery

Comment by Andreas Dilger [ 13/Jul/12 ]

I thought I saw a very similar bug report a week or two ago. Did you do a search for this first? I believe there was already a patch for it.

Comment by Mikhail Pershin [ 13/Jul/12 ]

The same as ORI-668, there is discussion about that issue

Comment by Mikhail Pershin [ 13/Jul/12 ]

This is not recovery issue but reporting, we are reporting currently just number of clients in recovery, that is why there is 0. There is no patch for this yet.

Comment by Jodi Levi (Inactive) [ 16/Jul/12 ]

Mike,
Would you be able to look into this one or assign to someone that could?
Thank you!

Comment by Ian Colle (Inactive) [ 25/Jul/12 ]

Li Wei - Mikhail is swamped with rebase. Can you please work on this? He says fix should be to change message and output slightly different values.

Comment by Mikhail Pershin [ 25/Jul/12 ]

Message can be changed like the following:

                        LCONSOLE_WARN("%s: Denying connection for new client "
                                      "%s (at %s), total clients to recover %d,"
                                      " %d clients in recovery for %d:%.02d\n",
                                      target->obd_name,
                                      libcfs_nid2str(req->rq_peer.nid),
                                      cluuid.uuid,
                                      target->obd_max_recoverable_clients,
                                      cfs_atomic_read(&target-> \
                                                      obd_lock_replay_clients),
                                      (int)t / 60, (int)t % 60);

It outputs obd_max_recoverable_clients as total expected number of client to recover and uses obd_lock_replay_clients counter to show how many of them are already participating in recovery

Comment by Li Wei (Inactive) [ 27/Jul/12 ]

http://review.whamcloud.com/3485

Comment by Li Wei (Inactive) [ 01/Aug/12 ]

The patch has landed to master.

Generated at Sat Feb 10 01:18:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.