[LU-3929] 2.1.6->2.4.1 rolling upgrade: lustre-MDT0000: recovery is timed out, evict stale exports Created: 11/Sep/13  Updated: 07/Jul/14  Resolved: 06/Jan/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.4.1, Lustre 2.5.0
Fix Version/s: Lustre 2.5.1

Type: Bug Priority: Major
Reporter: Jian Yu Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: mn4

Issue Links:
Related
is related to LU-5298 The lwp device cannot be started when... Resolved
Severity: 3
Rank (Obsolete): 10379

 Description   

While performing rolling upgrade from Lustre 2.1.6 to 2.4.1 RC2 with the path of OSS->MDS->Client one by one, the test failed after upgrading MDS:

Starting the MDS service on fat-amd-3...
----------------
fat-amd-3
----------------
debug=-1
subsystem_debug=all -lnet -lnd -pinger
debug_mb=100
pdsh -l root -t 100 -S -w fat-amd-3 "mkdir -p /mnt/mds1 && mount -t lustre -o user_xattr /dev/sdc1 /mnt/mds1"
Waiting 895 secs for fat-amd-3 recovery done. status: RECOVERING
<~snip~>
Waiting 5 secs for fat-amd-3 recovery done. status: RECOVERING
Waiting 0 secs for fat-amd-3 recovery done. status: RECOVERING
fat-amd-3 recovery not done in 900 sec. status: RECOVERING

On MDS fat-amd-3, "lctl get_param -n ..recovery_status" showed that:

----------------
fat-amd-3
----------------
status: RECOVERING
recovery_start: 1378874775
time_remaining: 0
connected_clients: 2/4
req_replay_clients: 0
lock_repay_clients: 0
completed_clients: 2
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 4294967297

Console log on MDS showed that:

Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 4 clients reconnect
Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
Lustre: lustre-MDT0000: disconnecting 2 stale clients
Lustre: lustre-MDT0000: recovery is timed out, evict stale exports

Maloo reports:
https://maloo.whamcloud.com/test_sets/d91a2b68-1aa1-11e3-88ff-52540035b04c
https://maloo.whamcloud.com/test_sets/dae9450c-1a86-11e3-8ceb-52540035b04c

The same failure also occurred while rolling upgrade from Lustre 2.1.6 to 2.4.0:
https://maloo.whamcloud.com/test_sets/c70af506-1ab5-11e3-8898-52540035b04c



 Comments   
Comment by Peter Jones [ 12/Sep/13 ]

Hongchao

Could you please make an assessment of this issue?

Thanks

Peter

Comment by Hongchao Zhang [ 13/Sep/13 ]

this issue is related to the LWP(Light Weight Proxy) connection.
in b2_1_*, the LWP connection is treated as normal clients and one client data "tg_export_data" is allocated for it, then there will be one more export
to be recover during recovery at MDT, but in b2_4_*, MDT won't allocate client data for LWP connection, which doesn't need recover.

Comment by Sebastien Buisson (Inactive) [ 21/Oct/13 ]

Hi,

We are suffering from this error, which is very annoying in case a customer wants to upgrade its OSSes first and then its MDSes and clients.

This ticket was opened a month ago, but did not make any progress since then. This is surprising, as I would tend to consider it a major issue on the upgrade path from 2.1 to 2.4 (and 2.5 too). Am I missing something?

Sebastien.

Comment by Hongchao Zhang [ 24/Oct/13 ]

status update:
the patch is under testing and will be pushed to Gerrit soon. Thanks

Comment by Hongchao Zhang [ 29/Oct/13 ]

the patch is against b2_1, and is tracked at http://review.whamcloud.com/#/c/8086/

Comment by Hongchao Zhang [ 19/Nov/13 ]

the patch against master is tracked at http://review.whamcloud.com/#/c/8328/

Comment by Sebastien Buisson (Inactive) [ 19/Nov/13 ]

Hi, I have just tested patch http://review.whamcloud.com/#/c/8086/ for b2_1, and it works fine. I mean rolling upgrade from Lustre 2.1.6 plus this patch to 2.4.1 went off smoothly.

So now I am wondering what is the purpose of this new patch http://review.whamcloud.com/#/c/8328/ for master.

Sebastien.

Comment by Hongchao Zhang [ 22/Nov/13 ]

it could allow more previous Lustre version to upgrade to new version with the patch against master.

Comment by Sebastien Buisson (Inactive) [ 22/Nov/13 ]

Do you mean the master patch alone would be enough to be able to successfully upgrade from 2.1 with the path OSS->MDS->Client?

Comment by Oleg Drokin [ 23/Nov/13 ]

Yes, the master patch alone should be enough to allow upgrades from unpatched 2.1 OSTS (i.e. those that do not have 8086 patch present). Can you give such a combination a try, please?

We believe it's a better way since it saves you one extra step of upgrading all your OSTS to 2.1.6+patch before you can update your MDS to 2.4+ and then update your OSTs again to 2.4+ too (which is kind of overkill).

Comment by Sebastien Buisson (Inactive) [ 26/Nov/13 ]

Hi,

Here is the test I carried out:

It went off smoothly. So I confirm that the master patch is enough. And, as Oleg explained, having the patch in the target version simplifies upgrade.

Cheers,
Sebastien.

Comment by Peter Jones [ 06/Jan/14 ]

Landed for 2.5.1

Generated at Sat Feb 10 01:38:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.