[LU-8086] client eviction after MDT restart or failover Created: 29/Apr/16  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Frank Heckes (Inactive) Assignee: Lai Siyao
Resolution: Not a Bug Votes: 0
Labels: soak
Environment:

lola
build: master commit 71d2ea0fde17ecde0bf237f486d4bafb5d54fe3f + patches


Attachments: File console-lola-11.bz2     File console-lola-20.bz2     File console-lola-3.bz2     File lola-11-lustre-log.1461863436.19099.bz2     File lola-11-lustre-log.1461863718.28215.bz2     File lola-11-lustre-log.1461863825.28262.bz2     File lola-20-lustre-log.1461863720.182550.bz2     File lola-3-lustre-log.1461863731.75227.bz2     File lola-3-lustre-log.1461863838.75334.bz2     File messages-lola-11.bz2     File messages-lola-20.bz2     File messages-lola-3.bz2    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The error happens during soak testing of build '20160427' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160427). DNE is enabled. OSTs had been formatted with zfs, MDT's using ldiskfs as storage backend. OSS and MDT nodes are configured in HA active-active failover configuration. For debugging purpose parameter dump_on_eviction=1 was set.

The configuration, especially the mapping of node node to role can be found here: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration

After every MDS restart or failover a large number of Luster nodes (very often the majority) are evicted.

The following sequence of events is 100% reproducible:

  • 2016-04-28 10:07:15,956 mds_failover lola-8 ---> lola-9 started
  • 2016-04-28 10:16:23,738:fsmgmt.fsmgmt:INFO Node lola-9: 'soaked-MDT0000' recovery completed
  • 2016-04-28 10:16:23,739:fsmgmt.fsmgmt:INFO Unmounting soaked-MDT0000 on lola-9 ...
  • 2016-04-28 10:16:48,995:fsmgmt.fsmgmt:INFO ... soaked-MDT0000 mounted successfully on lola-8

2016-04-28 10:16:48,996 mds_failover (failback completed ; lola-8 run own own resource mdt-0 again)

2016-04-28 10:17:32 recovery of mdt-0 finished on lola-8:

Apr 28 10:17:32 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:43, of 21 clients 21 recovered and 0 were evicted.
* 
  • 2016-04-28 10:17:* most clients get evicted although stated differetly in Lustre message above:
    lola-10.log:Apr 28 10:17:24 lola-10 kernel: LustreError: 48860:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-11.log:Apr 28 10:17:05 lola-11 kernel: LustreError: 28261:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-13.log:Apr 28 10:17:06 lola-13 kernel: LustreError: 81063:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-16.log:Apr 28 10:17:10 lola-16 kernel: LustreError: 229277:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-18.log:Apr 28 10:17:08 lola-18 kernel: LustreError: 110914:0:(import.c:1405:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-19.log:Apr 28 10:17:25 lola-19 kernel: LustreError: 233525:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-20.log:Apr 28 10:17:14 lola-20 kernel: LustreError: 182741:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-21.log:Apr 28 10:17:14 lola-21 kernel: LustreError: 155091:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-22.log:Apr 28 10:17:05 lola-22 kernel: LustreError: 171992:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-23.log:Apr 28 10:17:34 lola-23 kernel: LustreError: 158263:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-24.log:Apr 28 10:17:21 lola-24 kernel: LustreError: 160657:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-25.log:Apr 28 10:17:11 lola-25 kernel: LustreError: 196242:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-26.log:Apr 28 10:17:07 lola-26 kernel: LustreError: 153478:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-27.log:Apr 28 10:17:20 lola-27 kernel: LustreError: 158888:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-29.log:Apr 28 10:17:25 lola-29 kernel: LustreError: 29326:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-2.log:Apr 28 10:17:10 lola-2 kernel: LustreError: 16891:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-2.log:Apr 28 10:17:17 lola-2 kernel: LustreError: 16899:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-2.log:Apr 28 10:17:42 lola-2 kernel: LustreError: 16907:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-30.log:Apr 28 10:17:14 lola-30 kernel: LustreError: 34608:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-31.log:Apr 28 10:17:21 lola-31 kernel: LustreError: 17749:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-32.log:Apr 28 10:17:02 lola-32 kernel: LustreError: 152914:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-33.log:Apr 28 10:17:14 lola-33 kernel: LustreError: 165946:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-34.log:Apr 28 10:17:16 lola-34 kernel: LustreError: 152469:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-3.log:Apr 28 10:17:18 lola-3 kernel: LustreError: 75334:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-4.log:Apr 28 10:17:08 lola-4 kernel: LustreError: 34658:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-5.log:Apr 28 10:17:07 lola-5 kernel: LustreError: 32477:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-6.log:Apr 28 10:17:02 lola-6 kernel: LustreError: 75888:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-7.log:Apr 28 10:17:24 lola-7 kernel: LustreError: 20063:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    lola-9.log:Apr 28 10:17:31 lola-9 kernel: LustreError: 11783:0:(import.c:1406:ptlrpc_invalidate_import_thread()) dump the log upon eviction
    

Attached files messages, console, and debug log for each Lustre node type:
OSS : lola-3
MDS : lola-11
client : lola-20

As stated above the effect can be reproduced with certainty in case additional information are needed.

IB fabric and LNet routers didn't indicate any errors or malfunctions at any of the time interval the error occurred, nor earlier or later.



 Comments   
Comment by Di Wang [ 02/May/16 ]

It seems most of the eviction happened between mgc and mgs in lola-20-lustre-log.1461863720.182550, which is normal in this test.

ptlrpc_invalidate_import_thread^@dump the log upon eviction
ptlrpc_invalidate_import_thread^@ffff880821fd8000 MGS: changing import state from EVICTED to RECOVER
Generated at Sat Feb 10 02:14:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.