[LU-1142] MDS recovery fails due to client evictions Created: 25/Feb/12  Updated: 14/Sep/17  Resolved: 25/Feb/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None
Environment:

Hyperion - RHEL 5


Severity: 3
Rank (Obsolete): 6445

 Description   

mds-recovery fails, a single client is evicted.
Client:
---------------

Lustre: lustre-MDT0000-mdc-ffff81021ccdac00: Connection restored to service lustre-MDT0000 using nid 192.168.120.126@o2ib.
Lustre: DEBUG MARKER: mds has failed over 2 times, and counting...
LustreError: 11-0: an error occurred while communicating with 192.168.120.126@o2ib. The ldlm_enqueue operation failed with -107
Lustre: lustre-MDT0000-mdc-ffff81021ccdac00: Connection to service lustre-MDT0000 via nid 192.168.120.126@o2ib was lost; in progress operations using this service w
ill wait for recovery to complete.
LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
Lustre: Server lustre-MDT0000_UUID version (2.1.1.0) is much newer than client version (1.8.7)
LustreError: 20567:0:(mdc_locks.c:652:mdc_enqueue()) ldlm_cli_enqueue error: -4
LustreError: 20567:0:(file.c:3329:ll_inode_revalidate_fini()) failure -4 inode 222298113
LustreError: 20742:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff810217f9ec00 x1394757178766278/t0 o101->lustre-MDT0000_UUID@192.168.120.126@o
2ib:12/10 lens 544/1232 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
Lustre: lustre-MDT0000-mdc-ffff81021ccdac00: Connection restored to service lustre-MDT0000 using nid 192.168.120.126@o2ib.
Lustre: DEBUG MARKER: Duration: 86400
LustreError: 17920:0:(o2iblnd_cb.c:2532:kiblnd_rejected()) 192.168.117.3@o2ib rejected: o2iblnd fatal error
LustreError: 17920:0:(o2iblnd_cb.c:2532:kiblnd_rejected()) Skipped 39 previous similar messages

MDS

Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: lustre-MDT0000: sending delayed replies to recovered clients
Lustre: 25439:0:(mds_lov.c:1024:mds_notify()) MDS mdd_obd-lustre-MDT0000: in recovery, not resetting orphans on lustre-OST0000_UUID
Lustre: 25439:0:(mds_lov.c:1024:mds_notify()) Skipped 7 previous similar messages
Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0004_UUID now active, resetting orphans
Lustre: Skipped 7 previous similar messages
Lustre: DEBUG MARKER: mds has failed over 2 times, and counting...
md: rebuild md1 throttled due to IO
LustreError: 0:0:(ldlm_lockd.c:356:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 192.168.114.116@o2ib  ns: mdt-ffff81091498f800 lock: ffff810fbf9f66c0/0xcb280298ce1d3c25 lrc: 3/0,0 mode: PR/PR res: 222298113/3922531948 bits 0x3 rrc: 217 type: IBT flags: 0x20 remote: 0x3c5e7588abacbec3 expref: 8 pid: 25553 timeout: 4299068451
LustreError: 0:0:(ldlm_lockd.c:356:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 192.168.114.51@o2ib  ns: mdt-ffff81091498f800 lock: ffff810fbf9f6480/0xcb280298ce1d3c17 lrc: 3/0,0 mode: PR/PR res: 222298113/3922531948 bits 0x3 rrc: 217 type: IBT flags: 0x20 remote: 0x5711f697b9a89693 expref: 8 pid: 25553 timeout: 4299068451
LustreError: 25588:0:(ldlm_lockd.c:1210:ldlm_handle_enqueue0()) ### lock on destroyed export ffff81054ec6c000 ns: mdt-ffff81091498f800 lock: ffff810cef6a4480/0xcb280298ce1d3f2e lrc: 3/0,0 mode: PR/PR res: 222298113/3922531948 bits 0x3 rrc: 193 type: IBT flags: 0x4000000 remote: 0xfb40c962a891f585 expref: 3 pid: 25588 timeout: 0
LustreError: 25588:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@ processing error (-107)  req@ffff810397453000 x1394757210221710/t0(0) o-1->7a66717e-dbe2-1092-ecee-6263c3bca713@NET_0x50000c0a8728f_UUID:0/0 lens 544/536 e 2 to 0 dl 1330146049 ref 1 fl Interpret:/ffffffff/ffffffff rc -107/-1
LustreError: 25616:0:(ldlm_lockd.c:1210:ldlm_handle_enqueue0()) ### lock on destroyed export ffff810550486000 ns: mdt-ffff81091498f800 lock: ffff8105542e5d80/0xcb280298ce1d40d2 lrc: 3/0,0 mode: PR/PR res: 222298113/3922531948 bits 0x3 rrc: 168 type: IBT flags: 0x4000000 remote: 0x3c5e7588abacbed1 expref: 3 pid: 25616 timeout: 0
LustreError: 25588:0:(ldlm_lib.c:2129:target_send_reply_msg()) Skipped 96 previous similar messages
Lustre: 25570:0:(ldlm_lib.c:877:target_handle_connect()) lustre-MDT0000: connection from bb5f6103-fd47-8201-1084-9a41a87168fe@192.168.114.116@o2ib t8590090887 exp 0000000000000000 cur 1330145971 last 0
Lustre: 25570:0:(ldlm_lib.c:877:target_handle_connect()) Skipped 127 previous similar messages
Lustre: 25582:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import lustre-MDT0000->NET_0x50000c0a87291_UUID netid 50000: select flavor null
Lustre: 25582:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 136 previous similar messages
Lustre: DEBUG MARKER: Duration: 86400
md: rebuild md1 throttled due to IO
md: rebuild md1 throttled due to IO
md: rebuild md1 throttled due to IO


 Comments   
Comment by Cliff White (Inactive) [ 25/Feb/12 ]

clients were on 1.8.7 - retrying

Comment by Peter Jones [ 25/Feb/12 ]

Oleg

Any thoughts?

Peter

Comment by songzhlong [ 14/Sep/17 ]

the ptoblemhas been fixed ?
the detail?

Generated at Sat Feb 10 01:13:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.