[LU-12283] Lustre client can not mount filesystem Created: 10/May/19  Updated: 02/Mar/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question/Request Priority: Major
Reporter: sebg-crd-pm (Inactive) Assignee: Peter Jones
Resolution: Unresolved Votes: 1
Labels: None
Environment:

lustre 2.10.6


Attachments: Text File lustre_12283.log    
Rank (Obsolete): 9223372036854775807

 Description   

1.other lustre client have mounted lustre io access ok.

2.the client can  lctl ping mgs node ok but can not mount filesystem  with these error in client.

[Wed May  8 02:44:47 2019] LustreError: 166-1: MGC172.20.0.201@o2ib1: Connection to MGS (at 172.20.0.201@o2ib1) was lost; in progress operations using this service will fail
[Wed May  8 02:44:47 2019] LustreError: 1906:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1557283188, 300s ago), entering recovery for MGS@MGC172.20.0.201@o2ib1_0 ns: MGC172.20.0.201@o2ib1 lock: ffff880a08a8e600/0x4c09d5e9a043c1dd lrc: 4/1,0 mode: --/CR res: [0x736d61726170:0x3:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x2c2f58d4e075de74 expref: -99 pid: 1906 timeout: 0 lvb_type: 0
[Wed May  8 02:44:47 2019] LustreError: 1906:0:(mgc_request.c:2242:mgc_process_config()) MGC172.20.0.201@o2ib1: can't process params llog: rc = -5
[Wed May  8 02:44:47 2019] Lustre: MGC172.20.0.201@o2ib1: Connection restored to MGC172.20.0.201@o2ib1_0 (at 172.20.0.201@o2ib1)
[Wed May  8 02:44:47 2019] LustreError: 15c-8: MGC172.20.0.201@o2ib1: The configuration from log 'hpcfs-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
[Wed May  8 02:44:47 2019] LustreError: 1920:0:(lov_obd.c:878:lov_cleanup()) hpcfs-clilov-ffff880a258ff000: lov tgt 0 not cleaned! deathrow=0, lovrc=1
[Wed May  8 02:44:47 2019] Lustre: Unmounted hpcfs-client
[Wed May  8 02:44:47 2019] LustreError: 1906:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-5)

 

3.the lustre client can mount filesystem after re-mount  mgt 



 Comments   
Comment by Peter Jones [ 10/May/19 ]

Could you please provide logs as per the comment on LU-11972

Comment by sebg-crd-pm (Inactive) [ 13/May/19 ]

lustre_12283.log

The attached file is mgs server log.

It looks like something wrong from May 7 14:09:09 till I try to restart mgt in May 8. 

Comment by Thomas Roth [ 02/Mar/21 ]

We seem to hit this with 2.12.5.

Server lxmds19 has the combined MGS + MDT.  Every 10 minutes, the connection to the MGS is lost and restored:

 

Feb 28 11:24:31 lxmds19 kernel: LustreError: 166-1: MGC10.20.3.0@o2ib5: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Feb 28 11:24:31 lxmds19 kernel: LustreError: 2917:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1614507571, 300s ago), entering recovery for MGS@MGC10.20.3.0@o2ib5_0 ns: MGC10.20.3.0@o2ib5 lo
Feb 28 11:24:31 lxmds19 kernel: LustreError: 13227:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.20.3.0@o2ib5: namespace resource [0x65626568:0x2:0x0].0x0 (ffff96354efad8c0) refcount nonzero (1) after lock cleanup; forcing cleanup.

 

This is seen by the other servers and clients, consequently new mounts will fail often, but not always.

 

Generated at Sat Feb 10 02:51:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.