[LU-9748] DNE recovery hangs, blocks Lustre recovery Created: 07/Jul/17  Updated: 09/Aug/17  Resolved: 09/Aug/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: soak
Environment:

Soak test cluster, lustre-master build 3606 version=2.9.59_32_g62bc3af


Attachments: File soak-10.2.log.gz     File soak-10.lustre.log.3.txt.gz     File soak-10.lustre.log.4.txt.gz     File soak-10.lustre.log.5.txt.gz     File soak-10.lustre.log.6.txt.gz     File soak-10.lustre.log.7.txt.gz     File soak-10.lustre.log.txt.gz     File soak-10.postMGSreboot.log.gz     File soak-10.stacks.and.console.txt.gz     File soak-11.2.log.gz     File soak-11.lustre.log.3.txt.gz     File soak-11.lustre.log.4.txt.gz     File soak-11.lustre.log.5.txt.gz     File soak-11.lustre.log.6.txt.gz     File soak-11.lustre.log.7.txt.gz     File soak-11.lustre.log.txt.gz     File soak-11.postreboot.log.gz     File soak-11.stacks.and.console.txt.gz     File soak-8.2.log.gz     File soak-8.lustre.log.3.txt.gz     File soak-8.lustre.log.4.txt.gz     File soak-8.lustre.log.5.txt.gz     File soak-8.lustre.log.6.txt.gz     File soak-8.lustre.log.7.txt.gz     File soak-8.lustre.log.txt.gz     File soak-8.postreboot.log.gz     File soak-8.stacks.and.console.txt.gz     File soak-9.2.log.gz     File soak-9.lustre.log.3.txt.gz     File soak-9.lustre.log.4.txt.gz     File soak-9.lustre.log.5.txt.gz     File soak-9.lustre.log.6.txt.gz     File soak-9.lustre.log.7.txt.gz     File soak-9.lustre.log.txt.gz     File soak-9.postreboot.log.gz     File soak-9.stacks.and.console.txt.gz    
Issue Links:
Related
is related to LU-9274 LBUG: (recover.c:157:ptlrpc_replay_ne... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Sequence:

  • MDS failover occurs.
  • failover nodes complete.
  • recovery across all MDS blocks
    Jul  7 15:34:17 soak-9 kernel: LDISKFS-fs warning (device dm-6): ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, please wait.
    Jul  7 15:35:00 soak-9 kernel: LDISKFS-fs (dm-6): recovery complete
    Jul  7 15:35:00 soak-9 kernel: LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
    Jul  7 15:35:06 soak-9 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.128@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
    Jul  7 15:35:06 soak-9 kernel: Lustre: soaked-MDT0001: Not available for connect from 192.168.1.132@o2ib (not set up)
    Jul  7 15:35:06 soak-9 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0001: operation mds_connect to node 192.168.1.108@o2ib failed: rc = -114
    Jul  7 15:35:07 soak-9 kernel: Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
    Jul  7 15:35:09 soak-9 kernel: Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 37 clients reconnect
    

    The failover node stays in a WAITING state:

    soak-10
    ----------------
    mdt.soaked-MDT0002.recovery_status=
    status: WAITING
    non-ready MDTs:  0003
    recovery_start: 1499451258
    time_waited: 2147
    
    Jul  7 18:29:12 soak-10 kernel: LustreError: 11-0: soaked-MDT0003-osp-MDT0002: operation mds_connect to node 192.168.1.111@o2ib failed: rc = -114
    Jul  7 18:29:12 soak-10 kernel: LustreError: Skipped 11 previous similar messages
    Jul  7 18:29:13 soak-10 kernel: Lustre: 3682:0:(ldlm_lib.c:1784:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1
    Jul  7 18:29:13 soak-10 kernel: Lustre: 3682:0:(ldlm_lib.c:1784:extend_recovery_timer()) Skipped 9 previous similar messages
    Jul  7 18:29:29 soak-10 kernel: Lustre: soaked-MDT0002: Recovery already passed deadline 0:08, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
    
    

dumped lustre-logs on the MDS multiple times during this, dumped stacks, attached



 Comments   
Comment by Cliff White (Inactive) [ 07/Jul/17 ]

Further details in LU-9274

Comment by Cliff White (Inactive) [ 07/Jul/17 ]

Also during this recovery time, clients are blocked, entire system is wedged waiting for recovery to complete.

Comment by Cliff White (Inactive) [ 07/Jul/17 ]

After rebooting soak-8 (MGS) and soak-11 (mentioned in soak-10 errors) soak-10 completed recovery.
However, soak-8 and soak-11 are now wedged in wait.

----------------
soak-8
----------------
mdt.soaked-MDT0000.recovery_status=
status: WAITING
non-ready MDTs:  0000 0001 0002 0003
recovery_start: 1499455762
time_waited: 832
----------------
soak-9
----------------
mdt.soaked-MDT0001.recovery_status=
status: COMPLETE
recovery_start: 1499446112
recovery_duration: 47
completed_clients: 7/37
replayed_requests: 0
last_transno: 1245546027187
VBR: DISABLED
IR: ENABLED
----------------
soak-10
----------------
mdt.soaked-MDT0002.recovery_status=
status: COMPLETE
recovery_start: 1499451258
recovery_duration: 4604
completed_clients: 35/37
replayed_requests: 0
last_transno: 1189768212435
VBR: ENABLED
IR: ENABLED
----------------
soak-11
----------------
mdt.soaked-MDT0003.recovery_status=
status: WAITING
non-ready MDTs:  0001
recovery_start: 1499455317
time_waited: 1277
Comment by Cliff White (Inactive) [ 07/Jul/17 ]

Further set of lustre/console logs attached, I will leave the system in this state if anyone wishes to examine it.

Comment by Cliff White (Inactive) [ 07/Jul/17 ]

With system completely idle, recovery finally completes:

soak-8
----------------
mdt.soaked-MDT0000.recovery_status=
status: COMPLETE
recovery_start: 1499455762
recovery_duration: 1500
completed_clients: 36/37
replayed_requests: 0
last_transno: 1297102100712
VBR: DISABLED
IR: DISABLED
----------------
soak-9
----------------
mdt.soaked-MDT0001.recovery_status=
status: COMPLETE
recovery_start: 1499446112
recovery_duration: 47
completed_clients: 7/37
replayed_requests: 0
last_transno: 1245546027187
VBR: DISABLED
IR: ENABLED
----------------
soak-10
----------------
mdt.soaked-MDT0002.recovery_status=
status: COMPLETE
recovery_start: 1499451258
recovery_duration: 4604
completed_clients: 35/37
replayed_requests: 0
last_transno: 1189768212435
VBR: ENABLED
IR: ENABLED
----------------
soak-11
----------------
mdt.soaked-MDT0003.recovery_status=
status: COMPLETE
recovery_start: 1499455317
recovery_duration: 1678
completed_clients: 37/37
replayed_requests: 0
last_transno: 1262744109605
VBR: DISABLED
IR: ENABLED
Comment by Lai Siyao [ 11/Jul/17 ]

'grep lod_sub_recovery_ *' shows:

soak-10.lustre.log.txt:00000004:00080000:16.0:1499451252.837946:0:3678:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0002-osd retrieve update log: rc = 0
soak-10.2.log:00000004:00080000:19.0:1499351735.916949:0:3600:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0001-osp-MDT0002 retrieve update log: rc = 0
soak-10.2.log:00000004:00000010:19.0:1499351735.916950:0:3600:0:(lod_dev.c:463:lod_sub_recovery_thread()) kfreed 'lrd': 32 at ffff8804177b0ba0.
soak-10.2.log:00000004:00000001:19.0:1499351735.916982:0:3600:0:(lod_dev.c:469:lod_sub_recovery_thread()) Process leaving (rc=0 : 0 : 0)
soak-10.2.log:00000004:00080000:16.0:1499351736.870806:0:3599:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0002 retrieve update log: rc = 0
soak-10.2.log:00000004:00000010:16.0:1499351736.870813:0:3599:0:(lod_dev.c:463:lod_sub_recovery_thread()) kfreed 'lrd': 32 at ffff8804177b0b80.
soak-10.2.log:00000004:00000001:16.0:1499351736.870858:0:3599:0:(lod_dev.c:469:lod_sub_recovery_thread()) Process leaving (rc=0 : 0 : 0)
soak-10.lustre.log.4.txt:00000004:00080000:3.0:1499455861.692338:0:3681:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 retrieve update log: rc = 0
soak-10.lustre.log.4.txt:00000004:00080000:3.0:1499455861.692340:0:3681:0:(lod_dev.c:456:lod_sub_recovery_thread()) soaked-MDT0002 got update logs from all MDTs.
soak-10.lustre.log.4.txt:00000004:00000010:3.0:1499455861.692358:0:3681:0:(lod_dev.c:463:lod_sub_recovery_thread()) kfreed 'lrd': 32 at ffff88082c4bccc0.
soak-10.lustre.log.4.txt:00000004:00000001:3.0:1499455861.692412:0:3681:0:(lod_dev.c:469:lod_sub_recovery_thread()) Process leaving (rc=0 : 0 : 0)

MDT0002 retrieved remote update log from MDT0000 and MDT0001 at 1499351735, local update log at 1499451252, and from MDT0003 at 1499455861. It means it took 28hours to fetch update logs from other MDTs, this is really unacceptable. Since this last too long, logs of MDT0, MDT1 and MDT3 doesn't contain any useful information.

Comment by Gerrit Updater [ 12/Jul/17 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/28000
Subject: LU-9748 lod: safely access update log stat
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ceaae52b221d0986599796420add36305f01eff3

Comment by Gerrit Updater [ 09/Aug/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28000/
Subject: LU-9748 lod: safely access update log stat
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a446cbb8b1e2fa73c30938d043f79f644c13efe7

Comment by Minh Diep [ 09/Aug/17 ]

Landed in 2.11

Generated at Sat Feb 10 02:28:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.