[LU-9748] DNE recovery hangs, blocks Lustre recovery Created: 07/Jul/17 Updated: 09/Aug/17 Resolved: 09/Aug/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
Soak test cluster, lustre-master build 3606 version=2.9.59_32_g62bc3af |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Sequence:
dumped lustre-logs on the MDS multiple times during this, dumped stacks, attached |
| Comments |
| Comment by Cliff White (Inactive) [ 07/Jul/17 ] |
|
Further details in |
| Comment by Cliff White (Inactive) [ 07/Jul/17 ] |
|
Also during this recovery time, clients are blocked, entire system is wedged waiting for recovery to complete. |
| Comment by Cliff White (Inactive) [ 07/Jul/17 ] |
|
After rebooting soak-8 (MGS) and soak-11 (mentioned in soak-10 errors) soak-10 completed recovery. ---------------- soak-8 ---------------- mdt.soaked-MDT0000.recovery_status= status: WAITING non-ready MDTs: 0000 0001 0002 0003 recovery_start: 1499455762 time_waited: 832 ---------------- soak-9 ---------------- mdt.soaked-MDT0001.recovery_status= status: COMPLETE recovery_start: 1499446112 recovery_duration: 47 completed_clients: 7/37 replayed_requests: 0 last_transno: 1245546027187 VBR: DISABLED IR: ENABLED ---------------- soak-10 ---------------- mdt.soaked-MDT0002.recovery_status= status: COMPLETE recovery_start: 1499451258 recovery_duration: 4604 completed_clients: 35/37 replayed_requests: 0 last_transno: 1189768212435 VBR: ENABLED IR: ENABLED ---------------- soak-11 ---------------- mdt.soaked-MDT0003.recovery_status= status: WAITING non-ready MDTs: 0001 recovery_start: 1499455317 time_waited: 1277 |
| Comment by Cliff White (Inactive) [ 07/Jul/17 ] |
|
Further set of lustre/console logs attached, I will leave the system in this state if anyone wishes to examine it. |
| Comment by Cliff White (Inactive) [ 07/Jul/17 ] |
|
With system completely idle, recovery finally completes: soak-8 ---------------- mdt.soaked-MDT0000.recovery_status= status: COMPLETE recovery_start: 1499455762 recovery_duration: 1500 completed_clients: 36/37 replayed_requests: 0 last_transno: 1297102100712 VBR: DISABLED IR: DISABLED ---------------- soak-9 ---------------- mdt.soaked-MDT0001.recovery_status= status: COMPLETE recovery_start: 1499446112 recovery_duration: 47 completed_clients: 7/37 replayed_requests: 0 last_transno: 1245546027187 VBR: DISABLED IR: ENABLED ---------------- soak-10 ---------------- mdt.soaked-MDT0002.recovery_status= status: COMPLETE recovery_start: 1499451258 recovery_duration: 4604 completed_clients: 35/37 replayed_requests: 0 last_transno: 1189768212435 VBR: ENABLED IR: ENABLED ---------------- soak-11 ---------------- mdt.soaked-MDT0003.recovery_status= status: COMPLETE recovery_start: 1499455317 recovery_duration: 1678 completed_clients: 37/37 replayed_requests: 0 last_transno: 1262744109605 VBR: DISABLED IR: ENABLED |
| Comment by Lai Siyao [ 11/Jul/17 ] |
|
'grep lod_sub_recovery_ *' shows: soak-10.lustre.log.txt:00000004:00080000:16.0:1499451252.837946:0:3678:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0002-osd retrieve update log: rc = 0 soak-10.2.log:00000004:00080000:19.0:1499351735.916949:0:3600:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0001-osp-MDT0002 retrieve update log: rc = 0 soak-10.2.log:00000004:00000010:19.0:1499351735.916950:0:3600:0:(lod_dev.c:463:lod_sub_recovery_thread()) kfreed 'lrd': 32 at ffff8804177b0ba0. soak-10.2.log:00000004:00000001:19.0:1499351735.916982:0:3600:0:(lod_dev.c:469:lod_sub_recovery_thread()) Process leaving (rc=0 : 0 : 0) soak-10.2.log:00000004:00080000:16.0:1499351736.870806:0:3599:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0002 retrieve update log: rc = 0 soak-10.2.log:00000004:00000010:16.0:1499351736.870813:0:3599:0:(lod_dev.c:463:lod_sub_recovery_thread()) kfreed 'lrd': 32 at ffff8804177b0b80. soak-10.2.log:00000004:00000001:16.0:1499351736.870858:0:3599:0:(lod_dev.c:469:lod_sub_recovery_thread()) Process leaving (rc=0 : 0 : 0) soak-10.lustre.log.4.txt:00000004:00080000:3.0:1499455861.692338:0:3681:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 retrieve update log: rc = 0 soak-10.lustre.log.4.txt:00000004:00080000:3.0:1499455861.692340:0:3681:0:(lod_dev.c:456:lod_sub_recovery_thread()) soaked-MDT0002 got update logs from all MDTs. soak-10.lustre.log.4.txt:00000004:00000010:3.0:1499455861.692358:0:3681:0:(lod_dev.c:463:lod_sub_recovery_thread()) kfreed 'lrd': 32 at ffff88082c4bccc0. soak-10.lustre.log.4.txt:00000004:00000001:3.0:1499455861.692412:0:3681:0:(lod_dev.c:469:lod_sub_recovery_thread()) Process leaving (rc=0 : 0 : 0) MDT0002 retrieved remote update log from MDT0000 and MDT0001 at 1499351735, local update log at 1499451252, and from MDT0003 at 1499455861. It means it took 28hours to fetch update logs from other MDTs, this is really unacceptable. Since this last too long, logs of MDT0, MDT1 and MDT3 doesn't contain any useful information. |
| Comment by Gerrit Updater [ 12/Jul/17 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/28000 |
| Comment by Gerrit Updater [ 09/Aug/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28000/ |
| Comment by Minh Diep [ 09/Aug/17 ] |
|
Landed in 2.11 |