[LU-14048] Unable stop all OSTs due to obd_unlinked_exports Created: 20/Oct/20 Updated: 15/Jul/21 Resolved: 21/Apr/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.5 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stephane Thiell | Assignee: | Yang Sheng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LTS12 | ||
| Environment: |
2.12.5_7.srcc (https://github.com/stanford-rc/lustre/commits/b2_12_5) on servers + 2.13 clients |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Since our Oak upgrade from Lustre 2.10.8 to 2.12.5, it has not always been possible to stop OSTs anymore (when doing a manual failover for example). When stopping OSTs, we can see these messages on the servers: Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011a is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 6. Is it stuck? Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff887ca4215400 f2e15550-2e62-4 10.51.1.18@o2ib3 1 (0 0 0) 1 0 1 0: (null) 0 stale:0 Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff88de7f82bc00 2be2f511-7fb2-4 10.51.1.43@o2ib3 1 (0 0 0) 1 0 1 0: (null) 0 stale:0 Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff88de717a7c00 f04b9922-9d5a-4 10.51.1.63@o2ib3 1 (0 0 0) 1 0 1 0: (null) 0 stale:0 Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff88deb8acd000 79a83f3f-8d8d-4 10.51.1.42@o2ib3 1 (0 0 0) 1 0 1 0: (null) 0 stale:0 Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST0112: UNLINKED ffff8833ab100c00 2be2f511-7fb2-4 10.51.1.43@o2ib3 1 (0 0 0) 1 0 1 0: (null) 0 stale:0 Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST0112: UNLINKED ffff88de6d629400 5f8df1bb-2246-4 10.51.1.39@o2ib3 1 (0 0 0) 1 0 1 0: (null) 0 stale:0 This has never been an issue with 2.10. We tried to manual failback OSTs this morning and we waited for about 10 minutes before rebooting the servers because it was too long and started to impact production. It would be nice to have a short timeout on this. Restart and recovery of the OSTs went well (perhaps worst case 1 client only was evicted). Attaching some lustre debug logs from 3 different OSS oak-io[4-6]-s2:
Attaching kernel logs from the same 3 servers:
This could be similar to the issue reported in LU-12319 which doesn't seem to be fixed at this time. Thanks for the help! |
| Comments |
| Comment by Peter Jones [ 21/Oct/20 ] |
|
Stephane Does this reproduce reliably or just occur occasionally? Yang Sheng Could you please investigate? Thanks Peter |
| Comment by Stephane Thiell [ 21/Oct/20 ] |
|
Hi Peter, Thanks! It did happen on 3 different servers, but at about the same time. I cannot say at this point if this was just a one time thing, as the upgrade is very recent. I don't remember having seen that on Fir before (which has been using 2.12 for a while), but Oak also has even more clients. |
| Comment by Etienne Aujames [ 11/Jan/21 ] |
|
Hello, We encounter the same type of issue during the migration of Lustre 2.12.0.3_cray to Lustre 2.12.0.4_cray: 2020-12-07 08:47:53 [5791062.978004] Lustre: Failing over fs-OST0017 2020-12-07 08:48:07 [5791076.983534] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 4. Is it stuck? 2020-12-07 08:48:07 [5791076.996600] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0: (null) 0 stale:0 2020-12-07 08:48:23 [5791093.010738] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 4. Is it stuck? 2020-12-07 08:48:23 [5791093.024450] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0: (null) 0 stale:0 2020-12-07 08:48:55 [5791125.038060] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 4. Is it stuck? 2020-12-07 08:48:55 [5791125.051938] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0: (null) 0 stale:0 2020-12-07 08:49:32 [5791162.000809] LustreError: 9629:0:(qsd_reint.c:56:qsd_reint_completion()) fs-OST0017: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5 2020-12-07 08:49:59 [5791189.064757] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 4. Is it stuck? 2020-12-07 08:49:59 [5791189.078677] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0: (null) 0 stale:0 2020-12-07 08:51:32 [5791282.006030] LustreError: 13737:0:(qsd_reint.c:56:qsd_reint_completion()) fs-OST0017: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5 2020-12-07 08:52:07 [5791317.088242] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 4. Is it stuck? 2020-12-07 08:52:07 [5791317.102361] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0: The logs above occurred during a filesystem umount with Lustre 2.12.0.3_cray on an OSS. |
| Comment by Stephane Thiell [ 17/Feb/21 ] |
|
Again, hit the same problem with Lustre 2.12.5 OSTs when stopping: 00000020:02000400:6.0:1613582408.575023:0:230973:0:(genops.c:1837:obd_exports_barrier()) oak-OST000d is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 4. Is it stuck? 00000020:02000400:6.0:1613582408.587840:0:230973:0:(genops.c:1802:print_export_data()) oak-OST000d: UNLINKED ffff9bc55a67c000 aa46df42-0a4c-4 10.50.16.5@o2ib2 1 (0 0 0) 1 0 1 0: (null) 0 stale:0 # cd /proc/fs/lustre/obdfilter/oak-OST000d/exports/10.50.16.5@o2ib2 [root@oak-io1-s1 10.50.16.5@o2ib2]# for f in *; do echo ===$f===:; cat $f; done ===export===: ===fmd_count===: ===hash===: ===ldlm_stats===: snapshot_time 1613582626.251044421 secs.nsecs ldlm_enqueue 42 samples [reqs] ldlm_cancel 22 samples [reqs] ===nodemap===: ===reply_data===: ===stats===: snapshot_time 1613582626.253618983 secs.nsecs read_bytes 41 samples [bytes] 4096 4096 167936 statfs 1 samples [reqs] set_info 5132 samples [reqs] ===uuid===: |
| Comment by Gerrit Updater [ 18/Feb/21 ] |
|
Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41687 |
| Comment by Gerrit Updater [ 21/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41687/ |
| Comment by Peter Jones [ 21/Apr/21 ] |
|
Landed for 2.15 |
| Comment by Gerrit Updater [ 05/May/21 ] |
|
Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/43542 |