[LU-14048] Unable stop all OSTs due to obd_unlinked_exports Created: 20/Oct/20  Updated: 15/Jul/21  Resolved: 21/Apr/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: LTS12
Environment:

2.12.5_7.srcc (https://github.com/stanford-rc/lustre/commits/b2_12_5) on servers + 2.13 clients


Attachments: File oststopissue_oak-io4-s2.dk.log.gz     Text File oststopissue_oak-io4-s2.kern.log     File oststopissue_oak-io5-s2.dk.log.gz     Text File oststopissue_oak-io5-s2.kern.log     File oststopissue_oak-io6-s2.dk.log.gz     Text File oststopissue_oak-io6-s2.kern.log    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Since our Oak upgrade from Lustre 2.10.8 to 2.12.5, it has not always been possible to stop OSTs anymore (when doing a manual failover for example).

When stopping OSTs, we can see these messages on the servers:

Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011a is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 6. Is it stuck?
Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff887ca4215400 f2e15550-2e62-4 10.51.1.18@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff88de7f82bc00 2be2f511-7fb2-4 10.51.1.43@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff88de717a7c00 f04b9922-9d5a-4 10.51.1.63@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff88deb8acd000 79a83f3f-8d8d-4 10.51.1.42@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST0112: UNLINKED ffff8833ab100c00 2be2f511-7fb2-4 10.51.1.43@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST0112: UNLINKED ffff88de6d629400 5f8df1bb-2246-4 10.51.1.39@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0

This has never been an issue with 2.10. We tried to manual failback OSTs this morning and we waited for about 10 minutes before rebooting the servers because it was too long and started to impact production. It would be nice to have a short timeout on this. Restart and recovery of the OSTs went well (perhaps worst case 1 client only was evicted).

Attaching some lustre debug logs from 3 different OSS oak-io[4-6]-s2:

Attaching kernel logs from the same 3 servers:

 

This could be similar to the issue reported in LU-12319 which doesn't seem to be fixed at this time.

Thanks for the help!
Stephane



 Comments   
Comment by Peter Jones [ 21/Oct/20 ]

Stephane

Does this reproduce reliably or just occur occasionally?

Yang Sheng

Could you please investigate?

Thanks

Peter

Comment by Stephane Thiell [ 21/Oct/20 ]

Hi Peter,

Thanks! It did happen on 3 different servers, but at about the same time. I cannot say at this point if this was just a one time thing, as the upgrade is very recent. I don't remember having seen that on Fir before (which has been using 2.12 for a while), but Oak also has even more clients.

Comment by Etienne Aujames [ 11/Jan/21 ]

Hello,

We encounter the same type of issue during the migration of Lustre 2.12.0.3_cray to Lustre 2.12.0.4_cray:

2020-12-07 08:47:53 [5791062.978004] Lustre: Failing over fs-OST0017
2020-12-07 08:48:07 [5791076.983534] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 4. Is it stuck?
2020-12-07 08:48:07 [5791076.996600] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
2020-12-07 08:48:23 [5791093.010738] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 4. Is it stuck?
2020-12-07 08:48:23 [5791093.024450] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
2020-12-07 08:48:55 [5791125.038060] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 4. Is it stuck?
2020-12-07 08:48:55 [5791125.051938] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
2020-12-07 08:49:32 [5791162.000809] LustreError: 9629:0:(qsd_reint.c:56:qsd_reint_completion()) fs-OST0017: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5
2020-12-07 08:49:59 [5791189.064757] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 4. Is it stuck?
2020-12-07 08:49:59 [5791189.078677] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
2020-12-07 08:51:32 [5791282.006030] LustreError: 13737:0:(qsd_reint.c:56:qsd_reint_completion()) fs-OST0017: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5
2020-12-07 08:52:07 [5791317.088242] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 4. Is it stuck?
2020-12-07 08:52:07 [5791317.102361] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0:   

The logs above occurred during a filesystem umount with Lustre 2.12.0.3_cray on an OSS.
11 OSS have freezed with the same type of logs.

Comment by Stephane Thiell [ 17/Feb/21 ]

Again, hit the same problem with Lustre 2.12.5 OSTs when stopping:

00000020:02000400:6.0:1613582408.575023:0:230973:0:(genops.c:1837:obd_exports_barrier()) oak-OST000d is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 4. Is it stuck?
00000020:02000400:6.0:1613582408.587840:0:230973:0:(genops.c:1802:print_export_data()) oak-OST000d: UNLINKED ffff9bc55a67c000 aa46df42-0a4c-4 10.50.16.5@o2ib2 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
# cd /proc/fs/lustre/obdfilter/oak-OST000d/exports/10.50.16.5@o2ib2
[root@oak-io1-s1 10.50.16.5@o2ib2]# for f in *; do echo ===$f===:; cat $f; done
===export===:
===fmd_count===:
===hash===:
===ldlm_stats===:
snapshot_time             1613582626.251044421 secs.nsecs
ldlm_enqueue              42 samples [reqs]
ldlm_cancel               22 samples [reqs]
===nodemap===:
===reply_data===:
===stats===:
snapshot_time             1613582626.253618983 secs.nsecs
read_bytes                41 samples [bytes] 4096 4096 167936
statfs                    1 samples [reqs]
set_info                  5132 samples [reqs]
===uuid===:
Comment by Gerrit Updater [ 18/Feb/21 ]

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41687
Subject: LU-14048 obd: fix race between connect vs disconnect
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 647b9b3444244c8a7f8b1df077b92ccf770483e5

Comment by Gerrit Updater [ 21/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41687/
Subject: LU-14048 obd: fix race between connect vs disconnect
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 08f1d2961361f0f6c253b6fbd429ca7b61a3def2

Comment by Peter Jones [ 21/Apr/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 05/May/21 ]

Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/43542
Subject: LU-14048 obd: fix race between connect vs disconnect
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 7f54a720c64409c1d9d710cfe885b2f0a07970a0

Generated at Sat Feb 10 03:06:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.