Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.12.0
-
None
-
CentOS 7.6 - 3.10.0-957.1.3.el7_lustre.x86_64
-
3
-
9223372036854775807
Description
We just got some kind of deadlock on fir-md1-s2 that serves MDT0001 and MDT0003.
I took a crash dump because the MDS was not usable and filesystem was hanging
Attaching foreach bt in bt.all and vmcore-dmesg.txt
Also, this is from the crash:
crash> foreach bt >bt.all crash> kmem -i PAGES TOTAL PERCENTAGE TOTAL MEM 65891316 251.4 GB ---- FREE 13879541 52.9 GB 21% of TOTAL MEM USED 52011775 198.4 GB 78% of TOTAL MEM SHARED 32692923 124.7 GB 49% of TOTAL MEM BUFFERS 32164243 122.7 GB 48% of TOTAL MEM CACHED 781150 3 GB 1% of TOTAL MEM SLAB 13801776 52.6 GB 20% of TOTAL MEM TOTAL HUGE 0 0 ---- HUGE FREE 0 0 0% of TOTAL HUGE TOTAL SWAP 1048575 4 GB ---- SWAP USED 0 0 0% of TOTAL SWAP SWAP FREE 1048575 4 GB 100% of TOTAL SWAP COMMIT LIMIT 33994233 129.7 GB ---- COMMITTED 228689 893.3 MB 0% of TOTAL LIMIT crash> ps | grep ">" > 0 0 0 ffffffffb4218480 RU 0.0 0 0 [swapper/0] > 0 0 1 ffff8ec129410000 RU 0.0 0 0 [swapper/1] > 0 0 2 ffff8ed129f10000 RU 0.0 0 0 [swapper/2] > 0 0 3 ffff8ee129ebe180 RU 0.0 0 0 [swapper/3] > 0 0 4 ffff8eb1a9ba6180 RU 0.0 0 0 [swapper/4] > 0 0 5 ffff8ec129416180 RU 0.0 0 0 [swapper/5] > 0 0 6 ffff8ed129f16180 RU 0.0 0 0 [swapper/6] > 0 0 7 ffff8ee129eba080 RU 0.0 0 0 [swapper/7] > 0 0 8 ffff8eb1a9ba30c0 RU 0.0 0 0 [swapper/8] > 0 0 9 ffff8ec129411040 RU 0.0 0 0 [swapper/9] > 0 0 10 ffff8ed129f11040 RU 0.0 0 0 [swapper/10] > 0 0 11 ffff8ee129ebd140 RU 0.0 0 0 [swapper/11] > 0 0 12 ffff8eb1a9ba5140 RU 0.0 0 0 [swapper/12] > 0 0 13 ffff8ec129415140 RU 0.0 0 0 [swapper/13] > 0 0 14 ffff8ed129f15140 RU 0.0 0 0 [swapper/14] > 0 0 15 ffff8ee129ebb0c0 RU 0.0 0 0 [swapper/15] > 0 0 16 ffff8eb1a9ba4100 RU 0.0 0 0 [swapper/16] > 0 0 17 ffff8ec129412080 RU 0.0 0 0 [swapper/17] > 0 0 19 ffff8ee129ebc100 RU 0.0 0 0 [swapper/19] > 0 0 20 ffff8eb1a9408000 RU 0.0 0 0 [swapper/20] > 0 0 21 ffff8ec129414100 RU 0.0 0 0 [swapper/21] > 0 0 22 ffff8ed129f14100 RU 0.0 0 0 [swapper/22] > 0 0 23 ffff8ee129f38000 RU 0.0 0 0 [swapper/23] > 0 0 24 ffff8eb1a940e180 RU 0.0 0 0 [swapper/24] > 0 0 25 ffff8ec1294130c0 RU 0.0 0 0 [swapper/25] > 0 0 26 ffff8ed129f130c0 RU 0.0 0 0 [swapper/26] > 0 0 27 ffff8ee129f3e180 RU 0.0 0 0 [swapper/27] > 0 0 28 ffff8eb1a9409040 RU 0.0 0 0 [swapper/28] > 0 0 29 ffff8ec129430000 RU 0.0 0 0 [swapper/29] > 0 0 30 ffff8ed129f50000 RU 0.0 0 0 [swapper/30] > 0 0 31 ffff8ee129f39040 RU 0.0 0 0 [swapper/31] > 0 0 32 ffff8eb1a940d140 RU 0.0 0 0 [swapper/32] > 0 0 33 ffff8ec129436180 RU 0.0 0 0 [swapper/33] > 0 0 34 ffff8ed129f56180 RU 0.0 0 0 [swapper/34] > 0 0 35 ffff8ee129f3d140 RU 0.0 0 0 [swapper/35] > 0 0 36 ffff8eb1a940a080 RU 0.0 0 0 [swapper/36] > 0 0 37 ffff8ec129431040 RU 0.0 0 0 [swapper/37] > 0 0 38 ffff8ed129f51040 RU 0.0 0 0 [swapper/38] > 0 0 39 ffff8ee129f3a080 RU 0.0 0 0 [swapper/39] > 0 0 40 ffff8eb1a940c100 RU 0.0 0 0 [swapper/40] > 0 0 41 ffff8ec129435140 RU 0.0 0 0 [swapper/41] > 0 0 42 ffff8ed129f55140 RU 0.0 0 0 [swapper/42] > 0 0 43 ffff8ee129f3c100 RU 0.0 0 0 [swapper/43] > 0 0 44 ffff8eb1a940b0c0 RU 0.0 0 0 [swapper/44] > 0 0 45 ffff8ec129432080 RU 0.0 0 0 [swapper/45] > 0 0 46 ffff8ed129f52080 RU 0.0 0 0 [swapper/46] > 0 0 47 ffff8ee129f3b0c0 RU 0.0 0 0 [swapper/47] > 109549 109543 18 ffff8ee05406c100 RU 0.0 115440 2132 bash
I noticed a lot of threads blocked on quota commands.
so the issue is that quita acquision is supposed to allocate on-disk space for the given id and the existing infrastructure can specify that space is allocated already and there is no need to start a transaction.
I think it's safe to just remove quota adjustement from osd_object_delete() path, but that wouldn't be optmial in some cases as we want to maintain local vs global quota grant balanced.
it doesn't need to be done in the context of process releasing memory, rather it's an extra pressure for VM subsystem as quota rebalance may need additional memory (locks, RPC, etc).
I'd say it would be nice to schedule rebalance and do that in a separate context periodically.
not sure what exact structure to use. the simplest one would be some sort of ID list, probably batched on per-cpu basis to save a bit of memory on list_heads, though if millions of objects sharing few IDs are being removed then we'd waste quite amount of memory).
a smarter structure like a radix-tree-based bitmap would do better, at cost of development.