[LU-3442] MDS performance degraded by reading of ZFS spacemaps - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.4.0
Labels:
- performance
- zfs
Environment:
server: lustre-2.4.0-RC2_2chaos_2.6.32_358.6.1.3chaos.ch5.1.ch5.1.x86_64
clients: mix of PPC/Lustre 2.4 and x86_64/Lustre 2.1

Severity:
3
Rank (Obsolete):
8581

Description

We started to experience degraded performance on our MDS with a ZFS backend. Certain RPCs were taking many seconds or even minutes to service. Users would accordingly see very slow interactive responsiveness. On investigation, this turned out to be due to ZFS transaction groups taking very long to sync, blocking request handlers that needed to write out an llog record. This in turn was due to zio processing threads waiting in space_map_load_wait():

[<ffffffffa038cdad>] cv_wait_common+0xed/0x100 [spl]                              
[<ffffffffa038ce15>] __cv_wait+0x15/0x20 [spl]                                    
[<ffffffffa0480f2f>] space_map_load_wait+0x2f/0x40 [zfs]                          
[<ffffffffa046ab47>] metaslab_activate+0x77/0x160 [zfs]                           
[<ffffffffa046b67e>] metaslab_alloc+0x4fe/0x950 [zfs]                             
[<ffffffffa04c801a>] zio_dva_allocate+0xaa/0x350 [zfs]                            
[<ffffffffa04c93e0>] zio_ready+0x3c0/0x460 [zfs]                                  
[<ffffffffa04c93e0>] zio_ready+0x3c0/0x460 [zfs]                                  
[<ffffffffa04c6293>] zio_execute+0xb3/0x130 [zfs]                                 
[<ffffffffa0389277>] taskq_thread+0x1e7/0x3f0 [spl]                               
[<ffffffff81096c76>] kthread+0x96/0xa0                                            
[<ffffffff8100c0ca>] child_rip+0xa/0x20                                           
[<ffffffffffffffff>] 0xffffffffffffffff

We are able to mitigate this problem by setting the zfs module option metaslab_debug=1, which forces all spacemaps to stay resident in memory. However, this solution is a bit heavy-handed, and we'd like to gain a better understanding of why we're reading spacemaps from disk so often, and what should be done about it.

Our first thought was that pool fragmentation was the underlying cause, causing the block allocator to search all spacemaps to find a suitable interval. Our thinking was that llog cancellation promotes fragmentation by punching holes in otherwise contiguously allocated regions. But I'm not sure this theory is consistent with how llogs actually work, or with how the ZFS allocator works for that matter.

Another idea is that a concurrent write and unlink workload could cause this behaviour, but it's all just speculation until we better understand the workload and how ZFS manages spacemaps.

The most appealing approach we've discussed so far is to modify ZFS to use the ARC to cache spacemap objects. I believe ZFS currently only keeps one spacemap (per vdev?) active in memory at a time, and it bypasses the ARC for these objects. Using the ARC would keep the hot spacemaps in memory, but allow them to get pitched under memory pressure.

So, I'm not sure there's a Lustre bug here, but it's an issue to be aware of when using ZFS backends.

Attachments

Issue Links

is related to

LU-3443 performance impact of mdc_rpc_lock serialization

Resolved

Activity

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 06/Jun/13 6:56 PM

Updated:: 26/Apr/17 11:55 PM

Resolved:: 26/Apr/17 11:55 PM