Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.7.0
-
single-node testing on master (5c4f68be57 + http://review.whamcloud.com/11258 )
kernel: 2.6.32-358.23.2.el6_lustre.gc9be53c.x86_64
combined MDS+MGS+OSS, 2x MDT, 3xOST on LVM
-
3
-
16062
Description
While running sanity-benchmark.sh dbench, I hit the following memory allocation deadlock under mdc_read_page_remote():
dbench D 0000000000000001 0 14532 1 0x00000004 Call Trace: resched_task+0x68/0x80 __mutex_lock_slowpath+0x13e/0x180 mutex_lock+0x2b/0x50 lu_cache_shrink+0x203/0x310 [obdclass] shrink_slab+0x11a/0x1a0 do_try_to_free_pages+0x3f7/0x610 try_to_free_pages+0x92/0x120 __alloc_pages_nodemask+0x478/0x8d0 alloc_pages_current+0xaa/0x110 __page_cache_alloc+0x87/0x90 mdc_read_page_remote+0x13c/0xd90 [mdc] do_read_cache_page+0x7b/0x180 read_cache_page_async+0x19/0x20 read_cache_page+0xe/0x20 mdc_read_page+0x192/0x950 [mdc] lmv_read_page+0x1e0/0x1210 [lmv] ll_get_dir_page+0xbc/0x370 [lustre] ll_dir_read+0x9e/0x300 [lustre] ll_readdir+0x12a/0x4d0 [lustre] vfs_readdir+0xc0/0xe0 sys_getdents+0x89/0xf0
The page allocation is recursing into Lustre and the DLM slab shrinker, which is blocked on a lock that is being held. Presumably it needs to use GFP_NOFS during the allocation? I didn't actually check what locks were held, since the machine hung as I was trying to get more info.