[LU-5461] mdt_readpage returning - ENOMEM causes directory to be unreadable Created: 07/Aug/14  Updated: 30/Mar/17  Resolved: 23/Dec/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.2
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Minor
Reporter: Robert Read (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 15215

 Description   

While testing HSM copytool in a single VM with 512MB memory, I saw page allocation errors in mdt_readpage, and subsequent IO errors on the client when trying to read that directory again. It appears the client is caching the error page, and not allowing the ll_get_dir_page() try to fetch it again. Here are the errors on the client side:

LustreError: 18907:0:(dir.c:422:ll_get_dir_page()) read cache page: [0x200000402:0x27a:0x0] at 0: rc -12
LustreError: 18907:0:(dir.c:584:ll_dir_read()) error reading dir [0x200000402:0x27a:0x0] at 0: rc -12
LustreError: 18912:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0x200000402:0x27a:0x0] at 0: rc -5
LustreError: 18912:0:(dir.c:584:ll_dir_read()) error reading dir [0x200000402:0x27a:0x0] at 0: rc -5
LustreError: 7358:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0x200000402:0x27a:0x0] at 0: rc -5
LustreError: 7358:0:(dir.c:584:ll_dir_read()) error reading dir [0x200000402:0x27a:0x0] at 0: rc -5

And this is the allocation failure on the MDT:

mdt_rdpg00_001: page allocation failure. order:0, mode:0xc0
Pid: 4794, comm: mdt_rdpg00_001 Not tainted 2.6.32-431.17.1.el6_lustre.x86_64 #1
Call Trace:
[<ffffffff8112f64a>] ? __alloc_pages_nodemask+0x74a/0x8d0
[<ffffffffa0653d10>] ? lustre_swab_mdt_body+0x0/0x140 [ptlrpc]
[<ffffffff8116769a>] ? alloc_pages_current+0xaa/0x110
[<ffffffffa0c9f3c0>] ? mdt_readpage+0x1d0/0x940 [mdt]
[<ffffffffa0c8f58a>] ? mdt_handle_common+0x52a/0x1470 [mdt]
[<ffffffffa0ccb735>] ? mds_readpage_handle+0x15/0x20 [mdt]
[<ffffffffa0660bc5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
[<ffffffffa03713cf>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
[<ffffffffa06582a9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
[<ffffffffa0661f2d>] ? ptlrpc_main+0xaed/0x1740 [ptlrpc]
[<ffffffffa0661440>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
[<ffffffff8109ab56>] ? kthread+0x96/0xa0
[<ffffffff8100c20a>] ? child_rip+0xa/0x20
[<ffffffff8109aac0>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:  72
active_anon:9625 inactive_anon:10498 isolated_anon:0
active_file:24454 inactive_file:28863 isolated_file:0
unevictable:0 dirty:4370 writeback:0 unstable:0
free:1018 slab_reclaimable:5718 slab_unreclaimable:18294
mapped:2472 shmem:132 pagetables:1268 bounce:0
Node 0 DMA free:2020kB min:84kB low:104kB high:124kB active_anon:464kB inactive_anon:2924kB active_file:1412kB inactive_file:7108kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15368kB mlocked:0kB dirty:408kB writeback:0kB mapped:268kB shmem:308kB slab_reclaimable:264kB slab_unreclaimable:632kB kernel_stack:40kB pagetables:704kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 489 489 489
Node 0 DMA32 free:2052kB min:2784kB low:3480kB high:4176kB active_anon:38036kB inactive_anon:39068kB active_file:96404kB inactive_file:108344kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:500896kB mlocked:0kB dirty:17072kB writeback:0kB mapped:9620kB shmem:220kB slab_reclaimable:22608kB slab_unreclaimable:72544kB kernel_stack:1608kB pagetables:4368kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:64 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2020kB
Node 0 DMA32: 183*4kB 5*8kB 4*16kB 2*32kB 2*64kB 0*128kB 2*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 2052kB
53571 total pagecache pages
112 pages in swap cache
Swap cache stats: add 245, delete 133, find 25/32
Free swap  = 834836kB
Total swap = 835576kB
131055 pages RAM
5534 pages reserved
66002 pages shared
75464 pages non-shared


 Comments   
Comment by Jodi Levi (Inactive) [ 11/Aug/14 ]

Lai,
Can you comment on this one and let us know the priority with regards to 2.7?
Thank you!

Comment by Lai Siyao [ 14/Aug/14 ]

Patch is on http://review.whamcloud.com/#/c/11450/

Comment by Gerrit Updater [ 17/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11450/
Subject: LU-5461 mdc: don't add to page cache upon failure
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2e1472489481ddef9956db8008d63a78c7c84289

Comment by Lai Siyao [ 23/Dec/14 ]

patch landed to master.

Generated at Sat Feb 10 01:51:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.