Loading...

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.2.0
Labels:
None

Severity:
3
Rank (Obsolete):
4233

Description

When we run ior test on nfs over lustre, we consistently hit OOM on the nfs server/lustre client

10:04:52:Lustre: DEBUG MARKER: == sanity test iorssf: iorssf ======================================================================== 10:04:49 (1323367489)
10:05:17:Lustre: 2881:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1323367509/real 0] req@ffff8100048c3800 x1387647840891989/t0(0) o400->lustre-OST0000-osc-MDT0000@10.10.4.187@tcp:28/4 lens 192/192 e 0 to 1 dl 1323367516 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
10:05:17:Lustre: lustre-OST0000-osc-MDT0000: Connection to service lustre-OST0000 via nid 10.10.4.187@tcp was lost; in progress operations using this service will wait for recovery to complete.
10:05:17:Lustre: 2881:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1323367509/real 0] req@ffff81002a5f7400 x1387647840891991/t0(0) o400->lustre-OST0002-osc-MDT0000@10.10.4.187@tcp:28/4 lens 192/192 e 0 to 1 dl 1323367516 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
10:05:17:Lustre: lustre-OST0002-osc-MDT0000: Connection to service lustre-OST0002 via nid 10.10.4.187@tcp was lost; in progress operations using this service will wait for recovery to complete.
10:05:18:Lustre: 2949:0:(ldlm_lib.c:634:target_handle_reconnect()) MGS: db6f2199-0ea2-5617-813c-4e0e918000c0 reconnecting
10:05:18:Lustre: 2949:0:(ldlm_lib.c:909:target_handle_connect()) MGS: connection from db6f2199-0ea2-5617-813c-4e0e918000c0@10.10.4.187@tcp t0 exp ffff81007617a400 cur 1323367517 last 1323367517
10:05:18:LustreError: 2949:0:(obd_class.h:509:obd_set_info_async()) obd_set_info_async: dev 0 no operation
10:05:22:Lustre: 2879:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1323367516/real 1323367517] req@ffff81003dd67c00 x1387647840893130/t0(0) o8->lustre-OST0000-osc-MDT0000@10.10.4.187@tcp:28/4 lens 368/512 e 0 to 1 dl 1323367522 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
10:05:22:Lustre: 2879:0:(client.c:1789:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
10:05:22:Lustre: lustre-OST0005-osc-ffff81004a393800: Connection to service lustre-OST0005 via nid 10.10.4.187@tcp was lost; in progress operations using this service will wait for recovery to complete.
10:05:22:Lustre: Skipped 5 previous similar messages
10:05:22:Lustre: 2882:0:(import.c:525:import_select_connection()) lustre-OST0000-osc-MDT0000: tried all connections, increasing latency to 6s
10:05:22:Lustre: 2882:0:(import.c:525:import_select_connection()) lustre-OST0001-osc-MDT0000: tried all connections, increasing latency to 6s
10:05:25:Lustre: 2880:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1323367516/real 1323367518] req@ffff81004d676c00 x1387647840893137/t0(0) o400->lustre-OST0000-osc-ffff81004a393800@10.10.4.187@tcp:28/4 lens 192/192 e 0 to 1 dl 1323367524 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
10:05:25:Lustre: 2880:0:(client.c:1789:ptlrpc_expire_one_request()) Skipped 16 previous similar messages
10:05:25:Lustre: lustre-OST0000-osc-ffff81004a393800: Connection to service lustre-OST0000 via nid 10.10.4.187@tcp was lost; in progress operations using this service will wait for recovery to complete.
10:05:26:Lustre: Skipped 2 previous similar messages
10:05:29:Lustre: 2949:0:(ldlm_lib.c:634:target_handle_reconnect()) MGS: db6f2199-0ea2-5617-813c-4e0e918000c0 reconnecting
10:05:29:Lustre: 2949:0:(ldlm_lib.c:909:target_handle_connect()) MGS: connection from db6f2199-0ea2-5617-813c-4e0e918000c0@10.10.4.187@tcp t0 exp ffff81007617a400 cur 1323367528 last 1323367528
10:05:29:LustreError: 2949:0:(obd_class.h:509:obd_set_info_async()) obd_set_info_async: dev 0 no operation
10:05:30:LustreError: 11-0: an error occurred while communicating with 10.10.4.187@tcp. The ost_connect operation failed with -16
10:05:30:LustreError: 11-0: an error occurred while communicating with 10.10.4.187@tcp. The ost_connect operation failed with -16
INIT: version 2.86 reloading
10:34:10:automount invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
10:34:10:
10:34:10:Call Trace:
10:34:11: [<ffffffff800c96d5>] out_of_memory+0x8e/0x2f3
10:34:11: [<ffffffff8000f625>] __alloc_pages+0x27f/0x308
10:34:11: [<ffffffff80012fe4>] __do_page_cache_readahead+0x96/0x179
10:34:11: [<ffffffff8001391f>] filemap_nopage+0x14c/0x360
10:34:11: [<ffffffff80008964>] __handle_mm_fault+0x1fb/0x1039
10:34:12: [<ffffffff8006720b>] do_page_fault+0x4cb/0x874
10:34:12: [<ffffffff80062ff0>] thread_return+0x62/0xfe
10:34:12: [<ffffffff8005dde9>] error_exit+0x0/0x84
10:34:12:
10:34:12:Node 0 DMA per-cpu:
10:34:12:cpu 0 hot: high 0, batch 1 used:0
10:34:12:cpu 0 cold: high 0, batch 1 used:0
10:34:12:Node 0 DMA32 per-cpu:
10:34:12:cpu 0 hot: high 186, batch 31 used:43
10:34:12:cpu 0 cold: high 62, batch 15 used:49
10:34:12:Node 0 Normal per-cpu: empty
10:34:12:Node 0 HighMem per-cpu: empty
10:34:12:Free pages: 8648kB (0kB HighMem)
10:34:12:Active:32 inactive:7830 dirty:0 writeback:987 unstable:0 free:2162 slab:477062 mapped-file:1061 mapped-anon:599 pagetables:905
10:34:12:Node 0 DMA free:3028kB min:24kB low:28kB high:36kB active:0kB inactive:0kB present:9728kB pages_scanned:0 all_unreclaimable? yes
10:34:12:lowmem_reserve[]: 0 2003 2003 2003
10:34:12:Node 0 DMA32 free:5620kB min:5712kB low:7140kB high:8568kB active:128kB inactive:31320kB present:2052060kB pages_scanned:82448 all_unreclaimable? yes
10:34:12:lowmem_reserve[]: 0 0 0 0
10:34:12:Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
10:34:12:lowmem_reserve[]: 0 0 0 0
10:34:13:Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
10:34:13:lowmem_reserve[]: 0 0 0 0
10:34:13:Node 0 DMA: 5*4kB 4*8kB 2*16kB 4*32kB 4*64kB 2*128kB 1*256kB 0*512kB 2*1024kB 0*2048kB 0*4096kB = 3028kB
10:34:13:Node 0 DMA32: 1*4kB 4*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB = 5620kB
10:34:13:Node 0 Normal: empty
10:34:13:Node 0 HighMem: empty
10:34:13:7862 pagecache pages
10:34:13:Swap cache: add 8522, delete 7923, find 82/144, race 0+0
10:34:13:Free swap = 4072336kB
10:34:13:Total swap = 4104596kB
10:34:13:Out of memory: Killed process 1942, UID 51, (sendmail).
10:34:15:automount invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
10:34:15:
10:34:15:Call Trace:
10:34:15: [<ffffffff800c96d5>] out_of_memory+0x8e/0x2f3
10:34:15: [<ffffffff8000f625>] __alloc_pages+0x27f/0x308
10:34:15: [<ffffffff80012fe4>] __do_page_cache_readahead+0x96/0x179
10:34:16: [<ffffffff8001391f>] filemap_nopage+0x14c/0x360
10:34:16: [<ffffffff80008964>] __handle_mm_fault+0x1fb/0x1039
10:34:16: [<ffffffff8006720b>] do_page_fault+0x4cb/0x874
10:34:16: [<ffffffff8005dde9>] error_exit+0x0/0x84
10:34:16:
10:34:16:Node 0 DMA per-cpu:
10:34:16:cpu 0 hot: high 0, batch 1 used:0
10:34:16:cpu 0 cold: high 0, batch 1 used:0
10:34:16:Node 0 DMA32 per-cpu:
10:34:16:cpu 0 hot: high 186, batch 31 used:113
10:34:16:cpu 0 cold: high 62, batch 15 used:49
10:34:16:Node 0 Normal per-cpu: empty
10:34:16:Node 0 HighMem per-cpu: empty
10:34:16:Free pages: 8648kB (0kB HighMem)
10:34:16:Active:6 inactive:7856 dirty:0 writeback:987 unstable:0 free:2162 slab:476992 mapped-file:1061 mapped-anon:599 pagetables:905
10:34:16:Node 0 DMA free:3028kB min:24kB low:28kB high:36kB active:0kB inactive:0kB present:9728kB pages_scanned:0 all_unreclaimable? yes
10:34:16:lowmem_reserve[]: 0 2003 2003 2003
10:34:16:Node 0 DMA32 free:5620kB min:5712kB low:7140kB high:8568kB active:24kB inactive:31424kB present:2052060kB pages_scanned:189044 all_unreclaimable? yes
10:34:16:lowmem_reserve[]: 0 0 0 0
10:34:17:Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
10:34:17:lowmem_reserve[]: 0 0 0 0
10:34:17:Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
10:34:17:lowmem_reserve[]: 0 0 0 0
10:34:17:Node 0 DMA: 5*4kB 4*8kB 2*16kB 4*32kB 4*64kB 2*128kB 1*256kB 0*512kB 2*1024kB 0*2048kB 0*4096kB = 3028kB
10:34:17:Node 0 DMA32: 1*4kB 4*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB = 5620kB
10:34:17:Node 0 Normal: empty
10:34:17:Node 0 HighMem: empty
10:34:17:7862 pagecache pages
10:34:17:Swap cache: add 8522, delete 7923, find 82/146, race 0+0
10:34:17:Free swap = 4072336kB
10:34:17:Total swap = 4104596kB
10:34:21:Lustre: 2880:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1323369251/real 0] req@ffff8100766e3400 x1387647842440889/t0(0) o400->lustre-OST0000-osc-MDT0000@10.10.4.187@tcp:28/4 lens 192/192 e 0 to 1 dl 1323369258 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
10:34:21:Lustre: lustre-OST0000-osc-MDT0000: Connection to service lustre-OST0000 via nid 10.10.4.187@tcp was lost; in progress operations using this service will wait for recovery to complete.
10:34:21:audispd invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0
10:34:21:
10:34:21:Call Trace:
10:34:21: [<ffffffff800c96d5>] out_of_memory+0x8e/0x2f3
10:34:21: [<ffffffff8000f625>] __alloc_pages+0x27f/0x308
10:34:21: [<ffffffff80032905>] read_swap_cache_async+0x45/0xd8
10:34:21: [<ffffffff800cf48e>] swapin_readahead+0x60/0xd3
10:34:21: [<ffffffff800092cb>] __handle_mm_fault+0xb62/0x1039
10:34:21: [<ffffffff80153e83>] __next_cpu+0x19/0x28
10:34:21: [<ffffffff8006720b>] do_page_fault+0x4cb/0x874
10:34:21: [<ffffffff8008dc34>] dequeue_task+0x18/0x37
10:34:21: [<ffffffff80062ff0>] thread_return+0x62/0xfe
10:34:21: [<ffffffff8005dde9>] error_exit+0x0/0x84
10:34:21: [<ffffffff8003055f>] unix_poll+0x0/0x99
10:34:34: [<ffffffff8002f8ee>] do_sys_poll+0x2b7/0x360
10:34:35: [<ffffffff8002f815>] do_sys_poll+0x1de/0x360
10:34:35: [<ffffffff8001ecfd>] __pollwait+0x0/0xe2
10:34:35: [<ffffffff8008e438>] default_wake_function+0x0/0xe
10:34:35: [<ffffffff8008e438>] default_wake_function+0x0/0xe
10:34:35: [<ffffffff8006e608>] do_gettimeoffset_kvm+0x9/0x1d
10:34:35: [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90
10:34:35: [<ffffffff8005aa0e>] getnstimeofday+0x10/0x28
10:34:35: [<ffffffff800a4939>] ktime_get_ts+0x1a/0x4e
10:34:35: [<ffffffff800bfea4>] delayacct_end+0x5d/0x86
10:34:35: [<ffffffff8004fca9>] finish_wait+0x32/0x5d
10:34:35: [<ffffffff80063933>] __wait_on_bit_lock+0x5b/0x66
10:34:35: [<ffffffff8003ff5b>] __lock_page+0x5e/0x64
10:34:35: [<ffffffff800a2931>] wake_bit_function+0x0/0x23
10:34:35: [<ffffffff800d4820>] swap_info_get+0x6a/0xa3
10:34:35: [<ffffffff800095b9>] __handle_mm_fault+0xe50/0x1039
10:34:35: [<ffffffff8006723e>] do_page_fault+0x4fe/0x874
10:34:35: [<ffffffff8004c6c6>] sys_poll+0x2d/0x34
10:34:35: [<ffffffff8005d116>] system_call+0x7e/0x83
10:34:35:
10:34:35:Node 0 DMA per-cpu:

Attached is the maloo's log. I can't get the report because the test hung due to OOM

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

report.tgz
316 kB
09/Dec/11 11:13 AM

Issue Links

is related to

LU-814 automate NFSv3/v4 over Lustre Testing

Resolved

ior test on nfs over lustre causes OOM in nfs server (lustre client)

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates