Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.4.0
-
2
-
7462
Description
LBUG happens quite often when running racer under a memory pressure environment. For example, I'm using 2G memory vmware, and set up lustre with:
OSTSIZE=$((512*1024)) REFORMAT=1 sh racer.sh
It can hit LBUG as follows:
ll_ost_io01_005: page allocation failure. order:4, mode:0x50 Pid: 22344, comm: ll_ost_io01_005 Not tainted 2.6.32-279.19.1.el6.x86_64.debug #1 Call Trace: [<ffffffff81139b2a>] ? __alloc_pages_nodemask+0x6aa/0xa20 [<ffffffff81175dfe>] ? kmem_getpages+0x6e/0x170 [<ffffffff8117884b>] ? fallback_alloc+0x1cb/0x2b0 [<ffffffff811780a9>] ? cache_grow+0x4c9/0x530 [<ffffffff8117852b>] ? ____cache_alloc_node+0xab/0x200 [<ffffffff81179d08>] ? __kmalloc+0x288/0x330 [<ffffffffa048ebb0>] ? cfs_alloc+0x30/0x60 [libcfs] [<ffffffffa048ebb0>] ? cfs_alloc+0x30/0x60 [libcfs] [<ffffffffa0e35ad8>] ? ost_io_thread_init+0x48/0x300 [ost] [<ffffffffa0888703>] ? ptlrpc_main+0xa3/0x1810 [ptlrpc] [<ffffffff810aebad>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8151f690>] ? _spin_unlock_irq+0x30/0x40 [<ffffffff8105960d>] ? finish_task_switch+0x7d/0x110 [<ffffffff810595d8>] ? finish_task_switch+0x48/0x110 [<ffffffff810097dc>] ? __switch_to+0x1ac/0x320 [<ffffffffa0888660>] ? ptlrpc_main+0x0/0x1810 [ptlrpc] [<ffffffff8100c1ca>] ? child_rip+0xa/0x20 [<ffffffff8151f690>] ? _spin_unlock_irq+0x30/0x40 [<ffffffff8100bb10>] ? restore_args+0x0/0x30 [<ffffffffa0888660>] ? ptlrpc_main+0x0/0x1810 [ptlrpc] [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 CPU 2: hi: 0, btch: 1 usd: 0 CPU 3: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 40 CPU 1: hi: 186, btch: 31 usd: 0 CPU 2: hi: 186, btch: 31 usd: 0 CPU 3: hi: 186, btch: 31 usd: 38 active_anon:74087 inactive_anon:150981 isolated_anon:0 active_file:14723 inactive_file:15191 isolated_file:5 unevictable:0 dirty:1405 writeback:0 unstable:0 free:57272 slab_reclaimable:6236 slab_unreclaimable:76201 mapped:1475 shmem:214401 pagetables:2281 bounce:0 Node 0 DMA free:8756kB min:332kB low:412kB high:496kB active_anon:656kB inactive_anon:1672kB active_file:68kB inactive_file:1488kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15300kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:2328kB slab_reclaimable:8kB slab_unreclaimable:3028kB kernel_stack:16kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 2004 2004 2004 Node 0 DMA32 free:231280kB min:44720kB low:55900kB high:67080kB active_anon:298048kB inactive_anon:602252kB active_file:58824kB inactive_file:59020kB unevictable:0kB isolated(anon):0kB isolated(file):20kB present:2052192kB mlocked:0kB dirty:5804kB writeback:0kB mapped:5900kB shmem:855276kB slab_reclaimable:24936kB slab_unreclaimable:287616kB kernel_stack:4200kB pagetables:9860kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 203*4kB 104*8kB 24*16kB 6*32kB 5*64kB 3*128kB 3*256kB 2*512kB 2*1024kB 1*2048kB 0*4096kB = 8812kB Node 0 DMA32: 35286*4kB 9007*8kB 1989*16kB 87*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 251840kB 244228 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 524272 pages RAM 77789 pages reserved 65618 pages shared 325369 pages non-shared LustreError: 47322:0:(vvp_io.c:1086:vvp_io_commit_write()) Write page 17112 of inode ffff880038147728 failed -28 LustreError: 39799:0:(vvp_io.c:1086:vvp_io_commit_write()) Write page 1022 of inode ffff8800598402a8 failed -28 LustreError: 44035:0:(vvp_io.c:1086:vvp_io_commit_write()) Write page 255 of inode ffff880046224268 failed -28 LustreError: 46598:0:(vvp_io.c:1086:vvp_io_commit_write()) Write page 304 of inode ffff8800203b02a8 failed -28 Buffer I/O error on device loop0, logical block 17163 lost page write due to I/O error on loop0 LustreError: 48203:0:(file.c:2707:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000401:0x6f4d:0x0] error: rc = -116 Buffer I/O error on device loop0, logical block 29955 lost page write due to I/O error on loop0 Buffer I/O error on device loop0, logical block 29956 lost page write due to I/O error on loop0 Buffer I/O error on device loop0, logical block 29960 lost page write due to I/O error on loop0 Buffer I/O error on device loop0, logical block 12699 lost page write due to I/O error on loop0 Buffer I/O error on device loop0, logical block 12723 lost page write due to I/O error on loop0 Buffer I/O error on device loop0, logical block 12783 lost page write due to I/O error on loop0 Buffer I/O error on device loop0, logical block 15429 lost page write due to I/O error on loop0 Buffer I/O error on device loop0, logical block 15430 lost page write due to I/O error on loop0 LDISKFS-fs error (device loop0): ldiskfs_find_entry: reading directory #50271 offset 0 Aborting journal on device loop0-8. LustreError: 4993:0:(file.c:158:ll_close_inode_openhandle()) inode 144115205272531443 mdc close failed: rc = -30 LustreError: 4972:0:(llite_lib.c:1292:ll_md_setattr()) md_setattr fails: rc = -30 LustreError: 3150:0:(osd_handler.c:635:osd_trans_commit_cb()) transaction @0xffff88005eef12a0 commit error: 2 LDISKFS-fs error (device loop0): ldiskfs_journal_start_sb: Detected aborted journal LDISKFS-fs (loop0): Remounting filesystem read-only LustreError: 3150:0:(osd_handler.c:635:osd_trans_commit_cb()) transaction @0xffff880058789818 commit error: 2 LustreError: 3498:0:(osd_io.c:997:osd_ldiskfs_read()) lustre-MDT0000: can't read 4096@172032 on ino 110: rc = -5 LustreError: 3498:0:(llog_osd.c:562:llog_osd_next_block()) lustre-MDT0000-osd: can't read llog block from log [0x1:0xc:0x0] offset 172032: rc = -5 LustreError: 3498:0:(osp_sync.c:866:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 29 changes, 26 in progress, 7 in flight: -5 LustreError: 3498:0:(osp_sync.c:866:osp_sync_thread()) LBUG Pid: 3498, comm: osp-syn-1 Call Trace: [<ffffffffa048d8c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa048dec7>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0ea4a73>] osp_sync_thread+0x783/0x800 [osp] [<ffffffff810595d8>] ? finish_task_switch+0x48/0x110 [<ffffffff8151f690>] ? _spin_unlock_irq+0x30/0x40 [<ffffffff8105960d>] ? finish_task_switch+0x7d/0x110 [<ffffffff810595d8>] ? finish_task_switch+0x48/0x110 [<ffffffffa0ea42f0>] ? osp_sync_thread+0x0/0x800 [osp] LustreError: 3496:0:(osd_io.c:997:osd_ldiskfs_read()) lustre-MDT0000: can't read 4096@180224 on ino 108: rc = -5 LustreError: 3496:0:(llog_osd.c:562:llog_osd_next_block()) lustre-MDT0000-osd: can't read llog block from log [0x1:0xa:0x0] offset 180224: rc = -5 LustreError: 3496:0:(osp_sync.c:866:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 14 changes, 30 in progress, 6 in flight: -5 LustreError: 3496:0:(osp_sync.c:866:osp_sync_thread()) LBUG Pid: 3496, comm: osp-syn-0 Call Trace: [<ffffffffa048d8c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa048dec7>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0ea4a73>] osp_sync_thread+0x783/0x800 [osp] [<ffffffff810595d8>] ? finish_task_switch+0x48/0x110 [<ffffffff8151f690>] ? _spin_unlock_irq+0x30/0x40 [<ffffffff8105960d>] ? finish_task_switch+0x7d/0x110 [<ffffffff810595d8>] ? finish_task_switch+0x48/0x110 [<ffffffffa0ea42f0>] ? osp_sync_thread+0x0/0x800 [osp] [<ffffffff8100c1ca>] child_rip+0xa/0x20 [<ffffffff8151f690>] ? _spin_unlock_irq+0x30/0x40 [<ffffffff8100bb10>] ? restore_args+0x0/0x30 [<ffffffffa0ea42f0>] ? osp_sync_thread+0x0/0x800 [osp] [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 LustreError: 50716:0:(file.c:158:ll_close_inode_openhandle()) inode 144115205255753869 mdc close failed: rc = -30 LustreError: 50716:0:(file.c:158:ll_close_inode_openhandle()) Skipped 1 previous similar message LDISKFS-fs (loop0): Remounting filesystem read-only Kernel panic - not syncing: LBUG Pid: 3496, comm: osp-syn-0 Not tainted 2.6.32-279.19.1.el6.x86_64.debug #1 Call Trace: [<ffffffff8151baba>] ? panic+0xa0/0x16d [<ffffffffa048df1b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0ea4a73>] ? osp_sync_thread+0x783/0x800 [osp] [<ffffffff810595d8>] ? finish_task_switch+0x48/0x110 [<ffffffff8151f690>] ? _spin_unlock_irq+0x30/0x40 [<ffffffff8105960d>] ? finish_task_switch+0x7d/0x110 [<ffffffff810595d8>] ? finish_task_switch+0x48/0x110 [<ffffffffa0ea42f0>] ? osp_sync_thread+0x0/0x800 [osp] [<ffffffff8100c1ca>] ? child_rip+0xa/0x20 [<ffffffff8151f690>] ? _spin_unlock_irq+0x30/0x40 [<ffffffff8100bb10>] ? restore_args+0x0/0x30 [<ffffffffa0ea42f0>] ? osp_sync_thread+0x0/0x800 [osp] [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
Attachments
Issue Links
- is duplicated by
-
LU-3091 Interop 2.1.4<->2.4 failure on test suite parallel-scale test_iorssf: page allocation failure
- Closed