[LU-2818] Failure on test suite parallel-scale-nfsv4 test_compilebench: (mdt_lvb.c:126:mdt_lvbo_fill()) ASSERTION( rc == 0 ) failed Created: 15/Feb/13 Updated: 17/Mar/21 Resolved: 11/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 6822 | ||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/25d39028-755e-11e2-bf59-52540035b04c. The sub-test test_compilebench failed with the following error:
MDS console: 12:44:26:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench ============================================== 12:44:25 (1360701865) 12:44:27:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej 12:44:27:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej 13:06:42:mdt00_003: page allocation failure. order:5, mode:0x50 13:06:42:Pid: 22716, comm: mdt00_003 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 13:06:42:Call Trace: 13:06:42: [<ffffffff811231ff>] ? __alloc_pages_nodemask+0x77f/0x940 13:06:42: [<ffffffff8115d1a2>] ? kmem_getpages+0x62/0x170 13:06:42: [<ffffffff8115ddba>] ? fallback_alloc+0x1ba/0x270 13:06:42: [<ffffffff8115d80f>] ? cache_grow+0x2cf/0x320 13:06:42: [<ffffffff8115db39>] ? ____cache_alloc_node+0x99/0x160 13:06:42: [<ffffffffa04d8b60>] ? cfs_alloc+0x30/0x60 [libcfs] 13:06:42: [<ffffffff8115e909>] ? __kmalloc+0x189/0x220 13:06:42: [<ffffffffa04d8b60>] ? cfs_alloc+0x30/0x60 [libcfs] 13:06:42: [<ffffffffa0d3b69e>] ? osd_key_init+0x1e/0x670 [osd_ldiskfs] 13:06:42: [<ffffffffa066622f>] ? keys_fill+0x6f/0x190 [obdclass] 13:06:42: [<ffffffffa0669e6b>] ? lu_context_init+0xab/0x260 [obdclass] 13:06:42: [<ffffffffa066a03e>] ? lu_env_init+0x1e/0x30 [obdclass] 13:06:42: [<ffffffffa0e775ff>] ? mdt_lvbo_fill+0x1af/0x800 [mdt] 13:06:42: [<ffffffffa0e77450>] ? mdt_lvbo_fill+0x0/0x800 [mdt] 13:06:42: [<ffffffffa07c6591>] ? ldlm_handle_enqueue0+0x641/0x1080 [ptlrpc] 13:06:42: [<ffffffffa0e4e376>] ? mdt_enqueue+0x46/0x110 [mdt] 13:06:42: [<ffffffffa0e42fb8>] ? mdt_handle_common+0x628/0x1620 [mdt] 13:06:42: [<ffffffffa0e7a5c5>] ? mds_regular_handle+0x15/0x20 [mdt] 13:06:42: [<ffffffffa07f800c>] ? ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 13:06:42: [<ffffffffa04d85de>] ? cfs_timer_arm+0xe/0x10 [libcfs] 13:06:42: [<ffffffffa07ef739>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 13:06:42: [<ffffffff81052223>] ? __wake_up+0x53/0x70 13:06:42: [<ffffffffa07f9556>] ? ptlrpc_main+0xb76/0x1870 [ptlrpc] 13:06:42: [<ffffffffa07f89e0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 13:06:42: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 13:06:42: [<ffffffffa07f89e0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 13:06:42: [<ffffffffa07f89e0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 13:06:42: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 13:06:42:Mem-Info: 13:06:42:Node 0 DMA per-cpu: 13:06:42:CPU 0: hi: 0, btch: 1 usd: 0 13:06:42:Node 0 DMA32 per-cpu: 13:06:42:CPU 0: hi: 186, btch: 31 usd: 152 13:06:42:active_anon:4078 inactive_anon:1377 isolated_anon:0 13:06:42: active_file:109040 inactive_file:162532 isolated_file:0 13:06:42: unevictable:0 dirty:14064 writeback:512 unstable:0 13:06:42: free:14243 slab_reclaimable:32585 slab_unreclaimable:116609 13:06:42: mapped:3106 shmem:41 pagetables:752 bounce:0 13:06:42:Node 0 DMA free:8348kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:184kB inactive_file:3376kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:556kB slab_unreclaimable:3268kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no 13:06:42:lowmem_reserve[]: 0 2003 2003 2003 13:06:42:Node 0 DMA32 free:48624kB min:44720kB low:55900kB high:67080kB active_anon:16312kB inactive_anon:5508kB active_file:435976kB inactive_file:646752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052064kB mlocked:0kB dirty:56256kB writeback:2048kB mapped:12424kB shmem:164kB slab_reclaimable:129784kB slab_unreclaimable:463168kB kernel_stack:1960kB pagetables:3008kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no 13:06:42:lowmem_reserve[]: 0 0 0 0 13:06:42:Node 0 DMA: 33*4kB 5*8kB 3*16kB 0*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 3*2048kB 0*4096kB = 8348kB 13:06:42:Node 0 DMA32: 5780*4kB 2228*8kB 4*16kB 28*32kB 3*64kB 3*128kB 2*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 48624kB 13:06:42:195649 total pagecache pages 13:06:42:0 pages in swap cache 13:06:42:Swap cache stats: add 0, delete 0, find 0/0 13:06:42:Free swap = 4128760kB 13:06:42:Total swap = 4128760kB 13:06:42:524284 pages RAM 13:06:42:43608 pages reserved 13:06:42:207968 pages shared 13:06:42:267301 pages non-shared 13:06:42:LustreError: 22716:0:(mdt_lvb.c:126:mdt_lvbo_fill()) ASSERTION( rc == 0 ) failed: 13:06:42:LustreError: 22716:0:(mdt_lvb.c:126:mdt_lvbo_fill()) LBUG 13:06:42:Pid: 22716, comm: mdt00_003 13:06:42: 13:06:42:Call Trace: 13:06:42: [<ffffffffa04d7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 13:06:42: [<ffffffffa04d7e97>] lbug_with_loc+0x47/0xb0 [libcfs] 13:06:42: [<ffffffffa0e77bd5>] mdt_lvbo_fill+0x785/0x800 [mdt] 13:06:42: [<ffffffffa0e77450>] ? mdt_lvbo_fill+0x0/0x800 [mdt] 13:06:42: [<ffffffffa07c6591>] ldlm_handle_enqueue0+0x641/0x1080 [ptlrpc] 13:06:42: [<ffffffffa0e4e376>] mdt_enqueue+0x46/0x110 [mdt] 13:06:43: [<ffffffffa0e42fb8>] mdt_handle_common+0x628/0x1620 [mdt] 13:06:43: [<ffffffffa0e7a5c5>] mds_regular_handle+0x15/0x20 [mdt] 13:06:43: [<ffffffffa07f800c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 13:06:43: [<ffffffffa04d85de>] ? cfs_timer_arm+0xe/0x10 [libcfs] 13:06:43: [<ffffffffa07ef739>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 13:06:43: [<ffffffff81052223>] ? __wake_up+0x53/0x70 13:06:43: [<ffffffffa07f9556>] ptlrpc_main+0xb76/0x1870 [ptlrpc] 13:06:43: [<ffffffffa07f89e0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 13:06:43: [<ffffffff8100c0ca>] child_rip+0xa/0x20 13:06:43: [<ffffffffa07f89e0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 13:06:43: [<ffffffffa07f89e0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 13:06:43: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 13:06:43: 13:06:43:Kernel panic - not syncing: LBUG |
| Comments |
| Comment by Hongchao Zhang [ 18/Feb/13 ] |
|
this bug can be mitigated greatly by the patches(http://review.whamcloud.com/#change,5323, http://review.whamcloud.com/#change,5444) in |
| Comment by Hongchao Zhang [ 21/Feb/13 ] |
|
the additional patch is tracked at http://review.whamcloud.com/#change,5497 |
| Comment by Hongchao Zhang [ 18/Mar/13 ] |
|
status update: the patch which adding "lu_env" to "LDLM" modules is not ready yet for its complexity, and it affects lots of funtions. |
| Comment by Alex Zhuravlev [ 18/Mar/13 ] |
|
we're not allocating large chunks for osd_thread_info anymore, so this isn't a blocker already. |
| Comment by Hongchao Zhang [ 27/Mar/13 ] |
|
the patch passing the "env" into ldlm module has been pushed to Gerrit. |
| Comment by Oleg Drokin [ 23/Jul/13 ] |
|
I am now frequently hitting this again while running sanity in a loop, this didi not happen bfore in this sort of a test on same config node, so we probably need to take another look at this? <4>[13497.371686] mdt00_004: page allocation failure. order:1, mode:0x40 <4>[13497.372009] Pid: 11831, comm: mdt00_004 Not tainted 2.6.32-rhe6.4-debug #2 <4>[13497.372016] Call Trace: <4>[13497.372016] [<ffffffff8112a666>] ? __alloc_pages_nodemask+0x7c6/0x980 <4>[13497.372016] [<ffffffff814fe0ee>] ? _spin_unlock+0xe/0x10 <4>[13497.372016] [<ffffffff811658f2>] ? kmem_getpages+0x62/0x170 <4>[13497.372016] [<ffffffff8116834a>] ? fallback_alloc+0x1ba/0x270 <4>[13497.372016] [<ffffffff81167bf7>] ? cache_grow+0x4d7/0x520 <4>[13497.372016] [<ffffffff81168038>] ? ____cache_alloc_node+0xa8/0x200 <4>[13497.372016] [<ffffffff81168943>] ? kmem_cache_alloc_trace+0x1c3/0x250 <4>[13497.372016] [<ffffffffa0b4f2a5>] ? osd_key_init+0x25/0x4e0 [osd_ldiskfs] <4>[13497.372016] [<ffffffffa0b4f2a5>] ? osd_key_init+0x25/0x4e0 [osd_ldiskfs] <4>[13497.372016] [<ffffffffa04f8f8f>] ? keys_fill+0x6f/0x190 [obdclass] <4>[13497.372016] [<ffffffffa04fd00e>] ? lu_context_init+0x4e/0x240 [obdclass] <4>[13497.372016] [<ffffffffa04fd063>] ? lu_context_init+0xa3/0x240 [obdclass] <4>[13497.372016] [<ffffffffa04fd21e>] ? lu_env_init+0x1e/0x30 [obdclass] <4>[13497.372016] [<ffffffffa0bf131b>] ? mdt_lvbo_fill+0x1ab/0x840 [mdt] <4>[13497.372016] [<ffffffffa0bf1170>] ? mdt_lvbo_fill+0x0/0x840 [mdt] <4>[13497.372016] [<ffffffffa0664033>] ? ldlm_handle_enqueue0+0x623/0x10c0 [ptlrpc] <4>[13497.372016] [<ffffffffa0bb6a96>] ? mdt_enqueue+0x46/0xe0 [mdt] <4>[13497.372016] [<ffffffffa0bbcc57>] ? mdt_handle_common+0x647/0x16d0 [mdt] <4>[13497.372016] [<ffffffffa0bf54a5>] ? mds_regular_handle+0x15/0x20 [mdt] <4>[13497.372016] [<ffffffffa0693898>] ? ptlrpc_server_handle_request+0x3a8/0xc70 [ptlrpc] <4>[13497.372016] [<ffffffffa0a2155e>] ? cfs_timer_arm+0xe/0x10 [libcfs] <4>[13497.372016] [<ffffffffa0a32b9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] <4>[13497.372016] [<ffffffffa068afb1>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] <4>[13497.372016] [<ffffffff81054613>] ? __wake_up+0x53/0x70 <4>[13497.372016] [<ffffffffa0694b82>] ? ptlrpc_main+0xa22/0x1650 [ptlrpc] <4>[13497.372016] [<ffffffffa0694160>] ? ptlrpc_main+0x0/0x1650 [ptlrpc] <4>[13497.372016] [<ffffffff81094606>] ? kthread+0x96/0xa0 <4>[13497.372016] [<ffffffff8100c10a>] ? child_rip+0xa/0x20 <4>[13497.372016] [<ffffffff81094570>] ? kthread+0x0/0xa0 <4>[13497.372016] [<ffffffff8100c100>] ? child_rip+0x0/0x20 <6>[13497.372016] Mem-Info: <4>[13497.372016] Node 0 DMA per-cpu: <4>[13497.372016] CPU 0: hi: 0, btch: 1 usd: 0 <4>[13497.385289] CPU 1: hi: 0, btch: 1 usd: 0 <4>[13497.385289] CPU 2: hi: 0, btch: 1 usd: 0 <4>[13497.385289] CPU 3: hi: 0, btch: 1 usd: 0 <4>[13497.385289] Node 0 DMA32 per-cpu: <4>[13497.385289] CPU 0: hi: 186, btch: 31 usd: 26 <4>[13497.385289] CPU 1: hi: 186, btch: 31 usd: 75 <4>[13497.385289] CPU 2: hi: 186, btch: 31 usd: 177 <4>[13497.385289] CPU 3: hi: 186, btch: 31 usd: 170 <4>[13497.385289] active_anon:83027 inactive_anon:83006 isolated_anon:0 <4>[13497.385289] active_file:31104 inactive_file:29174 isolated_file:0 <4>[13497.385289] unevictable:0 dirty:59 writeback:55 unstable:0 <4>[13497.385289] free:13805 slab_reclaimable:79792 slab_unreclaimable:363470 <4>[13497.385289] mapped:669 shmem:163871 pagetables:341 bounce:0 <4>[13497.388538] Node 0 DMA free:9700kB min:136kB low:168kB high:204kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:9312kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes <4>[13497.389526] lowmem_reserve[]: 0 2967 2967 2967 <4>[13497.389526] Node 0 DMA32 free:45520kB min:44916kB low:56144kB high:67372kB active_anon:332000kB inactive_anon:332388kB active_file:124252kB inactive_file:116696kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3039080kB mlocked:0kB dirty:236kB writeback:220kB mapped:2676kB shmem:655484kB slab_reclaimable:319168kB slab_unreclaimable:1453880kB kernel_stack:2640kB pagetables:1364kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:861 all_unreclaimable? no <4>[13497.389526] lowmem_reserve[]: 0 0 0 0 <4>[13497.389526] Node 0 DMA: 3*4kB 1*8kB 3*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 2*4096kB = 9700kB <4>[13497.389526] Node 0 DMA32: 10258*4kB 18*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 45272kB <4>[13497.389526] 194553 total pagecache pages <4>[13497.389526] 0 pages in swap cache <4>[13497.389526] Swap cache stats: add 1550, delete 1550, find 434/448 <4>[13497.389526] Free swap = 2091392kB <4>[13497.389526] Total swap = 2097144kB <6>[13497.389526] 774397 pages RAM <6>[13497.389526] 38457 pages reserved <6>[13497.389526] 19663 pages shared <6>[13497.389526] 697781 pages non-shared <0>[13497.402539] LustreError: 11831:0:(mdt_lvb.c:125:mdt_lvbo_fill()) ASSERTION( rc == 0 ) failed: <0>[13497.403365] LustreError: 11831:0:(mdt_lvb.c:125:mdt_lvbo_fill()) LBUG <4>[13497.403820] Pid: 11831, comm: mdt00_004 <4>[13497.404356] <4>[13497.404356] Call Trace: <4>[13497.405032] [<ffffffffa0a208a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4>[13497.405510] [<ffffffffa0a20ea7>] lbug_with_loc+0x47/0xb0 [libcfs] <4>[13497.405979] [<ffffffffa0bf18d4>] mdt_lvbo_fill+0x764/0x840 [mdt] <4>[13497.406426] [<ffffffffa0bf1170>] ? mdt_lvbo_fill+0x0/0x840 [mdt] <4>[13497.406735] [<ffffffffa0664033>] ldlm_handle_enqueue0+0x623/0x10c0 [ptlrpc] <4>[13497.407043] [<ffffffffa0bb6a96>] mdt_enqueue+0x46/0xe0 [mdt] <4>[13497.407317] [<ffffffffa0bbcc57>] mdt_handle_common+0x647/0x16d0 [mdt] <4>[13497.407604] [<ffffffffa0bf54a5>] mds_regular_handle+0x15/0x20 [mdt] <4>[13497.407908] [<ffffffffa0693898>] ptlrpc_server_handle_request+0x3a8/0xc70 [ptlrpc] <4>[13497.408502] [<ffffffffa0a2155e>] ? cfs_timer_arm+0xe/0x10 [libcfs] <4>[13497.408814] [<ffffffffa0a32b9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] <4>[13497.409138] [<ffffffffa068afb1>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] <4>[13497.409425] [<ffffffff81054613>] ? __wake_up+0x53/0x70 <4>[13497.409706] [<ffffffffa0694b82>] ptlrpc_main+0xa22/0x1650 [ptlrpc] <4>[13497.410010] [<ffffffffa0694160>] ? ptlrpc_main+0x0/0x1650 [ptlrpc] <4>[13497.410288] [<ffffffff81094606>] kthread+0x96/0xa0 <4>[13497.410537] [<ffffffff8100c10a>] child_rip+0xa/0x20 <4>[13497.410790] [<ffffffff81094570>] ? kthread+0x0/0xa0 <4>[13497.411045] [<ffffffff8100c100>] ? child_rip+0x0/0x20 <4>[13497.411299] <0>[13497.681303] Kernel panic - not syncing: LBUG |
| Comment by Alex Zhuravlev [ 23/Jul/13 ] |
|
I think this can stop to happen once we start to fetch the layouts with explicit getxattr.. |
| Comment by Oleg Drokin [ 21/Jan/14 ] |
|
Because I am really tired of hittign this issue, here's a patch to at least making mdt not crash when it happens: |
| Comment by Peter Jones [ 11/Feb/14 ] |
|
Landed for 2.6 |