[LU-15078] sanity-sec test_26: nodemap_config_set_active()) ASSERTION( config->nmc_default_nodemap ) failed Created: 11/Oct/21  Updated: 30/Nov/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.8, Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Elena Gryaznova Assignee: Sebastien Buisson
Resolution: Unresolved Votes: 0
Labels: None

Attachments: Zip Archive 1633704046-sanity-sec-dectet_L300-26__2.zip.zip    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
[  394.285494] Lustre: DEBUG MARKER: == sanity-sec test 26: test transferring very large nodemap ========================================================== 17:02:01 (1633539721)
[ 1774.903911] LustreError: 11647:0:(nodemap_handler.c:1695:nodemap_config_set_active()) ASSERTION( config->nmc_default_nodemap ) failed: 
[ 1774.908810] LustreError: 11647:0:(nodemap_handler.c:1695:nodemap_config_set_active()) LBUG
[ 1774.911038] Pid: 11647, comm: ll_cfg_requeue 4.18.0-193.19.1.el8_2.x86_64 #1 SMP Mon Sep 14 14:37:00 UTC 2020
[ 1774.913568] Call Trace TBD:
[ 1774.915018] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs]
[ 1774.916666] [<0>] lbug_with_loc+0x43/0x80 [libcfs]
[ 1774.918650] [<0>] nodemap_config_set_active+0x2a6/0x2b0 [ptlrpc]
[ 1774.920492] [<0>] nodemap_config_set_active_mgc+0x3a/0x210 [ptlrpc]
[ 1774.922355] [<0>] mgc_process_recover_nodemap_log+0x10d1/0x1200 [mgc]
[ 1774.924222] [<0>] mgc_process_log+0x76e/0x800 [mgc]
[ 1774.925804] [<0>] mgc_requeue_thread+0x29e/0x740 [mgc]
[ 1774.927470] [<0>] kthread+0x112/0x130
[ 1774.928857] [<0>] ret_from_fork+0x35/0x40
[ 1774.930299] Kernel panic - not syncing: LBUG
[ 1774.931747] CPU: 0 PID: 11647 Comm: ll_cfg_requeue Tainted: G           OE    ---------r-t - 4.18.0-193.19.1.el8_2.x86_64 #1
[ 1774.934471] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1774.936179] Call Trace:
[ 1774.937345]  dump_stack+0x5c/0x80
[ 1774.938612]  panic+0xe7/0x2a9
[ 1774.939788]  ? ret_from_fork+0x35/0x40
[ 1774.941091]  lbug_with_loc.cold.10+0x18/0x18 [libcfs]
[ 1774.942677]  ? nodemap_config_set_active+0x29a/0x2b0 [ptlrpc]
[ 1774.944384]  nodemap_config_set_active+0x2a6/0x2b0 [ptlrpc]
[ 1774.946044]  nodemap_config_set_active_mgc+0x3a/0x210 [ptlrpc]
[ 1774.947719]  ? kmem_cache_free+0x18c/0x1b0
[ 1774.949043]  mgc_process_recover_nodemap_log+0x10d1/0x1200 [mgc]
[ 1774.950710]  mgc_process_log+0x76e/0x800 [mgc]
[ 1774.952075]  mgc_requeue_thread+0x29e/0x740 [mgc]
[ 1774.953524]  ? finish_wait+0x80/0x80
[ 1774.954730]  ? mgc_process_config+0xe60/0xe60 [mgc]
[ 1774.956172]  kthread+0x112/0x130
[ 1774.957304]  ? kthread_flush_work_fn+0x10/0x10
[ 1774.958636]  ret_from_fork+0x35/0x40
[ 1774.960354] Kernel Offset: 0x2cc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 1774.962825] ---[ end Kernel panic - not syncing: LBUG ]---


 Comments   
Comment by Alena Nikitenko [ 23/Nov/21 ]

I found similar crash in 2.12.8 tests https://testing.whamcloud.com/test_sets/a6e88cc1-a192-40de-8a3a-1afe126db426

[22430.721269] LustreError: 1203:0:(nodemap_handler.c:1663:nodemap_config_set_active()) ASSERTION( config->nmc_default_nodemap ) failed: 
[22430.723320] LustreError: 1203:0:(nodemap_handler.c:1663:nodemap_config_set_active()) LBUG
[22430.724640] Pid: 1203, comm: ll_cfg_requeue 3.10.0-1160.45.1.el7_lustre.x86_64 #1 SMP Thu Nov 18 19:16:15 UTC 2021
[22430.726275] Call Trace:
[22430.726741]  [<ffffffffc09a87cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[22430.727910]  [<ffffffffc09a887c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[22430.728940]  [<ffffffffc0f1a497>] nodemap_config_set_active+0x2a7/0x2e0 [ptlrpc]
[22430.730566]  [<ffffffffc0f24728>] nodemap_config_set_active_mgc+0x38/0x1e0 [ptlrpc]
[22430.731873]  [<ffffffffc114a796>] mgc_process_recover_nodemap_log+0x316/0xe70 [mgc]
[22430.733151]  [<ffffffffc114d8de>] mgc_process_log+0x7be/0x8f0 [mgc]
[22430.734184]  [<ffffffffc114f988>] mgc_requeue_thread+0x2c8/0x870 [mgc]
[22430.735270]  [<ffffffffabec5e61>] kthread+0xd1/0xe0
[22430.736131]  [<ffffffffac595df7>] ret_from_fork_nospec_end+0x0/0x39
[22430.737192]  [<ffffffffffffffff>] 0xffffffffffffffff
[22430.738233] Kernel panic - not syncing: LBUG
[22430.738932] CPU: 1 PID: 1203 Comm: ll_cfg_requeue Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.45.1.el7_lustre.x86_64 #1
[22430.740918] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[22430.741827] Call Trace:
[22430.742248]  [<ffffffffac583539>] dump_stack+0x19/0x1b
[22430.743084]  [<ffffffffac57d241>] panic+0xe8/0x21f
[22430.743859]  [<ffffffffc09a88cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[22430.744865]  [<ffffffffc0f1a497>] nodemap_config_set_active+0x2a7/0x2e0 [ptlrpc]
[22430.746064]  [<ffffffffc0f24728>] nodemap_config_set_active_mgc+0x38/0x1e0 [ptlrpc]
[22430.747310]  [<ffffffffc0e77d80>] ? ptlrpc_request_cache_free+0x90/0x1d0 [ptlrpc]
[22430.748525]  [<ffffffffc0e79045>] ? __ptlrpc_req_finished+0x515/0x790 [ptlrpc]
[22430.749681]  [<ffffffffc114a796>] mgc_process_recover_nodemap_log+0x316/0xe70 [mgc]
[22430.750888]  [<ffffffffc114d8de>] mgc_process_log+0x7be/0x8f0 [mgc]
[22430.751878]  [<ffffffffac586d90>] ? schedule_timeout+0x170/0x2d0
[22430.752845]  [<ffffffffc09af177>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[22430.753912]  [<ffffffffc114f988>] mgc_requeue_thread+0x2c8/0x870 [mgc]
[22430.754965]  [<ffffffffabedadf0>] ? wake_up_state+0x20/0x20
[22430.755857]  [<ffffffffc114f6c0>] ? mgc_process_config+0x13f0/0x13f0 [mgc]
[22430.756935]  [<ffffffffabec5e61>] kthread+0xd1/0xe0
[22430.757712]  [<ffffffffabec5d90>] ? insert_kthread_work+0x40/0x40
[22430.758675]  [<ffffffffac595df7>] ret_from_fork_nospec_begin+0x21/0x21
[22430.759703]  [<ffffffffabec5d90>] ? insert_kthread_work+0x40/0x40 
Comment by Peter Jones [ 26/Nov/21 ]

Sebastien

Should this issue be a concern?

Peter

Comment by Sebastien Buisson [ 29/Nov/21 ]

I do not think this issue should be considered a blocker for 2.15. On master, it occurred only once in the past 4 weeks, and only when running full-dne-part-2 test session. All other sessions that run sanity-sec, like review-dne-part-2, review-dne-zfs-part-2 and review-dne-selinux-ssk-part-2, did not hit this bug.
So I would tend to think this issue is provoked by something not totally clean left by a predecessor in the test list of full-dne-part-2 test session.

Comment by Xinliang Liu [ 19/Jun/23 ]

I found that I can hit this issue easily on arm AlmaLinux 8.8 with 4k page size kernel , master v2.15.56.9.

```

[28814.645272] Lustre: DEBUG MARKER: == sanity-sec test 26: test transferring very large nodemap ========================================================== 18:02:24 (1687024944)
[35582.906885] LustreError: 272997:0:(nodemap_handler.c:1819:nodemap_config_set_active()) ASSERTION( config->nmc_default_nodemap ) failed: 
[35582.910595] LustreError: 272997:0:(nodemap_handler.c:1819:nodemap_config_set_active()) LBUG
[35582.912690] Pid: 272997, comm: ll_cfg_requeue 4.18.0-477.10.1.el8_4k.aarch64 #1 SMP Mon Jun 12 10:35:59 UTC 2023
[35582.915128] Call Trace:
[35582.916038] [<0>] libcfs_call_trace+0xac/0x120 [libcfs]
[35582.917554] [<0>] lbug_with_loc+0x50/0x90 [libcfs]
[35582.918914] [<0>] nodemap_config_set_active+0x2d0/0x3c0 [ptlrpc]
[35582.920759] [<0>] nodemap_config_set_active_mgc+0x54/0x360 [ptlrpc]
[35582.922581] [<0>] mgc_process_nodemap_log+0x758/0xef8 [mgc]
[35582.924065] [<0>] mgc_process_log+0xabc/0xb90 [mgc]
[35582.925422] [<0>] mgc_requeue_thread+0x338/0xc38 [mgc]
[35582.926928] [<0>] kthread+0x150/0x160
[35582.928030] [<0>] ret_from_fork+0x10/0x18
[35582.929201] Kernel panic - not syncing: LBUG
[35582.930462] CPU: 0 PID: 272997 Comm: ll_cfg_requeue Tainted: G           OE    --------- -  - 4.18.0-477.10.1.el8_4k.aarch64 #1
[35582.934355] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[35582.936202] Call trace:
[35582.937113]  dump_backtrace+0x0/0x178
[35582.938250]  show_stack+0x28/0x38
[35582.939301]  dump_stack+0x5c/0x74
[35582.940360]  panic+0x140/0x30c
[35582.941355]  param_set_delay_minmax.isra.0+0x0/0xd8 [libcfs]
[35582.942934]  nodemap_config_set_active+0x2d0/0x3c0 [ptlrpc]
[35582.944670]  nodemap_config_set_active_mgc+0x54/0x360 [ptlrpc]
[35582.946430]  mgc_process_nodemap_log+0x758/0xef8 [mgc]
[35582.947865]  mgc_process_log+0xabc/0xb90 [mgc]
[35582.949142]  mgc_requeue_thread+0x338/0xc38 [mgc]
[35582.950455]  kthread+0x150/0x160
[35582.951454]  ret_from_fork+0x10/0x18
[35582.952507] SMP: stopping secondary CPUs
[35582.953675] Kernel Offset: 0x3c8886500000 from 0xffff800008000000
[35582.955245] PHYS_OFFSET: 0xffff92a3c0000000
[35582.956493] CPU features: 0x00400408,8c804201
[35582.957918] Memory Limit: none
[35582.959017] {}{}[ end Kernel panic - not syncing: LBUG ]{}{}

```

Note that, the test start running time is [28814.645272]s and  crash happen time is [35582.906885]s , it seems stuck for quite a long time 112.8 Mins before crash.

Generated at Sat Feb 10 03:15:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.