[LU-11961] nodemap_config_set_active()) ASSERTION( config->nmc_default_nodemap ) failed Created: 12/Feb/19 Updated: 17/Feb/21 Resolved: 17/Feb/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | Sebastien Buisson |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Looks like [ 4228.011703] Lustre: DEBUG MARKER: == sanity-sec test 26: test transferring very large nodemap ========================================== 06:23:44 (1549952624) [ 4271.913653] LustreError: 11659:0:(nodemap_handler.c:1663:nodemap_config_set_active()) ASSERTION( config->nmc_default_nodemap ) failed: [ 4271.915235] LustreError: 11659:0:(nodemap_handler.c:1663:nodemap_config_set_active()) LBUG [ 4271.916117] Pid: 11659, comm: ll_cfg_requeue 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Wed Feb 6 06:53:01 UTC 2019 [ 4271.917149] Call Trace: [ 4271.917451] [<ffffffffc09417cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 4271.918283] [<ffffffffc094187c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 4271.918961] [<ffffffffc0e0a6f7>] nodemap_config_set_active+0x2a7/0x2e0 [ptlrpc] [ 4271.920041] [<ffffffffc0e13498>] nodemap_config_set_active_mgc+0x38/0x1e0 [ptlrpc] [ 4271.920969] [<ffffffffc0cf2e12>] mgc_process_recover_nodemap_log+0x352/0xe40 [mgc] [ 4271.921830] [<ffffffffc0cf588e>] mgc_process_log+0x7be/0x8f0 [mgc] [ 4271.922527] [<ffffffffc0cf7938>] mgc_requeue_thread+0x2c8/0x870 [mgc] [ 4271.923365] [<ffffffffa22c1c31>] kthread+0xd1/0xe0 [ 4271.923935] [<ffffffffa2974c37>] ret_from_fork_nospec_end+0x0/0x39 [ 4271.924670] [<ffffffffffffffff>] 0xffffffffffffffff https://testing.whamcloud.com/test_sessions/a68fa72d-794e-45bb-ac5b-114917878f9c |
| Comments |
| Comment by Oleg Drokin [ 12/Feb/19 ] |
|
There was another crash in b2_10 in January: https://testing.whamcloud.com/test_sessions/2988eeb4-34a4-4ba9-8a21-048b8394097d So this likely means the issue is not caused by a recent landing. |
| Comment by Sebastien Buisson [ 13/Feb/19 ] |
|
I do not think the origin of the problem is the same as in Here the issue happened with ldiskfs, so there might be another issue. In the 2.10 branch occurence, we have the following messages: [12153.223611] Lustre: DEBUG MARKER: == sanity-sec test 26: test transferring very large nodemap ========================================== 03:21:40 (1548300100) [12155.289243] Lustre: 1848:0:(nodemap_handler.c:1076:nodemap_create()) adding nodemap 'c291' to config without default nodemap [12155.291275] Lustre: 1848:0:(nodemap_handler.c:1076:nodemap_create()) Skipped 25 previous similar messages [12254.129250] LustreError: 1848:0:(nodemap_handler.c:1524:nodemap_config_set_active()) ASSERTION( config->nmc_default_nodemap ) failed: [12254.131468] LustreError: 1848:0:(nodemap_handler.c:1524:nodemap_config_set_active()) LBUG So this is yet another hint at the fact that the default nodemap is not there yet. Although it does no harm to add regular nodemap entries without having a default one in the config, this situation lasts for too long and we end up trying to activate the nodemap with still no default nodemap entry. As it is very difficult to investigate this issue without debug logs, I will push a debug patch that returns an error instead of asserting. |
| Comment by Gerrit Updater [ 13/Feb/19 ] |
|
Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/34244 |
| Comment by Sebastien Buisson [ 13/Feb/19 ] |
|
I have noticed that nmc_default_nodemap is modified outside of nodemap_create(). Not sure that it is causing the problem reported in this ticket, but it should be fixed anyway. |
| Comment by Gerrit Updater [ 13/Feb/19 ] |
|
Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/34245 |
| Comment by Jeremy Filizetti [ 04/Feb/20 ] |
|
I don't think it's related but shouldn't the LASSERT(active_config != config) in nodemap_config_set_active come after the lock? |
| Comment by Gerrit Updater [ 14/Feb/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34245/ |
| Comment by Peter Jones [ 14/Feb/20 ] |
|
Is work still ongoing under this ticket? |
| Comment by Sebastien Buisson [ 14/Feb/20 ] |
|
The patch that landed might not be enough to fix the issue reported here. But it has been a while since I have seen this assertion, so I am not sure if it is still occurring on master. |
| Comment by Alex Zhuravlev [ 04/Apr/20 ] |
|
might be related? Lustre: DEBUG MARKER: == sanity test 60c: unlink file when mds full ======================================================== 22:23:16 (1586020996) BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 PGD 27f329067 P4D 27f329067 PUD 231175067 PMD 0 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 1 PID: 9139 Comm: ll_cfg_requeue Tainted: P W O X --------- --- 4.18.0 #4 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 RIP: 0010:nodemap_config_set_active_mgc+0x77/0x1d0 [ptlrpc] Code: e8 4e 5d 10 e1 48 8b 05 c7 4f 0b 00 48 8d 58 f0 48 3d 60 ea 69 a0 0f 84 92 00 00 00 49 bc 01 00 00 00 47 04 00 00 48 8b 73 08 <48> 8b 6e 08 48 85 ed 74 39 48 8b 45 08 f6 00 02 75 30 48 c7 c7 a0 RSP: 0018:ffff880288eabd80 EFLAGS: 00010287 RAX: ffff88027c4c0c50 RBX: ffff88027c4c0c40 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000282 RBP: ffff880254872000 R08: 0000000000000000 R09: ffff880258d3e600 R10: 0000000000000000 R11: 0000000000000000 R12: 0000044700000001 R13: ffff88026b8b4588 R14: ffff8802a6122ec0 R15: ffff8802a6122e00 FS: 0000000000000000(0000) GS:ffff8802b2200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 000000024dd81000 CR4: 00000000000006a0 Call Trace: mgc_process_log+0x266c/0x26c0 [mgc] mgc_requeue_thread+0x2ad/0x720 [mgc] ? wait_woken+0x90/0x90 kthread+0x105/0x140 ? mgc_process_config+0x17c0/0x17c0 [mgc] ? kthread_flush_work_fn+0x10/0x10 ret_from_fork+0x24/0x30 |
| Comment by James Nunez (Inactive) [ 13/Jan/21 ] |
|
I don't see the assertion, but is the crash at https://testing.whamcloud.com/test_sets/caac4a91-55b0-45ab-af99-356db88af611 the same or related to this ticket? On the OSS, [74388.707838] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [74388.709261] PGD 0 P4D 0 [74388.709696] Oops: 0000 [#1] SMP PTI [74388.710293] CPU: 0 PID: 7056 Comm: ll_cfg_requeue Kdump: loaded Tainted: G OE --------- - - 4.18.0-240.1.1.el8_lustre.x86_64 #1 [74388.712339] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [74388.713761] RIP: 0010:nodemap_config_set_active_mgc+0x92/0x210 [ptlrpc] [74388.714844] Code: 00 48 8d 58 f0 48 3d 40 ac 22 c1 0f 84 9f 00 00 00 49 c7 c7 a0 2f 25 c1 49 c7 c6 c8 b8 1d c1 49 c7 c5 30 17 18 c1 48 8b 73 08 <4c> 8b 66 08 4d 85 e4 74 3a 49 8b 54 24 08 f6 02 02 75 30 48 c7 c7 [74388.717857] RSP: 0018:ffffa43f40b8fd10 EFLAGS: 00010287 [74388.718712] RAX: ffff8e2db75694d0 RBX: ffff8e2db75694c0 RCX: 0000000000000000 [74388.719871] RDX: ffff8e2d8ed10000 RSI: 0000000000000000 RDI: ffffffffc122ac20 [74388.721023] RBP: ffff8e2dac4c1500 R08: ffff8e2dbda2f160 R09: ffff8e2d4fc02a00 [74388.722184] R10: ffff8e2d88aa3000 R11: 0000000000000001 R12: ffff8e2d77ef4800 [74388.723342] R13: ffffffffc1181730 R14: ffffffffc11db8c8 R15: ffffffffc1252fa0 [74388.724496] FS: 0000000000000000(0000) GS:ffff8e2dbda00000(0000) knlGS:0000000000000000 [74388.725804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [74388.726748] CR2: 0000000000000008 CR3: 0000000078c0a006 CR4: 00000000000606f0 [74388.727905] Call Trace: [74388.728378] mgc_process_recover_nodemap_log+0xe14/0xeb0 [mgc] [74388.729344] mgc_process_log+0x7a7/0x840 [mgc] [74388.730088] mgc_requeue_thread+0x29a/0x720 [mgc] [74388.730896] ? finish_wait+0x80/0x80 [74388.731494] ? mgc_process_config+0xe10/0xe10 [mgc] [74388.732314] kthread+0x112/0x130 [74388.732869] ? kthread_flush_work_fn+0x10/0x10 [74388.733627] ret_from_fork+0x35/0x40 |
| Comment by Sebastien Buisson [ 13/Jan/21 ] |
|
I am not sure it is the same issue, but it would need to have a proper dump in order to investigate the origin of the NULL pointer with the crash utility. |
| Comment by Peter Jones [ 17/Feb/21 ] |
|
Landed for 2.14 |