[LU-8824] sanity-sec test_9: ASSERTION( config->nmc_default_nodemap ) Created: 10/Nov/16 Updated: 06/Jul/21 Resolved: 23/Nov/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Kit Westneat |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/abdb13dc-a627-11e6-964e-5254006e85c2. The sub-test test_9 failed with the following error: trevis-34vm4:LBUG/LASSERT detected 02:04:11:[15785.994958] Lustre: DEBUG MARKER: == sanity-sec test 9: nodemap range add ============================================================== 02:02:49 (1478656969) 02:04:11:[15792.826885] Lustre: 10421:0:(nodemap_handler.c:1020:nodemap_create()) adding nodemap '27295_7' to config without default nodemap 02:04:11:[15792.830823] Lustre: 10421:0:(nodemap_handler.c:1020:nodemap_create()) Skipped 3 previous similar messages 02:04:11:[15800.705743] Lustre: 10421:0:(mgc_request.c:1756:mgc_process_recover_nodemap_log()) MGC10.9.5.176@tcp: error processing nodemap log nodemap: rc = -2 02:04:11:[15800.709914] LustreError: 10421:0:(nodemap_handler.c:1428:nodemap_config_set_active()) ASSERTION( config->nmc_default_nodemap ) failed: 02:04:11:[15800.714076] LustreError: 10421:0:(nodemap_handler.c:1428:nodemap_config_set_active()) LBUG 02:04:11:[15800.716317] Pid: 10421, comm: ll_cfg_requeue 02:04:11:[15800.718308] 02:04:11:[15800.718308] Call Trace: 02:04:11:[15800.721741] [<ffffffffa09387d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] 02:04:11:[15800.723818] [<ffffffffa0938d75>] lbug_with_loc+0x45/0xc0 [libcfs] 02:04:11:[15800.725837] [<ffffffffa0d34a17>] nodemap_config_set_active+0x2a7/0x2e0 [ptlrpc] 02:04:11:[15800.727873] [<ffffffffa0d3d908>] nodemap_config_set_active_mgc+0x38/0x1e0 [ptlrpc] 02:04:11:[15800.729985] [<ffffffffa0ca28f0>] ? ptlrpc_request_cache_free+0x90/0x1d0 [ptlrpc] 02:04:11:[15800.732071] [<ffffffffa0ca35d5>] ? __ptlrpc_req_finished+0x475/0x690 [ptlrpc] 02:04:11:[15800.734162] [<ffffffffa0c43e6b>] mgc_process_recover_nodemap_log+0x34b/0xe10 [mgc] 02:04:11:[15800.736195] [<ffffffffa0c46894>] mgc_process_log+0x754/0x880 [mgc] 02:04:11:[15800.738132] [<ffffffff816399cd>] ? schedule_timeout+0x17d/0x2d0 02:04:11:[15800.740126] [<ffffffffa09439d7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] 02:04:11:[15800.742013] [<ffffffffa0c48908>] mgc_requeue_thread+0x2b8/0x880 [mgc] 02:04:11:[15800.744113] [<ffffffff810b8940>] ? default_wake_function+0x0/0x20 02:04:11:[15800.746313] [<ffffffffa0c48650>] ? mgc_requeue_thread+0x0/0x880 [mgc] 02:04:11:[15800.748437] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 02:04:11:[15800.750331] [<ffffffff810a5ac0>] ? kthread+0x0/0xe0 02:04:11:[15800.752203] [<ffffffff81646c98>] ret_from_fork+0x58/0x90 02:04:11:[15800.754097] [<ffffffff810a5ac0>] ? kthread+0x0/0xe0 02:04:11:[15800.755897] Please provide additional information about the failure here. Info required for matching: sanity-sec 9 |
| Comments |
| Comment by Nathaniel Clark [ 10/Nov/16 ] |
|
https://testing.hpdd.intel.com/test_sets/b06c944a-9a63-11e6-a5e5-5254006e85c2 |
| Comment by Nathaniel Clark [ 10/Nov/16 ] |
|
Uncaught failures in past 4 weeks (console log of OST vm shows LBUG): |
| Comment by Gerrit Updater [ 10/Nov/16 ] |
|
Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/23706 |
| Comment by Peter Jones [ 10/Nov/16 ] |
|
Kit Could you please advise on this issue and how whether we could live with this in 2.9? Peter |
| Comment by Kit Westneat [ 11/Nov/16 ] |
|
Thanks for gathering the logs on this Nathaniel. It looks like there is an error handling issue in mgc_process_recover_nodemap_log. I can fix up the error handling for the nodemap portion, but the handling for the recovery log part is also missing, and I don't feel confident adding it there. Is there someone who can look at that portion? I'm not sure what the root cause is yet, though it looks like something to do with the default nodemap not getting transfered correctly - does this LBUG happen on all ZFS full group tests or is it more random? Fixing up the error handling should be enough for 2.9, though it means that nodemap will be only partially functional on ZFS systems. |
| Comment by Nathaniel Clark [ 11/Nov/16 ] |
|
It seems to happen anytime sanity-sec is run on ZFS, but not when run on ldiskfs. |
| Comment by Kit Westneat [ 11/Nov/16 ] |
|
I think I've figured out what's going on. The config load code expects the index file to return the key/values in key-sorted order, which the ldiskfs index files do. The ZFS index files however appear to return the keys in hash sorted order, at least according to this comment:
We currently embed the config record type in the key so that create records are processed before update records, and so not having the records sent in key-order breaks this. I'm going to investigate how easy it would be to modify the config load/send operation to have it do a two-pass load, where the create records would be loaded first, and then the other records could be loaded after. |
| Comment by Nathaniel Clark [ 12/Nov/16 ] |
|
Kit, Awesome find. EXCEPTing test_9 just delays the ASSERTION to test_15: I'm think getting a real fix is necissary for sanity-sec to pass with ZFS. |
| Comment by Peter Jones [ 14/Nov/16 ] |
|
Kit This is indeed good news. How are things progressing on making the changes necessary with the error handling? Peter |
| Comment by Kit Westneat [ 15/Nov/16 ] |
|
Hi Peter, I can get a patch up for the error handling tonight or tomorrow. Fixing the config loading and unloading will take a bit longer, but I'll ty to get a patch up by the end of the week.
|
| Comment by Gerrit Updater [ 16/Nov/16 ] |
|
Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/23778 |
| Comment by Peter Jones [ 16/Nov/16 ] |
|
Thanks Kit! This is encouraging news |
| Comment by Gerrit Updater [ 18/Nov/16 ] |
|
Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/23849 |
| Comment by Gerrit Updater [ 19/Nov/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23778/ |
| Comment by Gerrit Updater [ 23/Nov/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23849/ |
| Comment by Peter Jones [ 23/Nov/16 ] |
|
Landed for 2.9 |