[LU-8824] sanity-sec test_9: ASSERTION( config->nmc_default_nodemap ) Created: 10/Nov/16  Updated: 06/Jul/21  Resolved: 23/Nov/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Kit Westneat
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-8825 Lustre MGT can not been re-mount ater... Resolved
is duplicated by LU-8850 Set nodemap and add node range will c... Resolved
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/abdb13dc-a627-11e6-964e-5254006e85c2.

The sub-test test_9 failed with the following error:

trevis-34vm4:LBUG/LASSERT detected
02:04:11:[15785.994958] Lustre: DEBUG MARKER: == sanity-sec test 9: nodemap range add ============================================================== 02:02:49 (1478656969)
02:04:11:[15792.826885] Lustre: 10421:0:(nodemap_handler.c:1020:nodemap_create()) adding nodemap '27295_7' to config without default nodemap
02:04:11:[15792.830823] Lustre: 10421:0:(nodemap_handler.c:1020:nodemap_create()) Skipped 3 previous similar messages
02:04:11:[15800.705743] Lustre: 10421:0:(mgc_request.c:1756:mgc_process_recover_nodemap_log()) MGC10.9.5.176@tcp: error processing nodemap log nodemap: rc = -2
02:04:11:[15800.709914] LustreError: 10421:0:(nodemap_handler.c:1428:nodemap_config_set_active()) ASSERTION( config->nmc_default_nodemap ) failed: 
02:04:11:[15800.714076] LustreError: 10421:0:(nodemap_handler.c:1428:nodemap_config_set_active()) LBUG
02:04:11:[15800.716317] Pid: 10421, comm: ll_cfg_requeue
02:04:11:[15800.718308] 
02:04:11:[15800.718308] Call Trace:
02:04:11:[15800.721741]  [<ffffffffa09387d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
02:04:11:[15800.723818]  [<ffffffffa0938d75>] lbug_with_loc+0x45/0xc0 [libcfs]
02:04:11:[15800.725837]  [<ffffffffa0d34a17>] nodemap_config_set_active+0x2a7/0x2e0 [ptlrpc]
02:04:11:[15800.727873]  [<ffffffffa0d3d908>] nodemap_config_set_active_mgc+0x38/0x1e0 [ptlrpc]
02:04:11:[15800.729985]  [<ffffffffa0ca28f0>] ? ptlrpc_request_cache_free+0x90/0x1d0 [ptlrpc]
02:04:11:[15800.732071]  [<ffffffffa0ca35d5>] ? __ptlrpc_req_finished+0x475/0x690 [ptlrpc]
02:04:11:[15800.734162]  [<ffffffffa0c43e6b>] mgc_process_recover_nodemap_log+0x34b/0xe10 [mgc]
02:04:11:[15800.736195]  [<ffffffffa0c46894>] mgc_process_log+0x754/0x880 [mgc]
02:04:11:[15800.738132]  [<ffffffff816399cd>] ? schedule_timeout+0x17d/0x2d0
02:04:11:[15800.740126]  [<ffffffffa09439d7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
02:04:11:[15800.742013]  [<ffffffffa0c48908>] mgc_requeue_thread+0x2b8/0x880 [mgc]
02:04:11:[15800.744113]  [<ffffffff810b8940>] ? default_wake_function+0x0/0x20
02:04:11:[15800.746313]  [<ffffffffa0c48650>] ? mgc_requeue_thread+0x0/0x880 [mgc]
02:04:11:[15800.748437]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
02:04:11:[15800.750331]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
02:04:11:[15800.752203]  [<ffffffff81646c98>] ret_from_fork+0x58/0x90
02:04:11:[15800.754097]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
02:04:11:[15800.755897] 

Please provide additional information about the failure here.

Info required for matching: sanity-sec 9



 Comments   
Comment by Nathaniel Clark [ 10/Nov/16 ]

https://testing.hpdd.intel.com/test_sets/b06c944a-9a63-11e6-a5e5-5254006e85c2

Comment by Nathaniel Clark [ 10/Nov/16 ]

Uncaught failures in past 4 weeks (console log of OST vm shows LBUG):
https://testing.hpdd.intel.com/test_sets/969f1d60-9a60-11e6-a546-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/d0606342-93df-11e6-91aa-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/ff975cc6-9ed7-11e6-b8c4-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/b3b42986-9f06-11e6-a747-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/619feb44-a63b-11e6-bf77-5254006e85c2

Comment by Gerrit Updater [ 10/Nov/16 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/23706
Subject: LU-8824 test: EXCEPT test_9 till fixed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a469c81df1a8fea2798c8fe7866456c53a53a00d

Comment by Peter Jones [ 10/Nov/16 ]

Kit

Could you please advise on this issue and how whether we could live with this in 2.9?

Peter

Comment by Kit Westneat [ 11/Nov/16 ]

Thanks for gathering the logs on this Nathaniel.

It looks like there is an error handling issue in mgc_process_recover_nodemap_log. I can fix up the error handling for the nodemap portion, but the handling for the recovery log part is also missing, and I don't feel confident adding it there. Is there someone who can look at that portion?

I'm not sure what the root cause is yet, though it looks like something to do with the default nodemap not getting transfered correctly - does this LBUG happen on all ZFS full group tests or is it more random?

Fixing up the error handling should be enough for 2.9, though it means that nodemap will be only partially functional on ZFS systems.

Comment by Nathaniel Clark [ 11/Nov/16 ]

It seems to happen anytime sanity-sec is run on ZFS, but not when run on ldiskfs.

Comment by Kit Westneat [ 11/Nov/16 ]

I think I've figured out what's going on. The config load code expects the index file to return the key/values in key-sorted order, which the ldiskfs index files do. The ZFS index files however appear to return the keys in hash sorted order, at least according to this comment:
/*

  • XXX: implement support for fixed-size keys sorted with natural
  • numerical way (not using internal hash value)
    */

We currently embed the config record type in the key so that create records are processed before update records, and so not having the records sent in key-order breaks this.

I'm going to investigate how easy it would be to modify the config load/send operation to have it do a two-pass load, where the create records would be loaded first, and then the other records could be loaded after.

Comment by Nathaniel Clark [ 12/Nov/16 ]

Kit,

Awesome find.

EXCEPTing test_9 just delays the ASSERTION to test_15:
https://testing.hpdd.intel.com/sub_tests/aaedadbe-a888-11e6-b6bd-5254006e85c2

I'm think getting a real fix is necissary for sanity-sec to pass with ZFS.

Comment by Peter Jones [ 14/Nov/16 ]

Kit

This is indeed good news. How are things progressing on making the changes necessary with the error handling?

Peter

Comment by Kit Westneat [ 15/Nov/16 ]

Hi Peter,

I can get a patch up for the error handling tonight or tomorrow. Fixing the config loading and unloading will take a bit longer, but I'll ty to get a patch up by the end of the week.

  • Kit
Comment by Gerrit Updater [ 16/Nov/16 ]

Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/23778
Subject: LU-8824 nodemap: properly handle errors loading nodemap conf
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0ae8e3db5cd16acc4f3bde47a896b05a01383c9b

Comment by Peter Jones [ 16/Nov/16 ]

Thanks Kit! This is encouraging news

Comment by Gerrit Updater [ 18/Nov/16 ]

Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/23849
Subject: LU-8824 nodemap: load nodemap definitions first
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7d5800455161e0d2fca47a1754b7fc734d4a2999

Comment by Gerrit Updater [ 19/Nov/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23778/
Subject: LU-8824 nodemap: properly handle errors loading nodemap conf
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9be888d56caf73184f72a4ad782196d255331ee2

Comment by Gerrit Updater [ 23/Nov/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23849/
Subject: LU-8824 nodemap: load nodemap definitions first
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 89ce9d5b125762f39339916f14c01242107739ed

Comment by Peter Jones [ 23/Nov/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:20:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.