[LU-8287] sanity-sec test_16: mgs and c0 idmap mismatch, 10 attempts Created: 15/Jun/16  Updated: 18/Nov/16  Resolved: 11/Nov/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Kit Westneat
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-3291 IU UID/GID Mapping Feature Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/c7f684c6-2a8e-11e6-acf3-5254006e85c2.

This has failed quite a number of times, but was overshadowed by LU-8279 also causing a large number of sanity-sec failures.

The sub-test test_16 failed with the following error:

mgs and c0 idmap mismatch, 10 attempts

It looks like the OST rebooted for some reason, since the modules are not loaded and the dmesg log is empty:

trevis-54vm8: opening /dev/lnet failed: No such device
trevis-54vm8: hint: the kernel modules may not be loaded
trevis-54vm8: IOC_LIBCFS_GET_NI error 19: No such device
/usr/lib64/lustre/tests/sanity-sec.sh: line 981: [: ==: unary operator expected

Info required for matching: sanity-sec 16



 Comments   
Comment by Andreas Dilger [ 15/Jun/16 ]

This might be caused by a failure during test_15:

trevis-19vm4: error: get_param: param_path 'nodemap/25503_3/id': No such file or directory
CMD: trevis-19vm4 /usr/sbin/lctl nodemap_del 25503_4
CMD: trevis-19vm4 /usr/sbin/lctl get_param nodemap.25503_4.id

Unfortunately, I can't find console logs for any of these failures.

Comment by Andreas Dilger [ 20/Jun/16 ]

This is now the leading cause of autotest patch review test failures.

Comment by Kit Westneat [ 20/Jun/16 ]

There seem to be two different problems, although they may be related. The first is kernel panics on ZFS OSSes, and the second is a problem syncing the config with a second MDS.

zfs example - https://testing.hpdd.intel.com/test_sets/c7f684c6-2a8e-11e6-acf3-5254006e85c2
ldiskfs example - https://testing.hpdd.intel.com/test_sets/0daa6758-3557-11e6-acf3-5254006e85c2

I wonder if there is some issue with deleting and recreating the index file on ZFS that is causing OSS to kernel panic. It looks like sanity-sec hasn't passed on a ZFS system in a long time. Do you know why there is no console log on these test cases? I thought that there used to be a console log on all tests. It's going to be difficult to figure out what is going on without the kernel panic trace. On the other hand, if the maloo cases are any indication, it should be easy to reproduce.

For the second problem, there isn't enough info to figure out what is going on. I'll try to reproduce it locally, and write a debug patch to get more information. Is there a list somewhere of the default maloo debug flags?

Thanks,
Kit

Comment by Andreas Dilger [ 20/Jun/16 ]

I've filed an internal ticket regarding the console logs.

Comment by Gerrit Updater [ 21/Jun/16 ]

Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/20896
Subject: LU-8287 nodemap: add more debug messages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e4c46cdf8acf877f6a4a617099be397013fe6902

Comment by Gerrit Updater [ 23/Jun/16 ]

Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/20954
Subject: LU-8287 nodemap: don't stop config lock when target stops
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 59939efdad721ef3e48a7104e020b627383ffa88

Comment by Jian Yu [ 08/Jul/16 ]

More failure instance on master branch: https://testing.hpdd.intel.com/test_sets/d6c0079e-403b-11e6-acf3-5254006e85c2

Comment by Jian Yu [ 08/Jul/16 ]

This is affecting patch review testing on master branch.

Comment by Jian Yu [ 22/Jul/16 ]

One more failure instance on master branch: https://testing.hpdd.intel.com/test_sets/6b55672c-4fa6-11e6-bf87-5254006e85c2

Comment by nasf (Inactive) [ 27/Jul/16 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/aa77a6ac-53c2-11e6-a39e-5254006e85c2

Comment by Jian Yu [ 03/Aug/16 ]

One more failure instance on master: https://testing.hpdd.intel.com/test_sets/7d5b308e-594f-11e6-b5b1-5254006e85c2

Comment by Gerrit Updater [ 11/Aug/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20954/
Subject: LU-8287 nodemap: don't stop config lock when target stops
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 45ce1044cd7b94621e1161cd23c600f8e1c18317

Comment by Peter Jones [ 11/Aug/16 ]

Landed for 2.9

Comment by Saurabh Tandan (Inactive) [ 08/Nov/16 ]

This issue is still seen on master, build# 3468 for Full EL7.2 Server/EL7.2 Client - ZFS causing following tests to fail.
https://testing.hpdd.intel.com/test_sets/e2e0de84-a258-11e6-bf05-5254006e85c2

Comment by Kit Westneat [ 08/Nov/16 ]

The root cause in that case is due to the OSS kernel panic'ing, you can see that the Lustre modules are no longer loaded:
onyx-35vm4: opening /dev/lnet failed: No such device
onyx-35vm4: hint: the kernel modules may not be loaded
onyx-35vm4: IOC_LIBCFS_GET_NI error 19: No such device

Unfortunately the console logs don't appear to have been saved, is there a way to make sure these are saved?

Comment by Saurabh Tandan (Inactive) [ 09/Nov/16 ]

Reopening as this issue is still seen on master.
https://testing.hpdd.intel.com/test_sets/abdb13dc-a627-11e6-964e-5254006e85c2

Comment by Nathaniel Clark [ 10/Nov/16 ]

For each ZFS run of this test, it looks as though the OST vm has been rebooted in a previous test run because all the dmesg's are from a clean system, that hasn't had any tests from autotest run on it at all (no lctl mark statements).

All ZFS in last 4 weeks (all failures):
https://testing.hpdd.intel.com/test_sets/d0448e1a-93df-11e6-91aa-5254006e85c2 <-- LU-8824 (uncaught)
https://testing.hpdd.intel.com/test_sets/969f1d60-9a60-11e6-a546-5254006e85c2 <-- LU-8824 (uncaught)
https://testing.hpdd.intel.com/test_sets/b06c944a-9a63-11e6-a5e5-5254006e85c2 <-- LU-8824
https://testing.hpdd.intel.com/test_sets/b3a81182-9f06-11e6-a747-5254006e85c2 <-- LU-8824 (uncaught)
https://testing.hpdd.intel.com/test_sets/e2e0de84-a258-11e6-bf05-5254006e85c2 - no OST logs for test_9
https://testing.hpdd.intel.com/test_sets/abdb13dc-a627-11e6-964e-5254006e85c2 <-- LU-8824
https://testing.hpdd.intel.com/test_sets/6196c4ec-a63b-11e6-bf77-5254006e85c2 <-- LU-8824 (uncaught)

LU-8824 is an LBUG in test 9 that doesn't cause a TIMEOUT, just marked as a test failure. (possibility that same LBUG is un-caught in other cases?)

Comment by Jian Yu [ 17/Nov/16 ]

While testing patch http://review.whamcloud.com/23778 for LU-8824 on master branch, sanity-sec test 16 failed:

nodemap.c0.idmap= [ { idtype: uid, client_id: 60003, fs_id: 60000 }, { idtype: uid, client_id: 60004, fs_id: 60002 }, { idtype: gid, client_id: 60003, fs_id: 60000 }, { idtype: gid, client_id: 60004, fs_id: 60002 } ]
OTHER - IP: 10.2.4.167

 sanity-sec test_16: @@@@@@ FAIL: mgs and c0 idmap mismatch, 10 attempts 

Is this a regression from that patch or the failure was not resolved on master branch?

Comment by Kit Westneat [ 18/Nov/16 ]

Hi Jian,

Patch 23778 fixes an error handling bug, but the original error on ZFS still exists. The original nodemap on ZFS bug that caused this test failure is fixed in change 23849:
http://review.whamcloud.com/#/c/23849/

Generated at Sat Feb 10 02:16:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.