[LU-8287] sanity-sec test_16: mgs and c0 idmap mismatch, 10 attempts Created: 15/Jun/16 Updated: 18/Nov/16 Resolved: 11/Nov/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Kit Westneat |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/c7f684c6-2a8e-11e6-acf3-5254006e85c2. This has failed quite a number of times, but was overshadowed by The sub-test test_16 failed with the following error: mgs and c0 idmap mismatch, 10 attempts It looks like the OST rebooted for some reason, since the modules are not loaded and the dmesg log is empty: trevis-54vm8: opening /dev/lnet failed: No such device trevis-54vm8: hint: the kernel modules may not be loaded trevis-54vm8: IOC_LIBCFS_GET_NI error 19: No such device /usr/lib64/lustre/tests/sanity-sec.sh: line 981: [: ==: unary operator expected Info required for matching: sanity-sec 16 |
| Comments |
| Comment by Andreas Dilger [ 15/Jun/16 ] |
|
This might be caused by a failure during test_15: trevis-19vm4: error: get_param: param_path 'nodemap/25503_3/id': No such file or directory CMD: trevis-19vm4 /usr/sbin/lctl nodemap_del 25503_4 CMD: trevis-19vm4 /usr/sbin/lctl get_param nodemap.25503_4.id Unfortunately, I can't find console logs for any of these failures. |
| Comment by Andreas Dilger [ 20/Jun/16 ] |
|
This is now the leading cause of autotest patch review test failures. |
| Comment by Kit Westneat [ 20/Jun/16 ] |
|
There seem to be two different problems, although they may be related. The first is kernel panics on ZFS OSSes, and the second is a problem syncing the config with a second MDS. zfs example - https://testing.hpdd.intel.com/test_sets/c7f684c6-2a8e-11e6-acf3-5254006e85c2 I wonder if there is some issue with deleting and recreating the index file on ZFS that is causing OSS to kernel panic. It looks like sanity-sec hasn't passed on a ZFS system in a long time. Do you know why there is no console log on these test cases? I thought that there used to be a console log on all tests. It's going to be difficult to figure out what is going on without the kernel panic trace. On the other hand, if the maloo cases are any indication, it should be easy to reproduce. For the second problem, there isn't enough info to figure out what is going on. I'll try to reproduce it locally, and write a debug patch to get more information. Is there a list somewhere of the default maloo debug flags? Thanks, |
| Comment by Andreas Dilger [ 20/Jun/16 ] |
|
I've filed an internal ticket regarding the console logs. |
| Comment by Gerrit Updater [ 21/Jun/16 ] |
|
Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/20896 |
| Comment by Gerrit Updater [ 23/Jun/16 ] |
|
Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/20954 |
| Comment by Jian Yu [ 08/Jul/16 ] |
|
More failure instance on master branch: https://testing.hpdd.intel.com/test_sets/d6c0079e-403b-11e6-acf3-5254006e85c2 |
| Comment by Jian Yu [ 08/Jul/16 ] |
|
This is affecting patch review testing on master branch. |
| Comment by Jian Yu [ 22/Jul/16 ] |
|
One more failure instance on master branch: https://testing.hpdd.intel.com/test_sets/6b55672c-4fa6-11e6-bf87-5254006e85c2 |
| Comment by nasf (Inactive) [ 27/Jul/16 ] |
|
Another failure instance on master: |
| Comment by Jian Yu [ 03/Aug/16 ] |
|
One more failure instance on master: https://testing.hpdd.intel.com/test_sets/7d5b308e-594f-11e6-b5b1-5254006e85c2 |
| Comment by Gerrit Updater [ 11/Aug/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20954/ |
| Comment by Peter Jones [ 11/Aug/16 ] |
|
Landed for 2.9 |
| Comment by Saurabh Tandan (Inactive) [ 08/Nov/16 ] |
|
This issue is still seen on master, build# 3468 for Full EL7.2 Server/EL7.2 Client - ZFS causing following tests to fail. |
| Comment by Kit Westneat [ 08/Nov/16 ] |
|
The root cause in that case is due to the OSS kernel panic'ing, you can see that the Lustre modules are no longer loaded: Unfortunately the console logs don't appear to have been saved, is there a way to make sure these are saved? |
| Comment by Saurabh Tandan (Inactive) [ 09/Nov/16 ] |
|
Reopening as this issue is still seen on master. |
| Comment by Nathaniel Clark [ 10/Nov/16 ] |
|
For each ZFS run of this test, it looks as though the OST vm has been rebooted in a previous test run because all the dmesg's are from a clean system, that hasn't had any tests from autotest run on it at all (no lctl mark statements). All ZFS in last 4 weeks (all failures):
|
| Comment by Jian Yu [ 17/Nov/16 ] |
|
While testing patch http://review.whamcloud.com/23778 for nodemap.c0.idmap= [ { idtype: uid, client_id: 60003, fs_id: 60000 }, { idtype: uid, client_id: 60004, fs_id: 60002 }, { idtype: gid, client_id: 60003, fs_id: 60000 }, { idtype: gid, client_id: 60004, fs_id: 60002 } ]
OTHER - IP: 10.2.4.167
sanity-sec test_16: @@@@@@ FAIL: mgs and c0 idmap mismatch, 10 attempts
Is this a regression from that patch or the failure was not resolved on master branch? |
| Comment by Kit Westneat [ 18/Nov/16 ] |
|
Hi Jian, Patch 23778 fixes an error handling bug, but the original error on ZFS still exists. The original nodemap on ZFS bug that caused this test failure is fixed in change 23849: |