[LU-7169] conf-sanity 84 restart mds1 failed Created: 16/Sep/15  Updated: 30/Nov/15  Resolved: 30/Nov/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Niu Yawei (Inactive) Assignee: nasf (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-6895 sanity-lfsck test 4 hung: bad entry i... Resolved
is related to LU-7428 conf-sanity test_84, replay-dual 0a: ... Resolved
is related to LU-6789 Interop 2.5.3<->master conf-sanity te... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/e089506c-5bf0-11e5-9dac-5254006e85c2

LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts:
LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts:
Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000
Lustre: Skipped 79 previous similar messages
Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt
Lustre: Skipped 26 previous similar messages
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lust
Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null
Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
Lustre: DEBUG MARKER: sync; sync; sync
Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly
LustreError: 14276:0:(osd_handler.c:1380:osd_ro()) *** setting lustre-MDT0000 read-only ***
Turning device dm-0 (0xfd00000) read-only
Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
Lustre: DEBUG MARKER: lctl set_param fail_loc=0x20000709 fail_val=5
Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
Lustre: DEBUG MARKER: umount -d /mnt/mds1
Lustre: Failing over lustre-MDT0000
Removing read-only on unknown block (0xfd00000)
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
Lustre: DEBUG MARKER: hostname
Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1
Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o recovery_time_hard=60,recovery_time_soft=60                                   /dev/lvm-Role_MDS/P1 /mnt/mds1
LDISKFS-fs (dm-0): recovery complete
LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts:
LDISKFS-fs error (device dm-0): ldiskfs_lookup: deleted inode referenced: 75023
Aborting journal on device dm-0-8.
LDISKFS-fs (dm-0): Remounting filesystem read-only
LDISKFS-fs error (device dm-0): ldiskfs_put_super: Couldn't clean up the journal
LustreError: 14732:0:(obd_config.c:575:class_setup()) setup lustre-MDT0000-osd failed (-30)
LustreError: 14732:0:(obd_mount.c:203:lustre_start_simple()) lustre-MDT0000-osd setup error -30
LustreError: 14732:0:(obd_mount_server.c:1760:server_fill_super()) Unable to start osd on /dev/mapper/lvm--Role_MDS-P1: -30
LustreError: 14732:0:(obd_mount.c:1342:lustre_fill_super()) Unable to mount  (-30)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark  conf-sanity test_84: @@@@@@ FAIL: Restart of mds1 failed!

Looks the filesystem is corrupted somehow.



 Comments   
Comment by Andreas Dilger [ 16/Sep/15 ]

The corruption here may be related to the problem seen in LU-6895

Comment by James Nunez (Inactive) [ 16/Sep/15 ]

Another occurance
2015-09-15 14:42:13 - https://testing.hpdd.intel.com/test_sets/f3da22c2-5bf0-11e5-9dac-5254006e85c2

Comment by Joseph Gmitter (Inactive) [ 16/Sep/15 ]

Fan Yong,
Can you look at this issue and see if it is a dup of LU-6895?
Thanks.
Joe

Comment by Andreas Dilger [ 16/Sep/15 ]

The MDS console log reports:

18:38:45:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: 
18:38:45:LDISKFS-fs error (device dm-0): ldiskfs_lookup: deleted inode referenced: 75023
18:38:45:Aborting journal on device dm-0-8.
18:38:45:LDISKFS-fs (dm-0): Remounting filesystem read-only

On a later test when the MDS was remounted it crashed:

18:38:45:LDISKFS-fs error (device dm-0): ldiskfs_lookup: deleted inode referenced: 75023
18:38:45:Aborting journal on device dm-0-8.
18:38:45:LDISKFS-fs (dm-0): Remounting filesystem read-only
18:38:45:BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
18:38:45:IP: [<ffffffffa0b9833d>] osd_scrub_refresh_mapping+0x39d/0x410 [osd_ldiskfs]
18:38:45:Oops: 0000 [#1] SMP 
18:38:56:Pid: 15532, comm: mount.lustre Not tainted 2.6.32-573.3.1.el6_lustre.g3eb333f.x86_64 #1 Red Hat KVM
18:38:56:Call Trace:
18:38:56: [<ffffffffa0b9c616>] osd_scrub_setup+0xfb6/0x1170 [osd_ldiskfs]
18:38:56: [<ffffffffa0b707e9>] osd_device_alloc+0x6c9/0x960 [osd_ldiskfs]
18:38:56: [<ffffffffa05bf8df>] obd_setup+0x1bf/0x290 [obdclass]
18:38:56: [<ffffffffa05bfc0f>] class_setup+0x25f/0x940 [obdclass]
18:38:56: [<ffffffffa05c6df1>] class_process_config+0x1151/0x2840 [obdclass]
18:38:56: [<ffffffffa05d0abb>] do_lcfg+0x2cb/0x640 [obdclass]
18:38:56: [<ffffffffa05d0ec4>] lustre_start_simple+0x94/0x200 [obdclass]
18:38:56: [<ffffffffa0605ce1>] server_fill_super+0xfd1/0x1a70 [obdclass]
18:38:56: [<ffffffffa05d33e8>] lustre_fill_super+0x348/0x990 [obdclass]
18:38:56: [<ffffffff8119532f>] get_sb_nodev+0x5f/0xa0
18:38:56: [<ffffffffa05cb8a5>] lustre_get_sb+0x25/0x30 [obdclass]
18:38:56: [<ffffffff8119496b>] vfs_kern_mount+0x7b/0x1b0
18:38:56: [<ffffffff81194b12>] do_kern_mount+0x52/0x130
18:38:56: [<ffffffff811b697b>] do_mount+0x2fb/0x930
18:38:56: [<ffffffff811b7040>] sys_mount+0x90/0xe0
Comment by nasf (Inactive) [ 17/Sep/15 ]

The MDS crash at "osd_scrub_refresh_mapping+0x39d" is a duplication of LU-7061.

Comment by Joseph Gmitter (Inactive) [ 17/Sep/15 ]

Fan Yong,
Can you have a look at the corruption issue as well?
Thanks.
Joe

Comment by nasf (Inactive) [ 19/Sep/15 ]

It is NOT the same at LU-6895. The direct reason for the failure is as following:

LDISKFS-fs error (device dm-0): ldiskfs_lookup: deleted inode referenced: 75023

ldiskfs_lookup() found a deleted inode, then its error handler set the system as read-only, and then caused cascaded failures for subsequent operations.

In fact, before the ldiskfs_lookup() failure, the system already crashed. Here is the MDS debug log:

00100000:00000001:0.0:1442342253.462893:0:14732:0:(osd_scrub.c:2423:osd_scrub_setup()) Process entered
00080000:00000001:0.0:1442342253.462936:0:14732:0:(osd_handler.c:2401:osd_ea_fid_set()) Process entered
00080000:00000001:0.0:1442342253.462944:0:14732:0:(osd_handler.c:2429:osd_ea_fid_set()) Process leaving (rc=0 : 0 : 0)
00100000:10000000:0.0:1442342253.463278:0:14732:0:(osd_scrub.c:277:osd_scrub_file_reset()) lustre-MDT0000: reset OI scrub file, old flags = 0x0, add flags = 0x2

Related source code is as following:

int osd_scrub_setup(const struct lu_env *env, struct osd_device *dev)
{
...
        } else {
                if (memcmp(sf->sf_uuid, es->s_uuid, 16) != 0) {
                        osd_scrub_file_reset(scrub, es->s_uuid,SF_INCONSISTENT);
                        dirty = 1;
...
}

The logs shows that during the MDT0000 mount, the osd_scrub_setup() found the super block's uuid has been changed. Usually, such case only happens when MDT file-level backup/restore. But in our conf-sanity test cases, there were no backup/restore operations. So the local file-system should have been crashed during the MDT failover. As for what caused the super block corruption, I have no idea yet.

Comment by Gerrit Updater [ 29/Sep/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16664
Subject: LU-7169 tests: check disk corruption during failover
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: aa1a896b1fddadce8b95b2c9af6c4a8509535a19

Comment by Peter Jones [ 28/Oct/15 ]

Fan Yong

What are the next steps here? It looks like the debug patch did not trigger the failure. Should we land the debug patch or does it just need to be run with a higher number of runs to improve the chances of hitting it?

Thanks

Peter

Comment by nasf (Inactive) [ 29/Oct/15 ]

The issues cannot be reproduced any longer after the debug patch, even through several repeat the failed test case. So I suggest to land the patch to master and give the chance to normal Maloo run. I will update the patch to drop "fortestonly" and ask for landing permission.

Comment by Peter Jones [ 29/Oct/15 ]

Excellent. Thanks Fan Yong.

Comment by nasf (Inactive) [ 10/Nov/15 ]

We hit the trouble with the patch applied, the log shows that there is really some disk inconsistency as following:
https://testing.hpdd.intel.com/test_sets/e6f060ac-8707-11e5-bf92-5254006e85c2

CMD: onyx-44vm3 e2fsck -d -v -t -t -f -n /dev/lvm-Role_MDS/P1
onyx-44vm3: e2fsck 1.42.13.wc3 (28-Aug-2015)
Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Pass 1: Memory used: 292k/0k (87k/206k), time:  0.00/ 0.00/ 0.00
Pass 1: I/O read: 1MB, write: 0MB, rate: 553.40MB/s
Pass 2: Checking directory structure
Pass 2: Memory used: 292k/0k (98k/195k), time:  0.00/ 0.00/ 0.00
Pass 2: I/O read: 1MB, write: 0MB, rate: 655.74MB/s
Pass 3: Checking directory connectivity
Peak memory: Memory used: 292k/0k (98k/195k), time:  0.01/ 0.00/ 0.00
Pass 3: Memory used: 292k/0k (96k/197k), time:  0.00/ 0.00/ 0.00
Pass 3: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 4: Checking reference counts
Pass 4: Memory used: 292k/0k (62k/231k), time:  0.00/ 0.00/ 0.00
Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 5: Checking group summary information
Free blocks count wrong (33296, counted=32947).
Fix? no

Free inodes count wrong (99987, counted=99750).
Fix? no

Pass 5: Memory used: 292k/0k (62k/231k), time:  0.00/ 0.00/ 0.00
Pass 5: I/O read: 1MB, write: 0MB, rate: 333.78MB/s

          13 inodes used (0.01%, out of 100000)
           6 non-contiguous files (46.2%)
           0 non-contiguous directories (0.0%)
             # of inodes with ind/dind/tind blocks: 0/0/0
       16704 blocks used (33.41%, out of 50000)
           0 bad blocks
           1 large file

         125 regular files
         116 directories
           0 character device files
           0 block device files
           0 fifos
           0 links
           0 symbolic links (0 fast symbolic links)
           0 sockets
------------
         241 files
Memory used: 292k/0k (61k/232k), time:  0.01/ 0.01/ 0.00
I/O read: 1MB, write: 0MB, rate: 92.42MB/s
reboot facets: mds1

Such inconsistency does not means the super block crashed. Because without disk checksum, the e2fsck cannot detect per-block based data corruption. Only with above logs, we cannot say it is the root reason for test_84 failure. I will update the patch with more debug information.

Comment by Gerrit Updater [ 17/Nov/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16664/
Subject: LU-7169 tests: check disk corruption during failover
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f84e06eead85de5cd7832855bab5ff72a542e971

Comment by Joseph Gmitter (Inactive) [ 23/Nov/15 ]

Landed for 2.8

Comment by Andreas Dilger [ 27/Nov/15 ]

The landed patch was only for debugging. This issue is not resolved.

Comment by nasf (Inactive) [ 27/Nov/15 ]

Right, we are still waiting for more failure instances after landing the patch.

Comment by Andreas Dilger [ 27/Nov/15 ]

There are many, many failures of this test, but unfortunately they have all been assigned different bugs because the error messages are different.

In the tests I've seen, the e2fsck run is clean, except for the superblock inside and block counts, which is expected.

I pushed a patch under LU-7428 that may fix the problem, which I think is caused by test_84() setting the MDS read-only right after mount, and that is causing some of the recently written data to be discarded (e.g. superblock label, llog records, etc). Unfortunately, it will take a few days to be tested.

Comment by nasf (Inactive) [ 28/Nov/15 ]

You are right. LU-7428 is the duplication of this ticket. According to the latest test logs (marked as LU-7428)
https://testing.hpdd.intel.com/sub_tests/15602932-9477-11e5-a5ac-5254006e85c2

The former test_83 reformat the system after test_83 done, then move to test_84, after the replay_barrier(), the e2fsck found some inconsistency:

Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Pass 1: Memory used: 292k/0k (86k/207k), time:  0.01/ 0.00/ 0.00
Pass 1: I/O read: 1MB, write: 0MB, rate: 104.44MB/s
Pass 2: Checking directory structure
Pass 2: Memory used: 292k/0k (98k/195k), time:  0.02/ 0.00/ 0.01
Pass 2: I/O read: 1MB, write: 0MB, rate: 47.50MB/s
Pass 3: Checking directory connectivity
Peak memory: Memory used: 292k/0k (98k/195k), time:  0.04/ 0.01/ 0.01
Pass 3: Memory used: 292k/0k (96k/197k), time:  0.00/ 0.00/ 0.00
Pass 3: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 4: Checking reference counts
Pass 4: Memory used: 292k/0k (62k/231k), time:  0.00/ 0.00/ 0.00
Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 5: Checking group summary information
Free blocks count wrong (33296, counted=32960).
Fix? no

Free inodes count wrong (99987, counted=99754).
Fix? no

Pass 5: Memory used: 292k/0k (61k/232k), time:  0.01/ 0.01/ 0.00
...

It is reasonable to suspect that the replay_barrier() marked the system as readonly too early before super block synced to disk. But seems the replay_barrier() has considered that:

replay_barrier() {
        local facet=$1
        do_facet $facet "sync; sync; sync"
        df $MOUNT
..

So seems the "sync" does not work. As for your patch http://review.whamcloud.com/#/c/17371/, it should be helpful. But why not make the sync_all_data to be inside replay_barrier() to replace the 'do_facet $facet "sync; sync; sync"', then it can help others.

Comment by nasf (Inactive) [ 28/Nov/15 ]

The patch from Andreas:
Gerrit Updater added a comment - Yesterday
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17371
Subject: LU-7428 tests: write superblock in conf-sanity test_84
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d3e516e59697f7c48e5cb97054ee04cf97dc7132

Comment by Andreas Dilger [ 28/Nov/15 ]

The Free blocks/inodes count wrong errors do not indicate any kind of problem. These fields in the superblock are not updated during normal usage, so after replay_barrier() they will be out of date.

I thought about putting the sync_all_data(); sleep 5; call inside replay_barrier() but I didn't want to slow down any of the other tests that use this function that don't immediately follow reformatting the filesystem.

We'll see if this patch actually fixes the problem.

Comment by Peter Jones [ 30/Nov/15 ]

believed to be a duplicate of LU-7428

Generated at Sat Feb 10 02:06:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.