[LU-8364] during OSS failover test with quotas enabled, OSS node crashed on 2 of 4 failovers Created: 04/Jul/16  Updated: 24/May/17  Resolved: 24/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Major
Reporter: Lokesh Nagappa Jaliminche (Inactive) Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Console logs:
===========
Jan 10 14:50:47 snx11000n004 XYRAID(snx11000n004_md1-jnlr)[11815]: INFO: snx11000n004_md1-jnlr stop exit : 0
Jan 10 14:50:48 snx11000n004 kernel: [340395.094979] __ratelimit: 1047 callbacks suppressed
Jan 10 14:50:48 snx11000n004 kernel: [340395.099992] Write to readonly device md139 (0x90008b) bi_flags: f000000000000001, bi_vcnt: 1, bi_idx: 0, bi->size: 4096, bi_cnt: 2, bi_private: ffff8802fede3678
Jan 10 14:50:48 snx11000n004 kernel: [340395.114672] Write to readonly device md139 (0x90008b) bi_flags: f000000000000001, bi_vcnt: 1, bi_idx: 0, bi->size: 4096, bi_cnt: 2, bi_private: ffff8802fede3748
Jan 10 14:50:48 snx11000n004 kernel: [340395.129363] Write to readonly device md139 (0x90008b) bi_flags: f000000000000001, bi_vcnt: 1, bi_idx: 0, bi->size: 4096, bi_cnt: 2, bi_private: ffff8802fede3748
Jan 10 14:50:48 snx11000n004 kernel: [340395.144056] Write to readonly device md5 (0x900005) bi_flags: f000000000000001, bi_vcnt: 1, bi_idx: 0, bi->size: 4096, bi_cnt: 2, bi_private: ffff88049c2fb610
Jan 10 14:50:48 snx11000n004 kernel: [340395.176736] LDISKFS-fs error (device md5): ldiskfs_mb_release_inode_pa: pa free mismatch: [pa ffff88066edde7b8] [phy 16646156] [logic 267] [len 117] [free 115] [error 0] [inode 117886005] [freed 117]
Jan 10 14:50:48 snx11000n004 kernel: [340395.194885] Aborting journal on device md139.
Jan 10 14:50:48 snx11000n004 kernel: [340395.199450] Write to readonly device md139 (0x90008b) bi_flags: f000000000000001, bi_vcnt: 1, bi_idx: 0, bi->size: 4096, bi_cnt: 2, bi_private: ffff880594188540
Jan 10 14:50:48 snx11000n004 kernel: [340395.214136] LDISKFS-fs (md5): Remounting filesystem read-only
Jan 10 14:50:48 snx11000n004 kernel: [340395.220182] Write to readonly device md5 (0x900005) bi_flags: f000000000000001, bi_vcnt: 1, bi_idx: 0, bi->size: 4096, bi_cnt: 2, bi_private: ffff880596d40a88
Jan 10 14:50:48 snx11000n004 kernel: [340395.234671] LDISKFS-fs error (device md5): ldiskfs_mb_release_inode_pa: free 117, pa_free 115
Jan 10 14:50:48 snx11000n004 kernel: [340395.243558] ----------[ cut here ]----------
Jan 10 14:50:48 snx11000n004 kernel: [340395.248362] kernel BUG at /builddir/build/BUILD/lustre-ldiskfs-3.3.0.x2/ldiskfs/mballoc.c:3799!
Jan 10 14:50:49 snx11000n004 kernel: [340395.256674] Write to readonly device md143 (0x90008f) bi_flags: f000000000000001, bi_vcnt: 1, bi_idx: 0, bi->size: 4096, bi_cnt: 2, bi_private: ffff8802fede3bc0
Jan 10 14:50:49 snx11000n004 kernel: [340395.256679] Write to readonly device md143 (0x90008f) bi_flags: f000000000000001, bi_vcnt: 1, bi_idx: 0, bi->size: 4096, bi_cnt: 2, bi_private: ffff8802fede30c8
Jan 10 14:50:49 snx11000n004 kernel: [340395.256695] Write to readonly device md143 (0x90008f) bi_flags: f000000000000001, bi_vcnt: 1, bi_idx: 0, bi->size: 4096, bi_cnt: 2, bi_private: ffff8802fede30c8
Jan 10 14:50:49 snx11000n004 kernel: [340395.256712] Write to readonly device md7 (0x900007) bi_flags: f000000000000001, bi_vcnt: 1, bi_idx: 0, bi->size: 4096, bi_cnt: 2, bi_private: ffff880483f645a8
Jan 10 14:50:49 snx11000n004 kernel: [340395.317126] invalid opcode: 0000 1 SMP
Jan 10 14:50:49 snx11000n004 kernel: [340395.321485] last sysfs file: /sys/devices/virtual/block/md131/uevent
Jan 10 14:50:49 snx11000n004 kernel: [340395.328034] CPU 0
Jan 10 14:50:49 snx11000n004 kernel: [340395.424861]
Jan 10 14:50:49 snx11000n004 kernel: [340395.520504] Pid: 11724, comm: umount Tainted: P W ---------------- 2.6.32-131.21.1.el6.lustre.3021.x86_64 #1 CS6000AC
Jan 10 14:50:49 snx11000n004 kernel: [340395.532218] RIP: 0010:[<ffffffffa0921ab6>] [<ffffffffa0921ab6>] ldiskfs_mb_release_inode_pa+0x346/0x360 [ldiskfs]
Jan 10 14:50:49 snx11000n004 kernel: [340395.542875] RSP: 0018:ffff8805de375a58 EFLAGS: 00010202
Jan 10 14:50:49 snx11000n004 kernel: [340395.548364] RAX: 0000000000000073 RBX: 0000000000000075 RCX: ffff8807d3b0bc00
Jan 10 14:50:49 snx11000n004 kernel: [340395.555743] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff8806ebf95f00
Jan 10 14:50:49 snx11000n004 kernel: [340395.563142] RBP: ffff8805de375b08 R08: 0000000000000000 R09: 0000000000000080
Jan 10 14:50:49 snx11000n004 kernel: [340395.570521] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880324a63490
Jan 10 14:50:49 snx11000n004 kernel: [340395.577903] R13: ffff880596f74408 R14: 0000000000000082 R15: ffff88066edde7b8
Jan 10 14:50:49 snx11000n004 kernel: [340395.585286] FS: 00007f58836fe740(0000) GS:ffff880044600000(0000) knlGS:0000000000000000
Jan 10 14:50:49 snx11000n004 kernel: [340395.593619] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 10 14:50:49 snx11000n004 kernel: [340395.599539] CR2: 00007f6553fd90a0 CR3: 0000000779cc0000 CR4: 00000000000406f0
Jan 10 14:50:49 snx11000n004 kernel: [340395.606922] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 10 14:50:49 snx11000n004 kernel: [340395.614299] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 10 14:50:49 snx11000n004 kernel: [340395.621678] Process umount (pid: 11724, threadinfo ffff8805de374000, task ffff8806c96e80c0)
Jan 10 14:50:49 snx11000n004 kernel: [340395.630271] Stack:
Jan 10 14:50:49 snx11000n004 kernel: [340395.632464] ffff880500000075 0000000000000073 ffff880500000000 000000000706cc35
Jan 10 14:50:49 snx11000n004 kernel: [340395.639970] <0> 0000000000000075 000000000000007c ffff8805de375a98 ffffffff811a4066
Jan 10 14:50:49 snx11000n004 kernel: [340395.648046] <0> ffff8807d3b0bc00 ffff8807a0f1a800 ffff88066edde7b8 0000000000fe0000
Jan 10 14:50:49 snx11000n004 kernel: [340395.656371] Call Trace:
Jan 10 14:50:49 snx11000n004 kernel: [340395.659000] [<ffffffff811a4066>] ? __wait_on_buffer+0x26/0x30
Jan 10 14:50:49 snx11000n004 kernel: [340395.665024] [<ffffffffa092556e>] ldiskfs_discard_preallocations+0x1fe/0x490 [ldiskfs]
Jan 10 14:50:49 snx11000n004 kernel: [340395.673193] [<ffffffffa093e1c6>] ldiskfs_clear_inode+0x16/0x50 [ldiskfs]
Jan 10 14:50:49 snx11000n004 kernel: [340395.680168] [<ffffffff8118ceaf>] clear_inode+0x8f/0x110
Jan 10 14:50:49 snx11000n004 kernel: [340395.685655] [<ffffffff8118cf70>] dispose_list+0x40/0x120
Jan 10 14:50:49 snx11000n004 kernel: [340395.691236] [<ffffffff8118d41a>] invalidate_inodes+0xea/0x190
Jan 10 14:50:49 snx11000n004 kernel: [340395.697249] [<ffffffff81174f2c>] generic_shutdown_super+0x4c/0xe0
Jan 10 14:50:49 snx11000n004 kernel: [340395.703603] [<ffffffff81174ff1>] kill_block_super+0x31/0x50
Jan 10 14:50:49 snx11000n004 kernel: [340395.709455] [<ffffffff811760a0>] deactivate_super+0x70/0x90
Jan 10 14:50:49 snx11000n004 kernel: [340395.715291] [<ffffffff811915af>] mntput_no_expire+0xbf/0x110
Jan 10 14:50:49 snx11000n004 kernel: [340395.721253] [<ffffffffa10eb9c4>] unlock_mntput+0x64/0x70 [obdclass]
Jan 10 14:50:49 snx11000n004 kernel: [340395.727818] [<ffffffffa10f3ae3>] server_put_super+0x433/0x13e0 [obdclass]
Jan 10 14:50:49 snx11000n004 kernel: [340395.734875] [<ffffffff8108e120>] ? autoremove_wake_function+0x0/0x40
Jan 10 14:50:49 snx11000n004 kernel: [340395.741494] [<ffffffff8118d426>] ? invalidate_inodes+0xf6/0x190
Jan 10 14:50:49 snx11000n004 kernel: [340395.747672] [<ffffffff81174f3b>] generic_shutdown_super+0x5b/0xe0
Jan 10 14:50:49 snx11000n004 kernel: [340395.754054] [<ffffffff81175026>] kill_anon_super+0x16/0x60
Jan 10 14:50:49 snx11000n004 kernel: [340395.759856] [<ffffffffa10ea166>] lustre_kill_super+0x36/0x60 [obdclass]
Jan 10 14:50:49 snx11000n004 kernel: [340395.766760] [<ffffffff811760a0>] deactivate_super+0x70/0x90
Jan 10 14:50:49 snx11000n004 kernel: [340395.772612] [<ffffffff811915af>] mntput_no_expire+0xbf/0x110
Jan 10 14:50:49 snx11000n004 kernel: [340395.778555] [<ffffffff811919db>] sys_umount+0x7b/0x3a0
Jan 10 14:50:49 snx11000n004 kernel: [340395.783971] [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
same crash hit twice in 4 attempts. logs attached (kern, message, conman); will upload dump to ftp server.



 Comments   
Comment by Gerrit Updater [ 04/Jul/16 ]

lokesh.jaliminche (lokesh.jaliminche@seagate.com) uploaded a new patch: http://review.whamcloud.com/21141
Subject: LU-8364 ldiskfs: fixes for failover mode.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fa8baf792498fc74b369cb28b62a28e6db1b192e

Comment by Gerrit Updater [ 17/Dec/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21141/
Subject: LU-8364 ldiskfs: fixes for failover mode.
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a70b020e5b2f1bbe3b759232852beaac4f0852b5

Comment by Peter Jones [ 17/Dec/16 ]

Landed for 2.10

Comment by Bob Glossman (Inactive) [ 17/Dec/16 ]

This landed fix looks incomplete to me.
I see patches added for el7.2, but not el7.3
Are no fixes needed there?
And what about SLES distros?

Comment by Peter Jones [ 19/Dec/16 ]

Yang Sheng

Could you please check into this?

Thanks

Peter

Comment by Gerrit Updater [ 27/Apr/17 ]

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/26854
Subject: LU-8364 ldiskfs: fixes for failover mode
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e605bdce32f082aac0012d227aaf7e345ff4fd38

Comment by Bob Glossman (Inactive) [ 04/May/17 ]

I note this latest mod adds patches for SLES patch series. Older mod made similar changes for rhel7.2. I see nothing in either mod for el7.3. Are no changes needed there?

Comment by Gerrit Updater [ 09/May/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26854/
Subject: LU-8364 ldiskfs: fixes for failover mode
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 945fd61b80f22a4148c4c0953ddc4dfcd75337de

Comment by Peter Jones [ 09/May/17 ]

Yang Sheng

Could you please advise about whether patches are needed for RHEL 7.3?

Peter

Comment by Yang Sheng [ 10/May/17 ]

Yes, This patch should be landed to RHEL7.3. I am sorry to lost it in previous patch.

Comment by Yang Sheng [ 10/May/17 ]

Since this patch must conflict with LU-9384's patch. So i'll push patch after LU-9384 landed.

Thanks,
YangSheng

Comment by Gerrit Updater [ 11/May/17 ]

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/27077
Subject: LU-8364 ldiskfs: port check-ro patch to RHEL7.3
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: efc8e0150d59a9af6a2b0cfd9460ee1b2c39d5fe

Comment by Gerrit Updater [ 24/May/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27077/
Subject: LU-8364 ldiskfs: fixes for failover mode for RHEL7.3
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3e29710665c7e41f7c36ec2012f3679d656ba7e2

Comment by Peter Jones [ 24/May/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:16:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.