[LU-2775] Interop 2.1.4<->2.4 failure on test suite lustre-initialization-1: ASSERTION( fid_seq_is_mdt(loi->loi_oi.oi_seq) ) failed Created: 07/Feb/13 Updated: 08/Apr/13 Resolved: 28/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.1.4, Lustre 2.1.5 |
| Fix Version/s: | Lustre 2.4.0, Lustre 2.1.5 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mq313 | ||
| Environment: |
2.1.4 client vs 2.4 server |
||
| Severity: | 3 |
| Rank (Obsolete): | 6728 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/fce1d212-6f52-11e2-a955-52540035b04c. 03:44:53:Lustre: DEBUG MARKER: -----============= acceptance-small: runtests ============----- Sat Feb 2 03:44:51 PST 2013 03:44:53:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/sbin:/usr/sbin:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre 03:44:53:Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts); 03:44:53:mpts=$(mount | grep -c /mnt/lustre' '); 03:44:53:if [ $running -ne $mpts ]; then 03:44:53: echo $(hostname) env are INSANE!; 03:44:53: exit 1; 03:44:53:fi 03:44:53:Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre2' ' /proc/mounts); 03:44:53:mpts=$(mount | grep -c /mnt/lustre2' '); 03:44:53:if [ $running -ne $mpts ]; then 03:44:53: echo $(hostname) env are INSANE!; 03:44:53: exit 1; 03:44:53:fi 03:45:04:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20 03:45:04:Lustre: DEBUG MARKER: Using TIMEOUT=20 03:45:05:Lustre: DEBUG MARKER: lctl dl | grep ' IN osc ' 2>/dev/null | wc -l 03:45:05:Lustre: 14464:0:(debug.c:326:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release. 03:45:05:Lustre: 14464:0:(debug.c:326:libcfs_debug_str2mask()) Skipped 3 previous similar messages 03:45:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark touching \/mnt\/lustre at Sat Feb 2 03:44:58 PST 2013 03:45:05:Lustre: DEBUG MARKER: touching /mnt/lustre at Sat Feb 2 03:44:58 PST 2013 03:45:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark create an empty file \/mnt\/lustre\/hosts.12756 03:45:05:Lustre: DEBUG MARKER: create an empty file /mnt/lustre/hosts.12756 03:45:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark copying \/etc\/hosts to \/mnt\/lustre\/hosts.12756 03:45:05:Lustre: DEBUG MARKER: copying /etc/hosts to /mnt/lustre/hosts.12756 03:45:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark comparing \/etc\/hosts and \/mnt\/lustre\/hosts.12756 03:45:05:Lustre: DEBUG MARKER: comparing /etc/hosts and /mnt/lustre/hosts.12756 03:45:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark renaming \/mnt\/lustre\/hosts.12756 to \/mnt\/lustre\/hosts.12756.ren 03:45:05:Lustre: DEBUG MARKER: renaming /mnt/lustre/hosts.12756 to /mnt/lustre/hosts.12756.ren 03:45:05:LustreError: 15228:0:(lov_obd.c:1827:lov_find_cbdata()) ASSERTION( fid_seq_is_mdt(loi->loi_oi.oi_seq) ) failed: 03:45:05:LustreError: 15228:0:(lov_obd.c:1827:lov_find_cbdata()) LBUG 03:45:05:Pid: 15228, comm: mv 03:45:05: 03:45:05:Call Trace: 03:45:05: [<ffffffffa04357f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 03:45:05: [<ffffffffa0435e07>] lbug_with_loc+0x47/0xb0 [libcfs] 03:45:05: [<ffffffffa095eaba>] lov_find_cbdata+0x63a/0x720 [lov] 03:45:05: [<ffffffffa0a3c000>] ? return_if_equal+0x0/0x30 [lustre] 03:45:05: [<ffffffffa08dd361>] ? lmv_find_cbdata+0x1c1/0x5e0 [lmv] 03:45:05: [<ffffffffa0a3dec2>] find_cbdata+0x212/0x940 [lustre] 03:45:05: [<ffffffffa0a3e641>] ll_ddelete+0x51/0x2b0 [lustre] 03:45:05: [<ffffffff81272095>] ? _atomic_dec_and_lock+0x55/0x80 03:45:05: [<ffffffff81193eea>] dput+0xca/0x150 03:45:05: [<ffffffff8118b56a>] sys_renameat+0x1fa/0x260 03:45:05: [<ffffffff811808b4>] ? cp_new_stat+0xe4/0x100 03:45:05: [<ffffffff81180b8e>] ? vfs_lstat+0x1e/0x20 03:45:05: [<ffffffff810d6d42>] ? audit_syscall_entry+0x272/0x2a0 03:45:05: [<ffffffff81503ade>] ? do_page_fault+0x3e/0xa0 03:45:05: [<ffffffff8118b5eb>] sys_rename+0x1b/0x20 03:45:05: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 03:45:05: 03:46:27:BUG: soft lockup - CPU#0 stuck for 67s! [khelper:15230] 03:46:27:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 03:46:27:CPU 0 03:46:27:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 03:46:27: 03:46:27:Pid: 15230, comm: khelper Not tainted 2.6.32-279.14.1.el6.x86_64 #1 Red Hat KVM 03:46:27:RIP: 0010:[<ffffffff8150098e>] [<ffffffff8150098e>] _spin_lock+0x1e/0x30 03:46:27:RSP: 0018:ffff88007a011aa0 EFLAGS: 00000206 03:46:27:RAX: 0000000000000001 RBX: ffff88007a011aa0 RCX: 0000000000000000 03:46:27:RDX: 0000000000000000 RSI: ffffffff81a83fc0 RDI: ffffffff81a83fc0 03:46:27:RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000 03:46:27:R10: ffff8800738f9200 R11: 00000000000000c0 R12: ffffffffa010a5d0 03:46:27:R13: 0000000000000018 R14: ffff88007a011a18 R15: 0000000000010282 03:46:27:FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 03:46:27:CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b 03:46:27:CR2: 000000000107e228 CR3: 000000007c3cc000 CR4: 00000000000006f0 03:46:27:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 03:46:27:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 03:46:27:Process khelper (pid: 15230, threadinfo ffff88007a010000, task ffff88007c1bd540) 03:46:27:Stack: 03:46:27: ffff88007a011ad0 ffffffff81272095 ffff88007a011b10 ffffffff81a83fc0 03:46:27:<d> 00007fa32298b000 ffff88007a9e2380 ffff88007a011af0 ffffffff81193eba 03:46:27:<d> ffff88007a011b90 ffff88007a011ca0 ffff88007a011b10 ffffffff81188b35 03:46:27:Call Trace: |
| Comments |
| Comment by Di Wang [ 07/Feb/13 ] |
|
It seems 2.1 and 2.4 has different fid_seq_is_mdt 2.1 static inline int fid_seq_is_mdt(const __u64 seq)
{
return seq == FID_SEQ_OST_MDT0 ||
(seq >= FID_SEQ_OST_MDT1 && seq <= FID_SEQ_OST_MAX);
};
master static inline int fid_seq_is_mdt(const __u64 seq)
{
return seq == FID_SEQ_OST_MDT0 || seq >= FID_SEQ_NORMAL;
};
Clearly client can not recognize the new FID here. And also we still use ost index inside loi to locate OST on the client side, instead of fld lookup. So 2.1 client should be able to live with new normal seq except this ASSERT. Probably we just need to fix the LASSERT in 2.1? |
| Comment by Andreas Dilger [ 07/Feb/13 ] |
|
Di, I thought FID-on-OST was only enabled when multiple MDTs are enabled? That would avoid this problem for now, because we don't have interop between 2.1 clients and multiple MDTs, though there is still a need to fix the LASSERT() on the client. It would actually be better NOT to LASSERT() on bad data from the network, and instead just return an error. |
| Comment by Di Wang [ 07/Feb/13 ] |
|
Hmm, I thought FID-on-OST should be enabled once both MDS and OST is 2.4, though we still keep using IDIF if there are existing files. And denying 2.1 client to access other MDTs will be done by returning EIO (other error), when they try to access Remote directory. Thanks. |
| Comment by Andreas Dilger [ 07/Feb/13 ] |
|
http://review.whamcloud.com/5304 for b2_1 |
| Comment by Di Wang [ 08/Feb/13 ] |
|
http://review.whamcloud.com/5307 for master |
| Comment by Jodi Levi (Inactive) [ 21/Feb/13 ] |
|
Landed to master and b2_1 |
| Comment by Andreas Dilger [ 25/Feb/13 ] |
|
There is still a patch for master to land to remove the LASSERT() on data passed from the network: |
| Comment by Niu Yawei (Inactive) [ 03/Mar/13 ] |
|
The conf-sanity 32b still fail with -ENOSPC for 1.8 -> 2.4 test: https://maloo.whamcloud.com/test_sets/6efb0284-81d1-11e2-8564-52540035b04c dd: writing `/tmp/t32/mnt/lustre/tmp_file': No space left on device 1+0 records in 0+0 records out 0 bytes (0 B) copied, 0.0667586 s, 0.0 kB/s conf-sanity test_32b: @@@@@@ FAIL: dd failed It's similar to |
| Comment by Niu Yawei (Inactive) [ 04/Mar/13 ] |
|
It was because my 1.8 image disk size is wrong, sorry for the noise. |