[LU-2775] Interop 2.1.4<->2.4 failure on test suite lustre-initialization-1: ASSERTION( fid_seq_is_mdt(loi->loi_oi.oi_seq) ) failed Created: 07/Feb/13  Updated: 08/Apr/13  Resolved: 28/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.4, Lustre 2.1.5
Fix Version/s: Lustre 2.4.0, Lustre 2.1.5

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: mq313
Environment:

2.1.4 client vs 2.4 server


Severity: 3
Rank (Obsolete): 6728

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/fce1d212-6f52-11e2-a955-52540035b04c.

03:44:53:Lustre: DEBUG MARKER: -----============= acceptance-small: runtests ============----- Sat Feb 2 03:44:51 PST 2013
03:44:53:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/sbin:/usr/sbin:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre
03:44:53:Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts);
03:44:53:mpts=$(mount | grep -c /mnt/lustre' ');
03:44:53:if [ $running -ne $mpts ]; then
03:44:53:    echo $(hostname) env are INSANE!;
03:44:53:    exit 1;
03:44:53:fi
03:44:53:Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre2' ' /proc/mounts);
03:44:53:mpts=$(mount | grep -c /mnt/lustre2' ');
03:44:53:if [ $running -ne $mpts ]; then
03:44:53:    echo $(hostname) env are INSANE!;
03:44:53:    exit 1;
03:44:53:fi
03:45:04:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20
03:45:04:Lustre: DEBUG MARKER: Using TIMEOUT=20
03:45:05:Lustre: DEBUG MARKER: lctl dl | grep ' IN osc ' 2>/dev/null | wc -l
03:45:05:Lustre: 14464:0:(debug.c:326:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
03:45:05:Lustre: 14464:0:(debug.c:326:libcfs_debug_str2mask()) Skipped 3 previous similar messages
03:45:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark touching \/mnt\/lustre at Sat Feb  2 03:44:58 PST 2013
03:45:05:Lustre: DEBUG MARKER: touching /mnt/lustre at Sat Feb 2 03:44:58 PST 2013
03:45:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark create an empty file \/mnt\/lustre\/hosts.12756
03:45:05:Lustre: DEBUG MARKER: create an empty file /mnt/lustre/hosts.12756
03:45:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark copying \/etc\/hosts to \/mnt\/lustre\/hosts.12756
03:45:05:Lustre: DEBUG MARKER: copying /etc/hosts to /mnt/lustre/hosts.12756
03:45:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark comparing \/etc\/hosts and \/mnt\/lustre\/hosts.12756
03:45:05:Lustre: DEBUG MARKER: comparing /etc/hosts and /mnt/lustre/hosts.12756
03:45:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark renaming \/mnt\/lustre\/hosts.12756 to \/mnt\/lustre\/hosts.12756.ren
03:45:05:Lustre: DEBUG MARKER: renaming /mnt/lustre/hosts.12756 to /mnt/lustre/hosts.12756.ren
03:45:05:LustreError: 15228:0:(lov_obd.c:1827:lov_find_cbdata()) ASSERTION( fid_seq_is_mdt(loi->loi_oi.oi_seq) ) failed: 
03:45:05:LustreError: 15228:0:(lov_obd.c:1827:lov_find_cbdata()) LBUG
03:45:05:Pid: 15228, comm: mv
03:45:05:
03:45:05:Call Trace:
03:45:05: [<ffffffffa04357f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
03:45:05: [<ffffffffa0435e07>] lbug_with_loc+0x47/0xb0 [libcfs]
03:45:05: [<ffffffffa095eaba>] lov_find_cbdata+0x63a/0x720 [lov]
03:45:05: [<ffffffffa0a3c000>] ? return_if_equal+0x0/0x30 [lustre]
03:45:05: [<ffffffffa08dd361>] ? lmv_find_cbdata+0x1c1/0x5e0 [lmv]
03:45:05: [<ffffffffa0a3dec2>] find_cbdata+0x212/0x940 [lustre]
03:45:05: [<ffffffffa0a3e641>] ll_ddelete+0x51/0x2b0 [lustre]
03:45:05: [<ffffffff81272095>] ? _atomic_dec_and_lock+0x55/0x80
03:45:05: [<ffffffff81193eea>] dput+0xca/0x150
03:45:05: [<ffffffff8118b56a>] sys_renameat+0x1fa/0x260
03:45:05: [<ffffffff811808b4>] ? cp_new_stat+0xe4/0x100
03:45:05: [<ffffffff81180b8e>] ? vfs_lstat+0x1e/0x20
03:45:05: [<ffffffff810d6d42>] ? audit_syscall_entry+0x272/0x2a0
03:45:05: [<ffffffff81503ade>] ? do_page_fault+0x3e/0xa0
03:45:05: [<ffffffff8118b5eb>] sys_rename+0x1b/0x20
03:45:05: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
03:45:05:
03:46:27:BUG: soft lockup - CPU#0 stuck for 67s! [khelper:15230]
03:46:27:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
03:46:27:CPU 0 
03:46:27:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
03:46:27:
03:46:27:Pid: 15230, comm: khelper Not tainted 2.6.32-279.14.1.el6.x86_64 #1 Red Hat KVM
03:46:27:RIP: 0010:[<ffffffff8150098e>]  [<ffffffff8150098e>] _spin_lock+0x1e/0x30
03:46:27:RSP: 0018:ffff88007a011aa0  EFLAGS: 00000206
03:46:27:RAX: 0000000000000001 RBX: ffff88007a011aa0 RCX: 0000000000000000
03:46:27:RDX: 0000000000000000 RSI: ffffffff81a83fc0 RDI: ffffffff81a83fc0
03:46:27:RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000
03:46:27:R10: ffff8800738f9200 R11: 00000000000000c0 R12: ffffffffa010a5d0
03:46:27:R13: 0000000000000018 R14: ffff88007a011a18 R15: 0000000000010282
03:46:27:FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
03:46:27:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
03:46:27:CR2: 000000000107e228 CR3: 000000007c3cc000 CR4: 00000000000006f0
03:46:27:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
03:46:27:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
03:46:27:Process khelper (pid: 15230, threadinfo ffff88007a010000, task ffff88007c1bd540)
03:46:27:Stack:
03:46:27: ffff88007a011ad0 ffffffff81272095 ffff88007a011b10 ffffffff81a83fc0
03:46:27:<d> 00007fa32298b000 ffff88007a9e2380 ffff88007a011af0 ffffffff81193eba
03:46:27:<d> ffff88007a011b90 ffff88007a011ca0 ffff88007a011b10 ffffffff81188b35
03:46:27:Call Trace:


 Comments   
Comment by Di Wang [ 07/Feb/13 ]

It seems 2.1 and 2.4 has different fid_seq_is_mdt

2.1

static inline int fid_seq_is_mdt(const __u64 seq)
{       
        return seq == FID_SEQ_OST_MDT0 ||
               (seq >= FID_SEQ_OST_MDT1 && seq <= FID_SEQ_OST_MAX);
};      

master

static inline int fid_seq_is_mdt(const __u64 seq)
{                         
        return seq == FID_SEQ_OST_MDT0 || seq >= FID_SEQ_NORMAL;
};                             

Clearly client can not recognize the new FID here. And also we still use ost index inside loi to locate OST on the client side, instead of fld lookup. So 2.1 client should be able to live with new normal seq except this ASSERT. Probably we just need to fix the LASSERT in 2.1?

Comment by Andreas Dilger [ 07/Feb/13 ]

Di, I thought FID-on-OST was only enabled when multiple MDTs are enabled? That would avoid this problem for now, because we don't have interop between 2.1 clients and multiple MDTs, though there is still a need to fix the LASSERT() on the client.

It would actually be better NOT to LASSERT() on bad data from the network, and instead just return an error.

Comment by Di Wang [ 07/Feb/13 ]

Hmm, I thought FID-on-OST should be enabled once both MDS and OST is 2.4, though we still keep using IDIF if there are existing files. And denying 2.1 client to access other MDTs will be done by returning EIO (other error), when they try to access Remote directory. Thanks.

Comment by Andreas Dilger [ 07/Feb/13 ]

http://review.whamcloud.com/5304 for b2_1

Comment by Di Wang [ 08/Feb/13 ]

http://review.whamcloud.com/5307 for master

Comment by Jodi Levi (Inactive) [ 21/Feb/13 ]

Landed to master and b2_1

Comment by Andreas Dilger [ 25/Feb/13 ]

There is still a patch for master to land to remove the LASSERT() on data passed from the network:
http://review.whamcloud.com/5456

Comment by Niu Yawei (Inactive) [ 03/Mar/13 ]

The conf-sanity 32b still fail with -ENOSPC for 1.8 -> 2.4 test: https://maloo.whamcloud.com/test_sets/6efb0284-81d1-11e2-8564-52540035b04c

dd: writing `/tmp/t32/mnt/lustre/tmp_file': No space left on device
1+0 records in
0+0 records out
0 bytes (0 B) copied, 0.0667586 s, 0.0 kB/s
 conf-sanity test_32b: @@@@@@ FAIL: dd failed 

It's similar to LU-2768, but not exactly same.

Comment by Niu Yawei (Inactive) [ 04/Mar/13 ]

It was because my 1.8 image disk size is wrong, sorry for the noise.

Generated at Sat Feb 10 01:28:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.