[LU-4515] Test failure sanity-quota test_34: Used space for user 60000 is 0, expected 2048 Created: 20/Jan/14 Updated: 02/Oct/14 Resolved: 04/Jul/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 12354 | ||||||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: The sub-test test_34 failed with the following error:
Info required for matching: sanity-quota 34 |
| Comments |
| Comment by Jodi Levi (Inactive) [ 23/Jan/14 ] |
|
Niu, |
| Comment by Andreas Dilger [ 23/Jan/14 ] |
|
I would suspect this is a regression in a patch that landed recently. The problem only appeared on 2014-01-16, so failing tests should be checked for the earliest parent that is common to all of the test failures (probably landed within a day of the start of tests failing). I also wrote a small fixup to the sanity-quota.sh test script because it was spewing a lot of garbage into the logs: http://review.whamcloud.com/8978 Note that this is NOT a fix to the problem reported here. |
| Comment by Andreas Dilger [ 23/Jan/14 ] |
|
I did some digging into the test failures, and the earliest failures were hit with parent commit 035f3e4bf7532839dd88a4ae330fd67542e17cdd (http://review.whamcloud.com/8528) but I don't think that patch is the source of the problem. Nothing similar was hit in the two weeks prior to this. |
| Comment by Andreas Dilger [ 23/Jan/14 ] |
|
I was looking at the debug logs for https://maloo.whamcloud.com/test_sets/0c69b0d8-7e9d-11e3-b6d1-52540035b04c and it appears that the "chown" RPC ("MDS_REINT") is sent to the MDS, but no matching "OST_SETATTR" RPC is sent to the OST to change the object ownership. I also see in the logs: 1389835717.361895:0:4338:0:(osd_quota.c:98:osd_acct_index_lookup()) lustre-MDT0000: id ea60 not found in DMU accounting ZAP which indicates to me (a non quota expert) that this user (0xea60=60000) doesn't have any files in quota accounting. |
| Comment by nasf (Inactive) [ 24/Jan/14 ] |
|
Niu: I have submitted a patch (http://review.whamcloud.com/#/c/8981/) to disable sanity-quota |
| Comment by Johann Lombardi (Inactive) [ 24/Jan/14 ] |
|
The lack of OST_SETATTR requests is weird. The test relies on wait_delete_completed to wait for setattr RPCs to be sent, is it possible that this function does not work any more for setattr? |
| Comment by Zhenyu Xu [ 24/Jan/14 ] |
|
from client ll_setattr_raw(), after it set attr on MDS, client only set ost attr for size & time change, won't do anything about uid change on OST. if (attr->ia_valid & (ATTR_SIZE | ATTR_ATIME | ATTR_ATIME_SET | ATTR_MTIME | ATTR_MTIME_SET)) { /* For truncate and utimes sending attributes to OSTs, setting * mtime/atime to the past will be performed under PW [0:EOF] * extent lock (new_size:EOF for truncate). It may seem * excessive to send mtime/atime updates to OSTs when not * setting times to past, but it is necessary due to possible * time de-synchronization between MDT inode and OST objects */ rc = ll_setattr_ost(inode, attr); } I verified this behavior on ldiskfs backend, and the test passed for ldiskfs backend. From what's been observed, it likely relates to the quota implementation with zfs backend. |
| Comment by Johann Lombardi (Inactive) [ 24/Jan/14 ] |
|
Bobijam, since 2.4, this is the MDT which is responsible for sending the OST_SETATTR RPCs to change ownership of objects on the OSTs. This is done through OSP and that's why the test calls wait_delete_completed to wait for OSP to flush all pending requests. |
| Comment by Bob Glossman (Inactive) [ 24/Jan/14 ] |
|
As Johann said the test depends on wait_delete_completed to operate correctly. wait_delete_completed relies on lproc stats maintained in osp_sync.c. I see http://review.whamcloud.com/7158 landed that touches osp code just about the time these errors started up. I don't see anything wrong in that mod, but it does make changes in the close neighborhood and the coincidence in time seems suspicious. |
| Comment by Jodi Levi (Inactive) [ 27/Jan/14 ] |
|
Test has been disabled. Will re-enable once fix is in place. |
| Comment by Nathaniel Clark [ 27/Jan/14 ] |
|
Jodi, how has this test been disabled? Could this be related to issues with TEI-1403? |
| Comment by Niu Yawei (Inactive) [ 19/Feb/14 ] |
|
I saw this on the mds log before s-q test_34: Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl lustre-mdt1/mdt1 /mnt/mds1 ------------[ cut here ]------------ WARNING: at fs/proc/generic.c:590 proc_register+0x129/0x220() (Tainted: P W --------------- ) Hardware name: KVM proc_dir_entry 'lustre/osc' already registered Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) nodemap(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] Pid: 26088, comm: llog_process_th Tainted: P W --------------- 2.6.32-358.23.2.el6_lustre.ge80a1ca.x86_64 #1 Call Trace: [<ffffffff8106e3e7>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff8106e4d6>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff811f0299>] ? proc_register+0x129/0x220 [<ffffffff811f05c2>] ? proc_mkdir_mode+0x42/0x60 [<ffffffff811f05f6>] ? proc_mkdir+0x16/0x20 [<ffffffffa0718c00>] ? lprocfs_seq_register+0x20/0x80 [obdclass] [<ffffffffa070a203>] ? class_search_type+0x53/0xa0 [obdclass] [<ffffffffa1098f0a>] ? osp_lprocfs_init+0x30a/0x530 [osp] [<ffffffffa1085c3a>] ? osp_device_alloc+0xfaa/0x29e0 [osp] [<ffffffffa0746263>] ? lu_context_init+0xa3/0x240 [obdclass] [<ffffffffa072ce2f>] ? obd_setup+0x1bf/0x290 [obdclass] [<ffffffffa072d108>] ? class_setup+0x208/0x870 [obdclass] [<ffffffffa073567c>] ? class_process_config+0xc6c/0x1ad0 [obdclass] [<ffffffffa072e263>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass] [<ffffffffa07375fb>] ? class_config_llog_handler+0xa2b/0x1880 [obdclass] [<ffffffff8150faee>] ? mutex_lock+0x1e/0x50 [<ffffffffa06f3d5b>] ? llog_process_thread+0x8ab/0xd60 [obdclass] [<ffffffffa0741e4f>] ? keys_fill+0x6f/0x190 [obdclass] [<ffffffffa06f4255>] ? llog_process_thread_daemonize+0x45/0x70 [obdclass] [<ffffffffa06f4210>] ? llog_process_thread_daemonize+0x0/0x70 [obdclass] [<ffffffff81096a36>] ? kthread+0x96/0xa0 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 ---[ end trace f9bd04d09a084708 ]--- Looks there is something wrong with the osp proc files, which makes wait_delete_completed() out of work. This should be introduced by |
| Comment by James A Simmons [ 19/Feb/14 ] |
|
The revert helped but a module loading race condition still exist when both client and server are on the same node. A patch exist to fix that issue at http://review.whamcloud.com/#/c/9038. |
| Comment by Andreas Dilger [ 24/Apr/14 ] |
|
Niu, we still need a test to remove test_34 from sanity-quota.sh ALWAYS_EXCEPT. Please just delete that ALWAYS_EXCEPT line, since these exceptions should be added at the top of the function where ALWAYS_EXCEPT is first defined. |
| Comment by Jodi Levi (Inactive) [ 08/May/14 ] |
|
Patch landed to Master. Please reopen ticket if more work is needed. |
| Comment by Niu Yawei (Inactive) [ 03/Jul/14 ] |
| Comment by Niu Yawei (Inactive) [ 03/Jul/14 ] |
|
The test_34 need be re-enabled. |
| Comment by Andreas Dilger [ 04/Jul/14 ] |
|
We can't enable sanity-quota test_34 until the test is fixed by |
| Comment by Andreas Dilger [ 04/Jul/14 ] |
|
Closing this bug again. I've moved the re-enabling of test_34 into http://review.whamcloud.com/10989 so that this patch is being verified as fixed. |