[LU-3326] recovery-mds-scale test_failover_ost: tar: Cannot open: No space left on device Created: 13/May/13  Updated: 26/Feb/15  Resolved: 20/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.4.1, Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1
Fix Version/s: Lustre 2.6.0, Lustre 2.5.3

Type: Bug Priority: Critical
Reporter: Jian Yu Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: mn1, mn4
Environment:

Lustre Branch: master
Lustre Build: http://build.whamcloud.com/job/lustre-master/1486
Distro/Arch: RHEL6.4/x86_64
Test Group: failover


Issue Links:
Duplicate
is duplicated by LU-5235 Failure on test suite recovery-mds-sc... Resolved
Related
is related to LU-5526 recovery-mds-scale test failover_mds:... Resolved
Severity: 3
Rank (Obsolete): 8208

 Description   

After running recovery-mds-scale test_failover_ost for 1.5 hours (OST failed over 6 times), client load on one of the clients failed as follows:

<snip>
tar: etc/mail/submit.cf: Cannot open: No space left on device
tar: etc/mail/trusted-users: Cannot open: No space left on device
tar: etc/mail/virtusertable: Cannot open: No space left on device
tar: etc/mail/access: Cannot open: No space left on device
tar: etc/mail/aliasesdb-stamp: Cannot open: No space left on device
tar: etc/gssapi_mech.conf: Cannot open: No space left on device
tar: Exiting with failure status due to previous errors

Console log on the client (client-32vm6) showed that:

19:40:31:INFO: task tar:2790 blocked for more than 120 seconds.
19:40:31:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
19:40:31:tar           D 0000000000000000     0  2790   2788 0x00000080
19:40:31: ffff88004eb73a28 0000000000000082 ffff88004eb739d8 ffff88007c24fe50
19:40:31: 0000000000000286 0000000000000003 0000000000000001 0000000000000286
19:40:31: ffff88007bcb3ab8 ffff88004eb73fd8 000000000000fb88 ffff88007bcb3ab8
19:40:31:Call Trace:
19:40:31: [<ffffffffa03d775a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
19:40:31: [<ffffffff8150ea05>] schedule_timeout+0x215/0x2e0
19:40:31: [<ffffffffa068517c>] ? ptlrpc_request_bufs_pack+0x5c/0x80 [ptlrpc]
19:40:31: [<ffffffffa069a770>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
19:40:31: [<ffffffff8150e683>] wait_for_common+0x123/0x180
19:40:31: [<ffffffff81063310>] ? default_wake_function+0x0/0x20
19:40:31: [<ffffffff8150e79d>] wait_for_completion+0x1d/0x20
19:40:31: [<ffffffffa08cbf6c>] osc_io_setattr_end+0xbc/0x190 [osc]
19:40:31: [<ffffffffa095cde0>] ? lov_io_end_wrapper+0x0/0x100 [lov]
19:40:31: [<ffffffffa055cf30>] cl_io_end+0x60/0x150 [obdclass]
19:40:31: [<ffffffffa055d7e0>] ? cl_io_start+0x0/0x140 [obdclass]
19:40:31: [<ffffffffa095ced1>] lov_io_end_wrapper+0xf1/0x100 [lov]
19:40:31: [<ffffffffa095c86e>] lov_io_call+0x8e/0x130 [lov]
19:40:31: [<ffffffffa095e3bc>] lov_io_end+0x4c/0xf0 [lov]
19:40:31: [<ffffffffa055cf30>] cl_io_end+0x60/0x150 [obdclass]
19:40:31: [<ffffffffa0561f92>] cl_io_loop+0xc2/0x1b0 [obdclass]
19:40:31: [<ffffffffa0a2aa08>] cl_setattr_ost+0x208/0x2c0 [lustre]
19:40:31: [<ffffffffa09f8b0e>] ll_setattr_raw+0x9ce/0x1000 [lustre]
19:40:31: [<ffffffffa09f919b>] ll_setattr+0x5b/0xf0 [lustre]
19:40:31: [<ffffffff8119e708>] notify_change+0x168/0x340
19:40:31: [<ffffffff811b284c>] utimes_common+0xdc/0x1b0
19:40:31: [<ffffffff811828d1>] ? __fput+0x1a1/0x210
19:40:31: [<ffffffff811b29fe>] do_utimes+0xde/0xf0
19:40:31: [<ffffffff811b2b12>] sys_utimensat+0x32/0x90
19:40:31: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Maloo report: https://maloo.whamcloud.com/test_sets/053120d2-bb19-11e2-8824-52540035b04c



 Comments   
Comment by Jian Yu [ 13/May/13 ]

Lustre Branch: master
Lustre Build: http://build.whamcloud.com/job/lustre-master/1481
Distro/Arch: RHEL6.4/x86_64
Test Group: failover

After running recovery-mds-scale test_failover_ost for more than 1 hour (OST failed over 5 times), dd operation on one of the clients failed as follows:

2013-05-08 23:10:39: dd run starting
+ mkdir -p /mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com
+ /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com
+ cd /mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com
++ /usr/bin/lfs df /mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com
+ FREE_SPACE=4050128
+ BLKS=911278
+ echoerr 'Free disk space is 4050128, 4k blocks to dd is 911278'
+ echo 'Free disk space is 4050128, 4k blocks to dd is 911278'
Free disk space is 4050128, 4k blocks to dd is 911278
+ load_pid=27789
+ wait 27789
+ dd bs=4k count=911278 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com/dd-file
dd: writing `/mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com/dd-file': No space left on device
409345+0 records in
409344+0 records out
+ '[' 1 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2013-05-08 23:23:03: dd failed'

Console log on client (client-26vm5) showed that:

23:23:08:LustreError: 27789:0:(vvp_io.c:1086:vvp_io_commit_write()) Write page 409344 of inode ffff880037eda1b8 failed -28

Maloo report: https://maloo.whamcloud.com/test_sets/52c5d99a-b8f9-11e2-891d-52540035b04c

Comment by Jian Yu [ 13/May/13 ]

Hi Hongchao,

It seems the above out of space issue on dd operation is related to LU-657. Could you please check whether the original out of space issue on tar operation in this ticket is also the same or not? Thanks.

Comment by Andreas Dilger [ 13/May/13 ]

Yu Jian,
is there an expectation that the OST would not run out of space after writing to it for an hour? Are the files written earlier in the test being deleted, or will the test fill up all of the OST space? Does it check the free space on the OST that will be written on, or is this the free space for the whole filesystem and the one file may run out of space?

Comment by Jian Yu [ 14/May/13 ]

is there an expectation that the OST would not run out of space after writing to it for an hour?

Yes, the OST should not run out of space during the whole test run, because in run_{dd,tar,dbench,iozone}.sh each application keeps running inside its own $TESTDIR, which will be removed and recreated for each successful loop.

Are the files written earlier in the test being deleted, or will the test fill up all of the OST space?

If the previous failover_mds test passed, then the applications on clients will just be terminated without deleting $TESTDIR and corresponding files. If the test failed, then the failed application will also leave the $TESTDIR and files undeleted. We need fix run_*.sh to delete test dirs for the two situations.

However, it seems the above issue is not the cause of this ticket. Let's analyze the first Maloo report in this ticket, failover_mds test failed with LU-3325 and left /mnt/lustre/d0.tar-client-32vm6.lab.whamcloud.com dir on client-32vm6. Then failover_ost test started, tar application was also run inside /mnt/lustre/d0.tar-client-32vm6.lab.whamcloud.com dir on client-32vm6. Since there were several successful loops, the /mnt/lustre/d0.tar-client-32vm6.lab.whamcloud.com dir has been removed and recreated for several times, so the dir and files left by failover_mds did not affect failover_ost.

Does it check the free space on the OST that will be written on, or is this the free space for the whole filesystem and the one file may run out of space?

Currently, only run_dd.sh checks free space before performing dd operation. We need improve other run_*.sh to add this check. Running out of space on OST is not the expectation for failover_ost test.

Besides improving run_*.sh scripts, we still need figure out the real cause of this ticket.

Comment by Jodi Levi (Inactive) [ 14/May/13 ]

Reducing from blocker but want to continue digging into this to find the cause.

Comment by Andreas Dilger [ 14/May/13 ]

One comment is that the previous data was in /mnt/lustre/d0.tar-client-32vm6.lab.whamcloud.com but the failed test was writing to /mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com/dd-file (i.e. dd vs tar), so it is possible that the old directory was not cleaned up?

It might be useful to run lfs df output in the test script before each "dd" so that we know what the space distribution is.

Comment by Jian Yu [ 15/May/13 ]

One comment is that the previous data was in /mnt/lustre/d0.tar-client-32vm6.lab.whamcloud.com but the failed test was writing to /mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com/dd-file (i.e. dd vs tar), so it is possible that the old directory was not cleaned up?

For the first Maloo report (in the Description section above), both failover_mds and failover_ost tests failed at tar operation on client-32vm6, dd operation on client-32vm5 did not fail. Per the analysis in the above comment, the dir and files left by failover_mds did not affect failover_ost.

For the second Maloo report above, failover_mds failed at tar operation on client-26vm6 and left data in /mnt/lustre/d0.tar-client-26vm6.lab.whamcloud.com, dd operation kept successful on client-26vm5 and cleaned up its dir before failover_mds stopped. Then failover_ost started, tar operation was still run on client-26vm6 in /mnt/lustre/d0.tar-client-26vm6.lab.whamcloud.com and did not fail, dd operation was still run on client-26vm5 in /mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com but failed. Again, since there was one successful dd operation loop before it failed, the /mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com dir has been removed and recreated, so the dir and files left by failover_mds did not affect failover_ost either.

It might be useful to run lfs df output in the test script before each "dd" so that we know what the space distribution is.

We can see this information in "run_dd_debug" log in the Maloo reports:

2013-05-08 23:10:39: dd run starting
+ mkdir -p /mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com
+ /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com
+ cd /mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com
++ /usr/bin/lfs df /mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com
+ FREE_SPACE=4050128
+ BLKS=911278
+ echoerr 'Free disk space is 4050128, 4k blocks to dd is 911278'
+ echo 'Free disk space is 4050128, 4k blocks to dd is 911278'
Free disk space is 4050128, 4k blocks to dd is 911278
+ load_pid=27789
+ wait 27789
+ dd bs=4k count=911278 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com/dd-file
dd: writing `/mnt/lustre/d0.dd-client-26vm5.lab.whamcloud.com/dd-file': No space left on device
409345+0 records in
409344+0 records out

The corresponding codes in run_dd.sh:

while [ ! -e "$END_RUN_FILE" ] && $CONTINUE; do
        # ......
        FREE_SPACE=$($LFS df $TESTDIR|awk '/filesystem summary:/ {print $5}')
        BLKS=$((FREE_SPACE * 9 / 40))
        echoerr "Free disk space is $FREE_SPACE, 4k blocks to dd is $BLKS"
        # ......
done

It looks like dd operation keeps occupying too much free space (90%), which is likely to make the operations on other clients run out of space, or make itself run out of space since all of the operations compete for the free space concurrently.

Since the goal for recovery-*-scale tests is not to fill up the space but to keep the applications running on different clients, I'd suggest to reduce the dd size (like just use 50% of the free space or dynamically calculate the percent according to the number of clients) and leave enough free space for other applications.

Comment by Hongchao Zhang [ 17/May/13 ]

the patch is tracked at http://review.whamcloud.com/#change,6376

Comment by Jian Yu [ 28/May/13 ]

Lustre Tag: v2_4_0_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/12/
Distro/Arch: RHEL6.4/x86_64

The issue occurred on recovery-mds-scale test_failover_ost:
https://maloo.whamcloud.com/test_sets/2e35d81c-c6c5-11e2-ae4e-52540035b04c

Comment by Jian Yu [ 12/Aug/13 ]

This is blocking the recovery-mds-scale test running on Lustre b2_4 branch:
https://maloo.whamcloud.com/test_sets/ed4a088e-02db-11e3-a4b4-52540035b04c
https://maloo.whamcloud.com/test_sets/edec7172-fd14-11e2-b90c-52540035b04c

Comment by Hongchao Zhang [ 19/Aug/13 ]

the patch has been merged into master

Comment by Jian Yu [ 19/Aug/13 ]

The patch was also cherry-picked to Lustre b2_4 branch.

Comment by Jian Yu [ 09/Sep/13 ]

Lustre Tag: v2_4_1_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/45/
Distro/Arch: RHEL6.4/x86_64

The issue still occurred on recovery-mds-scale test_failover_ost:

tar: etc/yum.repos.d/cobbler-config.repo: Cannot open: No space left on device
tar: etc/yum.repos.d/lustre-build.repo: Cannot open: No space left on device
tar: Exiting with failure status due to previous errors

Console log on MDS showed that:

11:46:04:Lustre: lustre-OST0004-osc-MDT0000: slow creates, last=[0x100040000:0xbba1:0x0], next=[0x100040000:0xbba1:0x0], reserved=0, syn_changes=54, syn_rpc_in_progress=36, status=-28
11:46:26:LNet: Service thread pid 2233 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
11:46:26:Pid: 2233, comm: mdt00_001
11:46:27:
11:46:27:Call Trace:
11:46:27: [<ffffffff810810cc>] ? lock_timer_base+0x3c/0x70
11:46:28: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0
11:46:28: [<ffffffff810811e0>] ? process_timeout+0x0/0x10
11:46:28: [<ffffffffa04a56d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
11:46:28: [<ffffffffa0fb2174>] osp_precreate_reserve+0x5b4/0x1ed0 [osp]
11:46:28: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
11:46:28: [<ffffffffa0fabc75>] osp_declare_object_create+0x155/0x4f0 [osp]
11:46:29: [<ffffffffa0f6b6ed>] lod_qos_declare_object_on+0xed/0x480 [lod]
11:46:29: [<ffffffffa0f6c569>] lod_alloc_qos.clone.0+0xae9/0x1180 [lod]
11:46:30: [<ffffffffa0cbff1f>] ? qsd_op_begin+0x5f/0xb40 [lquota]
11:46:30: [<ffffffffa0f6e65a>] lod_qos_prep_create+0x74a/0x1b14 [lod]
11:46:30: [<ffffffffa097ead2>] ? fld_server_lookup+0x72/0x3d0 [fld]
11:46:30: [<ffffffffa0f6902b>] lod_declare_striped_object+0x14b/0x880 [lod]
11:46:31: [<ffffffff81096d8f>] ? wake_up_bit+0x2f/0x40
11:46:31: [<ffffffffa0f69c71>] lod_declare_object_create+0x511/0x7a0 [lod]
11:46:31: [<ffffffffa0c198cf>] mdd_declare_object_create_internal+0xbf/0x1f0 [mdd]
11:46:31: [<ffffffffa0c28fee>] mdd_declare_create+0x4e/0x870 [mdd]
11:46:31: [<ffffffffa0c277ef>] ? mdd_linkea_prepare+0x23f/0x430 [mdd]
11:46:31: [<ffffffffa0c29fd5>] mdd_create+0x7c5/0x1790 [mdd]
11:46:32: [<ffffffffa0d9b927>] ? osd_xattr_get+0x97/0x2d0 [osd_ldiskfs]
11:46:35: [<ffffffffa0ec808f>] mdt_reint_open+0x135f/0x20c0 [mdt]
11:46:35: [<ffffffffa04c182e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
11:46:35: [<ffffffffa07abdcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
11:46:36: [<ffffffffa063ff50>] ? lu_ucred+0x20/0x30 [obdclass]
11:46:36: [<ffffffffa0eb2911>] mdt_reint_rec+0x41/0xe0 [mdt]
11:46:36: [<ffffffffa0e97ae3>] mdt_reint_internal+0x4c3/0x780 [mdt]
11:46:36: [<ffffffffa0e9806d>] mdt_intent_reint+0x1ed/0x520 [mdt]
11:46:38: [<ffffffffa0e95f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
11:46:38: [<ffffffffa0763831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
11:46:38: [<ffffffffa078a1ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
11:46:39: [<ffffffffa0e963a6>] mdt_enqueue+0x46/0xe0 [mdt]
11:46:39: [<ffffffffa0e9ca97>] mdt_handle_common+0x647/0x16d0 [mdt]
11:46:39: [<ffffffffa0ed63f5>] mds_regular_handle+0x15/0x20 [mdt]
11:46:41: [<ffffffffa07bc3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
11:46:42: [<ffffffffa04a55de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
11:46:42: [<ffffffffa04b6d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
11:46:43: [<ffffffffa07b3729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
11:46:44: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
11:46:44: [<ffffffffa07bd75e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
11:46:44: [<ffffffffa07bcc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
11:46:44: [<ffffffff8100c0ca>] child_rip+0xa/0x20
11:46:45: [<ffffffffa07bcc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
11:46:45: [<ffffffffa07bcc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
11:46:45: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
11:46:45:
11:46:46:LustreError: dumping log to /tmp/lustre-log.1378579577.2233
11:46:46:Lustre: lustre-OST0005-osc-MDT0000: slow creates, last=[0x100050000:0xbc41:0x0], next=[0x100050000:0xbc41:0x0], reserved=0, syn_changes=0, syn_rpc_in_progress=10, status=-28
11:46:48:LNet: Service thread pid 2233 completed after 40.02s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

Maloo report: https://maloo.whamcloud.com/test_sets/cc263682-194c-11e3-bb73-52540035b04c

Comment by Jian Yu [ 09/Sep/13 ]

More instances on Lustre b2_4 branch:
https://maloo.whamcloud.com/test_sets/76525a9c-14fb-11e3-ac48-52540035b04c
https://maloo.whamcloud.com/test_sets/89a20084-14fb-11e3-ac48-52540035b04c
https://maloo.whamcloud.com/test_sets/da5b6604-142f-11e3-80c8-52540035b04c
https://maloo.whamcloud.com/test_sets/ec68c8a0-142f-11e3-80c8-52540035b04c

Comment by Hongchao Zhang [ 12/Sep/13 ]

the issue is reproduced locally, and it should be caused by the deletion of the large file created by "dd", which could contain many journal transactions
preventing the freed disk space to be used by the following write operations.

the patch is under creation&test.

Comment by Hongchao Zhang [ 16/Sep/13 ]

the patch is tracked at http://review.whamcloud.com/#/c/7662/

local reproducer will fail about 10 minutes without the patch, and run more than 6 hours successfully with the patch applied.

Comment by Jodi Levi (Inactive) [ 24/Sep/13 ]

Please complete the patch for Master before completing b2_4.

Comment by Andreas Dilger [ 24/Sep/13 ]

Please submit the patch for master first.

Comment by Hongchao Zhang [ 25/Oct/13 ]

the patch for master is tracked at http://review.whamcloud.com/#/c/8071/

Comment by Hongchao Zhang [ 02/Dec/13 ]

there are two options to fix the issue,
1, modify the test script "run_dd.sh" to do some extra work to wait for the completion of the destroy of the object created by "dd" (preferred)
2, modify Lustre itself to wait for the destroy of the objects when it encounters the -ENOSPC during writing.

Hi Yujian,
could you please take care of it during my leave for vacation, thanks!

Comment by Jian Yu [ 02/Dec/13 ]

could you please take care of it during my leave for vacation, thanks!

Sure, I just updated the patch with the first method.

Patch for Lustre b2_4 branch is in http://review.whamcloud.com/8447.

Comment by Andreas Dilger [ 05/Dec/13 ]

I don't see how fixing the test script avoids problems that users might also hit?

I recall that there are some fixes in patch 8071 which are useful to fix the OST-side lock cancellation and client-side page discard, though I haven't looked at the latest patch version. It would be useful to verify if this is correctly causing unwritten pages to be discarded on the client.

Comment by Jian Yu [ 05/Dec/13 ]

I recall that there are some fixes in patch 8071 which are useful to fix the OST-side lock cancellation and client-side page discard

The non-script part of changes in http://review.whamcloud.com/8071 is on file lustre/include/lustre_dlm_flags.h:

-#define LDLM_FL_AST_MASK 0x0000000080008000ULL
+#define LDLM_FL_AST_MASK»       »       0x0000000000018000ULL

Should I remain this change and drop the script part changes?

Comment by Jian Yu [ 05/Dec/13 ]

In addition, lustre/include/lustre_dlm_flags.h was added by http://review.whamcloud.com/5312, which does not exist on Lustre b2_4 branch.

Comment by Andreas Dilger [ 06/Dec/13 ]

Yu Jian, I think the code change in 8071 is potentially in the right area, but not necessarily the right fix. I think this will change the wire protocol to use a different flag (LDLM_FL_DISCARD_DATA vs. LDLM_FL_AST_DISCARD_DATA) in the wire protocol, but I'm not positive. It would be great to test this properly - write a large file in the background, delete it, and then check the kernel debug logs to see if more pages are being written or if they are being discarded.

Comment by Jian Yu [ 06/Dec/13 ]

I applied the LDLM_FL_AST_MASK change from http://review.whamcloud.com/8071 to the latest master branch and created http://review.whamcloud.com/8495 to get a new build.

The following recovery-mds-scale test_failover_mds was performed on that build:
https://maloo.whamcloud.com/sub_tests/9752d892-5e8a-11e3-a925-52540035b04c

The "run_dd_debug" log on Client 2 (wtm-68) showed that:

Total free disk space is 10589812, 4k blocks to dd is 2382707
+ pdsh -t 300 -S -w 'wtm-[67,68,71,72]' 'export PATH=$PATH:/sbin:/usr/sbin; lctl set_param debug=-1'
+ pdsh -t 300 -S -w 'wtm-[67,68,71,72]' 'export PATH=$PATH:/sbin:/usr/sbin; lctl set_param debug_mb=150'
+ pdsh -t 300 -S -w 'wtm-[67,68,71,72]' 'export PATH=$PATH:/sbin:/usr/sbin; lctl dk > /dev/null'
+ load_pid=42589
+ wait 42589
+ dd bs=4k count=2382707 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-wtm-68/dd-file
2382707+0 records in
2382707+0 records out
+ '[' 0 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2013-12-06 06:19:18: dd succeeded'
+ echo '2013-12-06 06:19:18: dd succeeded'
2013-12-06 06:19:18: dd succeeded
+ cd /tmp
+ rm -rf /mnt/lustre/d0.dd-wtm-68
+ pdsh -t 300 -S -w 'wtm-[67,68,71,72]' 'export PATH=$PATH:/sbin:/usr/sbin;
			lctl dk > /scratch/jianyu/test_logs/2013-12-06/060035/recovery-mds-scale.test_failover_mds.run_dd_dk.$(hostname -s)_$(date +%s).log'

The lctl debug logs are in "run_dd_dk_1386339559" and "run_dd_dk_1386339667" (there are two separate dd runs) in the above Maloo report.

I found ofd_destroy_by_fid() in the OSS (wtm-72) lctl debug logs. Need look into the logs deeply.

Comment by Jian Yu [ 10/Dec/13 ]

In "run_dd_dk_1386339667" debug logs on OSS (wtm-72), I found that:

00010000:00010000:18.0:1386339564.644518:0:33665:0:(ldlm_lockd.c:848:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-lustre-OST0003_UUID lock: ffff880416972cc0/0xe8fc4c150d5421b0 lrc: 3/0,0 mode: PW/PW res: [0x2:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x50000000010020 nid: 10.10.18.127@tcp remote: 0xe0b36e7f4f2635f3 expref: 7 pid: 38018 timeout: 0 lvb_type: 0

A blocking AST RPC was sent to client.

And on client (wtm-68), in "run_dd_dk_1386339667":

00000020:00000001:14.0:1386339585.987516:0:40723:0:(cl_page.c:420:cl_page_put()) Process entered
00000020:00000001:14.0:1386339585.987517:0:40723:0:(cl_page.c:422:cl_page_put()) page@ffff880746ae8600[1 ffff8805e1f6ac30:397170 ^(null)_(null) 4 0 1 (null) (null) 0x0]
00000020:00000001:14.0:1386339585.987518:0:40723:0:(cl_page.c:422:cl_page_put()) 1
00000020:00000001:14.0:1386339585.987518:0:40723:0:(cl_page.c:160:cl_page_free()) Process entered
00000020:00000001:14.0:1386339585.987519:0:40723:0:(lustre_fid.h:719:fid_flatten32()) Process leaving (rc=788528898 : 788528898 : 2effff02)
00000020:00000010:14.0:1386339585.987519:0:40723:0:(cl_page.c:174:cl_page_free()) kfreed 'page': 440 at ffff880746ae8600.
00000020:00000001:14.0:1386339585.987520:0:40723:0:(cl_page.c:175:cl_page_free()) Process leaving
00000020:00000001:14.0:1386339585.987520:0:40723:0:(cl_page.c:437:cl_page_put()) Process leaving
00020000:00000001:14.0:1386339585.987520:0:40723:0:(lov_page.c:82:lov_page_fini()) Process leaving
00000020:00000001:14.0:1386339585.987521:0:40723:0:(lustre_fid.h:719:fid_flatten32()) Process leaving (rc=4194307 : 4194307 : 400003)
00000020:00000010:14.0:1386339585.987521:0:40723:0:(cl_page.c:174:cl_page_free()) kfreed 'page': 304 at ffff880746ae8800.
00000020:00000001:14.0:1386339585.987522:0:40723:0:(cl_page.c:175:cl_page_free()) Process leaving
00000020:00000001:14.0:1386339585.987522:0:40723:0:(cl_page.c:437:cl_page_put()) Process leaving
00000008:00000001:14.0:1386339585.987523:0:40723:0:(osc_cache.c:3137:osc_page_gang_lookup()) Process leaving (rc=0 : 0 : 0)

00000008:00000001:14.0:1386339585.987534:0:40723:0:(osc_cache.c:3246:osc_lock_discard_pages()) Process leaving (rc=0 : 0 : 0)
00000020:00001000:14.0:1386339585.987535:0:40723:0:(cl_object.c:971:cl_env_put()) 2@ffff88081d244a68
00000008:00000001:14.0:1386339585.987536:0:40723:0:(osc_lock.c:1376:osc_lock_flush()) Process leaving (rc=0 : 0 : 0)
00010000:00000001:14.0:1386339585.987538:0:40723:0:(ldlm_request.c:1353:ldlm_cli_cancel()) Process entered
00010000:00000001:14.0:1386339585.987543:0:40723:0:(ldlm_request.c:1122:ldlm_cli_cancel_local()) Process entered
00010000:00010000:14.0:1386339585.987543:0:40723:0:(ldlm_request.c:1127:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-OST0003-osc-ffff880435738000 lock: ffff88083821d980/0xe0b36e7f4f2635f3 lrc: 4/0,0 mode: PW/PW res: [0x2:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x428400010000 nid: local remote: 0xe8fc4c150d5421b0 expref: -99 pid: 42589 timeout: 0 lvb_type: 1

The osc_lock_discard_pages() was called in osc_lock_flush(), however, the following code path was not covered in the debug logs:

                if (descr->cld_mode >= CLM_WRITE) {
                        result = osc_cache_writeback_range(env, obj,
                                        descr->cld_start, descr->cld_end,
                                        1, discard);
                        LDLM_DEBUG(ols->ols_lock,
                                "lock %p: %d pages were %s.\n", lock, result,
                                discard ? "discarded" : "written");
                        if (result > 0)
                                result = 0;
                }

I just started the testing again to do the "write a large file in the background and delete it" operations for more times.

Comment by Jian Yu [ 11/Dec/13 ]

I performed the test to dd a large file in the background and then delete the file from another client. Unfortunately, the debug logs did not show "pages were discarded" info on the clients, and there were also no "server preparing blocking AST" info on the OSS node. Still digging.

Comment by Jian Yu [ 23/Dec/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/70/ (2.4.2 RC2)
https://maloo.whamcloud.com/test_sets/34a4c608-6be4-11e3-a73e-52540035b04c

Comment by Hongchao Zhang [ 29/Dec/13 ]

local test with two nodes, the server node runs master and client runs b2_1, b2_3, b2_4 or master, the pages were not discarded after the lock's blocking AST
is received, and will discard the pages after adding "LDLM_FL_DISCARD_DATA" instead of "LDLM_FL_AST_DISCARD_DATA" into "LDLM_FL_AST_MASK".

LDLM_AST_DISCARD_DATA (changed to LDLM_FL_AST_DISCARD_DATA in master) is used to indicate ldlm to sent blocking AST with LDLM_FL_DISCARD_DATA flag,
but it self is not on the wire, then the LDLM_FL_DISCARD_DATA will be ignored for it will be masked by "LDLM_FL_AST_MASK".

code snippet in master

void ldlm_add_bl_work_item(struct ldlm_lock *lock, struct ldlm_lock *new,
                           cfs_list_t *work_list)
{
        if (!ldlm_is_ast_sent(lock)) {
                LDLM_DEBUG(lock, "lock incompatible; sending blocking AST.");
                ldlm_set_ast_sent(lock);
                /* If the enqueuing client said so, tell the AST recipient to
                 * discard dirty data, rather than writing back. */
                if (ldlm_is_ast_discard_data(new))         <---- check LDLM_FL_AST_DISCARD_DATA
                        ldlm_set_discard_data(lock);       <---- set LDLM_FL_DISCARD_DATA
                LASSERT(cfs_list_empty(&lock->l_bl_ast));
                cfs_list_add(&lock->l_bl_ast, work_list);
                LDLM_LOCK_GET(lock);
                LASSERT(lock->l_blocking_lock == NULL);
                lock->l_blocking_lock = LDLM_LOCK_GET(new);
        }
}

the corresponding patch is tracked at http://review.whamcloud.com/#/c/8671/

Comment by Jian Yu [ 04/Jan/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5/
Test Group: failover

The same failure occurred:
https://maloo.whamcloud.com/test_sets/c86198c4-7505-11e3-95ae-52540035b04c

Comment by Jian Yu [ 07/Jan/14 ]

More instances on Lustre b2_5 branch:
https://maloo.whamcloud.com/test_sets/fcbcabd0-770e-11e3-b181-52540035b04c
https://maloo.whamcloud.com/test_sets/e669c6a8-8643-11e3-9f3f-52540035b04c
https://maloo.whamcloud.com/test_sets/fc5c9556-8505-11e3-8da9-52540035b04c
https://maloo.whamcloud.com/test_sets/f16b8720-9922-11e3-83d7-52540035b04c
https://maloo.whamcloud.com/test_sets/7d4fc910-956b-11e3-936f-52540035b04c

Comment by Jian Yu [ 09/Mar/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/39/ (2.5.1 RC1)
Distro/Arch: RHEL6.5/x86_64
Test Group: failover

The same failure occurred:
https://maloo.whamcloud.com/test_sets/36c8c4fe-a657-11e3-a191-52540035b04c

Many sub-tests in replay-single in failover test group also hit the "No space left on device" failure:
https://maloo.whamcloud.com/test_sets/c55e55ee-a657-11e3-a191-52540035b04c

Comment by Jian Yu [ 01/Apr/14 ]

This is blocking recovery-mds-scale testing.

Comment by Jian Yu [ 19/May/14 ]

Hi Hongchao,

Could you please drive patches of http://review.whamcloud.com/8671 (needs address the comment from Andreas) and http://review.whamcloud.com/8071 to be landed? Thanks. LU-3326 has been blocking MDS/OST hard failover testing in Lustre release testing cycles.

Comment by Hongchao Zhang [ 12/Jun/14 ]

updated the two patches of this ticket

(1) http://review.whamcloud.com/8671
as per the comments in Lustre, the LDLM_FL_AST_DISCARD_DATA is also designed to indicate "FL_DISCARD" flag to the blocking ASTs

/**
 * These are flags that are mapped into the flags and ASTs of blocking
 * locks Add FL_DISCARD to blocking ASTs */
#define LDLM_FL_AST_DISCARD_DATA        0x0000000080000000ULL // bit  31

(2) http://review.whamcloud.com/8071
wait the deletion of objects to complete before doing various tests at client node.

Comment by Jian Yu [ 12/Aug/14 ]

The back-ported patch for http://review.whamcloud.com/8071 on Lustre b2_5 branch is in http://review.whamcloud.com/11425.

Comment by Peter Jones [ 20/Aug/14 ]

Test fixes landed for 2.5.3 and 2.7. Residual patch landing will be tracked under a new ticket.

Comment by Jian Yu [ 20/Aug/14 ]

The new ticket is LU-5526.

Generated at Sat Feb 10 01:32:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.