[LU-4292] Lustre data loss when client hang up Created: 22/Nov/13  Updated: 26/Apr/17  Resolved: 26/Apr/17

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Alexander Assignee: James Nunez (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: server
Environment:

OS RHEL 6.4, 6.3, Lustre versions lustre 2.1.6, 2.3, 2.4.1, 2.5. The error was watched on the different hardwares.


Severity: 3
Rank (Obsolete): 11775

 Description   

Error watched on Lustre 2.1.6 and later. Rhel 6.x data loss. If the file was open or modified approximately in 30 seconds before the server failed, all data in this file will be erased. Somebody watched a similar error?



 Comments   
Comment by Andreas Dilger [ 23/Nov/13 ]

Could you please explain your testing process further? Does this affect existing files that are opened, or only newly created files? If the server fails, but the client is still running, and then the server is restarted, is there data loss?

Comment by Alexander [ 25/Nov/13 ]

There is a prototype consisting of 2 servers. One server of meta data and object data. The second server is used as the client. Let's allow in file system already there was a certain file_1.txt file in which "a, b, c" was written. If from the client to execute record in this file "echo d >> /mnt/file_1.txt" that in file /proc/fs/osc/lustre-OST-*/cur_dirty_bytes will appear instead of 0 4096 (byte). Data are stored in "dirty_data" about 30 seconds then probably it is carried out from disk writing. And so, if the client machine fails before these will expire 30 seconds that in the /mnt/file_1.txt file will be gone not only those data which were just written (by means of the "echo d >> /mnt/file_1.txt" command), but also those that were written to it. I.e. this file will be empty.
In case of study file Lustre-x.x/lustre/ChangeLog found a similar error which can be found bugzilla 12181, 12203, but it is dated 2007-04-01.

Comment by Peter Jones [ 27/Nov/13 ]

James

Could you please see if you can reproduce this reported behavior?

Thanks

Peter

Comment by Alexander [ 09/Jan/14 ]

It would be desirable to learn, whether it managed to you to repeat this situation?

Comment by James Nunez (Inactive) [ 09/Jan/14 ]

Alexander,

I'm having trouble recreating this problem.

I have a single server with MDS/MGS and OSS and a single client. I write to a file using "echo" and I can see the cur_dirty_bytes stay at 4096 for a while and then change to 0, but I can't get any data loss or corruption. To fail the client I've used reboot and shutdown immediately after the echo command was issued.

So, I may be doing things differently from you. How did you fail the client?

Comment by Alexander [ 10/Jan/14 ]

James,

In case of reboot or poweroff loss of data isn't watched. Loss of data is watched in case of hangup of an operating system or for example forced switch-off of power supply from the client. Respectively, if the client hangs up or will be compulsorily disconnected from power supply at that moment while in the cur_dirty_bytes file there will be a value 4096 that there will be a loss of data.

Comment by James Nunez (Inactive) [ 13/Jan/14 ]

Alexander,

I've moved my testing to using VMs so that I can shut off machines to mimic a node going down. Here is what I can reproduce, using the b2_3 branch of Lustre:

I have a single node MDS and OSS and two clients. One client writes data to a file called test_file and the write completes normally. That same client writes data to the existing file, I print cur_dirty_bytes which is 4096 and then I power off the VM, which does not flush the cache. I go to the second client and try and list the contents of the Lustre file system and it hangs. If I control C the listing, I get something like:

[root@client2 ~]# ls -l /lustre/scratch/
^Cls: cannot access /lustre/scratch/test_file: Interrupted system call
total 0
-????????? ? ? ? ?            ? test_file

If I wait for a couple of minutes, the original file/data is restored, but the data in flight never makes it to the file:

[root@client2 ~]# ls /lustre/scratch/
client_2_touch  test_file
[root@client2 ~]# more /lustre/scratch/test_file 
a, b, c
d, e, f
g, h, i

Does this match your experience? If you wait a few minutes, does the file that your client was writing to, /mnt/file_1 in your example, come back and is the original data "a, b, c" there?

Comment by Alexander [ 14/Jan/14 ]

James,

The experiment made by you matches that I saw off except for one point. I made experiment not on the VM and on real servers, respectively the cache on real servers was cleared (dropped).
Probably on that the test_file file was recovered from a cache only after booted switched off by you VMs machine. Whether so it?
Try to repeat your experiment, but so that in case of forced VM reset/switching off its cache would be dropped (is cleared).

Comment by Jinshan Xiong (Inactive) [ 15/Jan/14 ]

Hi Alexander, when you saw the empty, do you notice what's the file size?

You can try to write something into the file, and then make sure the content has been written to OST, and then reproduce this problem. For example:

echo "hello world" > test_file;
sync; sync; sync
run your test to append the file

After you've reproduced the problem, please check what the file size is.

Thanks

Comment by Alexander [ 16/Jan/14 ]

Hi Jinshan,

I wrote similar procedure:

#!/bin/bash

path=/proc/fs/lustre/osc
ost_num=test-OST000
name=cur_dirty_bytes

while [ 1 ];
do
  for j in 0 1;
    do
      c=`cat $(path)/$(ost_num)$(j)*/$(name)`;
      if [ $(c) -ne 0 ]; then
        sync;
      fi
    done
  sleep 0.3;
done

This script forces to drop the memories this from a cache (if they there are) on a disk with an interval approximately to equal 1 second.

But it only minimizes possibility of a similar situation. On default about 30 seconds are stored in file system of cur_dirty_bytes then they automatically register from a memory cache in a disk. Thus the probability of loss of data remains if the client hangs up in that timepoint when the memories this in a cache still will be and the sync command yet didn't pass.

Comment by James Nunez (Inactive) [ 16/Jan/14 ]

Alexander,

Can you reproduce the data loss on your cluster/nodes? In other words, is this a persistent problem, does it happen occasionally or did it stop happening?

Thanks

Comment by Alexander [ 17/Jan/14 ]

James,

Yes at me this problem arises stablly, in case of described by me in the previous messages conditions. I can reproduce this error on the equipment also.

Comment by Alexander [ 03/Mar/14 ]

James,

Hello, whether there were at you any news on this subject?

Comment by James Nunez (Inactive) [ 12/Mar/14 ]

Alexander,

Nothing new to report; I am not able to reproduce this error. Would you please let me know what your mount options are when you experience the data loss? Also, do you have logs from these nodes when the data loss happens and would you make those available to me?

Thanks,

Comment by Alexandr [ 14/Apr/14 ]

James,

I reproduced this situation on the following prototype:
Two virtual machines, the first (VM1) is used as MDS/OSS, the second (VM2) as the client. X86_64 RHEL 6.4 operating system, lustre 2.4.1.

The sequence of actions was following:

On VM2 write data to /mnt/file:
[root@n2 ~]# date && dmesg > /mnt/file && ll /mnt/file && date
Thu Apr 10 02:58:16 EDT 2014
rw-rr- 1 root root 79650 Apr 10 02:58 /mnt/file
Thu Apr 10 02:58:16 EDT 2014
[root@n2 ~]#

On VM1 do:
[root@n1 common]# date && ll /mnt/file && date
Thu Apr 10 02:58:22 EDT 2014
rw-rr- 1 root root 79650 Apr 10 02:58 /mnt/file
Thu Apr 10 02:58:22 EDT 2014

The file at /mnt have size 79650.

Force poweroff VM2.

On VM1 do:
[root@n1 common]# date && ll /mnt/file && date
Thu Apr 10 02:58:33 EDT 2014
rw-rr- 1 root root 79650 Apr 10 02:58 /mnt/file
Thu Apr 10 03:02:07 EDT 2014

After about four minutes of waiting the command was executed also the file it seems not empty, but in case of repeated execution of the same command we receive:

[root@n1 common]# date && ll /mnt/file && date
Thu Apr 10 03:02:13 EDT 2014
rw-rr- 1 root root 0 Apr 10 02:58 /mnt/file
Thu Apr 10 03:02:13 EDT 2014
[root@n1 common]#

The file is empty. The VM2 server thus remained switched off.

The output of file /var/log/messages of the VM1 server from the moment of removal of a supply from VM2 is given below:

Apr 10 02:58:40 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113113/real 1397113113] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113120 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Apr 10 02:58:48 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113120/real 1397113120] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113127 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 02:58:55 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113127/real 1397113127] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113134 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 02:59:02 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113134/real 1397113134] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113141 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 02:59:09 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113141/real 1397113142] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113148 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 02:59:23 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113155/real 1397113157] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113162 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 02:59:23 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Apr 10 02:59:38 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397113178/real 1397113178] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113185 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 02:59:38 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 495793 previous similar messages
Apr 10 03:00:11 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397113210/real 1397113210] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113217 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 03:00:11 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 4198137 previous similar messages
Apr 10 03:01:14 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397113274/real 1397113274] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113281 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 03:01:14 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 9459123 previous similar messages
Apr 10 03:01:53 n1 kernel: LNet: Service thread pid 3118 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Apr 10 03:01:53 n1 kernel: Pid: 3118, comm: ll_ost00_002
Apr 10 03:01:53 n1 kernel:
Apr 10 03:01:53 n1 kernel: Call Trace:
Apr 10 03:01:53 n1 kernel: [<ffffffffa037c075>] ? libcfs_nid2str+0x155/0x160 [libcfs]
Apr 10 03:01:53 n1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Apr 10 03:01:53 n1 kernel: [<ffffffffa03ffc86>] ? lnet_send+0x6d6/0xb60 [lnet]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0400c33>] ? LNetPut+0x313/0x860 [lnet]
Apr 10 03:01:53 n1 kernel: [<ffffffffa06319ca>] ? ptl_send_buf+0x12a/0x550 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa03f8007>] ? LNetMDAttach+0x427/0x5a0 [lnet]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0634f05>] ? ptl_send_rpc+0x565/0xc40 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa062c702>] ? ptlrpc_expire_one_request+0xa2/0x440 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa062d956>] ? ptlrpc_check_set+0xd26/0x1b20 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffff8109715c>] ? remove_wait_queue+0x3c/0x50
Apr 10 03:01:53 n1 kernel: [<ffffffffa062e8b5>] ? ptlrpc_set_wait+0x165/0x8c0 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Apr 10 03:01:53 n1 kernel: [<ffffffffa05ee0a0>] ? ldlm_work_gl_ast_lock+0x0/0x290 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa05f0f7b>] ? ldlm_run_ast_work+0x1db/0x490 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa061298b>] ? ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0ea2096>] ? ofd_intent_policy+0x516/0x7d0 [ofd]
Apr 10 03:01:53 n1 kernel: [<ffffffffa05f0831>] ? ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa06171ef>] ? ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0617e16>] ? ldlm_handle_enqueue+0x66/0x70 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0617e20>] ? ldlm_server_completion_ast+0x0/0x6c0 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0e54300>] ? ost_blocking_ast+0x0/0x10f0 [ost]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0614830>] ? ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0e5d318>] ? ost_handle+0x1db8/0x48e0 [ost]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0640beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa06493c8>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa03705de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0381d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0640729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Apr 10 03:01:53 n1 kernel: [<ffffffffa064a75e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0649c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
Apr 10 03:01:53 n1 kernel: [<ffffffffa0649c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0649c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Apr 10 03:01:53 n1 kernel:
Apr 10 03:01:53 n1 kernel: LustreError: dumping log to /tmp/lustre-log.1397113313.3118
Apr 10 03:02:07 n1 kernel: Lustre: lustre-OST0000: haven't heard from client 7511411c-dd05-5423-aa8a-098f78d9739a (at 11.0.0.32@tcp) in 231 seconds. I think it's dead, and I am evicting it. exp ffff88007c62b400, cur 1397113327 expire 1397113177 last 1397113096
Apr 10 03:02:07 n1 kernel: LustreError: 3118:0:(client.c:1048:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113334 ref 1 fl Rpc:X/2/ffffffff rc -5/-1
Apr 10 03:02:07 n1 kernel: LustreError: 3118:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 11.0.0.32@tcp) returned 0 from glimpse AST ns: filter-lustre-OST0000_UUID lock: ffff88005dd47000/0x1151412baa60ebe9 lrc: 2/0,0 mode: -/PW res: [0x8ba2:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0>81919) flags: 0x801000000000 nid: 11.0.0.32@tcp remote: 0xdcdb6436fc134adb expref: 2 pid: 3118 timeout: 0 lvb_type: 0
Apr 10 03:02:07 n1 kernel: LNet: Service thread pid 3118 completed after 213.86s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Apr 10 03:02:43 n1 kernel: Lustre: MGS: haven't heard from client 6abc5e98-73d8-60a9-2c7a-c9503f5979d6 (at 11.0.0.32@tcp) in 272 seconds. I think it's dead, and I am evicting it. exp ffff88005dc6e800, cur 1397113363 expire 1397113213 last 1397113091

Comment by James Nunez (Inactive) [ 10/Mar/17 ]

Alexander,
Are you still experiencing this data corruption? If you are, what version of Lustre are you using?

Thanks

Comment by James Nunez (Inactive) [ 26/Apr/17 ]

Please open a new ticket if you are still experiencing this problem.

Generated at Sat Feb 10 01:41:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.