Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
Lustre 2.4.1
-
OS RHEL 6.4, 6.3, Lustre versions lustre 2.1.6, 2.3, 2.4.1, 2.5. The error was watched on the different hardwares.
-
3
-
11775
Description
Error watched on Lustre 2.1.6 and later. Rhel 6.x data loss. If the file was open or modified approximately in 30 seconds before the server failed, all data in this file will be erased. Somebody watched a similar error?
Attachments
Activity
Resolution | New: Cannot Reproduce [ 5 ] | |
Status | Original: Open [ 1 ] | New: Closed [ 6 ] |
Alexander,
Are you still experiencing this data corruption? If you are, what version of Lustre are you using?
Thanks
James,
I reproduced this situation on the following prototype:
Two virtual machines, the first (VM1) is used as MDS/OSS, the second (VM2) as the client. X86_64 RHEL 6.4 operating system, lustre 2.4.1.
The sequence of actions was following:
On VM2 write data to /mnt/file:
[root@n2 ~]# date && dmesg > /mnt/file && ll /mnt/file && date
Thu Apr 10 02:58:16 EDT 2014
rw-rr- 1 root root 79650 Apr 10 02:58 /mnt/file
Thu Apr 10 02:58:16 EDT 2014
[root@n2 ~]#
On VM1 do:
[root@n1 common]# date && ll /mnt/file && date
Thu Apr 10 02:58:22 EDT 2014
rw-rr- 1 root root 79650 Apr 10 02:58 /mnt/file
Thu Apr 10 02:58:22 EDT 2014
The file at /mnt have size 79650.
Force poweroff VM2.
On VM1 do:
[root@n1 common]# date && ll /mnt/file && date
Thu Apr 10 02:58:33 EDT 2014
rw-rr- 1 root root 79650 Apr 10 02:58 /mnt/file
Thu Apr 10 03:02:07 EDT 2014
After about four minutes of waiting the command was executed also the file it seems not empty, but in case of repeated execution of the same command we receive:
[root@n1 common]# date && ll /mnt/file && date
Thu Apr 10 03:02:13 EDT 2014
rw-rr- 1 root root 0 Apr 10 02:58 /mnt/file
Thu Apr 10 03:02:13 EDT 2014
[root@n1 common]#
The file is empty. The VM2 server thus remained switched off.
The output of file /var/log/messages of the VM1 server from the moment of removal of a supply from VM2 is given below:
Apr 10 02:58:40 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113113/real 1397113113] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113120 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Apr 10 02:58:48 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113120/real 1397113120] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113127 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 02:58:55 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113127/real 1397113127] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113134 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 02:59:02 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113134/real 1397113134] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113141 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 02:59:09 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113141/real 1397113142] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113148 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 02:59:23 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113155/real 1397113157] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113162 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 02:59:23 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Apr 10 02:59:38 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397113178/real 1397113178] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113185 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 02:59:38 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 495793 previous similar messages
Apr 10 03:00:11 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397113210/real 1397113210] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113217 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 03:00:11 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 4198137 previous similar messages
Apr 10 03:01:14 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397113274/real 1397113274] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113281 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 10 03:01:14 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 9459123 previous similar messages
Apr 10 03:01:53 n1 kernel: LNet: Service thread pid 3118 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Apr 10 03:01:53 n1 kernel: Pid: 3118, comm: ll_ost00_002
Apr 10 03:01:53 n1 kernel:
Apr 10 03:01:53 n1 kernel: Call Trace:
Apr 10 03:01:53 n1 kernel: [<ffffffffa037c075>] ? libcfs_nid2str+0x155/0x160 [libcfs]
Apr 10 03:01:53 n1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Apr 10 03:01:53 n1 kernel: [<ffffffffa03ffc86>] ? lnet_send+0x6d6/0xb60 [lnet]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0400c33>] ? LNetPut+0x313/0x860 [lnet]
Apr 10 03:01:53 n1 kernel: [<ffffffffa06319ca>] ? ptl_send_buf+0x12a/0x550 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa03f8007>] ? LNetMDAttach+0x427/0x5a0 [lnet]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0634f05>] ? ptl_send_rpc+0x565/0xc40 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa062c702>] ? ptlrpc_expire_one_request+0xa2/0x440 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa062d956>] ? ptlrpc_check_set+0xd26/0x1b20 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffff8109715c>] ? remove_wait_queue+0x3c/0x50
Apr 10 03:01:53 n1 kernel: [<ffffffffa062e8b5>] ? ptlrpc_set_wait+0x165/0x8c0 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Apr 10 03:01:53 n1 kernel: [<ffffffffa05ee0a0>] ? ldlm_work_gl_ast_lock+0x0/0x290 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa05f0f7b>] ? ldlm_run_ast_work+0x1db/0x490 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa061298b>] ? ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0ea2096>] ? ofd_intent_policy+0x516/0x7d0 [ofd]
Apr 10 03:01:53 n1 kernel: [<ffffffffa05f0831>] ? ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa06171ef>] ? ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0617e16>] ? ldlm_handle_enqueue+0x66/0x70 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0617e20>] ? ldlm_server_completion_ast+0x0/0x6c0 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0e54300>] ? ost_blocking_ast+0x0/0x10f0 [ost]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0614830>] ? ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0e5d318>] ? ost_handle+0x1db8/0x48e0 [ost]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0640beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa06493c8>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa03705de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0381d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0640729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Apr 10 03:01:53 n1 kernel: [<ffffffffa064a75e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0649c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
Apr 10 03:01:53 n1 kernel: [<ffffffffa0649c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffffa0649c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Apr 10 03:01:53 n1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Apr 10 03:01:53 n1 kernel:
Apr 10 03:01:53 n1 kernel: LustreError: dumping log to /tmp/lustre-log.1397113313.3118
Apr 10 03:02:07 n1 kernel: Lustre: lustre-OST0000: haven't heard from client 7511411c-dd05-5423-aa8a-098f78d9739a (at 11.0.0.32@tcp) in 231 seconds. I think it's dead, and I am evicting it. exp ffff88007c62b400, cur 1397113327 expire 1397113177 last 1397113096
Apr 10 03:02:07 n1 kernel: LustreError: 3118:0:(client.c:1048:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113334 ref 1 fl Rpc:X/2/ffffffff rc -5/-1
Apr 10 03:02:07 n1 kernel: LustreError: 3118:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 11.0.0.32@tcp) returned 0 from glimpse AST ns: filter-lustre-OST0000_UUID lock: ffff88005dd47000/0x1151412baa60ebe9 lrc: 2/0,0 mode: -/PW res: [0x8ba2:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0>81919) flags: 0x801000000000 nid: 11.0.0.32@tcp remote: 0xdcdb6436fc134adb expref: 2 pid: 3118 timeout: 0 lvb_type: 0
Apr 10 03:02:07 n1 kernel: LNet: Service thread pid 3118 completed after 213.86s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Apr 10 03:02:43 n1 kernel: Lustre: MGS: haven't heard from client 6abc5e98-73d8-60a9-2c7a-c9503f5979d6 (at 11.0.0.32@tcp) in 272 seconds. I think it's dead, and I am evicting it. exp ffff88005dc6e800, cur 1397113363 expire 1397113213 last 1397113091
Alexander,
Nothing new to report; I am not able to reproduce this error. Would you please let me know what your mount options are when you experience the data loss? Also, do you have logs from these nodes when the data loss happens and would you make those available to me?
Thanks,
James,
Hello, whether there were at you any news on this subject?
James,
Yes at me this problem arises stablly, in case of described by me in the previous messages conditions. I can reproduce this error on the equipment also.
Alexander,
Can you reproduce the data loss on your cluster/nodes? In other words, is this a persistent problem, does it happen occasionally or did it stop happening?
Thanks
Hi Jinshan,
I wrote similar procedure:
#!/bin/bash path=/proc/fs/lustre/osc ost_num=test-OST000 name=cur_dirty_bytes while [ 1 ]; do for j in 0 1; do c=`cat $(path)/$(ost_num)$(j)*/$(name)`; if [ $(c) -ne 0 ]; then sync; fi done sleep 0.3; done
This script forces to drop the memories this from a cache (if they there are) on a disk with an interval approximately to equal 1 second.
But it only minimizes possibility of a similar situation. On default about 30 seconds are stored in file system of cur_dirty_bytes then they automatically register from a memory cache in a disk. Thus the probability of loss of data remains if the client hangs up in that timepoint when the memories this in a cache still will be and the sync command yet didn't pass.
Hi Alexander, when you saw the empty, do you notice what's the file size?
You can try to write something into the file, and then make sure the content has been written to OST, and then reproduce this problem. For example:
echo "hello world" > test_file;
sync; sync; sync
run your test to append the file
After you've reproduced the problem, please check what the file size is.
Thanks
Please open a new ticket if you are still experiencing this problem.