Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4292

Lustre data loss when client hang up

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.4.1
    • OS RHEL 6.4, 6.3, Lustre versions lustre 2.1.6, 2.3, 2.4.1, 2.5. The error was watched on the different hardwares.
    • 3
    • 11775

    Description

      Error watched on Lustre 2.1.6 and later. Rhel 6.x data loss. If the file was open or modified approximately in 30 seconds before the server failed, all data in this file will be erased. Somebody watched a similar error?

      Attachments

        Activity

          [LU-4292] Lustre data loss when client hang up
          jamesanunez James Nunez (Inactive) made changes -
          Resolution New: Cannot Reproduce [ 5 ]
          Status Original: Open [ 1 ] New: Closed [ 6 ]

          Please open a new ticket if you are still experiencing this problem.

          jamesanunez James Nunez (Inactive) added a comment - Please open a new ticket if you are still experiencing this problem.

          Alexander,
          Are you still experiencing this data corruption? If you are, what version of Lustre are you using?

          Thanks

          jamesanunez James Nunez (Inactive) added a comment - Alexander, Are you still experiencing this data corruption? If you are, what version of Lustre are you using? Thanks
          sasha111 Alexandr added a comment -

          James,

          I reproduced this situation on the following prototype:
          Two virtual machines, the first (VM1) is used as MDS/OSS, the second (VM2) as the client. X86_64 RHEL 6.4 operating system, lustre 2.4.1.

          The sequence of actions was following:

          On VM2 write data to /mnt/file:
          [root@n2 ~]# date && dmesg > /mnt/file && ll /mnt/file && date
          Thu Apr 10 02:58:16 EDT 2014
          rw-rr- 1 root root 79650 Apr 10 02:58 /mnt/file
          Thu Apr 10 02:58:16 EDT 2014
          [root@n2 ~]#

          On VM1 do:
          [root@n1 common]# date && ll /mnt/file && date
          Thu Apr 10 02:58:22 EDT 2014
          rw-rr- 1 root root 79650 Apr 10 02:58 /mnt/file
          Thu Apr 10 02:58:22 EDT 2014

          The file at /mnt have size 79650.

          Force poweroff VM2.

          On VM1 do:
          [root@n1 common]# date && ll /mnt/file && date
          Thu Apr 10 02:58:33 EDT 2014
          rw-rr- 1 root root 79650 Apr 10 02:58 /mnt/file
          Thu Apr 10 03:02:07 EDT 2014

          After about four minutes of waiting the command was executed also the file it seems not empty, but in case of repeated execution of the same command we receive:

          [root@n1 common]# date && ll /mnt/file && date
          Thu Apr 10 03:02:13 EDT 2014
          rw-rr- 1 root root 0 Apr 10 02:58 /mnt/file
          Thu Apr 10 03:02:13 EDT 2014
          [root@n1 common]#

          The file is empty. The VM2 server thus remained switched off.

          The output of file /var/log/messages of the VM1 server from the moment of removal of a supply from VM2 is given below:

          Apr 10 02:58:40 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113113/real 1397113113] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113120 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
          Apr 10 02:58:48 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113120/real 1397113120] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113127 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
          Apr 10 02:58:55 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113127/real 1397113127] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113134 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
          Apr 10 02:59:02 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113134/real 1397113134] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113141 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
          Apr 10 02:59:09 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113141/real 1397113142] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113148 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
          Apr 10 02:59:23 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113155/real 1397113157] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113162 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
          Apr 10 02:59:23 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 1 previous similar message
          Apr 10 02:59:38 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397113178/real 1397113178] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113185 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
          Apr 10 02:59:38 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 495793 previous similar messages
          Apr 10 03:00:11 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397113210/real 1397113210] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113217 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
          Apr 10 03:00:11 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 4198137 previous similar messages
          Apr 10 03:01:14 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397113274/real 1397113274] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113281 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
          Apr 10 03:01:14 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 9459123 previous similar messages
          Apr 10 03:01:53 n1 kernel: LNet: Service thread pid 3118 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
          Apr 10 03:01:53 n1 kernel: Pid: 3118, comm: ll_ost00_002
          Apr 10 03:01:53 n1 kernel:
          Apr 10 03:01:53 n1 kernel: Call Trace:
          Apr 10 03:01:53 n1 kernel: [<ffffffffa037c075>] ? libcfs_nid2str+0x155/0x160 [libcfs]
          Apr 10 03:01:53 n1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
          Apr 10 03:01:53 n1 kernel: [<ffffffffa03ffc86>] ? lnet_send+0x6d6/0xb60 [lnet]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa0400c33>] ? LNetPut+0x313/0x860 [lnet]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa06319ca>] ? ptl_send_buf+0x12a/0x550 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa03f8007>] ? LNetMDAttach+0x427/0x5a0 [lnet]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa0634f05>] ? ptl_send_rpc+0x565/0xc40 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa062c702>] ? ptlrpc_expire_one_request+0xa2/0x440 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa062d956>] ? ptlrpc_check_set+0xd26/0x1b20 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffff8109715c>] ? remove_wait_queue+0x3c/0x50
          Apr 10 03:01:53 n1 kernel: [<ffffffffa062e8b5>] ? ptlrpc_set_wait+0x165/0x8c0 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
          Apr 10 03:01:53 n1 kernel: [<ffffffffa05ee0a0>] ? ldlm_work_gl_ast_lock+0x0/0x290 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa05f0f7b>] ? ldlm_run_ast_work+0x1db/0x490 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa061298b>] ? ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa0ea2096>] ? ofd_intent_policy+0x516/0x7d0 [ofd]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa05f0831>] ? ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa06171ef>] ? ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa0617e16>] ? ldlm_handle_enqueue+0x66/0x70 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa0617e20>] ? ldlm_server_completion_ast+0x0/0x6c0 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa0e54300>] ? ost_blocking_ast+0x0/0x10f0 [ost]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa0614830>] ? ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa0e5d318>] ? ost_handle+0x1db8/0x48e0 [ost]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa0640beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa06493c8>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa03705de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa0381d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa0640729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
          Apr 10 03:01:53 n1 kernel: [<ffffffffa064a75e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa0649c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
          Apr 10 03:01:53 n1 kernel: [<ffffffffa0649c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffffa0649c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
          Apr 10 03:01:53 n1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
          Apr 10 03:01:53 n1 kernel:
          Apr 10 03:01:53 n1 kernel: LustreError: dumping log to /tmp/lustre-log.1397113313.3118
          Apr 10 03:02:07 n1 kernel: Lustre: lustre-OST0000: haven't heard from client 7511411c-dd05-5423-aa8a-098f78d9739a (at 11.0.0.32@tcp) in 231 seconds. I think it's dead, and I am evicting it. exp ffff88007c62b400, cur 1397113327 expire 1397113177 last 1397113096
          Apr 10 03:02:07 n1 kernel: LustreError: 3118:0:(client.c:1048:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113334 ref 1 fl Rpc:X/2/ffffffff rc -5/-1
          Apr 10 03:02:07 n1 kernel: LustreError: 3118:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 11.0.0.32@tcp) returned 0 from glimpse AST ns: filter-lustre-OST0000_UUID lock: ffff88005dd47000/0x1151412baa60ebe9 lrc: 2/0,0 mode: -/PW res: [0x8ba2:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0>81919) flags: 0x801000000000 nid: 11.0.0.32@tcp remote: 0xdcdb6436fc134adb expref: 2 pid: 3118 timeout: 0 lvb_type: 0
          Apr 10 03:02:07 n1 kernel: LNet: Service thread pid 3118 completed after 213.86s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
          Apr 10 03:02:43 n1 kernel: Lustre: MGS: haven't heard from client 6abc5e98-73d8-60a9-2c7a-c9503f5979d6 (at 11.0.0.32@tcp) in 272 seconds. I think it's dead, and I am evicting it. exp ffff88005dc6e800, cur 1397113363 expire 1397113213 last 1397113091

          sasha111 Alexandr added a comment - James, I reproduced this situation on the following prototype: Two virtual machines, the first (VM1) is used as MDS/OSS, the second (VM2) as the client. X86_64 RHEL 6.4 operating system, lustre 2.4.1. The sequence of actions was following: On VM2 write data to /mnt/file: [root@n2 ~] # date && dmesg > /mnt/file && ll /mnt/file && date Thu Apr 10 02:58:16 EDT 2014 rw-r r - 1 root root 79650 Apr 10 02:58 /mnt/file Thu Apr 10 02:58:16 EDT 2014 [root@n2 ~] # On VM1 do: [root@n1 common] # date && ll /mnt/file && date Thu Apr 10 02:58:22 EDT 2014 rw-r r - 1 root root 79650 Apr 10 02:58 /mnt/file Thu Apr 10 02:58:22 EDT 2014 The file at /mnt have size 79650. Force poweroff VM2. On VM1 do: [root@n1 common] # date && ll /mnt/file && date Thu Apr 10 02:58:33 EDT 2014 rw-r r - 1 root root 79650 Apr 10 02:58 /mnt/file Thu Apr 10 03:02:07 EDT 2014 After about four minutes of waiting the command was executed also the file it seems not empty, but in case of repeated execution of the same command we receive: [root@n1 common] # date && ll /mnt/file && date Thu Apr 10 03:02:13 EDT 2014 rw-r r - 1 root root 0 Apr 10 02:58 /mnt/file Thu Apr 10 03:02:13 EDT 2014 [root@n1 common] # The file is empty. The VM2 server thus remained switched off. The output of file /var/log/messages of the VM1 server from the moment of removal of a supply from VM2 is given below: Apr 10 02:58:40 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113113/real 1397113113] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113120 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Apr 10 02:58:48 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113120/real 1397113120] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113127 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 10 02:58:55 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113127/real 1397113127] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113134 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 10 02:59:02 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113134/real 1397113134] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113141 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 10 02:59:09 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113141/real 1397113142] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113148 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 10 02:59:23 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1397113155/real 1397113157] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113162 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 10 02:59:23 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 1 previous similar message Apr 10 02:59:38 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397113178/real 1397113178] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113185 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 10 02:59:38 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 495793 previous similar messages Apr 10 03:00:11 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397113210/real 1397113210] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113217 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 10 03:00:11 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 4198137 previous similar messages Apr 10 03:01:14 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397113274/real 1397113274] req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113281 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 10 03:01:14 n1 kernel: Lustre: 3118:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 9459123 previous similar messages Apr 10 03:01:53 n1 kernel: LNet: Service thread pid 3118 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 10 03:01:53 n1 kernel: Pid: 3118, comm: ll_ost00_002 Apr 10 03:01:53 n1 kernel: Apr 10 03:01:53 n1 kernel: Call Trace: Apr 10 03:01:53 n1 kernel: [<ffffffffa037c075>] ? libcfs_nid2str+0x155/0x160 [libcfs] Apr 10 03:01:53 n1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Apr 10 03:01:53 n1 kernel: [<ffffffffa03ffc86>] ? lnet_send+0x6d6/0xb60 [lnet] Apr 10 03:01:53 n1 kernel: [<ffffffffa0400c33>] ? LNetPut+0x313/0x860 [lnet] Apr 10 03:01:53 n1 kernel: [<ffffffffa06319ca>] ? ptl_send_buf+0x12a/0x550 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffffa03f8007>] ? LNetMDAttach+0x427/0x5a0 [lnet] Apr 10 03:01:53 n1 kernel: [<ffffffffa0634f05>] ? ptl_send_rpc+0x565/0xc40 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffffa062c702>] ? ptlrpc_expire_one_request+0xa2/0x440 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffffa062d956>] ? ptlrpc_check_set+0xd26/0x1b20 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffff8109715c>] ? remove_wait_queue+0x3c/0x50 Apr 10 03:01:53 n1 kernel: [<ffffffffa062e8b5>] ? ptlrpc_set_wait+0x165/0x8c0 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Apr 10 03:01:53 n1 kernel: [<ffffffffa05ee0a0>] ? ldlm_work_gl_ast_lock+0x0/0x290 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffffa05f0f7b>] ? ldlm_run_ast_work+0x1db/0x490 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffffa061298b>] ? ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffffa0ea2096>] ? ofd_intent_policy+0x516/0x7d0 [ofd] Apr 10 03:01:53 n1 kernel: [<ffffffffa05f0831>] ? ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffffa06171ef>] ? ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffffa0617e16>] ? ldlm_handle_enqueue+0x66/0x70 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffffa0617e20>] ? ldlm_server_completion_ast+0x0/0x6c0 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffffa0e54300>] ? ost_blocking_ast+0x0/0x10f0 [ost] Apr 10 03:01:53 n1 kernel: [<ffffffffa0614830>] ? ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffffa0e5d318>] ? ost_handle+0x1db8/0x48e0 [ost] Apr 10 03:01:53 n1 kernel: [<ffffffffa0640beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffffa06493c8>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffffa03705de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Apr 10 03:01:53 n1 kernel: [<ffffffffa0381d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Apr 10 03:01:53 n1 kernel: [<ffffffffa0640729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Apr 10 03:01:53 n1 kernel: [<ffffffffa064a75e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffffa0649c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 Apr 10 03:01:53 n1 kernel: [<ffffffffa0649c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffffa0649c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Apr 10 03:01:53 n1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Apr 10 03:01:53 n1 kernel: Apr 10 03:01:53 n1 kernel: LustreError: dumping log to /tmp/lustre-log.1397113313.3118 Apr 10 03:02:07 n1 kernel: Lustre: lustre-OST0000: haven't heard from client 7511411c-dd05-5423-aa8a-098f78d9739a (at 11.0.0.32@tcp) in 231 seconds. I think it's dead, and I am evicting it. exp ffff88007c62b400, cur 1397113327 expire 1397113177 last 1397113096 Apr 10 03:02:07 n1 kernel: LustreError: 3118:0:(client.c:1048:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88005dc90400 x1464979110756852/t0(0) o106->lustre-OST0000@11.0.0.32@tcp:15/16 lens 296/280 e 0 to 1 dl 1397113334 ref 1 fl Rpc:X/2/ffffffff rc -5/-1 Apr 10 03:02:07 n1 kernel: LustreError: 3118:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 11.0.0.32@tcp) returned 0 from glimpse AST ns: filter-lustre-OST0000_UUID lock: ffff88005dd47000/0x1151412baa60ebe9 lrc: 2/0,0 mode: - /PW res: [0x8ba2:0x0:0x0] .0 rrc: 2 type: EXT [0->18446744073709551615] (req 0 >81919) flags: 0x801000000000 nid: 11.0.0.32@tcp remote: 0xdcdb6436fc134adb expref: 2 pid: 3118 timeout: 0 lvb_type: 0 Apr 10 03:02:07 n1 kernel: LNet: Service thread pid 3118 completed after 213.86s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 10 03:02:43 n1 kernel: Lustre: MGS: haven't heard from client 6abc5e98-73d8-60a9-2c7a-c9503f5979d6 (at 11.0.0.32@tcp) in 272 seconds. I think it's dead, and I am evicting it. exp ffff88005dc6e800, cur 1397113363 expire 1397113213 last 1397113091

          Alexander,

          Nothing new to report; I am not able to reproduce this error. Would you please let me know what your mount options are when you experience the data loss? Also, do you have logs from these nodes when the data loss happens and would you make those available to me?

          Thanks,

          jamesanunez James Nunez (Inactive) added a comment - Alexander, Nothing new to report; I am not able to reproduce this error. Would you please let me know what your mount options are when you experience the data loss? Also, do you have logs from these nodes when the data loss happens and would you make those available to me? Thanks,

          James,

          Hello, whether there were at you any news on this subject?

          alexander Alexander (Inactive) added a comment - James, Hello, whether there were at you any news on this subject?
          alexander Alexander (Inactive) added a comment - - edited

          James,

          Yes at me this problem arises stablly, in case of described by me in the previous messages conditions. I can reproduce this error on the equipment also.

          alexander Alexander (Inactive) added a comment - - edited James, Yes at me this problem arises stablly, in case of described by me in the previous messages conditions. I can reproduce this error on the equipment also.

          Alexander,

          Can you reproduce the data loss on your cluster/nodes? In other words, is this a persistent problem, does it happen occasionally or did it stop happening?

          Thanks

          jamesanunez James Nunez (Inactive) added a comment - Alexander, Can you reproduce the data loss on your cluster/nodes? In other words, is this a persistent problem, does it happen occasionally or did it stop happening? Thanks
          alexander Alexander (Inactive) added a comment - - edited

          Hi Jinshan,

          I wrote similar procedure:

          #!/bin/bash
          
          path=/proc/fs/lustre/osc
          ost_num=test-OST000
          name=cur_dirty_bytes
          
          while [ 1 ];
          do
            for j in 0 1;
              do
                c=`cat $(path)/$(ost_num)$(j)*/$(name)`;
                if [ $(c) -ne 0 ]; then
                  sync;
                fi
              done
            sleep 0.3;
          done
          

          This script forces to drop the memories this from a cache (if they there are) on a disk with an interval approximately to equal 1 second.

          But it only minimizes possibility of a similar situation. On default about 30 seconds are stored in file system of cur_dirty_bytes then they automatically register from a memory cache in a disk. Thus the probability of loss of data remains if the client hangs up in that timepoint when the memories this in a cache still will be and the sync command yet didn't pass.

          alexander Alexander (Inactive) added a comment - - edited Hi Jinshan, I wrote similar procedure: #!/bin/bash path=/proc/fs/lustre/osc ost_num=test-OST000 name=cur_dirty_bytes while [ 1 ]; do for j in 0 1; do c=`cat $(path)/$(ost_num)$(j)*/$(name)`; if [ $(c) -ne 0 ]; then sync; fi done sleep 0.3; done This script forces to drop the memories this from a cache (if they there are) on a disk with an interval approximately to equal 1 second. But it only minimizes possibility of a similar situation. On default about 30 seconds are stored in file system of cur_dirty_bytes then they automatically register from a memory cache in a disk. Thus the probability of loss of data remains if the client hangs up in that timepoint when the memories this in a cache still will be and the sync command yet didn't pass.

          Hi Alexander, when you saw the empty, do you notice what's the file size?

          You can try to write something into the file, and then make sure the content has been written to OST, and then reproduce this problem. For example:

          echo "hello world" > test_file;
          sync; sync; sync
          run your test to append the file

          After you've reproduced the problem, please check what the file size is.

          Thanks

          jay Jinshan Xiong (Inactive) added a comment - Hi Alexander, when you saw the empty, do you notice what's the file size? You can try to write something into the file, and then make sure the content has been written to OST, and then reproduce this problem. For example: echo "hello world" > test_file; sync; sync; sync run your test to append the file After you've reproduced the problem, please check what the file size is. Thanks

          People

            jamesanunez James Nunez (Inactive)
            alexander Alexander (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: