Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16361

Lustre client was crashed when I used WBC

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Overview

      I'm interested in WBC(Write back cache) feature[1]. So I'm testing the WBC in development.

      WBC mostly works well. But, my lustre client machine crashed on some testing.

      [1] https://jira.whamcloud.com/browse/LU-10938

      Detail of crash

      I couldn't access to the lustre mounted directory after presenting error messages on some testing. And I couldn't unmount the lustre mounted directory. And sometimes my machine crashed after that.

      My machine crashed at the following timings.

      • Immediately after an error occurred.
      • A few hours after an error occurred.
      • The machine shutdown process was stopped after the error occurred/

      There were several patterns of error messages.

      Pattern1: Request sent has timed out for sent delay

      Nov  9 18:49:20 anila kernel: [ 3667.591798] Lustre: 6549:0:(client.c:2284:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1667987353/real 0]     req@0000000002a4f44e x1749008903039168/t0(0) o36->exafs-MDT0000-mdc-ffff97086cd1d800@192.168.10.43@tcp:12/10 lens 520/576 e 0 to 1 dl 1667987360 ref 3 fl Rpc:Xr/0/ffffffff rc 0/-1 job:''
      Nov  9 18:49:20 anila kernel: [ 3667.591810] Lustre: exafs-MDT0000-mdc-ffff97086cd1d800: Connection to exafs-MDT0000 (at 192.168.10.43@tcp) was lost; in progress operations using this service     will wait for recovery to complete
      Nov  9 18:49:24 anila kernel: [ 3671.431810] Lustre: 4874:0:(client.c:2284:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1667987355/real 0]     req@000000003297c839 x1749008903114560/t0(0) o4->exafs-OST0000-osc-ffff97086cd1d800@192.168.10.43@tcp:6/4 lens 488/448 e 0 to 1 dl 1667987364 ref 3 fl Rpc:Xr/0/ffffffff rc 0/-1 job:''
      Nov  9 18:49:24 anila kernel: [ 3671.431814] Lustre: 4874:0:(client.c:2284:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      Nov  9 18:49:24 anila kernel: [ 3671.431833] Lustre: exafs-OST0000-osc-ffff97086cd1d800: Connection to exafs-OST0000 (at 192.168.10.43@tcp) was lost; in progress operations using this service     will wait for recovery to complete
      Nov  9 18:49:25 anila kernel: [ 3672.395923] Lustre: exafs-MDT0000-mdc-ffff97086cd1d800: Connection restored to 192.168.10.43@tcp (at 192.168.10.43@tcp)
      Nov  9 18:49:33 anila kernel: [ 3680.231831] Lustre: 4901:0:(client.c:2284:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1667987365/real 1667987371]    req@00000000be7eddcb x1749008903114624/t0(0) o4->exafs-OST0000-osc-ffff97086cd1d800@192.168.10.43@tcp:6/4 lens 488/448 e 0 to 1 dl 1667987373 ref 2 fl Rpc:XQr/2/ffffffff rc 0/-1 job:''
      Nov  9 18:49:33 anila kernel: [ 3680.231835] Lustre: 4901:0:(client.c:2284:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      Nov  9 18:49:33 anila kernel: [ 3680.231853] Lustre: exafs-OST0000-osc-ffff97086cd1d800: Connection to exafs-OST0000 (at 192.168.10.43@tcp) was lost; in progress operations using this service     will wait for recovery to complete
      Nov  9 18:49:41 anila kernel: [ 3688.839825] Lustre: 4877:0:(client.c:2284:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1667987369/real 1667987371]    req@00000000eedcb0b0 x1749008903115136/t0(0) o4->exafs-OST0000-osc-ffff97086cd1d800@192.168.10.43@tcp:6/4 lens 488/448 e 0 to 1 dl 1667987381 ref 2 fl Rpc:XQr/2/ffffffff rc -11/-1 job:''
      Nov  9 18:49:41 anila kernel: [ 3688.839829] Lustre: 4877:0:(client.c:2284:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      Nov  9 18:49:43 anila kernel: [ 3690.631834] Lustre: exafs-MDT0000-mdc-ffff97086cd1d800: Connection to exafs-MDT0000 (at 192.168.10.43@tcp) was lost; in progress operations using this service     will wait for recovery to complete
      Nov  9 18:49:46 anila kernel: [ 3693.991839] Lustre: 4892:0:(client.c:2284:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1667987365/real 1667987371]    req@000000006d74b528 x1749008903117952/t0(0) o4->exafs-OST0000-osc-ffff97086cd1d800@192.168.10.43@tcp:6/4 lens 488/448 e 0 to 1 dl 1667987386 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:''
      Nov  9 18:49:46 anila kernel: [ 3693.991843] Lustre: 4892:0:(client.c:2284:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      Nov  9 18:49:56 anila kernel: [ 3703.943822] Lustre: 4872:0:(client.c:2284:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1667987388/real 0]     req@0000000017b47a96 x1749008903197504/t0(0) o400->exafs-OST0002-osc-ffff97086cd1d800@192.168.10.44@tcp:28/4 lens 224/224 e 0 to 1 dl 1667987396 ref 2 fl Rpc:XNr/0/ffffffff rc 0/-1 job:''
      Nov  9 18:49:56 anila kernel: [ 3703.943826] Lustre: 4872:0:(client.c:2284:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      Nov  9 18:49:56 anila kernel: [ 3703.943840] Lustre: exafs-OST0002-osc-ffff97086cd1d800: Connection to exafs-OST0002 (at 192.168.10.44@tcp) was lost; in progress operations using this service     will wait for recovery to complete
      Nov  9 18:49:56 anila kernel: [ 3703.943846] LustreError: 166-1: MGC192.168.10.43@tcp: Connection to MGS (at 192.168.10.43@tcp) was lost; in progress operations using this service will fail
      Nov  9 18:49:58 anila kernel: [ 3705.255880] LustreError: 4849:0:(events.c:211:client_bulk_callback()) event type 1, status -5, desc 000000005fad487b
      Nov  9 18:50:10 anila kernel: [ 3717.255825] Lustre: exafs-MDT0001-mdc-ffff97086cd1d800: Connection to exafs-MDT0001 (at 192.168.10.44@tcp) was lost; in progress operations using this service     will wait for recovery to complete
      Nov  9 18:50:10 anila kernel: [ 3717.953515] Lustre: exafs-MDT0001-mdc-ffff97086cd1d800: Connection restored to 192.168.10.44@tcp (at 192.168.10.44@tcp)
      Nov  9 18:50:16 anila kernel: [ 3723.655861] LustreError: 4841:0:(events.c:211:client_bulk_callback()) event type 1, status -113, desc 00000000eea9e18b
      Nov  9 18:50:16 anila kernel: [ 3723.655955] LustreError: 4841:0:(events.c:211:client_bulk_callback()) Skipped 7 previous similar messages
      

       

      Pattern2: revalidate FID error( while the machine shutdown process )

      LNetError: 4850:0:(lib-move.c:2250:lnet_handle_find_routed_path()) peer 192.168.10.43@tcp has no available nets
      LNetError: 4860:0:(lib-move.c:2250:lnet_handle_find_routed_path()) Skipped 27 previous similar messages
      systemd-shutdown[1]: Syncing filesystems and block devices - timed out, issuing SIGKILL ti PID 12341.
      LustreError: 9158:0:(file.c:5124:ll_inode_revalidate_fini()) exafs: revalidate FID [0x200000007:0x1:0x0] error: rc = -4]
      

      Pattern3: Bad RIP value

      Nov 15 18:19:34 anila kernel: [ 2660.738158] INFO: task sync:5644 blocked for more than 120 seconds.
      Nov 15 18:19:34 anila kernel: [ 2660.738223] Tainted: P OE 5.4.0-131-generic #147-Ubuntu
      Nov 15 18:19:34 anila kernel: [ 2660.738274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Nov 15 18:19:34 anila kernel: [ 2660.738333] sync D 0 5644 5172 0x00004000
      Nov 15 18:19:34 anila kernel: [ 2660.738337] Call Trace:
      Nov 15 18:19:34 anila kernel: [ 2660.738350] __schedule+0x2e3/0x740
      Nov 15 18:19:34 anila kernel: [ 2660.738355] schedule+0x42/0xb0
      Nov 15 18:19:34 anila kernel: [ 2660.738364] wb_wait_for_completion+0x56/0x90
      Nov 15 18:19:34 anila kernel: [ 2660.738371] ? __wake_up_pollfree+0x40/0x40
      Nov 15 18:19:34 anila kernel: [ 2660.738375] sync_inodes_sb+0xd8/0x2a0
      Nov 15 18:19:34 anila kernel: [ 2660.738382] sync_inodes_one_sb+0x15/0x20
      Nov 15 18:19:34 anila kernel: [ 2660.738389] iterate_supers+0xa3/0x100
      Nov 15 18:19:34 anila kernel: [ 2660.738393] ? default_file_splice_write+0x30/0x30
      Nov 15 18:19:34 anila kernel: [ 2660.738398] ksys_sync+0x42/0xb0
      Nov 15 18:19:34 anila kernel: [ 2660.738402] __ia32_sys_sync+0xe/0x20
      Nov 15 18:19:34 anila kernel: [ 2660.738409] do_syscall_64+0x57/0x190
      Nov 15 18:19:34 anila kernel: [ 2660.738415] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Nov 15 18:19:34 anila kernel: [ 2660.738419] RIP: 0033:0x7f34a384022b
      Nov 15 18:19:34 anila kernel: [ 2660.738428] Code: Bad RIP value.
      Nov 15 18:19:34 anila kernel: [ 2660.738431] RSP: 002b:00007ffd325c6158 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
      Nov 15 18:19:34 anila kernel: [ 2660.738434] RAX: ffffffffffffffda RBX: 00007ffd325c6298 RCX: 00007f34a384022b
      Nov 15 18:19:34 anila kernel: [ 2660.738436] RDX: 00007f34a391a501 RSI: 0000000000000000 RDI: 00007f34a38e0bc0
      Nov 15 18:19:34 anila kernel: [ 2660.738437] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
      Nov 15 18:19:34 anila kernel: [ 2660.738439] R10: 00007f34a3914640 R11: 0000000000000246 R12: 0000000000000000
      Nov 15 18:19:34 anila kernel: [ 2660.738440] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      

      Pattern4: Out of memory

      Out of memory message

      Tests that caused the error

      Write a 20GB file

      1. I set the 100ms latency of the network used for Luster. I set it by tc command. The purpose of latency is to simulate the case where the client is far away from the Lustre server.
      2. I ran the following command to write a 20GB file.
       # ./ior --dataPacketType=timestamp -C -Q 1 -g -G 1836258435 -k -e -o /mnt/lustre/io500/test/ior_file_easy -t 1m -F -w -D 300 -O stoneWallingWearOut=1 -a POSIX -b 20480m
      
      • fsync was executed because -e option was specified.
      • Error didn't occur when sync was not executed.
      • I think that syncing a 20GB file is a ordinary operation.

      Execute IOR, MDTest continuously.

      1. I didn't set latency.
      2. I executed IOR, MDTest continuously without fsync.
       # mpiexec -np 4 ./ior --dataPacketType=timestamp -g -G 1836258167 -k -o /mnt/lustre/io500/test/ior-easy/ior_file_easy -t 1m -b 25600m -F -w -D 300 -O stoneWallingWearOut=1 -a POSIX
       # sudo -n bash -c "echo 3 > /proc/sys/vm/drop_caches"
       # mpiexec -np 4 ./mdtest --dataPacketType=timestamp -n 1000000 -u -L -F -P -G -1572531335 -N 0 -d /mnt/lustre/io500/test/mdtest-easy -C -W 300 -a POSIX
       # sudo -n bash -c "echo 3 > /proc/sys/vm/drop_caches"
       # mpiexec -np 4 ./ior --dataPacketType=timestamp -g -G=1836268855 -k -o /mnt/lustre/io500/test/ior-hard/file -t 47008 -b 47008 -s 500000 -w -D 300 -a POSIX -O stoneWallingWearOut=1
       # sudo -n bash -c "echo 3 > /proc/sys/vm/drop_caches"
       # mpiexec -np 4 ./mdtest --dataPacketType=timestamp -n 1000000 -t -w 3901 -e 3901 -P -G=-1572520647 -N 0 -F -d /mnt/lustre/io500/test/mdtest-hard -C -W 300 -a POSIX
      
      • IOR-easy write: 100GB
      • MDTest-easy write: 4,000,000 files
      • IOR-hard write: 500000 segment (About 85GB)
      • MDTest-hard write: 4,000,000 files
      • Error didn't occur when I didn't execute fsync.

      Sync a lot of files

      1. I didn't set latency.
      2. I ran MDTest without fsync. A lot of files were cached in WBC.
      3. I ran sync.
       # mpiexec -np 4 ./mdtest --dataPacketType=timestamp -n 1000000 -u -L -F -P -G -1572531335 -N 0 -d /mnt/lustre/test/mdtest-easy -C -W 300 -a POSIX
       # sync
      

      my prediction

      Releasing WBC will take a lot of time if I set the latency or sync a lot of files. As a result, request will be timed out.

      No error occurred when I ran IOR/MDTest with fsync option in no-latency environment. So I think that syncing in small steps will inhibit this error.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              honjo.rikimaru Honjo Rikimaru
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: