Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15776

2.15 RC3: lost writes during server fofb by forced panics

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.15.0
    • Lustre 2.15.0
    • 3
    • 9223372036854775807

    Description

      data miscompare seen during server failover by forced panics for an IOR job (ill formed IO).  

      Error Output :

       

      [13] Expected: 0x0000000000029ff8

      [13] Actual:   0x0000000000000000

      [13] FAILED comparison of buffer containing 8-byte ints:

      [13]   File name = /lus/snx11922/ost2/ostest.vers/alsorun.20220421145202.27664.saturn-p2/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse.1.qSxF7f.1650570751/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse/IORfile.00000021

      [13]   In transfer 152, 21504 errors between buffer indices 0 and 21503.

      [13]   File byte offset = 37576704:

      [13]     Expected: 0x000000156261b63a 0000000000000008 000000156261b63a 0000000000000018

      [13]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000

        • error **

      ERROR in IOR.c (line 448): data check error, aborting execution.

      ERROR: No such file or directory

        • exiting **

       

      Client involved during this corruption:

       

      task 13 on nid00097

      task 21 on nid00099

       

      saturn-p2:/ # lfs getstripe -r -y /lus/snx11922/ost2/ostest.vers/alsorun.20220421145202.27664.saturn-p2/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse.1.qSxF7f.1650570751/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse/IORfile.00000021

        lcm_layout_gen:    7

        lcm_mirror_count:  1

        lcm_entry_count:   5

        component0:

          lcme_id:             1

          lcme_mirror_id:      0

          lcme_flags:          init

          lcme_extent.e_start: 0

          lcme_extent.e_end:   1048576

          sub_layout:

            lmm_stripe_count:  0

            lmm_stripe_size:   1048576

            lmm_pattern:       mdt

            lmm_layout_gen:    0

            lmm_stripe_offset: 0

       

        component1:

          lcme_id:             2

          lcme_mirror_id:      0

          lcme_flags:          init

          lcme_extent.e_start: 1048576

          lcme_extent.e_end:   269484032

          sub_layout:

            lmm_stripe_count:  1

            lmm_stripe_size:   1048576

            lmm_pattern:       raid0

            lmm_layout_gen:    0

            lmm_stripe_offset: 3

            lmm_pool:          ost2

            lmm_objects:

            - l_ost_idx: 3

              l_fid:     0x100030000:0x83fae34:0x0

       

        component2:

          lcme_id:             3

          lcme_mirror_id:      0

          lcme_flags:          extension

          lcme_extent.e_start: 269484032

          lcme_extent.e_end:   536870912

          sub_layout:

            lmm_stripe_count:  0

            lmm_extension_size: 134217728

            lmm_pattern:       raid0

            lmm_layout_gen:    0

            lmm_stripe_offset: -1

            lmm_pool:          ost2

       

        component3:

          lcme_id:             4

          lcme_mirror_id:      0

          lcme_flags:          0

          lcme_extent.e_start: 536870912

          lcme_extent.e_end:   536870912

          sub_layout:

            lmm_stripe_count:  -1

            lmm_stripe_size:   1048576

            lmm_pattern:       raid0

            lmm_layout_gen:    0

            lmm_stripe_offset: -1

            lmm_pool:          ost2

       

        component4:

          lcme_id:             5

          lcme_mirror_id:      0

          lcme_flags:          extension

          lcme_extent.e_start: 536870912

          lcme_extent.e_end:   EOF

          sub_layout:

            lmm_stripe_count:  0

            lmm_extension_size: 268435456

            lmm_pattern:       raid0

            lmm_layout_gen:    0

            lmm_stripe_offset: -1

            lmm_pool:          ost2

       

       

       

      Console Logs:

       

      2022-04-21T14:55:08.086885-05:00 c0-0c1s8n0 Lustre: 9811:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570887/real 1650570887]  req@000000007bfb28c1 x1730745432703552/t0(0) o400->snx11922-OST0003-osc-ffff88878818e800@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570903 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''

      2022-04-21T14:55:08.086967-05:00 c0-0c1s8n0 Lustre: snx11922-OST0003-osc-ffff88878818e800: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete

      2022-04-21T14:55:08.086988-05:00 c0-0c1s8n0 Lustre: 9809:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570887/real 1650570887]  req@000000008ee911d3 x1730745432703488/t0(0) o400->snx11922-OST0002-osc-ffff88878818e800@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570903 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''

      2022-04-21T14:55:08.087004-05:00 c0-0c1s8n0 Lustre: snx11922-OST0002-osc-ffff88878818e800: Connection to snx11922-OST0002 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete

      2022-04-21T14:55:08.087023-05:00 c0-0c1s8n2 Lustre: 9808:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570887/real 1650570887]  req@00000000e7ca3f51 x1730745432704128/t0(0) o400->snx11922-OST0003-osc-ffff8887854c2000@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570903 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''

      2022-04-21T14:55:08.087040-05:00 c0-0c1s8n2 Lustre: snx11922-OST0003-osc-ffff8887854c2000: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete

      2022-04-21T14:55:08.087076-05:00 c0-0c1s8n1 Lustre: 9797:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570889/real 1650570889]  req@0000000039643501 x1730745432705728/t0(0) o400->snx11922-OST0003-osc-ffff88838195c000@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570905 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''

      2022-04-21T14:55:08.087086-05:00 c0-0c1s8n1 Lustre: snx11922-OST0003-osc-ffff88838195c000: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete

      2022-04-21T14:55:08.087114-05:00 c0-0c1s8n3 Lustre: 9786:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570889/real 1650570889]  req@000000008cb8442c x1730745432700352/t0(0) o400->snx11922-OST0002-osc-ffff888386f33000@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570905 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''

      2022-04-21T14:55:08.087129-05:00 c0-0c1s8n3 Lustre: snx11922-OST0003-osc-ffff888386f33000: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete

       

      Kern logs :

      Apr 21 14:55:25 snx11922n004 kernel: md: md3: resync interrupted.

      Apr 21 14:55:26 snx11922n004 kernel: LDISKFS-fs (md3): recovery complete

      Apr 21 14:55:26 snx11922n004 kernel: LDISKFS-fs (md3): mounted filesystem with ordered data mode. Opts: user_xattr,errors=panic,errors=panic,journal_checksum,no_mbcache,nodelalloc

      Apr 21 14:55:27 snx11922n003 kernel: Lustre: snx11922-MDT0001: Received LWP connection from 10.12.0.52@o2ib4, removing former export from 10.12.0.53@o2ib4

      Apr 21 14:55:27 snx11922n004 kernel: Lustre: snx11922-OST0003: Imperative Recovery enabled, recovery window shrunk from 900-2700 down to 450-2700

      Apr 21 14:55:27 snx11922n004 kernel: Lustre: snx11922-OST0003: in recovery but waiting for the first client to connect

      Apr 21 14:55:27 snx11922n004 kernel: md: resync of RAID array md3

      Apr 21 14:55:27 snx11922n004 kernel: [204833]:md.c:10632:md_do_sync(): md: md3: resync range: [0, 30316376064)

      Apr 21 14:55:27 snx11922n002 kernel: Lustre: snx11922-MDT0000: Received LWP connection from 10.12.0.52@o2ib4, removing former export from 10.12.0.53@o2ib4

      Apr 21 14:55:27 snx11922n004 kernel: Lustre: snx11922-OST0003: Will be in recovery for at least 7:30, or until 16 clients reconnect

      Apr 21 14:55:40 snx11922n001 kernel: LNet: 2575200:0:(o2iblnd_cb.c:3369:kiblnd_check_conns()) Timed out tx for 10.12.0.53@o2ib4: 2 seconds

      Apr 21 14:55:40 snx11922n001 kernel: LNet: 2575200:0:(o2iblnd_cb.c:3369:kiblnd_check_conns()) Skipped 1 previous similar message

      Apr 21 14:55:48 snx11922n003 kernel: Lustre: snx11922-OST0002-osc-MDT0001: Connection restored to 10.12.0.52@o2ib4 (at 10.12.0.52@o2ib4)

      Apr 21 14:55:48 snx11922n001 kernel: Lustre: snx11922-OST0002-osc-ffff998fa4d8d000: Connection restored to 10.12.0.52@o2ib4 (at 10.12.0.52@o2ib4)

      Apr 21 14:55:48 snx11922n004 kernel: Lustre: snx11922-OST0002: Recovery over after 0:32, of 16 clients 16 recovered and 0 were evicted.

       

      Attachments

        Issue Links

          Activity

            People

              shadow Alexey Lyashkov
              m,prabu Prabu Manoharan
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: