Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14598

Too many FIDs to precreate OST replaced. Wrong IDIF logic.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.8, Lustre 2.15.0
    • Lustre 2.14.0, Lustre 2.12.6, Lustre 2.15.0
    • 3
    • 9223372036854775807

    Description

      During testing, job failure was observed

      V-1: Entering mdtest_stat...
      12/01/2020 04:46:28: Process 30(nid00514): FAILED in mdtest_stat, unable to stat file: No such file or directory
      Rank 30 [Tue Dec  1 04:46:28 2020] [c0-1c2s0n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 30
      12/01/2020 04:46:29: Process 3(nid00514): FAILED in mdtest_stat, unable to stat file: No such file or directory
      Rank 3 [Tue Dec  1 04:46:29 2020] [c0-1c2s0n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
      _pmiu_daemon(SIGCHLD): [NID 00514] [c0-1c2s0n2] [Tue Dec  1 04:46:29 2020] PE RANK 3 exit signal Aborted
      [NID 00514] 2020-12-01 04:46:29 Apid 7421368: initiated application termination
      Application 7421368 exit codes: 134
      Application 7421368 resources: utime ~0s, stime ~8s, Rss ~9528, inblocks ~0, outblocks ~0
      Job Script: command stopped at Tue Dec 1 04:46:36 CST 2020
      Job Script: command runtime was 240 seconds
      drwxr-xr-x 3 vers tsttool 4096 Dec  1 04:42 /lus/snx11281/ostest.vers/alsorun.20201130141403.23327.saturn-p4/CL_mdtest_pfl_1s_fo.6.cdR49p.1606819353/CL_mdtest_pfl_1s_fo
      ls: cannot access '/lus/snx11281/ostest.vers/alsorun.20201130141403.23327.saturn-p4/CL_mdtest_pfl_1s_fo.6.cdR49p.1606819353/CL_mdtest_pfl_1s_fo/#test-dir.0/mdtest_tree.0.0/file.mdtest.0.5023': No such file or directory
      ls: cannot access '/lus/snx11281/ostest.vers/alsorun.20201130141403.23327.saturn-p4/CL_mdtest_pfl_1s_fo.6.cdR49p.1606819353/CL_mdtest_pfl_1s_fo/#test-dir.0/mdtest_tree.0.0/file.mdtest.0.4854': No such file or directory
      ...
      Dec  1 04:45:12 snx11281n002 kernel: Lustre: 15664:0:(client.c:2188:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1606819417/real 1606819417]  req@ffff94b0eb7d9680 x1684868787957760/t0(0) o5->snx11281-OST0001-osc-MDT0000@10.10.100.6@o2ib3:28/4 lens 432/432 e 0 to 1 dl 1606819512 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 job:''
      Dec  1 04:45:12 snx11281n002 kernel: Lustre: snx11281-OST0001-osc-MDT0000: Connection to snx11281-OST0001 (at 10.10.100.6@o2ib3) was lost; in progress operations using this service will wait for recovery to complete
      Dec  1 04:45:12 snx11281n002 kernel: LustreError: 15664:0:(osp_precreate.c:687:osp_precreate_send()) snx11281-OST0001-osc-MDT0000: can't precreate: rc = -11
      Dec  1 04:45:12 snx11281n002 kernel: LustreError: 15664:0:(osp_precreate.c:1358:osp_precreate_thread()) snx11281-OST0001-osc-MDT0000: cannot precreate objects: rc = -11
      

      With a debug patch we have catch vmcore with additional info.

      [ 1401.294579] Lustre: format at ofd_dev.c:1651:ofd_create_hdl doesn't end in newline
      [ 1401.294587] Lustre: snx11281-OST0004: Too many FIDs to precreate OST replaced or reformatted: LFSCK will clean up
      [ 1401.294587] LustreError: 19714:0:(ofd_dev.c:1665:ofd_create_hdl()) ASSERTION( diff <= 5 * OST_MAX_PRECREATE ) failed: snx11281-OST0004: precreate FID 0x0:4395085801 is over 100000 larger than the LAST_ID 0x0:4294967297 diff 100118504
      [ 1401.294589] LustreError: 19714:0:(ofd_dev.c:1665:ofd_create_hdl()) LBUG
      

      Here is a root cause.
      ofd_preprw_write() uses a wrong logic for getting sequence from a fid.
      During recovery ofd_preprw_write took a wrong sequence 0x100040001 from fid, and loaded it for ofd device. After that OST got creation request from MDT0, ofd_create_hdl found loaded IDIF sequence and used it. Since sequence was a new it has a large oid difference with MDT0.

      Attachments

        Issue Links

          Activity

            People

              aboyko Alexander Boyko
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: