Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14598

Too many FIDs to precreate OST replaced. Wrong IDIF logic.

Details

    • 3
    • 9223372036854775807

    Description

      During testing, job failure was observed

      V-1: Entering mdtest_stat...
      12/01/2020 04:46:28: Process 30(nid00514): FAILED in mdtest_stat, unable to stat file: No such file or directory
      Rank 30 [Tue Dec  1 04:46:28 2020] [c0-1c2s0n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 30
      12/01/2020 04:46:29: Process 3(nid00514): FAILED in mdtest_stat, unable to stat file: No such file or directory
      Rank 3 [Tue Dec  1 04:46:29 2020] [c0-1c2s0n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
      _pmiu_daemon(SIGCHLD): [NID 00514] [c0-1c2s0n2] [Tue Dec  1 04:46:29 2020] PE RANK 3 exit signal Aborted
      [NID 00514] 2020-12-01 04:46:29 Apid 7421368: initiated application termination
      Application 7421368 exit codes: 134
      Application 7421368 resources: utime ~0s, stime ~8s, Rss ~9528, inblocks ~0, outblocks ~0
      Job Script: command stopped at Tue Dec 1 04:46:36 CST 2020
      Job Script: command runtime was 240 seconds
      drwxr-xr-x 3 vers tsttool 4096 Dec  1 04:42 /lus/snx11281/ostest.vers/alsorun.20201130141403.23327.saturn-p4/CL_mdtest_pfl_1s_fo.6.cdR49p.1606819353/CL_mdtest_pfl_1s_fo
      ls: cannot access '/lus/snx11281/ostest.vers/alsorun.20201130141403.23327.saturn-p4/CL_mdtest_pfl_1s_fo.6.cdR49p.1606819353/CL_mdtest_pfl_1s_fo/#test-dir.0/mdtest_tree.0.0/file.mdtest.0.5023': No such file or directory
      ls: cannot access '/lus/snx11281/ostest.vers/alsorun.20201130141403.23327.saturn-p4/CL_mdtest_pfl_1s_fo.6.cdR49p.1606819353/CL_mdtest_pfl_1s_fo/#test-dir.0/mdtest_tree.0.0/file.mdtest.0.4854': No such file or directory
      ...
      Dec  1 04:45:12 snx11281n002 kernel: Lustre: 15664:0:(client.c:2188:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1606819417/real 1606819417]  req@ffff94b0eb7d9680 x1684868787957760/t0(0) o5->snx11281-OST0001-osc-MDT0000@10.10.100.6@o2ib3:28/4 lens 432/432 e 0 to 1 dl 1606819512 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 job:''
      Dec  1 04:45:12 snx11281n002 kernel: Lustre: snx11281-OST0001-osc-MDT0000: Connection to snx11281-OST0001 (at 10.10.100.6@o2ib3) was lost; in progress operations using this service will wait for recovery to complete
      Dec  1 04:45:12 snx11281n002 kernel: LustreError: 15664:0:(osp_precreate.c:687:osp_precreate_send()) snx11281-OST0001-osc-MDT0000: can't precreate: rc = -11
      Dec  1 04:45:12 snx11281n002 kernel: LustreError: 15664:0:(osp_precreate.c:1358:osp_precreate_thread()) snx11281-OST0001-osc-MDT0000: cannot precreate objects: rc = -11
      

      With a debug patch we have catch vmcore with additional info.

      [ 1401.294579] Lustre: format at ofd_dev.c:1651:ofd_create_hdl doesn't end in newline
      [ 1401.294587] Lustre: snx11281-OST0004: Too many FIDs to precreate OST replaced or reformatted: LFSCK will clean up
      [ 1401.294587] LustreError: 19714:0:(ofd_dev.c:1665:ofd_create_hdl()) ASSERTION( diff <= 5 * OST_MAX_PRECREATE ) failed: snx11281-OST0004: precreate FID 0x0:4395085801 is over 100000 larger than the LAST_ID 0x0:4294967297 diff 100118504
      [ 1401.294589] LustreError: 19714:0:(ofd_dev.c:1665:ofd_create_hdl()) LBUG
      

      Here is a root cause.
      ofd_preprw_write() uses a wrong logic for getting sequence from a fid.
      During recovery ofd_preprw_write took a wrong sequence 0x100040001 from fid, and loaded it for ofd device. After that OST got creation request from MDT0, ofd_create_hdl found loaded IDIF sequence and used it. Since sequence was a new it has a large oid difference with MDT0.

      Attachments

        Issue Links

          Activity

            [LU-14598] Too many FIDs to precreate OST replaced. Wrong IDIF logic.
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.16.0 [ 15190 ]

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49586/
            Subject: LU-14598 tests: skip conf-sanity test_122b in interop
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 22d3efff0df2365812ce23ea22c79ce13f9379ab

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49586/ Subject: LU-14598 tests: skip conf-sanity test_122b in interop Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 22d3efff0df2365812ce23ea22c79ce13f9379ab

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49583/
            Subject: LU-14598 tests: skip conf-sanity test_122b in interop
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a9f83af896f2ce20e9ee430ad371b707c0c140cc

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49583/ Subject: LU-14598 tests: skip conf-sanity test_122b in interop Project: fs/lustre-release Branch: master Current Patch Set: Commit: a9f83af896f2ce20e9ee430ad371b707c0c140cc

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49586
            Subject: LU-14598 tests: skip conf-sanity test_122b in interop
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 988f86ee62c39488e47dfad7392f43e7fad02d17

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49586 Subject: LU-14598 tests: skip conf-sanity test_122b in interop Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 988f86ee62c39488e47dfad7392f43e7fad02d17
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-16456 [ LU-16456 ]

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49583
            Subject: LU-14598 tests: skip conf-sanity test_122b in interop
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8e15572e836a75bd52e6d249b59b76f340b97de3

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49583 Subject: LU-14598 tests: skip conf-sanity test_122b in interop Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8e15572e836a75bd52e6d249b59b76f340b97de3
            sarah Sarah Liu made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 29939 ]
            sarah Sarah Liu made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 29795 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.12.8 [ 15093 ]

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43541/
            Subject: LU-14598 ofd: fix for IDIF sequence at ofd_preprw_write
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 4a0290c6be47e3ecd8108f78961afb2e54d6fddf

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43541/ Subject: LU-14598 ofd: fix for IDIF sequence at ofd_preprw_write Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 4a0290c6be47e3ecd8108f78961afb2e54d6fddf

            People

              aboyko Alexander Boyko
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: