Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-20421

Corrupted entries caused by a distributed transaction failure: top_trans_stop()... write updates failed: rc = -116

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • Lustre 2.15.8
    • Lustre 2.15.8.ocean1
    • 2
    • 9223372036854775807

      Context:

      The CEA was performing an online update on the backend storage (SFAOS).
      For MDTs, they moved each target one-by-one on the failover controller to shutdown the VMs and update the controller. They waited the recovery for each failover.

      A Robinhood trash policy was running in the background. This policy move cold entries from user directories to a replicate trash path (and then chown root:root). The trash directories are created using DNEv3 so the rename create a distributed transaction between 2 MDTs (the source parent+inode and the target parent are not on the same MDT).

      Files:

      [root@gscluster990 ~]# ll /ccc/scratch/.cccdump/trash/.../IGCM_OUT/OL2/PROD/NMIP3-modif/SH8/Out/
      ls: cannot access '/ccc/scratch/.cccdump/trash/.../IGCM_OUT/OL2/PROD/NMIP3-modif/SH8/Out/create_ts.20101231.3D.stack.3746263': No such file or directory
      total 0
      -????????? ? ? ? ? ? create_ts.20101231.3D.stack.3746263
      [root@gscluster990 ~]# lfs getstripe -m /ccc/scratch/.cccdump/trash/.../IGCM_OUT/OL2/PROD/NMIP3-modif/SH8/Out/
      5
      [root@gscluster990 ~]# lfs path2fid /ccc/scratch/.cccdump/trash/.../IGCM_OUT/OL2/PROD/NMIP3-mo
      dif/SH8/Out/
      [30000fd00:1d7cb:0]
      
      debugfs: dump oi.16.0 /tmp/oi.16.0
      [root@gscluster402 ~]# .../check_iam -r /tmp/oi.16.0 | grep 30000fd00:1d7cb:0
      [30000fd00:1d7cb:0] 782762811/2181992872
      
      debugfs: ls -lD <782762811>
       782762811 40755 (2) 0 0 36864 3-May-2026 11:04 .
       782762810 40755 (18) 0 0 4096 3-May-2026 06:54 fid:[30000fd00:1d7ca:0] ..
            0 0 (17) 0 0 0 fid:[28000edc1:185a2:0] pack_restart.18561231.out
            0 0 (17) 0 0 0 fid:[3c000dc2e:6263:0] SH8_18960101_18961231_out_execution
            0 0 (17) 0 0 0 fid:[28000eddd:11c87:0] pack_output.18771231.out
            0 0 (17) 0 0 0 fid:[28000eddd:16a51:0] create_ts.18701231.3D.stack.2378519
       782776477 100000 (17) 0 0 0 1-Jan-1970 01:00 fid:[28000ee60:a06a:0] create_ts.20101231.3D.stack.3746263
            0 0 (17) 0 0 0 fid:[28000ee3c:3193:0] monitoring.19401231.out
            0 0 (17) 0 0 0 fid:[28000ee4a:11cee:0] create_se.19891231.stack.1888794
            0 0 (17) 0 0 0 fid:[28000edc1:185ad:0] pack_restart.18561231.stack.3847579
            0 0 (17) 0 0 0 fid:[28000eddd:e60:0] monitoring.18561231.stack.622619o
      
      debugfs: stat <782776477>
      Inode: 782776477 Type: regular Mode: 0000 Flags: 0
      Generation: 3046618564 Version: 00000000:00000000
      User: 0 Group: 0 Project: 0 Size: 0
      File ACL: 0
      Links: 1 Blockcount: 0
      Fragment: Address: 0 Number: 0 Size: 0
       ctime: 69ba72c2:257fc840 -- Wed Mar 18 10:39:14 2026
       atime: 00000000:fffffff8 -- Thu Jan 1 01:00:00 1970
       mtime: 00000000:fffffff8 -- Thu Jan 1 01:00:00 1970
      crtime: 69ba72c2:257fc840 -- Wed Mar 18 10:39:14 2026
      Size of extra inode fields: 32
      Extended attributes:
        lma: fid=[28000ee60:a06a:0] compat=0 incompat=2
      BLOCKS:
      
      28000ee60 -> MDT0004
      

      Changelogs (robinhood output for 0x28000ee60:0xa06a:0x0):

      2026/01/17 10:30:28 [1651060/6] scratch-MDT0004: 3760865736 11CLOSE 1768642221.567979397 442 t=[28000ee60:a06a:0] J=ppi.b3321114.11758.33227
      2026/01/17 10:30:28 [1651060/6] scratch-MDT0004: 3760865737 11CLOSE 1768642221.587415319 442 t=[28000ee60:a06a:0] J=ppi.b3321114.11758.33227
      2026/03/18 10:50:59 [505948/6] scratch-MDT0004: 4511524511 12LYOUT 1773826753.850764555 0 t=[28000ee60:a06a:0] J=glc51.0.u_NOLOGIN
      2026/03/18 10:50:59 [505948/6] scratch-MDT0004: 4511524512 11CLOSE 1773826753.851419095 3 t=[28000ee60:a06a:0] J=glc51.0.u_NOLOGIN
      2026/03/18 10:42:08 [505948/7] scratch-MDT0005: 3605569674 08RENME 1773826754.157766613 0 t=[0:0:0] p=[30000fd00:1d7cb:0] create_ts.20101231.3D.stack.3746263 s=[28000ee60:a06a:0] sp=[28000db7d:1c08:0] create_ts.20101231.3D.stack.3746263 J=glc51.0.u_NOLOGIN <--------------
      2026/03/18 10:50:59 [505948/6] scratch-MDT0004: 4511524556 14SATTR 1773826754.503069056 e4 t=[28000ee60:a06a:0] J=glc51.0.u_NOLOGIN
      2026/04/02 03:26:38 [715108/6] scratch-MDT0004: 4817815561 15XATTR 1775093192.518982693 0 t=[28000ee60:a06a:0] J=glc490.s0.robinhood
      2026/05/02 04:25:24 [12714/6] scratch-MDT0004: 6054439024 06UNLNK 1777688708.755896028 1 t=[28000ee60:a06a:0] p=[28000db7d:1c08:0]  create_ts.20101231.3D.stack.3746263 J=glc490.s0.robinhood        <----------
      

      The last unlink (CLF_UNLINK_LAST -> linkcount=0) is done with the old parent before the rename: the parent for the last unlink should be p=[0x30000fd00:0x1d7cb:0x0] instead of [0x28000db7d:0x1c08:0x0]).
      The rename adds a new link on the target parent without removing the old link on the source parent.

      Logs
      HA actions:

      Mar 18 10:36:55 gscluster400 ccc_root[23170]: admin[5499] yyyy 51330 422: pcs status
      Mar 18 10:37:38 gscluster400 ccc_root[24804]: admin[5499] yyyy 51330 422: pcs resource move fs-scratch-MDT0003 gscluster400
      Mar 18 10:39:16 gscluster400 ccc_root[29332]: admin[5499] yyyy 51330 422: pcs resource move fs-scratch-MDT0004 gscluster401
      Mar 18 10:42:24 gscluster400 ccc_root[35890]: admin[5499] yyyy 51330 422: pcs resource move fs-scratch-MDT0007 gscluster403
      Mar 18 10:45:48 gscluster400 ccc_root[42623]: admin[5499] yyyy 51330 422: pcs resource move fs-scratch-MDT0007 gscluster402
      Mar 18 10:48:03 gscluster400 ccc_root[48965]: admin[5499] yyyy 51330 422: pcs resource move fs-scratch-MDT0008 gscluster403
      Mar 18 10:51:14 gscluster400 ccc_root[55862]: admin[5499] yyyy 51330 422: pcs node standby gscluster404 gscluster405 gscluster406 gscluster407
      

      gscluster405 (MDT0004 failover due to "pcs ressource move"):

      2026-03-18 10:39:17 [3716382.250929] Lustre: Failing over scratch-MDT0004
      

      gscluster402 (MDT0005 rename transaction master for the rename op):

      Mar 18 10:39:14 gscluster402 kernel: LustreError: 9260:0:(update_trans.c:1062:top_trans_stop()) scratch-MDT0004-osp-MDT0005: write updates failed: rc = -116 
      

      gscluster401 (MDT0004 remount on gscluster401):

      Mar 18 10:39:52 gscluster401 kernel: LDISKFS-fs (sde): mounted filesystem with ordered data mode. Opts: user_xattr,errors=panic,journal_checksum,no_mbcache,nodelalloc
      Mar 18 10:41:57 gscluster401 kernel: Lustre: scratch-MDT0004: Recovery over after 1:34, of 2418 clients 2418 recovered and 0 were evicted.
      Mar 18 10:42:00 gscluster401 ccc_root[22850]: admin[18816] yyyy 56222 422: grep . /proc/fs/lustre/*/*/recovery_status
      Mar 18 10:42:24 gscluster401 kernel: LustreError: 11-0: scratch-MDT0007-osp-MDT0002: operation out_update to node xxxx@o2ib10 failed: rc = -19
      Mar 18 10:42:24 gscluster401 kernel: Lustre: scratch-MDT0007-osp-MDT0002: Connection to scratch-MDT0007 (at xxxx@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
      Mar 18 10:45:47 gscluster401 kernel: Lustre: mdt06_004: service thread pid 8228 was inactive for 201.746 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Mar 18 10:45:47 gscluster401 kernel: Lustre: mdt07_045: service thread pid 8503 was inactive for 202.172 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      Mar 18 10:45:47 gscluster401 kernel: Lustre: Skipped 2 previous similar messages
      Mar 18 10:45:47 gscluster401 kernel: [<0>] ldlm_completion_ast+7b0/900 [ptlrpc]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] ldlm_cli_enqueue_fini+a51/f50 [ptlrpc]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] ldlm_cli_enqueue+621/a70 [ptlrpc]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] osp_md_object_lock+1d1/2c0 [osp]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] lod_object_lock+132/800 [lod]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_remote_object_lock_try+186/3c0 [mdt]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_remote_object_lock+1f/30 [mdt]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_rename_lock+26f/3b0 [mdt]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_reint_rename+c77/1d80 [mdt]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_reint_rec+11f/270 [mdt]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_reint_internal+4d0/7e0 [mdt]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_reint+5d/110 [mdt]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] tgt_request_handle+ccd/1b10 [ptlrpc]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] ptlrpc_server_handle_request+23f/c60 [ptlrpc]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] ptlrpc_main+c1d/1580 [ptlrpc]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] kthread+134/150
      Mar 18 10:45:47 gscluster401 kernel: [<0>] ret_from_fork+1f/40
      Mar 18 10:45:47 gscluster401 kernel: [<0>] ldlm_completion_ast+7b0/900 [ptlrpc]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] ldlm_cli_enqueue_fini+a51/f50 [ptlrpc]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] ldlm_cli_enqueue+621/a70 [ptlrpc]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] osp_md_object_lock+1d1/2c0 [osp]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] lod_object_lock+132/800 [lod]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_remote_object_lock_try+186/3c0 [mdt]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_remote_object_lock+1f/30 [mdt]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_rename_lock+26f/3b0 [mdt]
      Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_reint_rename+c77/1d80 [mdt]
      ...
      Mar 18 10:47:29 gscluster401 kernel: LustreError: 8296:0:(ldlm_request.c:140:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1773826945, 304s ago), entering recovery for scratch-MDT0000_UUID@xxxx@o2ib10 ns: scratch-MDT0000-osp-MDT0002 lock: 000000005744851a/31e6cbc9ad6ddbf3 lrc: 4/0,1 mode: --/EX res: [200000004:1:0].0 bits 2/0 rrc: 19 type: IBT gid 0 flags: 1000001000000 nid: local remote: 5a4bc09b47ef2ea1 expref: -99 pid: 8296 timeout: 0 lvb_type: 0
      

      Lustre Big Fs Lock fid: LUSTRE_BFL_FID = 0x200000004:0x1:0x0

      Analyse:

      The rename changelog time matches the transaction failure:

      2026/03/18 10:42:08 [505948/7] scratch-MDT0005: 3605569674 08RENME 1773826754.157766613 0 t=[0:0:0] p=[30000fd00:1d7cb:0] create_ts.20101231.3D.stack.3746263 s=[28000ee60:a06a:0] sp=[28000db7d:1c08:0] create_ts.20101231.3D.stack.3746263 J=glc51.0.u_NOLOGIN
      
      $ date -d@1773826754.157766613
      Wed Mar 18 10:39:14 AM CET 2026
      
      Mar 18 10:39:14 gscluster402 kernel: LustreError: 9260:0:(update_trans.c:1062:top_trans_stop()) scratch-MDT0004-osp-MDT0005: write updates failed: rc = -116 
      

      After verification (error injection in top_trans_stop()), the master mds is not able to abort/cancel the local sub transaction (osd_trans_stop()) if another remote sub transaction failed. The local change is committed even if a remote sub transaction failed -> corruption.

      I’m not certain whether the failover of remote MDT0004 is related, as it appears to occur after the transaction fails ("top_trans_stop() rc = -116" -> 10:39:14 ", "MDT0004 failover" -> 10:39:17).

      I have tried to reproduce by injecting some delays in top_trans_stop() but if a remote target failover, top_trans_stop() waits for the missing remote MDT.
      I have reproduced this by manually evicting+failover the remote MDT during top_trans_stop().

            eaujames Etienne Aujames
            eaujames Etienne Aujames
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated: