Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.15.8
Labels:
- corruption
- dne3
Environment:
Lustre 2.15.8.ocean1

Epic/Theme:
- dne
Severity:
2
Rank (Obsolete):
9223372036854775807

Context:

The CEA was performing an online update on the backend storage (SFAOS).
For MDTs, they moved each target one-by-one on the failover controller to shutdown the VMs and update the controller. They waited the recovery for each failover.

A Robinhood trash policy was running in the background. This policy move cold entries from user directories to a replicate trash path (and then chown root:root). The trash directories are created using DNEv3 so the rename create a distributed transaction between 2 MDTs (the source parent+inode and the target parent are not on the same MDT).

Files:

[root@gscluster990 ~]# ll /ccc/scratch/.cccdump/trash/.../IGCM_OUT/OL2/PROD/NMIP3-modif/SH8/Out/
ls: cannot access '/ccc/scratch/.cccdump/trash/.../IGCM_OUT/OL2/PROD/NMIP3-modif/SH8/Out/create_ts.20101231.3D.stack.3746263': No such file or directory
total 0
-????????? ? ? ? ? ? create_ts.20101231.3D.stack.3746263
[root@gscluster990 ~]# lfs getstripe -m /ccc/scratch/.cccdump/trash/.../IGCM_OUT/OL2/PROD/NMIP3-modif/SH8/Out/
5
[root@gscluster990 ~]# lfs path2fid /ccc/scratch/.cccdump/trash/.../IGCM_OUT/OL2/PROD/NMIP3-mo
dif/SH8/Out/
[30000fd00:1d7cb:0]

debugfs: dump oi.16.0 /tmp/oi.16.0
[root@gscluster402 ~]# .../check_iam -r /tmp/oi.16.0 | grep 30000fd00:1d7cb:0
[30000fd00:1d7cb:0] 782762811/2181992872

debugfs: ls -lD <782762811>
 782762811 40755 (2) 0 0 36864 3-May-2026 11:04 .
 782762810 40755 (18) 0 0 4096 3-May-2026 06:54 fid:[30000fd00:1d7ca:0] ..
      0 0 (17) 0 0 0 fid:[28000edc1:185a2:0] pack_restart.18561231.out
      0 0 (17) 0 0 0 fid:[3c000dc2e:6263:0] SH8_18960101_18961231_out_execution
      0 0 (17) 0 0 0 fid:[28000eddd:11c87:0] pack_output.18771231.out
      0 0 (17) 0 0 0 fid:[28000eddd:16a51:0] create_ts.18701231.3D.stack.2378519
 782776477 100000 (17) 0 0 0 1-Jan-1970 01:00 fid:[28000ee60:a06a:0] create_ts.20101231.3D.stack.3746263
      0 0 (17) 0 0 0 fid:[28000ee3c:3193:0] monitoring.19401231.out
      0 0 (17) 0 0 0 fid:[28000ee4a:11cee:0] create_se.19891231.stack.1888794
      0 0 (17) 0 0 0 fid:[28000edc1:185ad:0] pack_restart.18561231.stack.3847579
      0 0 (17) 0 0 0 fid:[28000eddd:e60:0] monitoring.18561231.stack.622619o

debugfs: stat <782776477>
Inode: 782776477 Type: regular Mode: 0000 Flags: 0
Generation: 3046618564 Version: 00000000:00000000
User: 0 Group: 0 Project: 0 Size: 0
File ACL: 0
Links: 1 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
 ctime: 69ba72c2:257fc840 -- Wed Mar 18 10:39:14 2026
 atime: 00000000:fffffff8 -- Thu Jan 1 01:00:00 1970
 mtime: 00000000:fffffff8 -- Thu Jan 1 01:00:00 1970
crtime: 69ba72c2:257fc840 -- Wed Mar 18 10:39:14 2026
Size of extra inode fields: 32
Extended attributes:
  lma: fid=[28000ee60:a06a:0] compat=0 incompat=2
BLOCKS:

28000ee60 -> MDT0004

Changelogs (robinhood output for 0x28000ee60:0xa06a:0x0):

2026/01/17 10:30:28 [1651060/6] scratch-MDT0004: 3760865736 11CLOSE 1768642221.567979397 442 t=[28000ee60:a06a:0] J=ppi.b3321114.11758.33227
2026/01/17 10:30:28 [1651060/6] scratch-MDT0004: 3760865737 11CLOSE 1768642221.587415319 442 t=[28000ee60:a06a:0] J=ppi.b3321114.11758.33227
2026/03/18 10:50:59 [505948/6] scratch-MDT0004: 4511524511 12LYOUT 1773826753.850764555 0 t=[28000ee60:a06a:0] J=glc51.0.u_NOLOGIN
2026/03/18 10:50:59 [505948/6] scratch-MDT0004: 4511524512 11CLOSE 1773826753.851419095 3 t=[28000ee60:a06a:0] J=glc51.0.u_NOLOGIN
2026/03/18 10:42:08 [505948/7] scratch-MDT0005: 3605569674 08RENME 1773826754.157766613 0 t=[0:0:0] p=[30000fd00:1d7cb:0] create_ts.20101231.3D.stack.3746263 s=[28000ee60:a06a:0] sp=[28000db7d:1c08:0] create_ts.20101231.3D.stack.3746263 J=glc51.0.u_NOLOGIN <--------------
2026/03/18 10:50:59 [505948/6] scratch-MDT0004: 4511524556 14SATTR 1773826754.503069056 e4 t=[28000ee60:a06a:0] J=glc51.0.u_NOLOGIN
2026/04/02 03:26:38 [715108/6] scratch-MDT0004: 4817815561 15XATTR 1775093192.518982693 0 t=[28000ee60:a06a:0] J=glc490.s0.robinhood
2026/05/02 04:25:24 [12714/6] scratch-MDT0004: 6054439024 06UNLNK 1777688708.755896028 1 t=[28000ee60:a06a:0] p=[28000db7d:1c08:0]  create_ts.20101231.3D.stack.3746263 J=glc490.s0.robinhood        <----------

The last unlink (CLF_UNLINK_LAST -> linkcount=0) is done with the old parent before the rename: the parent for the last unlink should be p=[0x30000fd00:0x1d7cb:0x0] instead of [0x28000db7d:0x1c08:0x0]).
The rename adds a new link on the target parent without removing the old link on the source parent.

Logs
HA actions:

Mar 18 10:36:55 gscluster400 ccc_root[23170]: admin[5499] yyyy 51330 422: pcs status
Mar 18 10:37:38 gscluster400 ccc_root[24804]: admin[5499] yyyy 51330 422: pcs resource move fs-scratch-MDT0003 gscluster400
Mar 18 10:39:16 gscluster400 ccc_root[29332]: admin[5499] yyyy 51330 422: pcs resource move fs-scratch-MDT0004 gscluster401
Mar 18 10:42:24 gscluster400 ccc_root[35890]: admin[5499] yyyy 51330 422: pcs resource move fs-scratch-MDT0007 gscluster403
Mar 18 10:45:48 gscluster400 ccc_root[42623]: admin[5499] yyyy 51330 422: pcs resource move fs-scratch-MDT0007 gscluster402
Mar 18 10:48:03 gscluster400 ccc_root[48965]: admin[5499] yyyy 51330 422: pcs resource move fs-scratch-MDT0008 gscluster403
Mar 18 10:51:14 gscluster400 ccc_root[55862]: admin[5499] yyyy 51330 422: pcs node standby gscluster404 gscluster405 gscluster406 gscluster407

gscluster405 (MDT0004 failover due to "pcs ressource move"):

2026-03-18 10:39:17 [3716382.250929] Lustre: Failing over scratch-MDT0004

gscluster402 (MDT0005 rename transaction master for the rename op):

Mar 18 10:39:14 gscluster402 kernel: LustreError: 9260:0:(update_trans.c:1062:top_trans_stop()) scratch-MDT0004-osp-MDT0005: write updates failed: rc = -116

gscluster401 (MDT0004 remount on gscluster401):

Mar 18 10:39:52 gscluster401 kernel: LDISKFS-fs (sde): mounted filesystem with ordered data mode. Opts: user_xattr,errors=panic,journal_checksum,no_mbcache,nodelalloc
Mar 18 10:41:57 gscluster401 kernel: Lustre: scratch-MDT0004: Recovery over after 1:34, of 2418 clients 2418 recovered and 0 were evicted.
Mar 18 10:42:00 gscluster401 ccc_root[22850]: admin[18816] yyyy 56222 422: grep . /proc/fs/lustre/*/*/recovery_status
Mar 18 10:42:24 gscluster401 kernel: LustreError: 11-0: scratch-MDT0007-osp-MDT0002: operation out_update to node xxxx@o2ib10 failed: rc = -19
Mar 18 10:42:24 gscluster401 kernel: Lustre: scratch-MDT0007-osp-MDT0002: Connection to scratch-MDT0007 (at xxxx@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
Mar 18 10:45:47 gscluster401 kernel: Lustre: mdt06_004: service thread pid 8228 was inactive for 201.746 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Mar 18 10:45:47 gscluster401 kernel: Lustre: mdt07_045: service thread pid 8503 was inactive for 202.172 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
Mar 18 10:45:47 gscluster401 kernel: Lustre: Skipped 2 previous similar messages
Mar 18 10:45:47 gscluster401 kernel: [<0>] ldlm_completion_ast+7b0/900 [ptlrpc]
Mar 18 10:45:47 gscluster401 kernel: [<0>] ldlm_cli_enqueue_fini+a51/f50 [ptlrpc]
Mar 18 10:45:47 gscluster401 kernel: [<0>] ldlm_cli_enqueue+621/a70 [ptlrpc]
Mar 18 10:45:47 gscluster401 kernel: [<0>] osp_md_object_lock+1d1/2c0 [osp]
Mar 18 10:45:47 gscluster401 kernel: [<0>] lod_object_lock+132/800 [lod]
Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_remote_object_lock_try+186/3c0 [mdt]
Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_remote_object_lock+1f/30 [mdt]
Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_rename_lock+26f/3b0 [mdt]
Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_reint_rename+c77/1d80 [mdt]
Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_reint_rec+11f/270 [mdt]
Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_reint_internal+4d0/7e0 [mdt]
Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_reint+5d/110 [mdt]
Mar 18 10:45:47 gscluster401 kernel: [<0>] tgt_request_handle+ccd/1b10 [ptlrpc]
Mar 18 10:45:47 gscluster401 kernel: [<0>] ptlrpc_server_handle_request+23f/c60 [ptlrpc]
Mar 18 10:45:47 gscluster401 kernel: [<0>] ptlrpc_main+c1d/1580 [ptlrpc]
Mar 18 10:45:47 gscluster401 kernel: [<0>] kthread+134/150
Mar 18 10:45:47 gscluster401 kernel: [<0>] ret_from_fork+1f/40
Mar 18 10:45:47 gscluster401 kernel: [<0>] ldlm_completion_ast+7b0/900 [ptlrpc]
Mar 18 10:45:47 gscluster401 kernel: [<0>] ldlm_cli_enqueue_fini+a51/f50 [ptlrpc]
Mar 18 10:45:47 gscluster401 kernel: [<0>] ldlm_cli_enqueue+621/a70 [ptlrpc]
Mar 18 10:45:47 gscluster401 kernel: [<0>] osp_md_object_lock+1d1/2c0 [osp]
Mar 18 10:45:47 gscluster401 kernel: [<0>] lod_object_lock+132/800 [lod]
Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_remote_object_lock_try+186/3c0 [mdt]
Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_remote_object_lock+1f/30 [mdt]
Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_rename_lock+26f/3b0 [mdt]
Mar 18 10:45:47 gscluster401 kernel: [<0>] mdt_reint_rename+c77/1d80 [mdt]
...
Mar 18 10:47:29 gscluster401 kernel: LustreError: 8296:0:(ldlm_request.c:140:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1773826945, 304s ago), entering recovery for scratch-MDT0000_UUID@xxxx@o2ib10 ns: scratch-MDT0000-osp-MDT0002 lock: 000000005744851a/31e6cbc9ad6ddbf3 lrc: 4/0,1 mode: --/EX res: [200000004:1:0].0 bits 2/0 rrc: 19 type: IBT gid 0 flags: 1000001000000 nid: local remote: 5a4bc09b47ef2ea1 expref: -99 pid: 8296 timeout: 0 lvb_type: 0

Lustre Big Fs Lock fid: LUSTRE_BFL_FID = 0x200000004:0x1:0x0

Analyse:

The rename changelog time matches the transaction failure:

2026/03/18 10:42:08 [505948/7] scratch-MDT0005: 3605569674 08RENME 1773826754.157766613 0 t=[0:0:0] p=[30000fd00:1d7cb:0] create_ts.20101231.3D.stack.3746263 s=[28000ee60:a06a:0] sp=[28000db7d:1c08:0] create_ts.20101231.3D.stack.3746263 J=glc51.0.u_NOLOGIN

$ date -d@1773826754.157766613
Wed Mar 18 10:39:14 AM CET 2026

Mar 18 10:39:14 gscluster402 kernel: LustreError: 9260:0:(update_trans.c:1062:top_trans_stop()) scratch-MDT0004-osp-MDT0005: write updates failed: rc = -116

After verification (error injection in top_trans_stop()), the master mds is not able to abort/cancel the local sub transaction (osd_trans_stop()) if another remote sub transaction failed. The local change is committed even if a remote sub transaction failed -> corruption.

I’m not certain whether the failover of remote MDT0004 is related, as it appears to occur after the transaction fails ("top_trans_stop() rc = -116" -> 10:39:14 ", "MDT0004 failover" -> 10:39:17).

I have tried to reproduce by injecting some delays in top_trans_stop() but if a remote target failover, top_trans_stop() waits for the missing remote MDT.
I have reproduced this by manually evicting+failover the remote MDT during top_trans_stop().

Assignee:: Etienne Aujames

Reporter:: Etienne Aujames

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 23/Jun/26 3:16 PM

Updated:: 24/Jun/26 9:56 AM

Details

Description

Attachments

Activity

People

Dates