Details
-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
3
-
9223372036854775807
Description
On our DNE testbed, one of our sharded directories seems to contain files that are all in a broken state. Currently both servers and clients are running 2.8.0_0.0.llnlpreview.40 (see the lustre-release-fe-llnl repo).
We can get a directory listing, but nothing listed is actually accessible. Here is an excerpt from running ls -l:
# pwd /p/lquake/casses1/opal-jet/simul_2 # ls -l ls: cannot access simul_link.2243: No such file or directory ls: cannot access simul_link.3161: No such file or directory ls: cannot access simul_link.3129: No such file or directory ls: cannot access simul_link.3893: No such file or directory ls: cannot access simul_link.691: No such file or directory ls: cannot access simul_link.3233: No such file or directory ls: cannot access simul_link.235: No such file or directory ls: cannot access simul_link.1653: No such file or directory ls: cannot access simul_link.3167: No such file or directory ls: cannot access simul_link.681: No such file or directory ls: cannot access simul_link.835: No such file or directory ls: cannot access simul_link.3857: No such file or directory ls: cannot access simul_link.1591: No such file or directory ls: cannot access simul_link.1175: No such file or directory [cut] -????????? ? ? ? ? ? simul_link.937 -????????? ? ? ? ? ? simul_link.94 -????????? ? ? ? ? ? simul_link.940 -????????? ? ? ? ? ? simul_link.941 -????????? ? ? ? ? ? simul_link.942 -????????? ? ? ? ? ? simul_link.943 -????????? ? ? ? ? ? simul_link.944 -????????? ? ? ? ? ? simul_link.947 [cut]
Here is the striping information:
# lfs getdirstripe . . lmv_stripe_count: 16 lmv_stripe_offset: 12 mdtidx FID[seq:oid:ver] 12 [0x50000996c:0x14fed:0x0] 13 [0x54000919d:0x14fed:0x0] 14 [0x58000a086:0x14fed:0x0] 15 [0x5c000996b:0x14fed:0x0] 0 [0x200006b03:0x14fed:0x0] 1 [0x3000089cc:0x14fed:0x0] 2 [0x38000996d:0x14fed:0x0] 3 [0x4c000b0df:0x14fed:0x0] 4 [0x2c000a142:0xec09:0x0] 5 [0x3c000b8b2:0xec09:0x0] 6 [0x34000a143:0xec09:0x0] 7 [0x40000a143:0xec09:0x0] 8 [0x44000a142:0xec09:0x0] 9 [0x24000a143:0xec09:0x0] 10 [0x2800091a4:0xec09:0x0] 11 [0x4800091a3:0xec09:0x0]
I ran lfsck on all services (at least those started by the "--all" option), but that did not address this situation.
The problem files cannot be unlinked:
# rm simul_link.999 rm: cannot remove 'simul_link.999': No such file or directory