[LU-7217] sanity-lfsck test_23b hangs on START_NAMESPACE Created: 28/Sep/15  Updated: 29/Sep/15  Resolved: 29/Sep/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: parinay v kondekar (Inactive) Assignee: nasf (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None

Attachments: File 23c.lctl.tgz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Configuration : 4 node setup : 1 MDS/ 1 OSS/ 2 clients
Release
Server 2.7.60
Client 2.7.60

stdout.log
Total allocated inode limit: 0, total allocated block limit: 0

== sanity-lfsck test 23c: LFSCK can repair dangling name entry (3) == 08:02:38 (1443427358)
#####
The objectA has multiple hard links, one of them corresponding
to the name entry_B. But there is something wrong for the name
entry_B and cause entry_B to references non-exist object_C.
In the first-stage scanning, the LFSCK will think the entry_B
as dangling, and re-create the lost object_C. And then others
modified the re-created object_C. When the LFSCK comes to the
second-stage scanning, it will find that the former re-creating
object_C maybe wrong and try to replace the object_C with the
real object_A. But because object_C has been modified, so the
LFSCK cannot replace it.
#####
Inject failure stub on MDT0 to simulate dangling name entry
fail_loc=0x1621
fail_loc=0
'ls' should fail because of dangling name entry
fail_val=10
fail_loc=0x1602
Trigger namespace LFSCK to find out dangling name entry


 Comments   
Comment by Joseph Gmitter (Inactive) [ 28/Sep/15 ]

Hi Fan Yong,
Can you look at this issue?
Thanks.
Joe

Comment by nasf (Inactive) [ 29/Sep/15 ]

According to the given log, I found that the unfinished test is test_23c, not test_23b. In the test_23c, we inject some stub to delay the LFSCK scanning:

est_23c() {
        echo "#####"
...

        #define OBD_FAIL_LFSCK_DELAY3           0x1602
        do_facet $SINGLEMDS $LCTL set_param fail_val=10 fail_loc=0x1602

        echo "Trigger namespace LFSCK to find out dangling name entry"
        $START_NAMESPACE -r -C ||
}

Means that when hit the stub "0x1602" (for every item scanned during the 2nd phase), the LFSCK engine will sleep 10 seconds, then go ahead. So the whole LFSCK scanning looks very slow. But it is not hung there.

Here is the MDS log:

00100000:10000000:0.0:1443429130.295422:0:7460:0:(lfsck_engine.c:1733:lfsck_assistant_engine()) lustre-MDT0000-osd: LFSCK assistant phase2 scan start, synced: rc = 0
00100000:10000000:0.0:1443429130.295430:0:7460:0:(lfsck_namespace.c:6121:lfsck_namespace_assistant_handler_p2()) lustre-MDT0000-osd: namespace LFSCK phase2 scan start
00100000:10000000:0.0:1443429130.295473:0:7460:0:(lfsck_namespace.c:5727:lfsck_namespace_scan_local_lpf()) lustre-MDT0000-osd: start to scan backend /lost+found
00000001:00020000:0.0:1443429130.295483:0:7460:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 1602 sleeping for 10000ms
...
00000100:00100000:1.0:1443429133.621277:0:7029:0:(service.c:2121:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0000:26a6aa1f-ae32-a150-7927-f6f66a7ef0c1+9:6277:x1513543135402544:12345-192.168.112.11@tcp:400 Request procesed in 35us (90us total) trans 0 rc 0/0
00000100:00100000:1.0:1443429133.621281:0:7029:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-192.168.112.11@tcp, seq: 1561
Debug log: 18556 lines, 18556 kept, 0 dropped, 0 bad.

As we can see, the LFSCK engine hit the "0x1602" at time "1443429130.295483", then went to sleep. As scheduled, it should be re-scheduled at the time "1443429140.295483". But before such time, the log has been dumped ("1443429133.xxxx"), means the sleep time has not finished yet. So it is not a bug.

Generated at Sat Feb 10 02:07:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.