[LU-15742] lockup in LNetMDUnlink during filesystem migration - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.12.8
Labels:
- llnl
Environment:
TOSS 3.7-19 based
RHEL kernel 3.10.0-1160.59.1
lustre 2.12.8_6.llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We are having significant lnet issues that have caused us to disable lustre on one of our compute clusters (catalyst). We've had to turn off all of the router nodes in that cluster.

When the routers for catalyst are on we see lots of errors and have connectivity problems on multiple clusters.

This ticket may be useful to explain our lnet setup. https://jira.whamcloud.com/browse/LU-15234

UPDATE: The initial issue have been resolved and our clusters and file systems are working and we don't have to turn off clusters and/or routers anymore. This ticket is now focused on the LNetMDUnlink() containing stack trace as a possible root cause. The OS update and underlying network issues we had seem to have been confounders.

Related to https://jira.whamcloud.com/browse/LU-11895

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

call-2022-4-19.tar.gz
943 kB
19/Apr/22 11:12 PM
console.catalyst153
1.96 MB
14/Apr/22 7:57 PM
console.orelic.tar.gz
368 kB
19/Apr/22 6:40 PM
console.orelic2
430 kB
14/Apr/22 7:42 PM
console.pascal128
1.73 MB
22/Apr/22 9:02 PM
console.zrelic.tar.gz
304 kB
19/Apr/22 6:40 PM
console.zrelic2
485 kB
14/Apr/22 8:50 PM
lustre_network_updated.jpg
202 kB
15/Apr/22 3:50 PM
opensm.orelic.log.gz
132 kB
19/Apr/22 6:40 PM
opensm.zrelic.log.gz
194 kB
19/Apr/22 6:41 PM
pascal128-vmcore-dmesg.txt
772 kB
22/Apr/22 9:02 PM
pfstest-nodes.tar.gz
7.51 MB
20/Apr/22 12:20 AM

Activity

[LU-15742] lockup in LNetMDUnlink during filesystem migration

Brian Behlendorf added a comment - 22/Apr/22 10:29 PM

The stack traces from the logs look very similar those from https://jira.whamcloud.com/browse/LU-11100 and several duplicate linked bugs.

There was a comment in ~~LU-11100~~ asking for additional information about who was holding the lock. According to the NMI watchdog the holder appears to be executing the list_for_each_entry loop in LNetMDUnlink() -> lnet_handle2md() -> lnet_res_lh_lookup(). We observed this in the stack traces from multiple clients which hit the issue.

struct lnet_libhandle *
lnet_res_lh_lookup(struct lnet_res_container *rec, __u64 cookie)
{
        ...

        list_for_each_entry(lh, head, lh_hash_chain) {
                if (lh->lh_cookie == cookie)
                        return lh;
        }        return NULL;
}

[1650641425.080626] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [ptlrpcd_00_34:17033]
[1650641425.212628] CPU: 7 PID: 17033 Comm: ptlrpcd_00_34 Kdump: loaded Tainted: P           OEL ------------ T 3.10.0-1160.53.1.1chaos.ch6.x86_64 #1
[1650641425.225628] Hardware name: Penguin Computing Relion X1904GT/MG20-OP0-ZB, BIOS R04 07/31/2017 
[1650641425.233628] task: ffff8f643be51080 ti: ffff8f641a7e4000 task.ti: ffff8f641a7e4000
[1650641425.241628] RIP: 0010:[<ffffffffc0adf288>]  [<ffffffffc0adf288>] lnet_res_lh_lookup+0x48/0x70 [lnet]
[1650641425.251629] RSP: 0018:ffff8f641a7e7ba0  EFLAGS: 00000206
[1650641425.257629] RAX: 0000000000000000 RBX: ffffffffffffff10 RCX: ffff9dc719aa5cc0
[1650641425.264629] RDX: ffff8f6327022340 RSI: 000000001ba3d665 RDI: ffff8f64359bff40
[1650641425.272629] RBP: ffff8f641a7e7ba0 R08: ffff8f443f5db8c0 R09: ffff8f443f85b8c0
[1650641425.279629] R10: 0000000000000000 R11: 000000000000000f R12: 0000000000090001
[1650641425.287629] R13: ffff8f443f45b8c0 R14: 0000000000390000 R15: 0000000000000000
[1650641425.294629] FS:  0000000000000000(0000) GS:ffff8f443f5c0000(0000) knlGS:0000000000000000
[1650641425.303629] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1650641425.309629] CR2: 00002aaab320f024 CR3: 0000003ff84b8000 CR4: 00000000003607e0
[1650641425.317629] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1650641425.324629] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1650641425.332630] Call Trace:
[1650641425.335630]  [<ffffffffc0af242c>] LNetMDUnlink+0xac/0x180 [lnet]
[1650641425.341630]  [<ffffffffc3ce8dd6>] ptlrpc_unregister_reply+0x156/0x880 [ptlrpc]
[1650641425.349630]  [<ffffffffc3cedc5e>] ptllnet_res_lh_lookuprpc_expire_one_request+0xfe/0x550 [ptlrpc]
[1650641425.356630]  [<ffffffffc3cee15f>] ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc]
[1650641425.364630]  [<ffffffffc3d1f1bc>] ptlrpcd+0x29c/0x570 [ptlrpc]
[1650641425.370630]  [<ffffffff994e1550>] ? wake_up_state+0x20/0x20
[1650641425.376630]  [<ffffffffc3d1ef20>] ? ptlrpcd_check+0x5e0/0x5e0 [ptlrpc]
[1650641425.383630]  [<ffffffff994cb221>] kthread+0xd1/0xe0
[1650641425.388630]  [<ffffffff994cb150>] ? insert_kthread_work+0x40/0x40
[1650641425.395630]  [<ffffffff99bc4fdd>] ret_from_fork_nospec_begin+0x7/0x21
[1650641425.402630]  [<ffffffff994cb150>] ? insert_kthread_work+0x40/0x40

Brian Behlendorf added a comment - 22/Apr/22 10:29 PM The stack traces from the logs look very similar those from https://jira.whamcloud.com/browse/LU-11100 and several duplicate linked bugs. There was a comment in LU-11100 asking for additional information about who was holding the lock. According to the NMI watchdog the holder appears to be executing the list_for_each_entry loop in LNetMDUnlink() -> lnet_handle2md() -> lnet_res_lh_lookup() . We observed this in the stack traces from multiple clients which hit the issue. struct lnet_libhandle * lnet_res_lh_lookup(struct lnet_res_container *rec, __u64 cookie) { ... list_for_each_entry(lh, head, lh_hash_chain) { if (lh->lh_cookie == cookie) return lh; } return NULL; } [1650641425.080626] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [ptlrpcd_00_34:17033] [1650641425.212628] CPU: 7 PID: 17033 Comm: ptlrpcd_00_34 Kdump: loaded Tainted: P OEL ------------ T 3.10.0-1160.53.1.1chaos.ch6.x86_64 #1 [1650641425.225628] Hardware name: Penguin Computing Relion X1904GT/MG20-OP0-ZB, BIOS R04 07/31/2017 [1650641425.233628] task: ffff8f643be51080 ti: ffff8f641a7e4000 task.ti: ffff8f641a7e4000 [1650641425.241628] RIP: 0010:[<ffffffffc0adf288>] [<ffffffffc0adf288>] lnet_res_lh_lookup+0x48/0x70 [lnet] [1650641425.251629] RSP: 0018:ffff8f641a7e7ba0 EFLAGS: 00000206 [1650641425.257629] RAX: 0000000000000000 RBX: ffffffffffffff10 RCX: ffff9dc719aa5cc0 [1650641425.264629] RDX: ffff8f6327022340 RSI: 000000001ba3d665 RDI: ffff8f64359bff40 [1650641425.272629] RBP: ffff8f641a7e7ba0 R08: ffff8f443f5db8c0 R09: ffff8f443f85b8c0 [1650641425.279629] R10: 0000000000000000 R11: 000000000000000f R12: 0000000000090001 [1650641425.287629] R13: ffff8f443f45b8c0 R14: 0000000000390000 R15: 0000000000000000 [1650641425.294629] FS: 0000000000000000(0000) GS:ffff8f443f5c0000(0000) knlGS:0000000000000000 [1650641425.303629] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1650641425.309629] CR2: 00002aaab320f024 CR3: 0000003ff84b8000 CR4: 00000000003607e0 [1650641425.317629] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1650641425.324629] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1650641425.332630] Call Trace: [1650641425.335630] [<ffffffffc0af242c>] LNetMDUnlink+0xac/0x180 [lnet] [1650641425.341630] [<ffffffffc3ce8dd6>] ptlrpc_unregister_reply+0x156/0x880 [ptlrpc] [1650641425.349630] [<ffffffffc3cedc5e>] ptllnet_res_lh_lookuprpc_expire_one_request+0xfe/0x550 [ptlrpc] [1650641425.356630] [<ffffffffc3cee15f>] ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc] [1650641425.364630] [<ffffffffc3d1f1bc>] ptlrpcd+0x29c/0x570 [ptlrpc] [1650641425.370630] [<ffffffff994e1550>] ? wake_up_state+0x20/0x20 [1650641425.376630] [<ffffffffc3d1ef20>] ? ptlrpcd_check+0x5e0/0x5e0 [ptlrpc] [1650641425.383630] [<ffffffff994cb221>] kthread+0xd1/0xe0 [1650641425.388630] [<ffffffff994cb150>] ? insert_kthread_work+0x40/0x40 [1650641425.395630] [<ffffffff99bc4fdd>] ret_from_fork_nospec_begin+0x7/0x21 [1650641425.402630] [<ffffffff994cb150>] ? insert_kthread_work+0x40/0x40

Gian-Carlo Defazio added a comment - 22/Apr/22 9:06 PM

I've uploaded the console logs for a node that crashed during the call on 2022-04-22, pascal128,

console.pascal128, pascal128-vmcore-dmesg.txt

Gian-Carlo Defazio added a comment - 22/Apr/22 9:06 PM I've uploaded the console logs for a node that crashed during the call on 2022-04-22, pascal128, console.pascal128, pascal128-vmcore-dmesg.txt

Gian-Carlo Defazio added a comment - 22/Apr/22 4:23 PM

Are you available for another call today?

Gian-Carlo Defazio added a comment - 22/Apr/22 4:23 PM Are you available for another call today?

Gian-Carlo Defazio added a comment - 22/Apr/22 4:18 PM - edited

The migration jobs are now being run on a different cluster, pascal.

pascal is in the same building, B654, as the source and destination of the migrations (zinc to boa).

We're still having issues with dropped messages and hangs and timeouts while the migrations are running. However, some migration jobs have been able to complete.

The machines in B451 and the relic routers look good now that they're not involved in the migration.

Gian-Carlo Defazio added a comment - 22/Apr/22 4:18 PM - edited The migration jobs are now being run on a different cluster, pascal. pascal is in the same building, B654, as the source and destination of the migrations (zinc to boa). We're still having issues with dropped messages and hangs and timeouts while the migrations are running. However, some migration jobs have been able to complete. The machines in B451 and the relic routers look good now that they're not involved in the migration.

Gian-Carlo Defazio added a comment - 20/Apr/22 11:55 PM

Cameron started running tests before I could.

He ran an ior test with 1m block sizes and had no issues.

He ran this ior command with no issues:

ior -a POSIX -i 2 -b 32G -t 32k -F -C -e -z -o /p/asplustre2/harr1/OST0000/2204201032 -v -v -wWr -g -d 15

We had a configuration change made to the 2 infiniband networks, B654_CZ_IB_SAN and B451_CZ_IB_SAN, which turned on hardware proxy mode. This was because hardware proxy mode had helped with another issue at our site on a different network.

After hardware proxy mode was turned on, Cameron ran his migration jobs again. They caused issues, including the same errors we've seen previously on the relic routers and timeouts and disconnect for other clusters on the network.

However, there are currently 3 migration jobs running our things look ok, and a few have been able to complete today.

Gian-Carlo Defazio added a comment - 20/Apr/22 11:55 PM Cameron started running tests before I could. He ran an ior test with 1m block sizes and had no issues. He ran this ior command with no issues: ior -a POSIX -i 2 -b 32G -t 32k -F -C -e -z -o /p/asplustre2/harr1/OST0000/2204201032 -v -v -wWr -g -d 15 We had a configuration change made to the 2 infiniband networks, B654_CZ_IB_SAN and B451_CZ_IB_SAN, which turned on hardware proxy mode. This was because hardware proxy mode had helped with another issue at our site on a different network. After hardware proxy mode was turned on, Cameron ran his migration jobs again. They caused issues, including the same errors we've seen previously on the relic routers and timeouts and disconnect for other clusters on the network. However, there are currently 3 migration jobs running our things look ok, and a few have been able to complete today.

Gian-Carlo Defazio added a comment - 20/Apr/22 5:18 PM - edited

That might be the case, but it's hard to tell due to how Cameron's migration script works.

Me:

also, does your migration script tell you what phase of dsync you're in?
 That is, is it possible to see what phase you're in when things start 
going bad? Serguei seems to think that things were fine during the walk 
of zinc (FS "A"), which is kinda how it looked to me too yesterday.

Cameron:

That's not quite how the jobs work. The script first looks for the 
heaviest users and prioritizes them at the top of the user list, then 
goes through the user list and submits a new job for each
 user or group directory at the root of the file system. So there's 
always a mix of jobs doing pure syncing and some doing walking

You are correct that the tool we're using, dsync from Mpifileutils, scans (or walks) the source and destination before making changes to the destination.

We have done IO testing on boa (FS "B"), although a lot of it was testing hardware and zfs performance, not lustre. However we've done lustre testing on it as well.

I ran some IOR benchmarks on boa about 2 months ago after an OS update that changed the lustre version. However my tests used relatively large transfer sizes (2M) and block sizes (2G). I don't have any records of tests using small transfer and block sizes.

We generally use catalyst for such tests and transfers because catalyst has a lot of spare cycles, and we did use it for some (and I believe most) of the initial performance testing on boa.

I'll try some IOR runs with smaller sizes from catalyst.

Gian-Carlo Defazio added a comment - 20/Apr/22 5:18 PM - edited That might be the case, but it's hard to tell due to how Cameron's migration script works. Me: also, does your migration script tell you what phase of dsync you're in? That is, is it possible to see what phase you're in when things start going bad? Serguei seems to think that things were fine during the walk of zinc (FS "A"), which is kinda how it looked to me too yesterday. Cameron: That's not quite how the jobs work. The script first looks for the heaviest users and prioritizes them at the top of the user list, then goes through the user list and submits a new job for each user or group directory at the root of the file system. So there's always a mix of jobs doing pure syncing and some doing walking You are correct that the tool we're using, dsync from Mpifileutils, scans (or walks) the source and destination before making changes to the destination. We have done IO testing on boa (FS "B"), although a lot of it was testing hardware and zfs performance, not lustre. However we've done lustre testing on it as well. I ran some IOR benchmarks on boa about 2 months ago after an OS update that changed the lustre version. However my tests used relatively large transfer sizes (2M) and block sizes (2G). I don't have any records of tests using small transfer and block sizes. We generally use catalyst for such tests and transfers because catalyst has a lot of spare cycles, and we did use it for some (and I believe most) of the initial performance testing on boa. I'll try some IOR runs with smaller sizes from catalyst.

Serguei Smirnov added a comment - 20/Apr/22 2:32 PM

These messages in orelic logs:

 Apr 19 15:22:28 orelic4 kernel: LNetError: 15918:0:(socklnd.c:1681:ksocknal_destroy_conn()) Completing partial receive from 12345-172.16.70.64@tcp[1], ip 172.16.70.64:988, with error, wanted: 224, left: 224, last alive is 0 secs ago

are likely caused by connection getting destroyed while transferring data, but I haven't yet been able to establish why the connection is getting destroyed.

My understanding is that

there are 2 FS involved and the migration job is first scanning FS "A" and then, after a while, applying changes to FS "B"
the issue starts happening when FS "B" starts getting accessed

Is is correct? If so, have you tried any other (fio) tests with FS "B" (using same catalyst nodes)?

Serguei Smirnov added a comment - 20/Apr/22 2:32 PM These messages in orelic logs: Apr 19 15:22:28 orelic4 kernel: LNetError: 15918:0:(socklnd.c:1681:ksocknal_destroy_conn()) Completing partial receive from 12345-172.16.70.64@tcp[1], ip 172.16.70.64:988, with error, wanted: 224, left: 224, last alive is 0 secs ago are likely caused by connection getting destroyed while transferring data, but I haven't yet been able to establish why the connection is getting destroyed. My understanding is that there are 2 FS involved and the migration job is first scanning FS "A" and then, after a while, applying changes to FS "B" the issue starts happening when FS "B" starts getting accessed Is is correct? If so, have you tried any other (fio) tests with FS "B" (using same catalyst nodes)?

Gian-Carlo Defazio added a comment - 20/Apr/22 12:21 AM

I've added pfstest-nodes.tar.gz which has the console logs for slurm group pfstest on catalyst. These are the nodes used to perform the migration.

Gian-Carlo Defazio added a comment - 20/Apr/22 12:21 AM I've added pfstest-nodes.tar.gz which has the console logs for slurm group pfstest on catalyst. These are the nodes used to perform the migration.

Serguei Smirnov added a comment - 19/Apr/22 11:30 PM

Gian-Carlo,

Would it be possible to provide the dmsg from the compute node(s) used to run the migration job around the time of the issue? It looks like these may be useful after all.

Thanks,

Serguei.

Serguei Smirnov added a comment - 19/Apr/22 11:30 PM Gian-Carlo, Would it be possible to provide the dmsg from the compute node(s) used to run the migration job around the time of the issue? It looks like these may be useful after all. Thanks, Serguei.

Gian-Carlo Defazio added a comment - 19/Apr/22 11:15 PM

I've added the lustre logs from today's call as call-2022-4-19.tar.gz

boa isn't on the diagram yet, but it's in building 654, the same as zinc.

Gian-Carlo Defazio added a comment - 19/Apr/22 11:15 PM I've added the lustre logs from today's call as call-2022-4-19.tar.gz boa isn't on the diagram yet, but it's in building 654, the same as zinc.

Gian-Carlo Defazio added a comment - 19/Apr/22 6:52 PM

The fix we had during the call (which was keeping catalyst154 off) didn't last . We're currently having issues even with all the catalyst routers off. It seems like we have underlying network issues on another compute cluster, quartz.

As for the stack trace from LU-14282, Brian Behlendorf noticed that and we made sure that net was turned off. It's usually off on our systems.

I've uploaded some logs for the relic clusters, including from the infiniband subnet manager. If you look at the network digram, the subnet managers for the 2 SANs are on orelic1 and zrelic1.

Would you be able to do another call today?

Gian-Carlo Defazio added a comment - 19/Apr/22 6:52 PM The fix we had during the call (which was keeping catalyst154 off) didn't last . We're currently having issues even with all the catalyst routers off. It seems like we have underlying network issues on another compute cluster, quartz. As for the stack trace from LU-14282 , Brian Behlendorf noticed that and we made sure that net was turned off. It's usually off on our systems. I've uploaded some logs for the relic clusters, including from the infiniband subnet manager. If you look at the network digram, the subnet managers for the 2 SANs are on orelic1 and zrelic1. Would you be able to do another call today?

People

Assignee:: Serguei Smirnov

Reporter:: Gian-Carlo Defazio

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 14/Apr/22 4:35 PM

Updated:: 22/Jul/22 10:43 PM