Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

app_rpt: "fast hangup" crashing with latest dev. #459

Open
mkmer opened this issue Jan 20, 2025 · 7 comments · May be fixed by #460
Open

app_rpt: "fast hangup" crashing with latest dev. #459

mkmer opened this issue Jan 20, 2025 · 7 comments · May be fixed by #460
Labels
bug Something isn't working

Comments

@mkmer
Copy link
Collaborator

mkmer commented Jan 20, 2025

There is quite a bit of additional code beyond what I had tested earlier in PR #451

Now when executing the "fast disconnect" test I receive this crash:

[2025-01-19 19:24:50.540] WARNING[420715][C-00000024]: app_rpt/rpt_channel.c:470 send_newkey: Failed to send text !NEWKEY1! on IAX2/149.154.11.243:24357-1639
[2025-01-19 19:24:50.542] WARNING[420715][C-00000024]: app_rpt/rpt_channel.c:470 send_newkey: Failed to send text !NEWKEY1! on IAX2/149.154.11.243:24357-1639
 == Spawn extension (radio-secure, 287893, 1) exited non-zero on 'IAX2/149.154.11.243:24357-1639'
[2025-01-19 19:24:50.542] WARNING[420715][C-00000024]: channel.c:2630 ast_hangup: Hard hangup called by thread LWP 420715 on IAX2/149.154.11.243:24357-1639, while blocked by thread LWP 420617 in procedure ast_waitfor_nandfds!  Expect a failure

core-asterisk-2025-01-20T00-24-50Z-full.txt
core-asterisk-2025-01-20T00-24-50Z-locks.txt
core-asterisk-2025-01-20T00-24-50Z-brief.txt
core-asterisk-2025-01-20T00-24-50Z-info.txt

@mkmer
Copy link
Collaborator Author

mkmer commented Jan 20, 2025

This is the offending code:

app_rpt/apps/app_rpt.c

Lines 6848 to 6852 in e4dee90

if (!l->chan || ast_check_hangup(chan)) {
/* This connection is already toast, just return -1 as normal and let the core kill the channel off */
ast_debug(3, "Channel %s is a dead link\n", ast_channel_name(chan));
return -1;
}

Commenting it out eliminates the crash but we end up with a Max retries error indefinitely:

[2025-01-20 08:07:23.356] WARNING[541574]: chan_iax2.c:3619 __attempt_transmit: Max retries exceeded to host 149.154.11.243 on IAX2/149.154.11.243:24305-8161 (type = 6, subclass = 2, ts=251998, seqno=37)

This is only occurring when the "fast" disconnect situation is triggered.

@mkmer
Copy link
Collaborator Author

mkmer commented Jan 20, 2025

Not really sure this is the right way, but return 0; plus elminating ast_check_hangup(chan) eliminates the crash AND eliminates the retry error...

This is my current "test" code:

		if (!l->chan) {
			/* This connection is already toast, just return -1 as normal and let the core kill the channel off */
			ast_debug(3, "Channel %s is a dead link\n", ast_channel_name(chan));
			return 0;
		}

@InterLinked1
Copy link
Member

There is quite a bit of additional code beyond what I had tested earlier in PR #451

Now when executing the "fast disconnect" test I receive this crash:

[2025-01-19 19:24:50.540] WARNING[420715][C-00000024]: app_rpt/rpt_channel.c:470 send_newkey: Failed to send text !NEWKEY1! on IAX2/149.154.11.243:24357-1639
[2025-01-19 19:24:50.542] WARNING[420715][C-00000024]: app_rpt/rpt_channel.c:470 send_newkey: Failed to send text !NEWKEY1! on IAX2/149.154.11.243:24357-1639
 == Spawn extension (radio-secure, 287893, 1) exited non-zero on 'IAX2/149.154.11.243:24357-1639'
[2025-01-19 19:24:50.542] WARNING[420715][C-00000024]: channel.c:2630 ast_hangup: Hard hangup called by thread LWP 420715 on IAX2/149.154.11.243:24357-1639, while blocked by thread LWP 420617 in procedure ast_waitfor_nandfds!  Expect a failure

core-asterisk-2025-01-20T00-24-50Z-full.txt core-asterisk-2025-01-20T00-24-50Z-locks.txt core-asterisk-2025-01-20T00-24-50Z-brief.txt core-asterisk-2025-01-20T00-24-50Z-info.txt

Are you sure this is the right coredump? This appears to be caused by something different:

Thread 1 (Thread 0x7fba0a6396c0 (LWP 420659)):
#0  0x00007fba8823331a in ast_sockaddr_isnull (addr=<optimized out>) at /usr/src/asterisk-20.11.1/include/asterisk/netsock2.h:129
No locals.
#1  function_iaxpeer (chan=0x7fba680015d0, cmd=<optimized out>, data=0x7fba0a6295c8 "CURRENTCHANNEL", buf=0x7fba0a62b700 "", len=4096) at chan_iax2.c:14566

Perhaps this is incidentally caused by the hard hangup issue.

https://github.com/asterisk/asterisk/blob/ff0fb401ae8372312070e4f911d5a46ffc42d137/include/asterisk/netsock2.h#L129

The backtrace says SIGSEGV, so somehow there is an issue here. My guess is some kind of invalid memory access since the crashing function already does a NULL check.

I have a suspicion about how the problem might be resolvable. Are you able to easily or at least periodically reproduce this, to test a patch?

Ultimately, this seems to be caused by rpt_manager_do_xstat in stack frame 6 (app_rpt/rpt_manager.c:227).

@InterLinked1
Copy link
Member

Okay, I see the log messages in a thread, so it is right.

Unfortunately, LWP 420715 isn't present in the core dump, so whatever thread triggered the issue had exited by the time the backtrace was generated.

To force a crash while the thread exists, can you do the following:

Go to channel.c line 2630.
Add an ast_assert(0) after that.

Assuming you are compiled with dev mode, that should force a crash there. If you could upload the backtrace from that, that would be great. I'd like to confirm if the offending thread is an app_rpt thread or something else.

@mkmer
Copy link
Collaborator Author

mkmer commented Jan 20, 2025

@InterLinked1 - your looking to add an assert in asterisk base code channel.c (right?) There is already and assert(0) in that location.

There are 2 scenarios for crashes (I made sure to grab each one after the specific crash - so they should be right this time):

PR without modifications:

[2025-01-20 11:24:06.322] WARNING[692197][C-00000024]: app_rpt/rpt_channel.c:470 send_newkey: Failed to send text !NEWKEY1! on IAX2/149.154.11.243:24501-5692
  == Spawn extension (radio-secure, 287893, 1) exited non-zero on 'IAX2/149.154.11.243:24501-5692'
[2025-01-20 11:24:06.322] WARNING[692197][C-00000024]: channel.c:2630 ast_hangup: Hard hangup called by thread LWP 692197 on IAX2/149.154.11.243:24501-5692, while blocked by thread LWP 692009 in procedure ast_waitfor_nandfds!  Expect a failure
    -- Hungup 'IAX2/149.154.11.243:24501-5692'

core-asterisk-2025-01-20T16-24-06Z-full.txt
core-asterisk-2025-01-20T16-24-06Z-info.txt
core-asterisk-2025-01-20T16-24-06Z-locks.txt
core-asterisk-2025-01-20T16-24-06Z-thread1.txt
core-asterisk-2025-01-20T16-24-06Z-brief.txt

Scenario #2: changing to return 0; but leaving ast_check_hangup(chan) during the fast disconnect gives a bunch of these and a core dump:

[2025-01-20 11:17:41.460] ERROR[678530]: channel.c:3055 ast_waitfor_nandfds: FRACK!, Failed assertion bad magic number 0x0 for object 0x7f4694006cd0 (0)
[2025-01-20 11:17:41.460] ERROR[678530]:   Got 7 backtrace records
# 0: /usr/sbin/asterisk(__ao2_unlock+0x13a) [0x55e4b279d96a]
# 1: /usr/sbin/asterisk(ast_waitfor_nandfds+0x167) [0x55e4b27de317]
# 2: /usr/sbin/asterisk(ast_waitfor_n+0x14) [0x55e4b27deb74]
# 3: /usr/lib/asterisk/modules/app_rpt.so(+0x1fa4e) [0x7f46cb21fa4e]
# 4: /usr/sbin/asterisk(+0x1bf26c) [0x55e4b28f026c]
# 5: /lib/x86_64-linux-gnu/libc.so.6(+0x891c4) [0x7f46db18f1c4]
# 6: /lib/x86_64-linux-gnu/libc.so.6(+0x10985c) [0x7f46db20f85c]

core-asterisk-2025-01-20T16-17-41Z-brief.txt
core-asterisk-2025-01-20T16-17-41Z-full.txt
core-asterisk-2025-01-20T16-17-41Z-info.txt
core-asterisk-2025-01-20T16-17-41Z-locks.txt
core-asterisk-2025-01-20T16-17-41Z-thread1.txt

IF I change return -1; to return 0; AND remove the ast_check_hangup(chan) there are no crashes (or errors/warnings that I see)

Also, I'm thinking we have a problem with how link pointers are freed up -> it looks like we remove the link from the repeater linked list, and free the memory based on l->chan == NULL here:

app_rpt/apps/app_rpt.c

Lines 3270 to 3291 in e4dee90

if ((!l->chan) && (!l->disctime) && (!l->outbound)) {
ast_debug(1, "LINKDISC AA\n");
/* remove from queue */
rpt_link_remove(myrpt, l);
if (myrpt->links.next == &myrpt->links)
channel_revert(myrpt);
if (!strcmp(myrpt->cmdnode, l->name))
myrpt->cmdnode[0] = 0;
rpt_mutex_unlock(&myrpt->lock);
if (l->name[0] != '0') {
rpt_telemetry(myrpt, REMDISC, l);
}
rpt_update_links(myrpt);
if (myrpt->p.archivedir) {
donodelog_fmt(myrpt, "LINKDISC,%s", l->name);
}
dodispgm(myrpt, l->name);
/* hang-up on call to device */
ast_hangup(l->pchan);
ast_free(l);
rpt_mutex_lock(&myrpt->lock);
break;

We probably also need to be even more careful looking at link pointers than may have been freed already (basically everywhere?)

@InterLinked1
Copy link
Member

Okay, I see. Could you run core show threads immediately prior to crashing and provide its output? Perhaps it's in the thread list there. If not, you could grep the logs on that LWP for a match, there should be something there at least.

@InterLinked1
Copy link
Member

Never mind, I see it's the same thread that's logging it in fact.
I don't actually think this will work in that case, but curious what happens if you try with this patch: #460

@mkmer mkmer added the bug Something isn't working label Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants