[Home]

Summary:ASTERISK-25796: res_pjsip: DOS/Crash when TCP/TLS sockets exceed pjproject PJ_IOQUEUE_MAX_HANDLES
Reporter:George Joseph (gjoseph)Labels:Security
Date Opened:2016-02-15 10:22:09.000-0600Date Closed:2016-04-14 15:21:18
Priority:MajorRegression?
Status:Closed/CompleteComponents:Resources/res_pjsip
Versions:SVN 13.7.2 Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) bt_full.txt
( 1) options.xml
( 2) transport_management.diff
Description:pjproject's default PJ_IOQUEUE_MAX_HANDLES is set to 64. If an attempt is made to open more than that (actually MAX_HANDLES - 4) and pjproject was compiled without NDEBUG, pjproject will assert with "../src/pj/ioqueue_select.c:352: pj_ioqueue_register_sock2: Assertion `!pj_list_empty(&ioqueue->free_list)' failed." and Asterisk will die.  If pjproject WAS compiled with NDEBUG, then you can easily keep 60 sockets open and prevent Asterisk from performing any new TCP/TLS transactions.  You do NOT need to be authenticated to trigger the scenario.

To reproduce the crash...

Compile pjproject without NDEBUG.
Create a TCP transport, endpoint and aor with default settings.
Using the attached options.xml run 2 instances of sipp.  You have to run 2 and start them quick because sipp terminates when the remote end closes the listener.

$ sipp -sf options.xml <server> -s <endpoint> -t tn -m 61 -r 30 -max_socket 200 -bg
$ sipp -sf options.xml <server> -s <endpoint> -t tn -m 61 -r 30 -max_socket 200 -bg

To reproduce the DOS...
Compile pjproject with or without NDEBUG.
Create a TCP transport, endpoint and aor with default settings.

$ sipp -sf options.xml <server> -s <endpoint> -t tn -m 60 -r 30 -max_socket 200

You will not be able to initiate any new transactions
Comments:By: Asterisk Team (asteriskteam) 2016-02-15 10:22:10.852-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: George Joseph (gjoseph) 2016-02-15 13:00:18.655-0600

I've been looking for a way to mitigate this but Asterisk never even sees the attempts and I actually just realized something else, all you have to do is open the socket, you don't even have to send anything.


By: George Joseph (gjoseph) 2016-02-22 15:15:11.782-0600

Been doing some more playing around...   PJ_IOQUEUE_MAX_HANDLES sets the size of the array of FDs that is passed to select().  So there are 2 options...
First we can recommend setting PJ_IOQUEUE_MAX_HANDLES to 1024 (the max).  Second we can also recommend compiling pjproject with --enable-epoll so that at least for Linux, we can get around the limit altogether.  Unfortunately,  --enable-epoll isn't actually enabled in pjproject now so I've submitted a patch to Teluu to enable --enable-epoll again so we'll see what happens.

I had an email chat with Nanang and Riza at Teluu regarding the whole NDEBUG and IOQUEUE thing and they confirmed that their focus is on embeddable libraries for client apps so changing the defaults isn't going to happen.

This brings us back to the packagers.  Should they compile pjproject for it's intended use (low volume client apps), or for our use???  Another plug for pjproject_static. :)




By: Mark Michelson (mmichelson) 2016-02-26 13:48:48.200-0600

Thanks for the efforts you've put in on this. I just need to get a few pieces of information straight so that when it comes to a security advisory, we have the details correct.

When susceptible to the DOS, how long does it take before Asterisk can start accepting TCP requests again? Are the sockets tied up until Asterisk is restarted, or are we talking more like 30 seconds?

With the --enable-epoll patch to PJProject, that means that the limit on TCP connections is no longer determined by the ioqueue max handles, correct? If that's the case, then we're talking about not being DOS-able just due to PJProject's limits, but it still means we're potentially DOS-able by making the OS run out of file descriptors. In chan_sip, we have an option, tcpauthlimit, that helps to prevent this by automatically closing TCP sockets if the number of "unauthed" sessions reaches a certain limit. "Unauthed" in this case refers to connections on which we have not sent a 200 OK response. I wonder if something like that could be used in res_pjsip in order to automatically close connections when a certain number are currently pending.

By: George Joseph (gjoseph) 2016-02-26 15:07:05.708-0600

>>Thanks for the efforts you've put in on this. I just need to get a few pieces of information straight so that when it comes to a security advisory, we have the details correct.

>>When susceptible to the DOS, how long does it take before Asterisk can start accepting TCP requests again? Are the sockets tied up until Asterisk is restarted, or are we talking more like 30 seconds?

There is no timeout.  select() can only listen for data on PJ_IOQUEUE_MAX_HANDLES sockets.  Once that array is full, select can't listen on any more until some are closed.  Since all an attacker has to do is open a connection, that array will stay full until the attacker closes a connection or Asterisk restarts.

>>With the --enable-epoll patch to PJProject, that means that the limit on TCP connections is no longer determined by the ioqueue max handles, correct?

Not quite, PJ_IOQUEUE_MAX_HANDLES is still used for some things (I meant to track down exactly where but I forgot) but you can safely set it at something like 5000.  It's be nice to be able to auto-set it based on ulimit or something.

>>If that's the case, then we're talking about not being DOS-able just due to PJProject's limits, but it still means we're potentially DOS-able by making the OS run out of file descriptors. In chan_sip, we have an option, tcpauthlimit, that helps to prevent this by automatically closing TCP sockets if the number of "unauthed" sessions reaches a certain limit. "Unauthed" in this case refers to connections on which we have not sent a 200 OK response. I wonder if something like that could be used in res_pjsip in order to automatically close connections when a certain number are currently pending.

I think tcpauthlimit is a good idea but in this scenario, Asterisk never even knows about the connection.  To be fair, every tcp service on the planet is vulnerable to OS resource exhaustion. :)  


By: Mark Michelson (mmichelson) 2016-02-26 15:44:33.699-0600

{quote}
Not quite, PJ_IOQUEUE_MAX_HANDLES is still used for some things (I meant to track down exactly where but I forgot) but you can safely set it at something like 5000. It's be nice to be able to auto-set it based on ulimit or something.
{quote}

When you say "some things" does that still imply it's used for TCP file descriptors in some circumstances? If the IOQUEUE_MAX_HANDLES are used for other things, it may not be so bad.

{quote}
I think tcpauthlimit is a good idea but in this scenario, Asterisk never even knows about the connection. To be fair, every tcp service on the planet is vulnerable to OS resource exhaustion.
{quote}

Usually we'd like to be able to provide some sort of information that says that something fishy is going on. In this case, if Asterisk doesn't see any SIP messages, then it's not possible for us to send security events either. I'm not sure how administrators can be notified that someone is attempting an attack until it's too late.

By: George Joseph (gjoseph) 2016-02-26 16:40:30.984-0600

Now I remember...   Not for file descriptors.  PJSIP_MAX_TRANSPORTS is set to PJ_IOQUEUE_MAX_HANDLES and that controls the number of pj_ioqueue_keys that are precreated.  I think the keys are about 64 bytes each.  That's why you need to still set PJ_IOQUEUE_MAX_HANDLES to the largest number of sockets you expect to be open.  It's just memory though.

I don't think there's a good way for Asterisk to notify anyone in this situation.


By: Mark Michelson (mmichelson) 2016-03-02 15:01:21.574-0600

So, I've had talks with Matt Jordan and Josh Colp about all of this, and here's what we've come up with as far as how to address this problem:

1) Get packagers to use appropriate #defines for the PJ_IOQUEUE_MAX_HANDLES. Encourage them to use --enable-epoll when configuring Linux packages. You've actually already taken care of this for us :)
2) Get the static PJProject patch pushed through review
3) This is the new part: I've written a module that may mitigate some of the TCP connection issues, and I'll be attaching it here for you to have a look at. I'll explain it a bit more when I attach it.

By: Mark Michelson (mmichelson) 2016-03-02 15:07:25.935-0600

I'm uploading res_pjsip_keepadead.c

Yes, it's a stupid name, but I based it off the code in res_pjsip_keepalive.c and just have given it a placeholder name for the time being.

This keeps track of all incoming TCP/TLS connections that Asterisk accepts. It stores the transport in a container and schedules a task to run in SIP Timer D seconds. If that time expires and we never received an incoming SIP request on that connection, then we'll assume someone is doing something malicious and close the connection.

I've tested this to make sure that it does what is expected and that it does not cause problems by closing the connection when we would expect it to be left open. So far so good.

This module will protect against an attacker opening TCP connections and either sending junk or sending nothing. However, it does not protect against the situation you presented, of an attacker sending OPTIONS to us and then leaving the connection open. And unfortunately, I don't think that can be fixed without potentially causing problems instead.

The module is in an incomplete state. There are two things I wanted to add to it:
1) The keepadead->sip_received flag needs mutual exclusion. Right now, it's possible for there to be a race between us receiving a request and us attempting to shut down the transport because we think there has been no request received.
2) When we shut down the transport, it may be prudent to send a security event. I say "may" because I'm not sure if that doesn't just open us up to an amplification attack of some sort.

I figured I'd post the patch as it is because it "works for me" and I figured you might want to have a look at it and tell me what you think.

By: George Joseph (gjoseph) 2016-03-02 17:37:40.272-0600

Works great!
I'd say 'no' to the security event but I wouldn't complain if it were sent.

Since keepalive already has most of the guts, how about combining them?  res_pjsip_transport_monitor or something?

For OPTIONS, I wonder if there's a way we can get notified of a socket being idle for too long.





By: Mark Michelson (mmichelson) 2016-03-03 10:12:13.396-0600

I had thought about the idle socket idea, but there are a few problems there:

1) Defining "too long" is difficult. It seems possible for us to close sockets early when it legitimately should be open longer. I suppose that this could be configurable, but...
2) An attacker could still thwart this by occupying the socket and sending OPTIONS packets at regular intervals.

I'm totally cool with adding the auto-closing behavior to res_pjsip_keepalive, as well as renaming the file. I'll go ahead and do that along with the mutual exclusion for another version of the patch.

By: George Joseph (gjoseph) 2016-03-03 10:41:53.790-0600

>>  2) An attacker could still thwart this by occupying the socket and sending OPTIONS packets at regular intervals.

Yeah I was thinking about that last night.  A periodic OPTIONS would look totally normal.  Unless...we kept track of the source ip addresses and only allowed a configurable number of sockets from the same address.  max_connections_per_client on transport maybe?  Or maybe this is just a firewall thing.

>> I'm totally cool with adding the auto-closing behavior to res_pjsip_keepalive, as well as renaming the file. I'll go ahead and do that along with the mutual exclusion for another version of the patch.

Cool!


By: Mark Michelson (mmichelson) 2016-03-03 15:56:01.242-0600

I'm attaching transport_management.diff. Since this is not just a new module, but a patch to the existing res_pjsip_keepalive, this is an actual patch this time.

I've made the "sip_received" flag atomically set and fetched now so that hopefully there is no conflict between receiving a request and shutting down the transport. I've opted not to produce a security event when we shut the transport down.

By: George Joseph (gjoseph) 2016-03-04 09:44:14.383-0600

MIght want to change
AST_MODULE_INFO(ASTERISK_GPL_KEY, AST_MODFLAG_LOAD_ORDER, "PJSIP Stateful Connection Keepalive Support"
to
AST_MODULE_INFO(ASTERISK_GPL_KEY, AST_MODFLAG_LOAD_ORDER, "PJSIP Stateful Connection Monitoring"

Looks good otherwise.


By: Jørgen H (jorgen) 2016-11-18 04:32:34.269-0600

Hello,

This isnt fixed in pjsip 2.5.5 nor trunk
More than ~60 tcp connections and server crashes on line 363 / assertion in pjlib/src/pj/ioqueue_select.c
Is this the wrong place to report pjsip-issues perhaps ?
{noformat}
#0  0x00007f4e8a59d578 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f4e8a59e9fa in __GI_abort () at abort.c:89
#2  0x00007f4e8a596427 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x7f4e1d9cd938 "!pj_list_empty(&ioqueue->free_list)",
   file=file@entry=0x7f4e1d9cd660 "../src/pj/ioqueue_select.c", line=line@entry=363,
   function=function@entry=0x7f4e1d9cd9f0 <__PRETTY_FUNCTION__.6416> "pj_ioqueue_register_sock2") at assert.c:92
#3  0x00007f4e8a5964d2 in __GI___assert_fail (assertion=0x7f4e1d9cd938 "!pj_list_empty(&ioqueue->free_list)", file=0x7f4e1d9cd660 "../src/pj/ioqueue_select.c", line=363,
   function=0x7f4e1d9cd9f0 <__PRETTY_FUNCTION__.6416> "pj_ioqueue_register_sock2") at assert.c:101
#4  0x00007f4e1d9bb526 in pj_ioqueue_register_sock2 () from /usr/app/asterisk/14.1.1/lib/libpj.so.2
#5  0x00007f4e1d9bf6bc in pj_activesock_create () from /usr/app/asterisk/14.1.1/lib/libpj.so.2
#6  0x00007f4e1f9a8a7c in tcp_create.constprop () from /usr/app/asterisk/14.1.1/lib/libpjsip.so.2
#7  0x00007f4e1f9a8fb7 in on_accept_complete () from /usr/app/asterisk/14.1.1/lib/libpjsip.so.2
#8  0x00007f4e1d9bf06f in ioqueue_on_accept_complete () from /usr/app/asterisk/14.1.1/lib/libpj.so.2
#9  0x00007f4e1d9ba123 in ioqueue_dispatch_read_event () from /usr/app/asterisk/14.1.1/lib/libpj.so.2
#10 0x00007f4e1d9bba3f in pj_ioqueue_poll () from /usr/app/asterisk/14.1.1/lib/libpj.so.2
#11 0x00007f4e1f99d9cb in pjsip_endpt_handle_events2 () from /usr/app/asterisk/14.1.1/lib/libpjsip.so.2
#12 0x00007f4e1c59fe68 in monitor_thread_exec (endpt=<optimized out>) at res_pjsip.c:4017
#13 0x00007f4e1d9bcc0a in thread_main () from /usr/app/asterisk/14.1.1/lib/libpj.so.2
#14 0x00007f4e8aeac434 in start_thread (arg=0x7f4e1691b700) at pthread_create.c:334
#15 0x00007f4e8a6530ad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
{noformat}

By: Joshua C. Colp (jcolp) 2016-11-18 05:13:47.190-0600

[~jorgen] Please open a new issue and clarify how you have built PJSIP. If you are not using bundled it is entirely possible that your built PJSIP does not follow recommendations which would not have the same problem.