Pear Skidding

The last few months have been a bit of a roller-coaster ride for mysipswitch. The ride started on Mar 25th when Ruby dial plans were introduced. Initially the new dial plan format did not gain a lot of traction but then a combination of people wanting to do more complicated things and the default dial plan for new users being set to Ruby saw their use increase considerably.

Around the start of June the roller-coaster, which had been on a nice gentle climb, encountered a little dip, just enough of a dip to cause some worry about what was around the next corner. This corresponded to a few incidents where mysipswitch had 3 outages over the space of three or four weeks. Up until this point outages had been virtually non-existent. 

The outages were tracked down to abnormally high memory usage on the mysipswitch process followed by it becoming unresponsive to registrations and calls. Restarting the process fixed the problem but then the memory usage would start creeping up and reach a dangerous level somewhere between 2 and 7 days later.

Now memory leaks are not uncommon in software and apart from buffer overflows are probably the second biggest software defect. However in the case of mysipswitch it is written in managed C# and runs within a runtime that takes care of memory management making it difficult, but not impossible, for memory leaks to manifest. Initially we thought that we must have introduced a leak within the SIP stack in one of the lists that tracks registrations or calls. However after a couple of weeks of poring over the code and purchasing some profiling tools nothing was discovered and the memory leaks were still occurring.

Our attention was then turned onto the IronRuby assemblies. Straight away we found that if a Ruby dial plan generated an exception then memory utilisation would shoot up by up to 2MB. Some re-jigging of the way the scripts were loaded and executed improved things a lot but there was still a slow leak which meant a restart of the mysipswitch process was needed about twice a week.

Restarting twice a week was not that problematic but it was not ideal either. As such it was decided to get started on another initiative which was to split out the mysipswitch agents into individual processes. That would have the advantage of verifying that the leak was definitely in the process hosting the Ruby scripts and also meaning only it would need to be restarted and the other agents would be able to keep running. The mysipswitch agents are:

  • SIP Proxy/Application Server, this is the process that is the first point of call for all SIP traffic into and out of mysipswitch and handles dial plan processing and Ruby scripts,
  • SIP Registrar, registers user agents and records their contact details for incoming calls,
  • SIP Registration Agent, registers 3rd party SIP accounts with external SIP Providers,
  • Monitor, the telnet shell that allows some rudimentary monitoring of the other mysipswitch agents to add in troubleshooting and diagnostics.

The splitting out of the agents was always going to involve a bit of pain because it involved significant changes to both the code and the mechanisms used. As one example the SIP Proxy could no longer ask the SIP Registrar for a list of bindings for a particular account and instead would have to retrieve them from the database. The splitting out of the agents took two weeks or so and actually probably caused less pain than anticipated.

After a week or so of running with the indiviudal agents the memory leak was able to be verified to be confined to the SIP Proxy process and deemed to be manageable enough there to keep things going as the IronRuby code base continued to evolve (it’s still only on alpha 3 so gremlins have to be expected) and eventually eliminated whatever was causing the memory leak.

So up to this point the rollercoaster had been fairly sedate or at least not bad enough for anyone to lose their lunch. However trouble was just around the corner. Since the agents were all operating smoothly and not having any integration problems it was decided to get back to the feature requests. The most in demand feature was the Callback application. Now the Callback application is a lot trickier than it seems and requires some serious SIP Kung Fu! In theory what the application does should be straight forward enough using transfers (SIP REFER requests) but in the real World most providers don’t support REFER and some clever hacks using things such as raw sockets for IP impersonation are required.

A new version of the Callback application was introduced around last weekend (2nd of August) and in hindisght that’s when things went Pear Shaped and the roller coaster started the stomach churning descent (excuse the metaphors, they make it easier to keep writing dubious quality material). The problem was after the update was introduced strange things started happening with the Registration Agent and the SIP Registrar. Initially there was no obvious link to the happenings and the SIP Proxy but on Friday, after 3 or 4 days of pulling our hair out, the incident was finall observed and the strange happenings occurred at the same time the Proxy utilisation spiked.

After identifying there was a link between the agents it took three more solid days of investigation to indentify and then fix the issue (or at this point it may be better to say “hopefully” fixed the issue).

So after all that we are now hopefully at a point where the rpevious stability of mysipswitch is just around the corner. Despite the last 3 months of intensive effort not a lot has happened in the way of new features which is frustrating on one level but since most of the effort (the last week excepted) has been as a consequence of introducing Ruby dail plans it’s undoubtedly been well worth it.

 Another benefit of the recent work is the that the mysipswitch architecture is a lot more flexible and means it will be possible to change the modulus operandi slightly. Up until now new versions of the software has been rapidly released to the live mysipswitch service in line with its purpose as an experimental tool. There is no desire to take a step back from that rapid release approach but since most of the new development work is confined to the dial plan and dial plan applications it will now be possible to isolate the areas subject to rapid development and even give users the choice of using the cutting “edge” version or the “stable” version.

 The diagram below illustrates where we are aiming to get to with the mysipswitch architecture. Previously all teh mysipswitch agents were encapsulated in a single process. Now they are separate and two new ones are being introduced. The first is the Stateless SIP Proxy which will take over the role of the Stateful Proxy of being the first point of contact for all external SIP traffic and acting like a traffic cop for all the other agents. The second is an “Edge” application server. This server will fulfil the same role as the current Stateful Proxy, or SIP Application Server as it could now be more correctly termed, but will be used as for cutting edge releases of new code. The new server will likely be given a name like edge.mysipswitch.com and users will be able to conenct directly to it for their calls or alternatively will be able to stay connected to the stable server and choose to forward only specific calls to the edge server from within their dial plan.

Hopefully that gives people some idea of where our efforts have been going of late and provide a bit of a roadmap about where we are heading to. It hopefully also explains why I have been more grumpier than normal! Even though mysipswitch is a free service we do still take pride in keeping it running smoothly and it gets suprisingly stressful when that’s not the case.

Regards,

Aaron

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • blinkbits
  • BlinkList
  • blogmarks
  • BlogMemes
  • co.mments
  • description
  • Fleck
  • Furl
  • Gwar
  • Hemidemi
  • kick.ie
  • Kirtsy
  • LinkaGoGo
  • Linkter
  • Live
  • Ma.gnolia
  • MisterWong
  • muti
  • MyShare
  • Netvouz
  • NewsVine
  • description
  • ppnow
  • Propeller
  • Ratimarks
  • RawSugar
  • Reddit
  • Scoopeo
  • Shadows
  • Simpy
  • Slashdot
  • Smarking
  • Socialogs
  • SphereIt
  • Spurl
  • StumbleUpon
  • Technorati
  • ThisNext
  • TwitThis
  • Upnews
  • Webride
  • Wikio
  • Wists
  • Yigg

Tags: , , , ,

Comments are closed.