SmartSDR v3.8.21 and the SmartSDR v3.8.21 Release Notes
SmartSDR v2.12.1 and the SmartSDR v2.12.1 Release Notes
Power Genius XL Utility v3.8.9 and the Power Genius XL Release Notes v3.8.9
Tuner Genius XL Utility v1.2.11 and the Tuner Genius XL Release Notes v1.2.11
Antenna Genius Utility v4.1.8
Need technical support from FlexRadio? It's as simple as Creating a HelpDesk ticket.
September 11 - 2023 - SmartLink Outage
We want to inform you that we are currently addressing a SmartLink issue that could potentially affect your ability to access your remote radios. Please be assured that our dedicated team at FlexRadio is actively working to resolve this matter. We will keep you updated on our progress as we work towards a solution.
Please keep an eye on this posting for any updated status.
Thank you for your patience and understanding.
Comments
-
We understand the inconvenience caused by the ongoing service outage, now entering its second day. We sincerely apologize for the disruption to your services, and we want to provide you with an update on our efforts to resolve this situation.
Our dedicated team has been working around the clock to identify and rectify the underlying issue. While we had hoped to have services restored by now, the situation has proven to be more complex than initially anticipated. Rest assured that we are making every possible effort to expedite the resolution process.
Changes we made last night to the SmartLink authentication servers did have a brief positive affect, and some of you may have been able to access your radio, however, the core problem has returned.
We understand that your reliance on our services is critical, and we want to assure you that we are fully committed to getting your systems back up and running as soon as possible. We are continuously monitoring the situation and working diligently to minimize the impact on your operations.
We will provide regular updates to keep you informed of our progress. In the meantime, if you have any immediate concerns or require assistance, please do not hesitate to contact our support team at [support email/phone].
Once again, we apologize for the inconvenience this outage has caused and appreciate your patience as we work to restore normalcy.
10 -
Update
We have identified the root cause, which is a major step towards the final fix.
It is a 2 part fix. Part 1 has been completed.
Once we have a timeline on Part 2, we will let you know.
Again, thanks for your patience.
10 -
We want to assure you that we're actively working on resolving the current issue.
To get things back on track, we're developing a new version of SmartSDR, which is currently in progress.
Once completed, it will undergo thorough regression testing to ensure it is functional. Rest assured, we're prioritizing this to minimize any further inconvenience.
We understand that this solution will require a radio update, and we want to be transparent about the process.
The update can't be executed via SmartLink and must be performed on a local PC, not connected through SmartLink. This is a typical SmartSDR update process that you would normally follow to a new SmartSDR version.
We genuinely appreciate your patience and understand that this situation may pose challenges. Please know that we're doing our utmost to expedite this process and have your radios back in optimal working condition.
We value your support and understanding as we work diligently to resolve this matter.
12 -
Wednesday AM Update:
The new V2 and V3 builds of SmartSDR regression and Alpha testing are working well with the required changes. The fixes in these builds will allow the use of SmartLink again.
Hopefully, those will be released early this morning as v2.10.1 and v3.5.9.
Again, thanks for your patience.
9 -
Wednesday afternoon update.
The software is ready to go. We are waiting for the final sign off and the required documentation to be created and published.
Hopefully, it won't be too much longer.
9 -
Friday AM - Update. This was also emailed to all of our user install base.
Our servers are still working at 100% to resolve the required TLS calculations. More and more users are returning to being online.
Regarding our SmartLink Outage
On 9/11 around noon CDT, we began to receive reports of trouble using SmartLink. This outage was caused by a Root CA certificate expiring that our SmartLink SSL cert depended on. In layman’s terms, this meant that a secure connection to the server could not be made.
The discovery of the root problem occurred around 3AM CDT 9/12, roughly 15 hours after the initial reports. Fortunately, we were already in a position to update our cert and the supporting root CA cert because of lessons learned from prior years. Actions taken earlier in the morning on 9/11 helped accelerate this work. The new certificates were in place on the SmartLink server by 4AM CDT 9/12.
Unfortunately, there was one more step required as radios were still not communicating with the server due to the new root CA not being in our cert bundle in the radio. SmartSDR v3.5.9 and v2.10.1 were developed to address that issue and pushed through internal and Alpha testing. With their release around 3PM CDT 9/12, SmartLink is functional again, though the server was extremely sluggish.
With the SmartLink service functional, our attention has turned to the sluggish response of the SmartLink server. This has caused a fair bit of confusion for folks that updated their software and still experienced timeout errors. Through some investigation, it has become clear that radios running earlier versions of SmartSDR attempting to connect to the SmartLink server are driving most of the server load.
This isn’t a terrible surprise as the TLS connection negotiation is a CPU intensive process and the radio will continue to attempt to connect upon a failure. We will continue to investigate ways to mitigate this problem on the server side, but the best way that you can help is to update your software (and radio) to the latest version.
Rest assured that steps are being taken to avoid problems like this from happening in the future. We are committed to providing the best radio experience possible and we understand the frustration such an outage causes to our customers and apologize for the inconvenience.
**Action Required:**
If you have never used SmartLink or created a SmartLink login, there is no action required on your behalf.
In order to help, we are asking you to upgrade to the latest version of SmartSDR.
This can be either v2.10.1 or V3.5.9.
If you have already taken the time to update SmartSDR on your PC and also upgraded the firmware on the radio, we thank you very much.
For those SmartLink users, you will be required to be at v2.10.1 or v3.5.9 in order to continue to use SmartLink.
If you are a MAC or iOS, make sure that SmartSDR for MAC and also SmartSDR for iOS have been updated.
- **Visit our Downloads Page:** Go to our official downloads page at https://www.flexradio.com/ssdr/
- **Select Your Software Version:** Choose either SmartSDR v2.10.1 or v3.5.9, depending on your current version in use.
- **Install SmartSDR:** Download and install the selected software version.
If you need a review on how to do so, this video should help: https://youtu.be/yr1CWw9NgPM?si=rgII1fmfFqXxtSwS
At FlexRadio, we are committed to your security and satisfaction. By keeping your software up to date, you not only enhance your device's security but also ensure that you have access to the latest features and improvements.
0 -
The latest update from Engineering:
Flexers,
As we head into the weekend, we wanted to give you an update on the SmartLink Server situation. As a recap for those who aren’t in the loop, earlier this week we encountered a SmartLink outage as a result of a Root CA certificate expiration. Side note for some geek speak explanation: For those not familiar with SSL certificates, these are used to prove that our server is indeed the real FlexRadio SmartLink server by being verified through a chain of other certificates that are trusted from a Certificate Authority (CA). We addressed the certificate issue both on the server and in our radios (via SmartSDR v3.5.9 and v2.10.1), but have since been dealing with sluggish response from our SmartLink server. Thank you to everyone that updated to the latest versions as that will make a difference.
We have scaled up the server resources to help meet the demand in several steps. The first of these steps appeared to be working well until just before midnight CDT last night. With more configuration tweaking today, things appear to be stabilizing. The response time of the server is still slower than we would like, but it looks like connections are being negotiated successfully in our testing over the last few hours.
We know that many of our customers are frustrated by the downtime and unpredictable SmartLink availability and we feel that frustration as well. This has led to some healthy discussions of how to improve our infrastructure and procedures to prevent issues like this from happening and to minimize recovery time. While we can’t change the past, we can learn from it as we move towards a better future. We will continue to monitor the server through the weekend and keep you posted on any developments on that front. Thank you for your patience and your trust in our team as we navigate these challenges.
24 -
Flexers,
In the 7AM CDT hour this morning, the major remaining issue causing SmartLink disruption was addressed. This boiled down to some problems in the SmartLink server code dealing with the additional scaling we had enabled to help address the load on the server. We have had a number of confirmations from the field that the system has been stable since our latest server side change.
We will continue to monitor things and let you know if there are additional developments, but I believe the immediate issue's root cause and several complicating factors have all been addressed. We have several key lessons learned and will be working hard to ensure that we don't face another outage like this moving forward.
I am both incredibly sorry for the frustration that our Flexers have experienced during this outage -AND- incredibly proud of our teams ability to stand in the face of trials and endeavor until things are made right. As several have pointed out, there are areas of improvement and defects that were discovered along the way. What makes it all worthwhile during the hard times are the fantastic people we get to work alongside and the customers that encourage us even while experiencing frustration.
Onward and upward.
20 -
Flexers,
At 4:22AM CDT this morning we observed the SmartLink server having lingering issues similar to those observed during the outage last week. We have addressed the immediate issue and performance was back at normal levels at 7:32AM CDT. We have had 2 additional rounds of such issues since then. We will continue to monitor the situation and take measures prioritizing uptime.
We are also working towards a better solution that will be more robust. We are stress testing this solution now and expect to be able to deliver it transparently to you soon (e.g. no client software update required). This transition will require planned downtime. Watch the Community for more info.
8 -
Hi All
We want to take a moment to address the ongoing issues with SmartLink and reassure you that we are actively working on improvements to enhance your user experience. Your feedback has been invaluable in helping us identify areas for enhancement, and we want you to know that we have not forgotten about your concerns.
We understand that some customers have been experiencing lingering issues with SmartLink, and we want to assure you that our engineering team is diligently working to create a more robust and reliable solution.
We are committed to resolving these issues. We appreciate your continued support and look forward to delivering a vastly improved SmartLink service in the near future.
As more details are available that I can share, I will do so on this posting.
73 and thanks for your patience and understanding
2 -
All
We would like to inform you of an important upgrade to our SmartLink servers that will enhance your FlexRadio experience.
On Tuesday, October 3rd, at 3 PM CDT, 2000Z, we will be performing maintenance on our SmartLink servers.
During this upgrade, SmartLink will be temporarily unavailable for up to 4 hours. Should things go well, we expect this time to be much shorter.
We understand the importance of SmartLink to our users, and we're excited to share that we are replacing our servers with a new, more robust system.
Should there be any change in the schedule or some other important news, we will communicate via: FlexRadio Community, FaceBook Enthusiasts Group, and X (formally known as Twitter) from @FlexRadioSystem.
Our engineering teams have been hard at work since the SmartLink outage was first reported, and this upgrade represents a significant step forward. Moving to a new server is a positive step forward as it ensures greater reliability and performance, providing our users with a smoother and more dependable SmartLink experience. This upgrade reflects our commitment to constantly improving our services to meet our customers' needs.
To ensure its reliability, we've been testing the new server intensely with our team internally and recently expanded the testing to our Alpha Team. This new server is optimized for faster response times and will seamlessly scale to accommodate user load, providing you with a better SmartLink experience.
To take advantage of these changes and ensure SmartLink's correct operation, update your FLEX-6000 radio to SmartSDR v3.5.9 (or v2.10.1 if licensed for v2).
It is highly recommended that you update to these latest versions even if you are not a SmartLink user to take advantage of many internal performance improvements.
If you have already upgraded your FlexRadio to v2.10.1 or v3.5.9, no further action is required on your part as the server transition will happen transparently during the scheduled maintenance.
We value your trust in FlexRadio, and we apologize for any inconvenience this maintenance window may cause.
Please mark the date and time in your calendar and we appreciate your understanding as we continue to enhance our services.
Thank you for being a part of our FlexRadio community. If you have any questions or need further assistance, please don't hesitate to reach out to us.
FlexRadio Systems
512-535-4713
2 -
Post Incident Review
Summary
On Monday, 2023-09-11 0000Z, users attempting to use SmartLink™ experienced timeout errors and no access to their radios.
The event was triggered by an expired Root CA certificate on which our smartlink.flexradio.com certificate depended. This was discovered on 2023-09-12 around 0800Z.
The event was reported both through internal support channels and through the FlexRadio HelpDesk starting around 1700Z.
This critical severity incident affected all SmartLink users and had secondary performance effects on internet connected radios that were being used on Local Area Networks (LAN).
The SmartLink service was restored on the server side 2023-09-12 0900Z by installing a new smartlink.flexradio.com certificate. This addressed SmartSDR client connection failures. However, the updated certificate depended on a new SSL.com Root CA that was not present in the FLEX-6000 radio firmware.
New SmartSDR software (v3.5.9 and v2.10.1) was released 2023-09-13 1930Z to add the new SSL.com Root CA to the FLEX-6000 radio firmware and restore full SmartLink functionality for users.
SmartLink server stability has been intermittent since the new software release due to load. Downtime has been mitigated through server optimization, but ultimately it was determined a more robust infrastructure would be required to handle current load and will scale more appropriately in the future.
Improved SmartLink server infrastructure design began 2023-09-14, was internally pre-Alpha tested 2023-09-22, and was released to the Alpha team 2023-09-25.
Transition to roll out the new SmartLink server infrastructure through scheduled downtime occurred 2023-10-03 2000Z. SmartLink server stability and uptime improved dramatically.
1 -
Leadup
Previous SmartLink outage due to certificate expiration on 2022-09-07 led to a number of stop gap measures in an attempt to prevent another outage. These steps included monthly meetings to expose any public facing services with expirations (certificates, credit cards, etc) that could cause downtime along with automated expiration date reminders.
This led to purchasing an updated smartlink.flexradio.com certificate 30 days early in order to avoid an outage. While this did not ultimately prevent the outage, it did accelerate the server side fix once the root cause was identified.
Fault
While the smartlink.flexradio.com SSL certificate still had 27 days left prior to expiration, the Root Certificate Authority (CA) SSL.com certificate on which the smartlink cert depends expired 2023-09-11. As a necessary part of the certificate chain for the smartlink.flexradio.com, this expiration resulted in authorization failures from both FLEX-6000 radios and SmartSDR clients. Failure to update the certificate and the related certificate chain on the server and in the radio firmware prior to the Root CA expiration is the root cause of the incident.
Impact
For over 2 days, all SmartLink users were unable to access their radios remotely. Once service was ostensibly restored, it was spotty (sometimes radios would not show up on the available list).
Response
After becoming aware of the outage 2023-09-11 1700Z, engineering was in the middle of installing the updated smartlink.flexradio.com certificate. This work was stopped to begin investigation of the growing reports of SmartLink problems. The root cause was identified 2023-09-12 0800Z. Thanks to the earlier prep and acquisition of an updated smartlink.flexradio.com SSL certificate, the issue was addressed on the server side within an hour by 2023-09-12 0900Z. This allowed client functionality to be restored with existing software.
Unfortunately, the newer certificate CA chain was not included in the FLEX-6000 radio firmware. A patch was generated and released for Alpha testing 2023-09-12 2300Z before being released publicly 2023-09-13 1930Z.
The SmartLink server stability was impacted by high load due to FLEX-6000 radios and FlexLib based clients (SmartSDR for Windows, SmartSDR for Maestro, SmartSDR for M models) running earlier software versions with the expired certificate. The combination of the expired certificate and poor connection handling patterns in both the radios and clients overwhelmed the server.
Server scaling was initiated 2023-09-15 1830Z that helped to address load issues, but introduced new issues due to race conditions resulting in successful connections to the SmartLink server, but no data being exchanged. While the timeout errors were mitigated due to handling the load, clients did not receive necessary information regarding available radios and thus SmartLink functionality was again not possible.
Optimization efforts on the SmartLink server side were initiated prior to the new software release. The largest improvements were in place by 2023-09-17 1300Z, however performance was still unsatisfactory due to occasional instability due to excessive load resulting in timeout errors and longer latencies when logging in.
Additional monitoring was added to mitigate SmartLink server downtime 2023-09-20 1800Z. Since then, SmartLink server restarts/swaps have been initiated to mitigate server downtime. This means shorter outages continue to be experienced intermittently.
As it became clear that the existing infrastructure was insufficient for our current load (and future growth), work began 2023-09-14 1600Z on an improved infrastructure architecture and implementation.
The new SmartLink server infrastructure began testing outside of development in engineering test beginning 2023-09-18. The pool of testers was expanded outside of engineering 2023-09-22. Minor performance tweaks and bug fixes were applied before releasing the new SmartLink server for testing by the full Alpha team. The changes were rolled out publicly during scheduled maintenance on 2023-10-03.
0 -
Timeline
2023-09-11 0000Z: SSL.com Root CA certificate expired
2023-09-11 1700Z: Helpdesk and internal reports of issues accessing radios through SmartLink
2023-09-12 0800Z: Expired certificate discovered
2023-09-12 0900Z: smartlink.flexradio.com certificate (and supporting CA chain) updated on SmartLink server
2023-09-13 1930Z: SmartSDR software (v3.5.9 and v2.10.1) was released to add the new SSL.com Root CA to the FLEX-6000 radio firmware
2023-09-15 1830Z: Server scaling helps with load, but introduces new server-side issues
2023-09-17 1300Z: Diminishing returns on existing server optimization led to life support and monitoring while pouring more resources into New SmartLink Server project
New SmartLink Server
2023-09-14: Updated infrastructure design began
2023-09-22: Internal pre-Alpha test
2023-09-25: Release to Alpha team
2023-10-03 2000Z: Public rollout
Root Cause
The SmartLink server became overwhelmed with unsuccessful connection attempts when the SSL.com Root CA certificate on which our smartlink.flexradio.com certificate depended expired. This led to server instability and timeout errors that meant only a small fraction of users were being serviced, and those that were serviced had extremely long, unacceptable latency.
Lessons Learned
While it is not typical, Root CA expirations in SSL certificate chains should be checked in addition to the expiration date of the certificate itself. Replacing these certificates well ahead of the expiration (30 days) will ensure service outages due to certificate expirations do not occur.
Keeping the list of Root CA certificates up to date in the radio firmware can be automated to ensure that certificate updates can be performed on the server side without requiring a corresponding software update. This has been implemented (SMART-9725) and all future SmartSDR builds will include a best known Root CA list.
The SmartLink server that had served us well since launch became overwhelmed. This was a combination of a relatively under-resourced server dealing with many thousands of radios and clients all attempting the most expensive part of the connection lifecycle (initial authentication with crypto validation). While the server had performed well in ideal circumstances, it struggled under the load in this less typical scenario. Having enough performance margin (or scalability) to overcome unusual circumstances is something to be considered when designing enterprise level systems.
Corrective Actions
Our new SmartLink infrastructure includes automatic certificate renewal and installation. Any changes such as expired Root CAs will generate certificate updates that will automatically be integrated into the server moving forward.
We also integrated a process in our build system to pull the latest Root CA list into our radio firmware on every build moving forward. This will mean that when server side certificates do have to change, they should work with recent software without having to release an emergency update.
We have separated the CPU intensive crypto functions to our cloud provider’s load balancer. This will reduce the load on the application server and provide better application throughput increasing responsiveness. We can also scale the application containers CPU and memory as necessary. We significantly rewrote the application server software to be more efficient and increase its capacity for concurrent connections.
While carefully analyzing code paths and network activity during the recent SmartLink outage, it was apparent that even radios that were not registered for SmartLink were making an initial connection to the SmartLink server. This was not intentional and is tracked as SMART-9754.
When the radio failed to connect to the SmartLink server (in this case, as a result of an expired certificate), this resulted in subsequent attempts every 5 seconds. This is a non-standard design where a backoff is more typical (try again in 5 seconds, then ~20, then ~120, etc). This would help keep the traffic of the connection process (which involves crypto that can be expensive computationally at scale) from becoming a burden to the server if a systemic connection issue is encountered. This is tracked as SMART-9744.
As noted by several users in our community, it was also found that the thread priority of the SmartLink connection thread in the radio was such that these failed attempts ended up pre-empting other core functions of the radio (audio being the most obvious). The slow response of the server on top of the failed connection compounded this issue (high priority thread waiting on a slow server response). This is a poor design and is tracked as SMART-9755.
We have a similar non-standard design observed in FlexLib, (used in SmartSDR for Windows, SmartSDR for Maestro, and SmartSDR for M models). In this case, we have backoff logic when a connection doesn’t work out, but the design pattern falls apart with certain connection failures such as certificate authentication exceptions. This results in a disconnect call within a tight loop that initiates another connection immediately. The result is a barrage of connection attempts from these clients. This bug is tracked as SMART-9772.
5
Categories
- All Categories
- 296 Community Topics
- 2.1K New Ideas
- 543 The Flea Market
- 7.6K Software
- 6.1K SmartSDR for Windows
- 148 SmartSDR for Maestro and M models
- 375 SmartSDR for Mac
- 252 SmartSDR for iOS
- 239 SmartSDR CAT
- 175 DAX
- 359 SmartSDR API
- 8.8K Radios and Accessories
- 7K FLEX-6000 Signature Series
- 61 FLEX-8000 Signature Series
- 868 Maestro
- 45 FlexControl
- 849 FLEX Series (Legacy) Radios
- 815 Genius Products
- 426 Power Genius XL Amplifier
- 283 Tuner Genius XL
- 106 Antenna Genius
- 249 Shack Infrastructure
- 170 Networking
- 410 Remote Operation (SmartLink)
- 130 Contesting
- 656 Peripherals & Station Integration
- 126 Amateur Radio Interests
- 891 Third-Party Software