Welcome to the new FlexRadio Community! Please review the new Community Rules and other important new Community information on the Message Board.
If you are having a problem, please refer to the product documentation or check the Help Center for known solutions.
Need technical support from FlexRadio? It's as simple as Creating a HelpDesk ticket.

Publish Blameless Post-Mortems

Options

With SmartLink service restored, it would be good to put to rest community speculation (some clear-eyed, some wild-eyed) regarding the root causes and aggravating factors of the outage, and the process changes FlexRadio has put in place to both restore service and reduce severity and frequency of future outages. See for example https://www.etsy.com/codeascraft/blameless-postmortems/

These post-mortems should be:

  • blameless (no personnel at-fault / at-risk/ identified publicly),
  • specific (i.e. not 'a cert expired' but "The intermediate CA cert responsible for authenticating radio clients and software clients to the SmartLink service expired before it was replaced")
  • detailed in
    • the problem space (see above)
    • the restoration space ("A new intermediate CA was signed by our internal root CA and installed on the SmartLink servers, which restored service.") and
    • the preventative space ("To prevent this recurring, we have put in place two independent internal processes -- one which automatically renews the cert 60 days before the old cert's expiration, and a separate one which checks the deployed cert and alerts engineering if the cert will expire within 45 days, indicating a failure of the primary system. We have further added an automated process to generate a 'cert-update-only' point release for all current releases, so that customers can avoid service issues without changing the current functionality of their devices.")
  • Comprehensive, in providing a full picture of the incident, the root and aggravating causes, and other related/confounding problems (e.g. addressing in a similar manner the SmartLink backoff timer challenges, implementing some kind of source-IP-based rate limiting at the Smartlink service front end, etc)

Having read a bunch of the Facebook group posts, I'm sure that, to put it concisely, haters gonna hate. However, I hope that doesn't dissuade Flex from speaking to customers who-- having put in the hard work and busted knuckles to learn their respective professions and hobbies, and having made their own mistakes along the way, can respect Flex's transparency and efforts to improve their products and services.

16
18 votes

Completed · Last Updated

Comments

  • Alan
    Alan Member ✭✭✭✭
    Options

    Well laid out and well said.

    Constructive problem identification and resolution.

    Alan. WA9WUD

  • Pete La
    Pete La Member ✭✭
    Options

    I expect a company like FRS to be more proactive and less reactive. They have been caught with their pants down too many times recently.

    Pete K1OYQ

  • km8v
    km8v Member ✭✭
    Options
    I wish I could vote for this 100 times.
  • WK2X
    WK2X Member
    Options
    Yes, absolutely.
  • Geoff AB6BT
    Geoff AB6BT Member ✭✭✭
    Options

    This should also apply to folks who post a question with a problem on this forum.

    Instead of just saying that the problem is solved, and leaving us all hanging, tell us the solution.

    Sorry...one of my pet peeves.

  • Eric-KE5DTO
    Eric-KE5DTO Administrator, FlexRadio Employee admin
    Options

    This is such a great suggestion both in the way that it was posted (constructive) and with the excellent link that explains why this is a good practice (driving the right behaviors). I can't make any promises, but I'll do what I can from my end to make sure that we publish something like this. Thanks for the suggestion.

  • KD0RC
    KD0RC Member, Super Elmer Moderator
    Options

    Thanks for this Eric. It helps to know what happened and what is being done. I appreciate the effort that goes into a report like this (been there...).

  • Trucker
    Trucker Member ✭✭✭
    Options

    Eric, if I understand correctly from the information in the link you posted, even radios not setup and using SmartLink were pinging the SmartLink servers? Why would that be the case? I run v3.x software but don't use SmartLink. I can understand my radio checking for firmware and software updates on startup ( if an internet connection is available) . But, otherwise, my radio and others not using SmartLink, should have no effect on the SmartLink system.

    I have monitored my radio with Wireshark in the past and have only seen the initial update ping and nothing else. Has something changed? I understand about the certificates needed for accessing the SmartLink Authentication servers. But, why would this even be needed for someone not using SmartLink?

    As an aside, I think that as some have requested, there should be an explicit way to log off of the SmartLink system and, until the user wants to use SmartLink again, no pings or attempted connection between the radio and the SmartLink server. ( And just the firmware/software update check on startup if there is an internet connection available)

    Just my thoughts.

    James

    WD5GWY

  • Eric-KE5DTO
    Eric-KE5DTO Administrator, FlexRadio Employee admin
    Options

    The intention is that if your radio isn't registered for SmartLink that it wouldn't contact the SmartLink server (unless initiating the registration process). That is the on/off switch as far as our intended design. As stated in the postmortem:

    ...radios that were not registered for SmartLink were making an initial connection to the SmartLink server.

    This is a bug that will be addressed.

  • Trucker
    Trucker Member ✭✭✭
    Options

    Eric, I think I understand better. One other question. You mentioned registering for SmartLink as the On/Off , method. Does that mean that once a SmartLink user is finished connecting to their remote pc, that the user is no longer logged into SmartLink for Authentication until the next session? And, if they connect locally over their home network, SmartLink is not involved in the process of connecting to the radio over the local network. And there is no attempt by SmartSDR to look for the connection to SmartLink?

    Thanks for the information. ( just trying to understand the process)

    James

    WD5GWY

  • Eric-KE5DTO
    Eric-KE5DTO Administrator, FlexRadio Employee admin
    Options

    There are multiple pieces of the puzzle here. When you "log in", you are simply getting an authentication token. That token is then used when connecting to the SmartLink server. Once you are connected to the radio, the SmartLink server is no longer in the mix with regard to the connection to the radio.

    For local connections, the SmartLink server is not involved at all. If you are logged into SmartLink, the radio chooser may show SmartLink connected radios if any are available outside of your local network. But connecting to a local radio doesn't involve SmartLink at all. As a broker, it isn't necessary on a LAN as the radio is broadcasting "here I am" with discovery packets which helps the client know about the available radios.

  • Dan Trainor
    Dan Trainor Member ✭✭✭
    Options

    Cheers to Flex Team for this Transparency. Very good.

  • Trucker
    Trucker Member ✭✭✭
    Options

    Thank you Eric for clarifying how the system works. I have had people argue with me that thought the system had to phone home for every single setup, no matter if the user was using SmartLink or not. I always thought that was wrong as I have used Wireshark several times in the past and never could verify any activity beyond the check for new updates on startup. ( outside of my LAN)

    James

    WD5GWY

  • John Mikucki
    John Mikucki Member ✭✭
    Options

    @Eric-KE5DTO -- I want to take a moment to thank you and the team for the linked writeup. I for one found it very informative, and felt significantly more confidence in the fixes with an understanding of what they are. Please excuse the lateness of my reply; it's in no way reflective of my interest in the topic or your reply. :)

    Good post-mortems can be hard to write, especially the first one. Kudos to you and the team for both doing the hard work and for taking the suggestion to heart. Nobody wants a service outage, but if I had to take one, this is the kind of thing I want to read after it happens.

    Thanks,

    --John KD3H

Leave a Comment

Rich Text Editor. To edit a paragraph's style, hit tab to get to the paragraph menu. From there you will be able to pick one style. Nothing defaults to paragraph. An inline formatting menu will show up when you select text. Hit tab to get into that menu. Some elements, such as rich link embeds, images, loading indicators, and error messages may get inserted into the editor. You may navigate to these using the arrow keys inside of the editor and delete them with the delete or backspace key.