Around the same time, Cloudflare’s chief technology officer Dane Knecht explained that a latent bug was responsible in an apologetic X post.

“In short, a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation to our network and other services. This was not an attack,” Knecht wrote, referring to a bug that went undetected in testing and has not caused a failure.

  • groet@feddit.org
    link
    fedilink
    English
    arrow-up
    4
    ·
    7 hours ago

    it shouldn’t crash the whole thing: if the bot detection module crahses, control it, fire an alert but accept the request until fixed.

    Fail open vs fail closed. Bot detection is a security feature. If the security feature fails, do you disable it and allow unchecked access to the client data? Or do you value Integrity over Availability

    Imagine the opposite: they disable the feature and during that timeframe some customers get hacked. The hacks could have been prevented by the Bot detection (that the customer is paying for).

    Yes, bot detection is not the most critical security feature and probably not the reason someone gets hacked but having “fail closed” as the default for all security features is absolutely a valid policy. Changing this policy should not be the lesson from this disasters.

    • Fushuan [he/him]@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      1
      ·
      5 hours ago

      You don’t get hacking protection from bots, you get protection from DDoS attacks. Yeah some customers would have gone down, instead everyone went down… I said that instead of crashing the system they should have something that takes an intentional decision and informs properly about what’s happening. That decision might have been to clo

      You can keep the policy and inform everyone much better about what’s happening. Half a day is a wild amount of downtime if it were properly managed.

      Yes, bot detection is not the most critical…

      So you agree that if this were controlled instead of open crahsing everything them being able to make an informed decision and opening or closing things, with the suggestion of opening in the case of not detection is the correct approach. What’s the point of your complaint if you do agree? C’mon.

      • groet@feddit.org
        link
        fedilink
        English
        arrow-up
        1
        ·
        5 hours ago

        You don’t get hacking protection from bots

        I disagree. I don’t know the details of cloudflares bot detecion, but there are many automated vulnerability scanners that this could protect against.

        I said that instead of crashing the system they should have something that takes an intentional decision and informs properly about what’s happening.

        I agree. Every crash is a failure by the designers. Instead it should be caught by the program and result in a useful error state. They probably have something like that but it didn’t work because the crash was to severe.

        What’s the point of your complaint if you do agree?

        I am not complaining. I am informing you that you are missing an angle in your consideration. You can never prevent every crash ever. So when designing your product you have to consider what should happen if every safeguard fails and you get an uncontrolled crash. In that case you have to design for “fail open” or “fail closed”. Cloudflare fucked up. The crash should not have happened and if it did it should have been caught. They didn’t. They fucked up. But, i agree with the result of the fuck up causing a fail closed state.