Tuesday, June 15, 2010

Petroleum Engineers Should Learn from Network Engineers

I'm going to go out on a limb and say that an Internet disaster of the same magnitude as the BP oil spill is unlikely. To hedge my bet, I'm also going to say that if it does happen, the Internet will recover quickly, within hours or even minutes. The disaster won't linger for months like it appears posed to do on the Gulf Coast of the U.S.

I'll add one more caveat. The BP disaster is regional, mostly just affecting the Gulf Coast states. The Internet is global. I think a global Internet outage that lasts more than a few hours is unlikely. A regional outage is more likely, but if it happens, it won't affect the entire Southeast U.S., and recovery will be quick, within hours, especially in metropolitan areas.

What do network engineers do differently than petroleum engineers?
  • Provide extensive redundancy, with parallel links between all major components (see the network design from Cisco in the picture as an example)
  • Aim for 99.999% uptime (that's only five minutes of downtime per year)
  • Assume there will be hardware and software failures, and design for sub-second failover to a redundant path when a failure occurs
  • Assume there will be incessant security attacks from malware, viruses, Trojan horses, port scans, etc., and build systems to protect the network from these attacks
  • Build diversity into the system so that a software bug or virus breakout is contained to one vendor's equipment and doesn't affect the entire system
  • Build heterogeneity into the system so that the Internet is not owned or managed by a single company
  • When low cost is more important than redundancy, which is true for some situations, avoid using the network for applications that can fail in such a way that people, birds, and turtles die horrible deaths, and hundreds of workers lose their jobs
  • Design and build disaster recovery systems, and practice using them
  • Continually research better ways to design and build networks for high availability, security, scalability, performance, efficiency, and accuracy, using a top-down approach that puts users before technology
  • Move around bits and packets, not huge volumes of oil under tremendous pressure :-)
I said that an Internet disaster of the same magnitude as the BP oil spill is unlikely, but I also said I was going out on a limb saying this. Am I too optimistic? Do we really have enough redundancy and fail safes? Do we have enough diversity? Do we too often cut corners to save money, as it seems that BP did? If a major Internet outage occurs, will it be caused by a software bug, a hardware failure, or a security breach, and what steps will we take to recover quickly?

What do you think? Please comment. Thank you.

3 comments:

  1. It's way out of my area of expertise, but I doubt that an internet outage could be as catastrophic as a major oil spill. Network engineers have learned to build in lots of redundancy and security, whereas natural resource extraction companies have learned how to avoid safety regulations and responsibility for environmental damage. The recent BP spill is yet another example of irresponsible, profit-motive driven corporate decision making that discounted worker safety and environmental protection.

    ReplyDelete
  2. It depends on the company. I've seen the ideal design that includes redundancy and other safety measure turned down because of cost. I bet there are engineers working for BP that had concerns and executives that ignored them. I actually think some sort of internet disaster isn't as unlikely as people think. I that most of the internet infrastructure located in one country, the US. Furthermore, the majority of it is controlled by one company, Verizon. I think the odds of some freak accident are still limited, but centralization is notoriously bad in networking.

    ReplyDelete
  3. We plan for the obvious (and it's not even that obvious). They did not and probably do not to this day.

    ReplyDelete