Friday, November 13, 2009

Cuento de 5 hojas que puede ser interesante para leer "El Pozo" de Juan Carlos Onetti.

Tuesday, November 10, 2009

Sobre los datacenters 2

I am the Director of Operations for our DC. When we give tours, I explain the following (pseudo order of the tour):

- Begin with the history of the building, when it was built (1995), why it was build (result of Andrew in 1992), and how it is constructed (twin T, poured tilt wall).

Infastructure:
- Take you through the gen room, show you it is internal to the building, show you the roofing structure from the inside, explain the N+1 redundancy, the hours on the gens, when they are ready for maintenance, how they are maintained, by whom (the vendor), how the diesel is stored, supplied, duration of fuel at max and current loads. Explain conduct before a hurricane or lockdown, how we go off grid 24hours ahead of a storm, mention our various contracts for after storm refill and our straining / refill schedule.
- Take you to the switch gear room, explain the dual feeds from the power company, how the switch gear works, show you the three main bus breakers, show you the numerous other breakers for various sub panels, etc. Explain and show you the spare breakers we have in case replacement is needed.
- Take you to the cooling tower area, explain the piping, the amount of water flowing, the number of pumps, how many are needed, the switching schedule, explain the N+1 capacity and overall capability of the towers, explain maintenance, show you the replacement pumps in stock, explain the concept of condensed water cooling if needed.
- Take you through the UPS and battery rooms, explain the needed KW capacity, what the UPSs back up and what they do not. Show the various distribution breakers out to floor, their capacity, the static switches, bypass, explain the battery capacity, type of cells, number of cells, number of strings, last time the jars were replaced and how they are maintained. Explain max capacity of the load vs time. Answer questions relevant to switching from utility->UPS->generator and back.

Raised floor:
- Take walk on raised floor, explain connectivity, vendors, path diversity we have, how the circuits are protected. Show them network gear, dual everything, how we protect from a LAN or WAN outage, and specific network devices we have for DDoS, Load Balancing, Distribution, Aggregation. Explain how telco and others deliver DS0 to OC-12 capacity, offer information on cross connections regarding copper, fiber, coax. Explain our offerings (dedicated servers up to 5K sq ft cages) and ask what they are interested in.
- Explain below the floor, size of raise, that power and network is delivered under, what are on level one trays, level two trays, and the piping for cooling. Show the PDU units and how they related to the breakers in the previous rooms. Show them the cooling panel and leads out to CRAC units, explain the cooling capacity, plans for future cooling, explain hot/cold aisle fundamentals, and temperature goals. At this point, there are usually more questions about vented tiles, power types available and overall floor density in watts/sq ft.
- Explain the fire detection / mitigation system, monitoring of PDU's, CRAC units, and FM200. Explain the maintenance of the fire system, show them the fire marshal inspection logs and the panels that alert the police and fire departments (both on floor and in our security office in front).
- While finishing the walk on the floor, show cameras, explain process to bring in and remove equipment, tell them the retention on the video, explain the rounds the guards make, the access list updates and changes.

NOC:
- At this point we're back to the front of the building, go into the NOC, explain what we are monitoring (connectivity, weather, scheduled jobs, etc). Introduce NOC and security staff, explain they will always get a person if they call, submit a test ticket from a e-mail on my phone, they will see the alerts light up and the pager for the NOC will signal. The final steps are to introduce them to security and then I'll lead the customer(s) to the conference room so they can continue the conversation with the sales associate.

The sales person is normally with us. During the tour we will explain our SAS certifications and disclose any other NDA information. I see two types of tours, the first is the discovery tour, which is when a company or government entity is on a fact finding mission to see if we are close to their needs, then they talk with the SA. The other type (more common) are the tours taken after the agreement has been worked out and this is the final "sales" procedure. Our facility really sells itself. Once on tour, most sign up (if they are serious) within 24 hours. I probably missed a few things, so if you want me to follow up I can. For me, everything I present are things the customer NEEDS to know before installing in my building.
Sobre los datacenters:



- Raised floor is certainly important, and a given. Check
- Cable management above AND below the floor. This is not an either-or... Check
- Cooling capacity is hard to judge, should be scalable. Redundancy is often overlooked but is often even more important that capacity... Check
- Power quality: never seen a big datacenter without a Liebert, or at least UPS in every rack. Power does not have the be contitioned except between the UPS and the machines/devices. A whole data center power conditioner is often more efficient, but unnecessary for the little guys. either way - check.
- Age is irrelevent as long as it's under support. If it's not, replace it. Generators need to be run several times a year to validate their condition, and also to grease the innards... See too many good generators get kicked on and fail an hour later because the oil hand't been changed in 3 years....
- Outages should be tracked, by system, rack row, and power distro. When system seem to be going down more frequently in one area, there's usually an underlying reason... As Google recently proved as well for us all, do not ASSUME all is well, routine disgnostics including memory scans should be performed on ALL hardware. Even ECC RAM deteriorates with age (rapidly) and needs to be part of a maintenance testing and replacement policy - Check.
- Fire suppression is usually part of your building codes, and a given, as is the routine checks (at least anually) by law.

In addition, we deploy:
- Man traps on all enterences to data centers. You go in one door, it closes, then you authenticate to a second door. A pressure plate ensures only one person goes in/out at a time (and it it's tripped, a scurity guy looking at a screen has to override).
- Full 24x7 video surveilance of the data centers.
- in/out logs for all equipment. To take a device in/out of a datacenter requires it being logged in a book (by a designated person). This is for anything the size of a disk/tape and larger. All drive bays are audited nightly by security and if drives go missing, security reviews the access logs and server room security footage to see who might have taken them.
- clear and consistent labeling systems for rack, shelves, cables and systems.
- pre-cable just about everything to row level redundant switches, and have no cabling from server to other servers not passed through a rack/row switch first. Row switches connect to distro switches. This ensures cabling is simple, and predictable.
- Colorcoded cabling: we use 1 color for redundant cabling (indicating their should be 2 of these connected to the server at all times, and to seperate cards in the backplane and seperate switches to boot), a seperate color for generic gigabit connections, another color for DS View, another color the out management network(s), another color for heartbeat cables, and yet another for non-ethernet (T1/PRI/etc). Other colors are used in some areas to designate 100m connections, special connectivity, or security enclave barriers, and non-fiber switch-to-switch connections. Every cable is labled at both ends and every 6-8 feet inbetween.
- FULLY REDUNDANT POWER. It's not enough to have clean poewr, and good UPS and a generator. In a large datacenter (more than a few rows, or anything truly mission critical), you should have 2 seperate power companies, 2 seperate generators, and 2 fully segregated power systems at the datcenter, room, row, and rack levels. in each datacenter we use 2 Liebert mains, each row has a seperate distribution unit connected to a differnt main, and each rack has 4 PDUs (2 to each distro). Every server is connected to 2 seperat PDUs, run all the way back to 2 completely independent power grids. For a deployment of 50 servers or so this is big time overkill. We have over 3500 servers, we need this... We can not rely on a PSU failure taking out racks at a time which may server dozens of other systems each.
--
There is no contest in life for which the unprepared have the advantage.

Monday, November 09, 2009

In this mail I wanted to summarize some of the capabilities and features OpenSolaris offers for networking, including both, physical and virtual networking.

- I'll start with Crossbow (http://hub.opensolaris.org/bin/view/Project+crossbow/WebHome): Crossbow is the basis for network virtualization and resource control in OpenSolaris. It allows you to create virtual NICs to be used either by a networking service (HTTP, FTP, and others) or by virtual machines (Zones, VirtualBox...). Each of those vNICs has it's own priority and assigned bandwidth in order to guarantee QoS (via Flows) and prevent DoS by isolating the effect of any attack to just one vNIC and not the physical device. Crossbow also allows you to create a virtual switch connecting the vNICs, so you could create a complete virtual network inside OpenSolaris (called vWire). A GUI to build and test the network-in-a-box has been released and while it is at alpha quality at the time, it works nicely for demoing and testing: http://blogs.sun.com/observatory/entry/crossbow_virtual_wire_demo_tool . Also of interest, after the inclusion of Crossbow in OSol, similar solutions have been announced for inclusion in FreeBSD (http://itmanagement.earthweb.com/osrc/article.php/3835846/FreeBSD-to-Upgrade-Routing-Architecture.htm) by Blue Coat and in Linux (http://openvswitch.org/) by Citrix, proving how OpenSolaris is once again ahead of the competition.

- Recently, the Integrated Load Balancer (ILB: http://wikis.sun.com/display/OpenSolarisInfo/Integrated+Load+Balancer) has also been added to OSol (http://www.c0t0d0s0.org/archives/6072-Loadbalancing-with-Opensolaris-or-PSARC-2008575.html), adding L3/L4 (transport and network layers) load balancing to the OS by default.

- Project Clearview (http://hub.opensolaris.org/bin/view/Project+clearview/WebHome) includes components such as re-designed IP tunneling and IP Multipathing (IPMP) for better behavior and observability of the networking devices. The final component of Clearview (IP Tunneling DD) were integrated in Nevada build 125 (http://blogs.sun.com/seb/entry/clearview_ip_tunneling_in_opensolaris).

- Integrated Quagga (http://www.quagga.net/), IP Filter (http://coombs.anu.edu.au/~avalon/ip-filter.html, http://blogs.sun.com/tonyn/entry/firewall_configuration_in_opensolaris_2009) and IPSec (for network routing, firewall and packet authentication and encryption (including VPN tunneling). (http://docs.sun.com/app/docs/doc/819-3000/ipsectm-1?l=en&q=mobile+ip&a=view)

- And of course, DTrace can be used to debug networking problems (http://hub.opensolaris.org/bin/view/Community+Group+networking/dtrace_networking_cookbook)

It's important to remember that all of these capabilities can be used simultaneously, according to our needs. One example of this is the Virtual Network Router appliance project (http://hub.opensolaris.org/bin/view/Project+vnm/VNRP) that combines Crossbow, Quagga and Zones (and all managed trough Webmin: http://www.webmin.com/) to create an integrated edge-router, to separate intranet traffic from internet traffic.

For further reading, you can visit these links:
http://hub.opensolaris.org/bin/view/Community+Group+networking/WebHome
http://www.opensolaris.com/learn/features/networking/networkall/
check the attached presentation, and also, check the Storage Projects for information on different connectivity options that OSol offers (pNFS, NFS, FC, IB, iSCSI among others): http://hub.opensolaris.org/bin/view/Community+Group+storage/WebHome.
"Según el analista, al ser VMware el único jugador de ese mercado por bastante tiempo, el precio era muy alto para las pequeñas empresas" según Gartner http://www.datamation.com.ar/noticias/detalle_noticias.jsp?idContent=37390

Se lo merecen por usar softwares pagos en lugar de alternativas gratuitas y de códgio abierto como VirtualBox.
Pride is the biggest obstacle for science.
I like paying taxes. With them I buy civilization.
Innovation makes enemies of all those who prospered under the old regime... -- Machiavelli