Hadoop software is designed to orchestrate massively parallel processing on relatively low-cost servers that pack plenty of storage close to the processing power. All the power, reliability, redundancy, and fault tolerance is built into the software, which distributes the data and processing across tens, hundreds, or even thousands of “nodes” in a clustered server configuration.
Those nodes are “industry standard” x86 servers that cost $2,500 to $5,000 each — depending on CPU, RAM, and disk choices. They’re usually middle-of-the-road servers in terms of performance specs. A standard DataNode (a.k.a. Worker node) server, for example, is typically a 2U rack server with a two-socket Intel Sandy Bridge or Ivy Bridge CPU with a total of 12 processors. Each CPU is typically fitted with 64 GB to 128 GB of RAM, and there are usually a dozen 2 or 3 TB 3.5-inch hard drives in a JBOD (just a bunch of disks) configuration.
Companies seeking a bit more performance, for Spark in-memory analysis or Cloudera Impala, for example, might choose slightly higher clock speeds, 256 GB or more RAM per CPU, while those seeking maximum capacity are choosing 4 TB hard drives.
Management nodes running Hadoop’s NameNode (which coordinates data storage) and JobTracker (which coordinates data processing) require less storage but benefit from more reliable power supplies, enterprise-grade disks, RAID redundancy, and a bit more RAM. Connecting the nodes together is a job for redundant 10-Gigabit Ethernet or InfiniBand switches.