Aug 06 2019
The E8 storage technology architecture is the first to break completely with the storage architectures built round mechanical devices. The underlying architecture exploits native NVMe flash drives over a high performance network fabric in a uniquely scalable way. It maps well into the separation of compute and storage of hyper-converged systems, which underlies cloud providers in general and the Amazon IaaS architecture in particular. AWS can integrate the E8 storage architecture and network to provide a highly performant and scalable compute service second to none.
The fundamental bottleneck for modern real-time application performance is storage speed. Ultra fast storage will enhance AWS database sales. It will also support AWS enterprise and SaaS customers who need to develop modern systems of record combined with Artificial Intelligence (AI) inference code and real-time analytics. These may be applications using AWS resources, or hybrid cloud applications moving code to where the data is located. The data maybe in remote AWS “true hybrid storage” devices such as Outposts, or in other “buckets” in the Wikibon Hybrid Cloud Taxonomy. These application types and other high data density applications require raw IO performance latency measured in the very low microseconds.
The SAS Serial Attached SCSI (SAS) is a point-to-point serial protocol originally designed to move data to and from computer-storage devices such as hard drives (HDDs) and tape drives. Data throughput and IOPS are constrained.
NVMe is redesigned for the much faster NAND storage, usually in an SSD form factor. The NVMe drives use the system-level PCIe fabric, which is much faster than SAS. The potential throughput with Gen 3 PCIe is about 5 times faster than SAS, and Gen 4 PCIe is about 10 times faster.
NVMe over Fabrics (NVMe-oF) extends NVMe over PCIe to additional fabrics. These include:
These additional fabrics allow high speed low-latency access to NVMe drives to extend across systems.
The benefits of NVMe over PCIe fabric or other fabrics include:
Figure 1 – Toshiba and JAE XFMEXPRESS NVMe NAND over PCIe Fabric Storage, for ultra-mobile PCs, IoT, etc.
Source: Toshiba at Flash Memory Summit 2019
Wikibon expects NAND vendors to focus future products on NVMe for all NAND form factors, driven by PCIe performance, ease of integration and volume. This will be true across the board, for embedded device storage, PCs, mobile devices, enterprise computing and cloud computing.
A new form factor called XFMExpress has been introduced by Toshiba & JAE to provide a thin (14mm x 18mm x 1.4mm), light, removable NVMe storage card. Figure 1 to the left shows the card and the socket, and the Wikibon posting “Toshiba XFMExpress brings Edge Devices Datacenter Storage Performance” gives more details.
Wikibon expects usage of SAS and SATA with NAND storage to drop dramatically over the next few years. They will be used almost exclusively for traditional HDD drives.
This section goes into technical details of the E8 storage architecture. The detail shows how AWS can deploy this technology at scale. The detail in this section can be safely skipped. It is also useful as a reference, as all the previous data provided by E8 has been deleted from the E8 site!
Figure 2 – E8 Storage Plane Architecture
Source: E8 Presentation November 2017
E8 was named after 10 to the power 8, representing the 100 million IOPS that can be achieved from a cluster of E8 nodes. NVMe drives provide 10x times the bandwidth and IOPS of traditional SSD drives. The classic two controller architecture constrains the performance, latency, and potential of NVMe drives.
Figure 2 above shows an outline of the E8 Storage Architecture. Importantly, the E8 architecture separates the control plane from the data plane. The data plane requires simple logic, but heavy and frequent use of computation resources, particularly for a modern Raid 6 erasure code implementations. The E8 storage data plane uses one or more cores from the application servers to just manage the NVMe data queue(s) for that server. The total number of NVMe queues is 64K (>65,000). This allows very efficient large-scale scaling, and allocates the storage compute resources to the servers and applications that are using the storage resources.
Figure 3 – E8 Storage and Network Architecture
The control plane and metadata management is handled by an E8 “pod”, which consists of two servers close to twenty four (24) dual-ported SSDs. The complex but infrequent compute requirements for the control plane are handled by a pair of small servers. The E8 uses a classic two-phase commit architecture between the controller and the SSDs to protect data integrity from any failure of any component. The data “chunks” are spread across the drives using 2+2, 4+2, 8+2 or 16+2 RAID 6 erasure coding protection. This coding can be used to stripe data across the SSDs, and increase the SSD data density that can be deployed.
E8 uses NVMe over Fabrics (NVMe-oF) to connect the drives to the servers. This is shown in Figure 3 to the left. Mellanox RoCE (RDMA over Converged Ethernet) technology provides the NVMe-oF connection into a top-of-rack (ToR) switch with 100GbE bandwidth uplinks to the servers. Each E8 “pod” has a 25 GbE connection to the ToR. This allows a large number of SSDs attached to a large number of different processors to be aggregated with built-in multi-pathing into a logical pool.
The NVMe fabric must be ultra-low latency and lossless. The E8 is using the Mellanox NVMe SNAP architecture based on BlueField SmartNIC adapters. The SNAP acronym stands for Software-defined Network Accelerated Processing. Mellanox Ethernet adapter cards provide offloading mechanisms such as erasure coding, T10/DIF, TCP and UDP offloads, and overlay offloads. The use of RDMA over Converged Ethernet (RoCE) reduces the x86 CPU resources for the data plane.
The NVMe SNAP system on a chip integrates Mellanox ConnectX-5 network adapters and 16 ARM CPU cores in the same silicon layer, coupled with a PCIe Gen 4 switched NVMe fabric and acceleration engines for security, storage and application-specific use cases. This provides the offload capabilities discussed above, and is used to provide advanced storage services such as inline compression and snapshots.
This is a good example of a Hybrid Processor Architecture, which is discussed in the next section.
Moore’s law led the industry to expect that every new version of an x86 chip would be faster, cheaper, smaller and use less power. In general, the better business decision was to wait 18 months for new faster chips rather than invest in writing more efficient software. In addition, storage access was measured in milliseconds (ms), with 20ms being good. This left plenty of time to run inefficient code.
Wikibon observes that Moore’s Law is coming to an end. For example, the new 10th generation Intel Ice Lake 10 nm x86 chips are only slightly faster than the current 14nm chips. Improved processor architecture (the Tock part for the traditional Tick Tock chip progression) helps performance to a very limited extent.
In contrast flash storage now delivers 100 µs (microsecond) IO performance, about 200 times faster than traditional storage architectures. NVMe and NVMe-oF have improved the path lengths and parallelism in storage, and IO performance as low as 9µs has been claimed. The performance of system networking has also improved significantly to 100 GbE/second, with 200 and 400 GbE/second available soon. An SSD can deliver about 32 GB/second of bandwidth, which before NVMe-oF was severely constrained.
These trends mean that processing for storage and networking has to be more distributed, more specialized and lower cost. The Mellanox Bluefield technology in the previous section is a good example of a hybrid processor architecture. This allows the connection of SSDs spread across a data center into a common pool of storage with advanced functions, without draining the x86 application servers.
Figure 4 – AWS Nitro Cards
Source: Wikibon derived from AWS re:Invent 2018, downloaded August 2nd 2019
AWS has invested heavily in Nitro since 2013. Nitro consists of the Nitro security chip, the Nitro hypervisor, and Nitro cards. Clearly AWS will enhance the E8 storage technology with Nitro security and the Nitro hypervisor. However, Figure 4 to the left shows an overlap between the Nitro cards and the E8 storage architecture, which uses NVMe-oF with Mellanox technology. A clash of technology philosophies could lead to demotivation and loss of E8 personnel that have moved to the AWS center in Israel. As a result, AWS could suffer significantly delay and dilution of the E8 storage technology potential.
This is a very smart acquisition by AWS. The fundamental architecture of E8 has shown in Figures 2 & 3 above show that it can be integrated well into the AWS EC2 offerings.
The integration of E8 storage into AWS will take time. Wikibon does not expect initial availability of AWS fast storage services before early 2020. Wikibon expects migration to all regions and to Outpost hybrid EC2 instances to take another year. This is assuming that the overlap issues detailed in the previous section are overcome quickly.
Wikibon Bottom Line: NVMe-oF is a new concept, with benefits that are more easily understood by developers rather than operations staff. This will be a challenge for AWS to train its sales and technical staff and partners on NVMe-oF. Wikibon believes AWS will be very successful with making NVMe-oF seamless, performant, and flexible. AWS will pick up many enterprise IT mission-critical workloads where IT have not put in place aggressive NVMe-oF migration plans.
Traditional storage vendors have added NVMe capabilities to their existing All Flash Arrays (AFA). These include Datrium, Dell, HPE, IBM, Infinidat, NetApp, Pure and many others. The key challenge for these vendors is moving the data plane processing from the storage controller to the application, and avoid the controller bottleneck. Datrium’s fundamental architecture is a good fit for NVMe-oF. Dell has support for NVMe-oF in its VxBlock 1000 converged infrastructure system. IBM are using the Power chipset to increase the performance of the controllers and are shipping NVMe-oF IBM FlashSystem 9100 with support for end-to-end NVMe, including an NVMe-oF host connection based on InfiniBand. IBM storage software portfolio has been aggressively adding NVMe and NVMe-oF support. Infinidat has announced its intention to add NVMe-oF. NetApp has NVMe over Fabrics Support for the E-Series, and is shipping NVMe-oF support for Fibre Channel fabrics in ONTAP 9.4. Pure Storage has announced and shipped DirectFlash Fabric, which offers front-end NVMe-oF connectivity using a RoCE fabric.
Wikibon Bottom-line: NVMe and NVMe-oF is a fast moving train with very significant benefits to enterprise IT. Wikibon believes that storage vendors that try to keep all the NVMe-oF processing inside the traditional storage arrays will be swept aside.
HCI Platform NVMe-oF Conclusions
Hyper-converged infrastructure platforms (HCI) with Server SAN storage implementations running on them include VMware with vSAN, Nutanix, Oracle Exadata, and Pivot3. All support NVMe drives. All own the complete stack. All in theory could move the data plane processing and queues to application VMs, and use NVMe-oF to create a simplified and very high-performance storage network within and between HCI clusters. Oracle Exadata is the most advanced, with a system-wide InfiniBand RDMA network and full support for NVMe drives. Maybe we will hear more about vSAN and NVMe-oF at VMWorld 2019?
Wikibon Bottom-line: NVMe and NVMe-oF is a significant opportunity to HCI vendors, as they own the complete stack. However, there is also likely to be great institutional resistance within these vendors to re-architect existing platforms. They should take account of cloud providers such as AWS will be implementing robust NVMe-oF solutions in the next eighteen months. Wikibon believes that HCI platform vendors that do not commit 100% to end-to-end NVMe-oF architectures will find themselves just supporting a dwindling base, rather than being seen as a modern platform for new hybrid applications.
Other NVMe-oF Architectures Conclusions
Other companies technologies that have significant NVMe-oF end-to-end architectures or significant contributions to architecture. Wikibon has highlighted the following six (in alphabetical order):
NVMe and NVMe-oF are important technologies, but not the only emerging technologies that will improve storage performance.
Increasing memory size with hybrid memory NVDIMMs (Non-volatile DIMMs which include DRAM + Flash) is gaining traction, especially to support Database-in-memory implementations. Other memory technologies such as Intel & Micron 3D XPoint PCM technologies are being proposed as alternatives to flash – fast but very expensive.
Large caches of high-performance flash or other technologies are also a common solution to boost performance. These caches work best when the working set of data actively referenced is small, and remains fairly constant. Traditional workloads found on storage arrays often have a low percentage of active data. However, most modern advanced analytics, RPA (e.g., UiPath), or AI applications do not have small and stable working sets – just the opposite. NVMe & NMVe-oF are much better solutions for these types of workload.
Other technologies move some or all of the data processing to the SSD itself. NGD Systems with its Newport SSDs and Eideticom are examples of this technology.
CTOs and senior IT managers should understand the full potential of NVMe drives, and NVMe-oF topologies. They should move to NVMe-oF NAND storage aggressively for new deployments to take advantage of the performance and access benefits. The cost difference between NVMe and other NAND drives is rapidly disappearing as NVMe shipment volumes increase. NVMe-oF will support high density NVMe drives with erasure coding striping.
The business benefits of NVMe are the potential to drive data-driven applications, with the ability to process data from multiple sources in multiple locations in real-time. In addition, Edge computing will be much, much faster with the introduction of Toshiba’s XFMExpress form factor, which allows native PCIe connection. CIOs should encourage and drive the use of these technologies to develop hybrid applications, which can move code to where the data is, and increase the amount of data available to applications across a hybrid cloud environment.
AWS will be a natural partner in developing and deploying these hybrid applications. CTOs should use the E8 Storage architecture as a comparison reference for all NVMe-oF solutions presented to them.
Wikibon believes E8 is currently best of breed NVMe-oF architecture. However, there are a number of other NVMe-oF architecture solutions in the marketplace, and Wikibon expects many of them to be acquired by system and cloud vendors.