James Coomer, Sr. VP of Products at DDN talks about the IT Solutions challenges such as AI workloads and how efficient data infrastructure can help solve them.
Organizations use AI, DL and ML to analyze massive amounts of data. These tools provide valuable insights into data which can help organizations better understand data and ultimately improve decision-making. But many businesses are realizing AI-powered insights place entirely unique demands on IT infrastructure and project success is predicated on gathering, storing and processing data as efficiently as possible.
Using AI, DL and ML to analyze large amounts of data present challenges for enterprise IT teams in terms of computing workloads, storage of the data and management of the enterprise infrastructure.
IT infrastructure must scale rapidly and efficiently as workloads grow and the storage is a critical element of the infrastructure.
According to a survey from 451 Research, “A key to managing and aggregating volume of data lies in managing storage tiers. 63 percent of survey respondents want to improve storage efficiency. With the adoption of AI/ML workloads, the storage layer is expected to play a crucial role in the IT environment.”1 An intelligent storage infrastructure and the management of data are critical components of today’s data centers, especially as AI projects continue to gain traction in adoption and implementation.
Storage infrastructure challenges for IT and enterprise datacenters
Current IT datacenter infrastructures are designed for business application workloads and face challenges in handling the demanding needs of AI and DL. On-premise datacenters often have compute and storage infrastructure designed to handle limited, modest workloads and relatively small data volumes. Their networks may use poorly optimized filesystems with processing performed on CPUs and storage done mostly on HDDs. Even storage systems with high performance flash can be bottlenecked by inefficient protocols or network components. This infrastructure leads to processing pipeline congestion and IO bottlenecks while the compute infrastructure is idle waiting for IO to complete.
AI and DL run best on Graphical Processing Units (GPUs) which are more scalable and faster than CPUs for this type of work. Maximizing GPU based processing requires optimization across filesystems and storage. Without the optimization of a parallel IO data storage platform including solid state drive (SSD) flash arrays, a GPU-based computing platform can be rendered ineffective or burdened by overly complicated and unsupportable architectures.1
It is important for datacenters to consider the cost of storage options for analytic processing whether storing data in memory, using GPUs, SSD flash arrays or storing data on a public, private or hybrid cloud.
The 451 Research survey found that a more cost-effective approach is to ensure that the data, regardless of where it resides—memory, flash or spinning disk—is optimized for the storage media on which it resides, thereby reducing an enterprise’s overall storage costs.1 For AI at scale, a single shared resource that can provide the performance necessary to feed hundreds of GPUs while keeping both hardware and management costs in check is key.
IT system administration and maintenance challenges
Without storage systems optimized to run AI, DL or ML, enterprise datacenters may need to overprovision existing systems and hire additional staff or consultants, resulting in additional data storage costs. A common complaint among IT staff is all the time spent in performing routine maintenance, system administration and maintenance of their storage and AI workloads just exacerbate these problems. An intelligent storage system capable of automating many of these tasks is needed to free up IT staff to support business operations or develop applications to meet specific enterprise needs.
A shift to a data-centric operation requires a completely different set of capabilities from the current datacenter infrastructure. The proper selection of an intelligent data storage platform and its efficient integration in the datacenter infrastructure are key to eliminating AI bottlenecks, aiding in accelerating time to insight, reducing costs, and freeing IT staff from mundane maintenance tasks.
An intelligent storage platform must include:
- Scalability: Infrastructure that is flexible and enables efficient handling of a wide breadth of data sizes and types as well as allowing for storage expandability as datasets grow
- Flexibility: Must be architected for all types of IO patterns and data layouts handling any thread count, the toughest IO patterns and dynamic data placement
- High performance parallelism: High parallelized architecture that delivers data simultaneously to all processes running within the GPUs to eliminate waiting for data transactions
- Security: Storage system must provide high data availability, maximum system uptime and be integrated as a fully redundant system
- Low latency: The data path must be optimized to deliver data from disk to GPU with minimal overhead (latency)
Automated management in an intelligent storage platform
The term “Intelligent Infrastructure” is a popular term, with many companies claiming their infrastructure meets the criteria. An intelligent storage platform should help simplify and automate management, and improve insight, performance and provision with precision. It should be sufficiently flexible to move from POC to production seamlessly, no matter the growth of data sets. It should supply performance that maximizes the rest of the investment around AI and DL like IT resources by maximizing expensive GPU system utilization and human resources by speeding up results to insight.