Evren Sirin talks about the emerging trends of Data Management focusing on viable solutions for business. He focuses on data considering business presence in the markets.
Enterprise machine learning deployments are limited by two consequences of outdated data management practices widely used today. The first is the protracted time-to-insight that stems from antiquated data replication approaches. The second is the lack of unified, contextualized data that spans the organization horizontally.
Excessive data replication and the resulting “second order effects” are creating enormous efficiencies and waste in most organizations. According to IDC, over 60 zettabytes of data were produced last year, and this is forecast to increase at a CAGR of 23 percent until 2025. Worse, the ratio of unique to replicated data is 1:10, which implies that most organizations’ data management methods are based on copying data.
When creating machine learning models, firms usually section off relevant data by replicating them from different sources. Models are typically trained on 20 percent of this data, while the other 80 percent remain for testing. The rigors of data cleansing, feature engineering, and model evaluation can take six months or more, making data stale during this process while delaying time-to-insight and compromising findings.
The second repercussion of traditional, outdated data management approaches is reduced quality of insights. This effect is not only attributed to building models with stale data, but also to the inadequate relationship awareness, disconnected vertical data silos, poor contextualization, and schema limitations of relational data management techniques.
Properly implementing knowledge graphs in a modern data fabric corrects these data management issues while increasing machine learning’s value. Deploying data virtualization within a knowledge graph empowered data fabric enables companies to bring machine learning to their data—instead of the opposite, which wastes time and resources.
Moreover, the inherent flexibility of graph models and their ability to leverage inter-connected relationships make preparing data for machine learning much easier as they provide capabilities like improved feature engineering, root cause analysis, and graph analytics. This functionality is also key to helping knowledge graphs transition to be the dominant data management construct for the next 20 years as data management and AI converge. In short, knowledge graphs will help AI as much as AI will help knowledge graphs.
The Need for Strategic Data Management
The growing volumes and varieties of data organizations are dealing with prolonged machine learning deployments. Varying data formats, schemas, and terminologies across silos or data lakes delay machine learning initiatives requiring this training data. The lack of context and semantic annotations makes it difficult to understand data’s meaning and use for specific models. Even when data is sufficiently contextualized, this information rarely persists, so organizations must start over for subsequent projects. The months of training required when replicating this varied data is made even more difficult by fast moving data, like information collected by IoT devices, for example. Organizations are forced to deal with this obstacle by replicating fresh data again, restarting this time consuming process that impairs models’ functionality.
A far better approach is to train models at the data fabric layer instead of replicating data into silos. Organizations can easily create training and testing datasets without moving data. They can even specify, for example, a randomized 20 percent sample of their data with a query that extracts features and delivers a training dataset via this data virtualization approach underpinned by knowledge graphs. This methodology illustrates the connection between data management and machine learning to accelerate time-to-insight with the added benefit of training models on more current data.
Achieving Quality Machine Learning Insights
Knowledge graphs provide a richer, superior foundation for understanding enterprise data compared with relational or other approaches.They offer contextualized understanding and relationship detection between the edges of nodes, which is how graphs store data. This capability is significantly enhanced by semantic graph data models that standardize business-specific terminology as a hierarchical set of vocabularies or taxonomies. Thus, business users can innately understand data’s meaning and relation to any use case, such as machine learning. Semantic graph data models also align data at the schema level, provide intelligent inferences about concepts or business categories, and eschew conventional problems with terminology or synonyms while delivering a complete view of enterprise data.
These characteristics are pivotal for decreasing the time required to prepare data for machine learning while producing highly nuanced, contextualized insights from the available data. Another benefit of this approach is the relevance of graph-specific algorithms for machine learning. They allow organizations to take advantage of specific techniques pertaining to clustering, dimensionality reduction, Principle Component Analysis (PCA), and unsupervised learning that are ideal for getting training data ready in graph settings for machine learning. These techniques and others (like graph embedding) can accelerate the feature generation process or provide impact analysis for data preparation.
Fusing Data Management and Knowledge Management
The overarching utility of knowledge graphs for machine learning is demonstrative of the mutually reinforcing nature of data management and knowledge management. To paraphrase acclaimed Google Research Professor Peter Norvig, with enough data, one doesn’t need a fancy algorithm. That’s just what merging data management and knowledge management within a uniform data fabric supported by knowledge graphs and data virtualization provides: richer and more high-quality data that enables organizations to optimize machine learning without a perfect algorithm.
With sufficient data about their purchasing habits, for example, one doesn’t need fancy algorithms to predict which customers would be interested in a new product offering. The convergence of data management and knowledge management maximizes AI by giving organizations trained models, and algorithmically augmented intelligence to inform decision-making.
Leveraging a Knowledge Graph-Powered Data Fabric with Virtualization
The practical benefits of this implementation are readily apparent, especially when data virtualization (DV) is involved. DV capabilities let organizations train models in graph settings without replicating data, creating silos, or overlooking the vital context between different nodes and datasets. Also, by aligning the metadata for information assets with data virtualization and knowledge graphs, users can do logical inferences on categorical data, which is invaluable since machine learning involves both categorical and non-categorical data. In fact, the latter may also include numerical and textual data as well.
Categorical data is data whose meaning is fundamentally related to critical business categories. Examples include things like valued customers, insider threats, risky loans, or any other concept that has meaning within a specific business context. Some data or applications, like converting the Euro currency to dollars, involve both: non-categorical mathematical computations and categories of the continent/country. The knowledge foundation supporting categorical data is critical for implementing business rules into machine learning models utilizing these types of data, which is only possible when the business logic, data models, and machine learning models are accessible from one place via the aforementioned data fabric.
This approach vastly surpasses relational ones and their replication drawbacks of moving tables between locations. If users don’t copy supporting tables and the underlying schema with the latter method the context is lost, resulting in errors. Plus, machine learning model inputs are limited by relational schema in which all questions must be pre-defined upfront—which is impractical with the wealth of unstructured and semi-structured data inundating the enterprise today.
Unleashing Machine Learning
Archaic data management methods of replicating data and using relational technologies are hampering enterprise machine learning. Organizations can surmount these issues with a data fabric of knowledge graphs enriched by data virtualization. This method eliminates data replication while providing more powerful data understanding for machine learning models.
The greater significance of this approach is the coalescence of data management and knowledge management, which has significant ramifications for AI. Converging both disciplines fundamentally improves machine learning’s effectiveness since, with enough data, there’s no need for fancy algorithms.
For more such updates and perspectives around Digital Innovation, IoT, Data Infrastructure, AI & Cybersecurity, go to AI-Techpark.com.