Unpacking the evolution of DataOps and AI, from Data-Centric strategies to cutting-edge technologies, shaping the future of data-driven insights and machine learning.
Hello Guy. We are elated to have you at AI Tech Park. Could you please tell us how DataOps.live came to fruition and give us a brief about your role as CTO at DataOps.live?
I’m delighted to be here! We came up with the inspiration for DataOps.live from our experience leading a tech-forward Systems Integration firm in England that developed advanced analytic solutions for larger enterprises, mostly using cloud data platforms and associated tools and infrastructure. In building these data applications, we noticed there was no methodology, such as how you would use DevOps to build software applications. So we came up with the concept of DataOps, which is essentially like DevOps for data. We further defined and established this concept by collaborating with a group of experts on a distinct website known as TrueDataOps.org, where we published the “7 Pillars of DataOps”. We also published the book DataOps for Dummies. And so DataOps.live was born!
As the term Data-Centric AI is on the rise could you elaborate what exactly Data-Centric approach to AI is?
Like machine learning, AI needs high-quality data that users can trust. In the world of generative AI, a wealth of world knowledge is required to be able to apply it across many use cases. These include chat-based experiences over domain-specific models that get fine-tuned for a specific industry or use case or both to the emerging agentic mesh. This is where AI makes semi-automated decisions (e.g., in the supply chain) to select a different supplier when expiring, such as a natural disaster temporarily disrupting transportation.
The data-centric approach focuses on providing correct training data to minimize false positives. Take image recognition as an example of how AI must distinguish between cats and dogs. When there is a single animal in the image, it will be correctly labeled as a cat or dog. If it is a group picture, the cat can easily hide in a group of dogs and is incorrectly labeled as a dog. Thus, focusing on the outliers in your training set improves the model’s quality, even if you do not change the foundational model.
In summary, the data-centric approach to AI is necessary to improve the trust, correctness, and reach of the AI models.
What is a Model-Centric approach? Put some light on the cons of the Model-centric approach and how it is falling back in the process of AI evolution.
The model-centric approach focuses on using the existing foundation models and improving their quality by continuously fine-tuning them, which results in smaller models that better fit the use case. This fine-tuning enhances the quality of the response for a given prompt, for example, in the area of data classification. Let’s say you want to classify the win-loss reasons for specific deals based on customer and sales representative interviews. If you prompt a foundation model, you will only get structured data as a response. Sometimes, you will also get the loss reason and prose as a response. And then again, sometimes you won’t get a loss reason back at all. In summary, you might have only 80% correctness.
To get to 90%+ correctness, you can fine-tune a much smaller foundational model., which typically helps with overall costs. So what could be wrong with higher quality and reduced costs?
Well, it’s a moment-in-time assessment and it can go wrong If the use case changes even slightly, you need to train again, If your training data no longer represents the real-world input, you train again, and If better foundational models come along, you train again.
Essentially, there is a hidden cost to continuous fine-tuning and a model-centric approach.
It is said that the con of the Data-Centric approach is that it’s not well documented and can lead to technical debt. Could you provide us your insight on the same and also explain what technical debt is?
Technical debt refers to the sum of technical artifacts in your solution that, from today’s perspective, are outdated and no longer represent best practices. Technical debt is typically created when you need to bring a solution to market quickly but later don’t take the time to harmonize and standardize it with the rest of the solution stack. The longer you wait, the more necessary knowledge leaves the company, and no one is able or willing to make a change. Combined with the lack of documentation, it becomes harder and harder to reduce your debt. The more technical debt you accrue, your ability to innovate diminishes.
Applied to the data-centric approach, it is also easy to accrue technical debt. As you prepare, cleanse, and harmonize the data to train your models, it is easy to forget to document your choices as to why you cleaned up your data the way you did. Take the example of cats vs. dogs: did you note the human decision that made you relabel the cat in the group of dogs as a cat? If you didn’t, these rules could be forgotten in the next round of refreshing your training data, especially if there is a significant amount of time between the iterations. Like with other technical debt, eventually, your ability to innovate diminishes.
What technologies lead to the growth of Data-Centric AI and lead towards its success?
Multiple technologies are leading to the growth of data-centric AI. Fundamentally, they all focus on avoiding the famous “garbage in, garbage out” rule. As a foundation, you will leverage data pipelines to automate the ingestion, preparation, and cleansing of data from multiple sources so that you can train your models confidently. This includes all structured data, semi-structured data, and unstructured data like images, video, and audio.
While pipelines can bring the data to a place where users can learn from it, you will also need data preparation as a cornerstone. That includes standardizing and harmonizing the incoming data such that you can join multiple datasets and pass the aggregate to model training and inference. For AI, you must also focus on data labeling to improve model correctness. Using the “cats vs. dogs” example again, labeling means identifying the cats, if needed, through supervised learning involving humans. Finally, you must deploy data augmentation to generate synthetic data variations to improve model robustness. The variations allow you to future-proof your models against the ever-changing data in the real world before you retrain the model.
Next, you need to focus on continuous data quality monitoring of your data quality metrics to identify and address potential issues early on—think “garbage in, garbage out”! Monitoring involves automated data tests for completeness, correctness, or consistency, including your labeling score. An end-to-end view will lead to success.
Further, to industrialize your operations, you need to focus on data versioning to track changes to datasets over time. You can only empower data teams to collaborate on model evolution and track their performance by managing data and corresponding code in lockstep. Finally, let’s remember cloud computing. Only with this can you scale your data operations to terabytes of data.
Could you highlight how DataOps.live’s journey progresses when these kinds of evolutions happen in the AI field?
DataOps.live’s DNA is rooted in DevOps for data. We fundamentally help our customers operate their modern data platform, which starts with data pipelines. Our platform orchestrates the fundamental AI capabilities just covered. Take data ingestion, preparation, and automated tests as an example. These capabilities are available with DataOps.
Most importantly, data versioning is probably the most important feature we provide. First and foremost, every change request is developed in a separate environment for the data team’s members. Only once that change is tested and reviewed can it be merged to higher environments. Errors and mistakes are caught well before they land in production. The code, including your SQL, is versioned, and the corresponding data is kept in a separate environment too, until you promote it together with your code. This is DevOps for Data at its best.
In the world of Machine Learning, where everything needs to be so meticulously done could you tell our readers why this approach is needed?
You probably have heard about Gen AI being prone to hallucinations. It is critical to improve the quality of the training data to reduce the hallucinations and, as a result, improve the accuracy of the response. To further mitigate hallucinations, it goes without saying that improving the models is necessary, too. Industry leaders like OpenAI, Anthropic, or Meta are on the frontier with their foundational models. But today, we also know that large language models (LLMs) may know the truth, yet they can mix facts since it is the more likely response from the machine’s point of view and the inherent principle of Generative Pre-trained Transformers (GPTs).
So what is missing? The combined improvements in data-centric and model-centric AI will increase the accuracy of models. Yet we need to add automated reasoning to provide a feedback loop to the LLM response, allowing the LLM to correct the answer and return the truth it knows anyway.
How do you envision the advancements in the landscape of Machine Learning with the growing trends?
I expect the industry to see a fusion of traditional machine learning (ML) and Generative AI.
One area of synergies I envision is that GenAI will augment specific ML tasks like anomaly detection, time-series forecasting, and feature engineering. A second area will be generative AI streamlining ML workflows using AI agents. Focused on ML workflows, the agents will suggest optimized algorithms, suggest features, and enrich data pipelines. The ML toolchain will be simplified, and data scientists will become more productive.
As a second area, generative AI can revolutionize data quality. Consider cleaning, labeling, classification, and synthesizing data. Using classification as an example, even today, LLMs provide outstanding quality when classifying textual data, sometimes already exceeding the quality of traditional clustering ML models. More importantly, given the natural language interface, defining the business rules for classification becomes accessible to domain experts due to conversational interactions. You no longer need to involve the technical team, speeding up the overall process. Humans are in the loop, yet we are now way more efficient. Conversational AI applied to ML is a game-changer.
These two areas of synergies will streamline the data lifecycle for ML practitioners and provide 10 times more productivity. Ultimately, generative AI has been a major disruptor in the last two years, yet traditional ML techniques are not being replaced. Instead, both evolve synergistically. As a result, the future of ML lies in leveraging the power of generative AI.
What are the ways that can help Machine Learning practitioners to get better at understanding and grasping Data-Centric AI? How is DataOps.live implementing that?
Tools like DataOps.live, and Snowflake Data Cloud help machine learning practitioners operationalize augmented data management tasks. By blending data management with AI and ML workflows powered by Snowflake with the automation and lifecycle management of data products powered by DataOps.live, the practitioners can manage all their AI and ML workloads. That includes the management of data pipelines, automated data tests, the versionability of data and code, and the federated governance and compliance provided by the combined platforms. Tools like DataOps.live further offer real-time data pipeline observability, providing visibility into operations, identifying issues quickly, and troubleshooting them proactively before they hit production. Doing so ensures that high-quality data is always available for training.
Practitioners can rely on our augmented data management capabilities. Using Snowflake native features like data profiling and anomaly detection, DataOps.live can enforce quality checks, augment the development experience with its gen AI-powered copilot Assist, and use it, among other things, for synthetic data generation to improve model robustness.
Finally, you can benefit from feedback loops to continuously learn and improve your AI solution as a data team. Snowflake’s real time capabilities allow practitioners to incorporate feedback and adapt to changing data patterns. DataOps.live automates the training and retraining workflows by triggering updates when new data becomes available, or model performance is subpar.
As we come to the end, what course do you envision for DataOps.live to take as it continues to evolve the standards of maintaining and managing the quality of data and also providing maximum value for the same?
We will enhance the value we deliver to our customers in multiple areas. Let’s start with data quality. By incorporating AI for anomaly and outlier detection, we can better address quality issues in real time, preventing downstream impact on analytics, machine learning, and AI-powered data products. I expect us to enhance our federated, automated governance by dynamically deploying derived access rules for the ever-evolving data privacy regulations (e.g., GDPR, CCPA, AI Act). Governance capabilities could include automated classification, tagging, and lineage tracking across data product domains.
Automated root cause analysis is the second area of evolution besides data quality. By also pinpointing the cause of failures across multiple complex pipelines, we will accelerate the resolution time. It builds out our advanced observability, including predictive observability. Leveraging AI to predict potential pipeline failures or bottlenecks allows our users to achieve proactive resolution before they occur.
As always, we will provide deeper ecosystem integration. That starts with further strengthening our integration with Snowflake to streamline data workloads and workflows. We will also expand our partnerships with leading BI and analytics tools, as we are doing now with Sigma and HEX. We will ensure interoperability with AI and ML tools like AWS Sagemaker and Bedrock to enable our customers to build AI- and ML-powered data products.
Finally, we will continue to leverage our integration with advanced AI/ML tools to build intelligent pipelines. Using generative AI, we can suggest optional data workflows and transformations based on your historical usage patterns. We are also expanding our No-Code/Low-Code interface DataOps.live Assist to make data engineering, data science, and data application building more productive while engaging with business users and tapping into their subject matter expertise.
In summary, we see a massive opportunity to revolutionize the speed with which you build, test, deploy, and lifecycle-manage data products.

Guy Adams
CTO & Co-founder at DataOps.live
Guy Adams is the co-founder and chief technology officer at DataOps.com, the Data Products company.™ As a leading provider of DataOps and AIOps solutions, the company delivers productivity breakthroughs for data teams by enabling agile DevOps automation (#TrueDataOps) and a powerful Developer Experience (DX) to modern data platforms. Guy is also the co-founder of the truedataops.org movement. He is an experienced CTO and VP, who is passionate about DataOps. Guy has spent 20+ years leading software development organizations. In his current role, he brings the principles and business value from DevOps and CI/CD to data.