High-quality training data is the ultimate bottleneck. Shift from public scraping to ethical, proprietary licensing for sustainable model growth.
Protege is the platform for AI training data. Can you share what that means in practice — and what problem the company set out to solve in the first place?
Protege exists to solve one of AI’s biggest bottlenecks: access to real-world, high-quality training data. While models have quickly progressed and compute has built up rapidly, we are short on data. The world’s most valuable datasets remain locked away, sitting behind institutional walls, bound by privacy, or fragmented across industries. We built Protege to bridge that gap, securely connecting the companies that hold proprietary data with the AI builders who need it to train, fine-tune, and evaluate their models.
In practice, this means we help data owners ranging from healthcare providers to media distributors monetize their data safely and ethically, while providing AI developers with a trusted and privacy compliant pipeline of real-world information that accelerates innovation without compromising integrity.
Many experts say we’re hitting a “data bottleneck” in AI — that there simply isn’t enough high-quality, diverse data to keep improving large models. Do you agree? What does that mean for the next phase of AI development?
The early years of generative AI were powered by scraping the public internet, which accounts for a tiny fraction (the estimate is less than 0.000001%) of all global data, but we’re quickly reaching the limits of that approach. More importantly, the world’s most valuable information isn’t online at all; it lives inside proprietary systems, medical archives, film libraries, proprietary company databases, and private datasets that have never been structured for model training.
This means that the next leap in AI performance will need to come from unlocking richer, more world-representative data. When you train an AI system on only data that is readily available (or synthetically created / manufactured by contractors), these systems are not fully representative of the real world and all its data created by real people. Alternatively, real-world data grounds intelligence in context, illustrating how humans actually speak, behave, and make decisions in complex, unstructured environments. The future belongs to those who treat data quality as the true infrastructure for intelligence, and real-world data will play an integral part in that development.
As lawsuits over scraped web data and copyright violations mount, how can AI companies ensure the datasets they rely on are legal and ethical? Is true consent-based data collection at scale realistic, and what do these challenges signal about the future of data ownership and industry responsibility?
The AI industry is entering a phase of accountability and increasing transparency. For years, scraping felt like the path of least resistance, until courts, creators, and consumers pushed back against it. Now we’re seeing a transition to AI data licensing, defined by consent, compensation, and compliance.
Transparent data licensing reshapes the relationship between data providers and AI developers by turning extraction into collaboration. Instead of scraping, companies can ethically license data from vetted partners, creating clear incentives for both parties. Data providers gain revenue, control, and recognition, while AI developers access cleaner, higher-quality datasets without the legal and IP risks that come with unlicensed use. That naturally leads to a natural market forming for ethically licensed data and content specifically for AI, and that’s where Protege fits in.
At Protege, ethical licensing is a principle that is non-negotiable. All data on our platform is legally sourced, transparently licensed, and de-identified where required. We work directly with data partners, from hospital networks to TV and movie rights holders, to craft licensing frameworks that protect their IP while making the data accessible for innovation. Our mission is to prove that ethical data sourcing isn’t a burden on progress; it’s the infrastructure that makes progress sustainable.
Responsible data use is becoming central to AI development. How does Protege approach the challenge of enabling AI systems to learn from sensitive data without compromising privacy or compliance?
From the outset, we designed Protege with privacy-by-design principles in mind, integrating compliance into the product’s architecture. Every dataset is subject to strict de-identification, data quality control, quality assurance, and encryption standards, allowing sensitive information such as clinical data to be used responsibly for AI training and evaluation. We also work directly with data-holder compliance teams to ensure alignment with HIPAA and international regulations. Within the Protege Data Lab, researchers continuously audit data integrity to ensure that data provenance is correctly upheld from start to finish.
Data protection is no longer a formality but a key differentiator. Organizations that demonstrate privacy-by-design will be the ones trusted to power the next generation of AI.
As AI systems evolve to process increasingly complex and diverse data, how can organizations ensure consistent approaches to identifying, measuring, and mitigating bias across different data types and model architectures?
If your training sets don’t reflect the full diversity of the world, your models will have a hard time too. For example, in healthcare, much of the data used to date by many of the largest AI builders comes from large academic medical centers and providers that have the capabilities required to maintain deep patient records and rich clinical documentation. Those systems provide the kind of context that these models need to produce descriptive and nuanced outputs. However, many rural or community hospitals only capture single points in time with limited follow-ups and fewer structured notes, meaning their patients’ experiences are often not as reflected in the data that is being used to power these models.
If left unchecked, that imbalance could create a new kind of bias rooted not in intent, but in the depth of the data. Bridging that gap will be essential to ensure that as AI systems become more sophisticated, they continue to represent the full spectrum of human diversity they’re meant to serve. That kind of data coverage and diversity is something we care about deeply.
Synthetic data is often seen as a workaround to privacy and licensing issues — but it can introduce bias or unrealistic patterns. What role should synthetic data actually play in AI training?
Synthetic data certainly has a role to play, particularly in augmenting or testing models; however, it’s not a substitute for real-world data. Models trained exclusively on synthetic inputs could appear impressive in a lab but fail under real conditions. The same way a flight simulator is invaluable only because it’s built on real aerodynamics, for AI training and evaluation to be fully reflective of the human experience, it must trace back to the data produced by authentic human experiences.
So ultimately, the future of data isn’t “synthetic versus real.” It’s actually a thoughtful blend where synthetic data can be used as a tool for scalability and experimentation, and real-world, ethically sourced data to ensure that model training and evaluation is done in a grounded, unbiased, and data diverse way that is reflective of the actual situations where these models will be deployed. Without the latter, we risk building models that may look intelligent in isolated environments but perform poorly in real-world environments.
Beyond the cost of compute, the integrity of AI depends on the data it’s built on. What are the most pressing structural and ethical challenges in curating and governing the training data that shapes modern AI?
The world’s data is distributed across millions of institutions, each with its own policies, formats, and fears about sharing. That fragmentation creates friction, not just for innovation, but for oversight. Without a shared governance framework, there’s no consistent way to trace provenance, verify consent, or measure quality. Additionally, regulation often lags behind technology, and AI advances orders of magnitude faster than traditional policy cycles. We need adaptive standards that evolve with the field — principles that can flex as modalities shift from text to image to sensor data.
To move forward, we need shared frameworks for ethical data governance, including clear provenance and transparent licensing, that enable collaboration to be both safe and scalable. The infrastructure for responsible data exchange is what will separate short-term hype from lasting impact.
Industries like healthcare and finance face strict data privacy rules. What can the broader AI ecosystem learn from how these sectors approach secure, compliant data sharing?
Healthcare and finance have long operated under the principle that privacy and progress must coexist. They’ve built cultures of governance, such as robust audit trails, explicit consent, de-identification, and clear accountability, that the broader AI industry can learn from. In healthcare, for example, every dataset we work with carries an enormous ethical responsibility because behind every data point represents a human life. The systems we’ve built to manage that data safely, from encryption protocols to institutional review standards, can serve as a blueprint for AI at large.
Regulation doesn’t have to stifle innovation. When handled well, it creates the trust that makes innovation possible. The AI ecosystem is maturing, and adopting those same guardrails is how it will earn public confidence.
What does responsible data governance look like in a future where AI systems are trained on both proprietary and open datasets? Who should set the standards?
No single organization or agency can define the rules for sharing, protecting, and utilizing information in AI. It must be a collective effort that strikes a balance between innovation and accountability.
In sectors like healthcare, for example, many of our existing frameworks were designed for a pre-AI era focused on static records and isolated systems. Today, data moves fluidly between platforms and is continuously reshaped by learning models. We need modern principles that reflect this new reality, clarifying how consent is given, how data is anonymized, and how provenance and ownership are maintained across the lifecycle of an AI system.
Responsible data governance will depend on collaboration across the entire ecosystem, encompassing technology companies, data holders, regulators, researchers, and the broader public.
Looking ahead, do you have any thoughts or predictions about how the way we source and share AI data will look a couple of years from now?
Over the next several years, I anticipate a significant shift from the free-for-all era of data scraping to a structured, fairly compensated, and continually improving data ecosystem. Large foundation models will consolidate where and how they get access to key data sources they need to push model progress forward. As a result, attention will move down the stack to the organizations building the data infrastructure that powers them.
And of all potential data sources, real-world data will remain the gold standard. We’ll also see more vertical specialization, with curated datasets designed for specific industries such as healthcare, finance, robotics, etc. And with that will come new models of compensation where creators, rights holders, and individuals share in the value their data creates.

Bobby Samuels
Chief Executive Officer & Co-founder, Protege
Bobby leads Protege’s strategy and execution across product, go-to-market, and capital formation. He co-founded Protege in 2024 and has served as CEO since inception. Under his leadership, Protege has raised $35M in funding and scaled to $30M in GMV in its first full year of business. Previously, Bobby was General Manager of Privacy Hub at Datavant, where he helped drive the company’s growth leading up to its $7.0B merger with Ciox Health to create the largest neutral health data ecosystem in the U.S. Earlier, he led partnerships at LiveRamp, where he developed expertise in building neutral data networks. Bobby holds an M.B.A. from the Stanford Graduate School of Business and an A.B. from Harvard College, where he was President of The Harvard Crimson. He brings deep expertise in regulated data exchange and translating complex infrastructure into trusted AI enablement for enterprise partners.
