Spotlight on Data Infrastructure In the Age of AI

By Crystal Valentine, PhD

Generative AI has dominated private investing in 2023. Since the start of the year, we have seen a seemingly never-ending succession of announcements by new and established companies about how they are leveraging generative AI to produce better experiences for customers and more efficient processes for employees. As a result, a huge proportion of venture investments this year have focused on generative AI applications, large language model operations, and semiconductors. However, at Cross Creek, we believe that investments in data will also provide valuable leverage to companies using AI as they develop data-rich AI applications. As a result, data infrastructure companies are poised to benefit from large AI tailwinds, and we believe there are great investment opportunities within the data category.

Before joining Cross Creek, I worked with incredibly talented and high-performing data practitioners, most recently as the Chief Data Strategy Officer at Eventbrite (NYSE:EB) where I led the company’s data teams. In talking with many of my former peers in industry and with my current colleagues on Cross Creek’s own data team, I believe that data infrastructure represents an important focus area for companies looking to flex their AI muscles. After all, whether a company is focusing on prompt engineering or fine-tuning a foundational model, leveraging a company’s unique data assets is crucial to providing context to the models. During Microsoft’s earnings call last month, Satya Nadella confirmed that “every AI app starts with data and having a comprehensive data and analytics platform is more important than ever.” Indeed, the unique context that a company’s proprietary data represents is the key ingredient in creating AI-powered tools that are relevant and tailored to your domain.

Data teams, often the unsung heroes of product innovation, are among the most technically skilled and in-demand developers within software engineering teams. Data practitioners command high salaries but they are still woefully underserved by existing tools and platforms. It is our view that the technologies that empower data teams to make the best use of their time provide great value to data-driven organizations, and we are excited about data infrastructure companies that speak to the needs of high-performing data teams.

Despite the centrality of data in the modern enterprise, the challenges facing data teams today are the same ones they have faced for well over a decade. Data practitioners wrestle with fundamental tradeoffs that arise in designing a versatile and extensible data architecture. We highlight some of these tensions and their downsides here:

Centralization vs. decentralization

First, there is the never-ending tug-of-war between the proliferation of local copies of data sets to support speed and agility and the desire to have a unified data platform that can accommodate full, aggregate analyses. Over the past four decades we have seen the pendulum swing between having multiple, disparate, special-purpose data management technologies toward having a unified data lake-turned-swamp, and then back toward a decentralized, domain-oriented, data mesh architecture, and then back again. Striking the right balance between centralization and decentralization is a fundamental challenge that data teams need to navigate. Newer data transformation workflow technologies promise to allow special-purpose data sets to be created quickly within the data warehouse, which seems like a step in the right direction. However, data proliferation can be a challenge nonetheless, even if all the tables are stored in the same database management system (DBMS).

Data democratization vs. accurate interpretation of data

As the value of data has become widely accepted within the enterprise, more users want access to data and insights. Data teams must negotiate the tension between wanting to democratize access to data and needing to ensure consistent and correct interpretation of the analyses done with data. As any analyst can tell you, knowing SQL is not sufficient to ensure that you derive the correct conclusions about a business from a raw data set. Data visualization tools and dashboards can provide insights to business users but still require development and maintenance time for data teams, and dashboard proliferation can strain reporting resources.

Product development speed vs. data quality

When application developers want to “move fast and break things”, it’s often the data teams that are the ones having to fix things after the fact. Data teams, who are responsible for the accuracy and timeliness of data analyses, must negotiate the tension between wanting to empower product teams to move quickly and needing to ensure that downstream data pipelines don’t break because of schema changes. Data observability and data quality tools help data teams to identify and remedy problems faster, but data teams are still typically reacting to problems once they arise which means fire drills and data downtime will never totally disappear.

More data vs. fast analyses

As data volumes and varieties increase at an exponential rate, data teams are still expected to produce snappy analyses and reports against that data. Data teams must negotiate the tension between wanting to collect and interpret data of all formats and types and needing to run fast analytical queries against unwieldy data at scale. New database, file, and streaming technologies promise to solve some of these pain points, but always while making tradeoffs elsewhere. There is a deep fundamental difference between the design principles of transactional and analytical databases, and it remains difficult to bridge that gap.

Governance vs. agility

Finally, with increasing regulation around data privacy and management, data teams have newer requirements for data governance. They must negotiate the tension between wanting a centralized and well-governed data platform where lineage and metadata management are straightforward and needing to support multiple different stakeholders who want to move quickly and build data-driven products using local copies of the data without a lot of red tape. Every decade we see a new generation of data governance and metadata management solutions but, in my experience, the proliferation of data always seems to outpace our ability to keep catalogues up-to-date.

It is these and other tradeoffs with which data teams must grapple. There is no one-size-fits all platform that will work for every data team and the right tradeoffs for a data team have more to do with company culture than anything else. Moreover, the appropriateness of the architectural decisions made by the data teams needs to be evaluated continuously as the needs of the organization and the types of data being processed evolve and grow.

At Cross Creek, we believe that data infrastructure represents a great area for innovation and investment. The growth of AI within the enterprise is catalyzing the creation of myriad new data solutions and products so data teams are busier and more in-demand than ever. While machine learning engineers and experts are clearly critical to deploying AI-driven applications into production, data is the lifeblood of AI and data teams are the heart and soul of an organization’s ability to leverage their data effectively. The companies that can design solutions for data teams that make them more efficient will be at the center of value creation in the new AI arms race.

Crystal Valentine, PhDAugust 18, 2023