In a groundbreaking development, LinkedIn has unveiled the inner workings of its AI/ML platform, which drives the company’s innovative products. At the heart of this platform lies VeniceDB, a powerful NoSQL data store specifically designed for feature persistence. Let’s delve into the architecture, key technologies, and lessons learned from operating this cutting-edge system.
LinkedIn’s AI/ML platform plays a pivotal role in enhancing productivity for data scientists and engineers. It provides opinionated and unified end-to-end capabilities, supporting development, experimentation, and operation of AI/ML workloads. One of the critical components of this platform is VeniceDB, a derived data storage platform with remarkable features.
VeniceDB Features
- High Throughput Asynchronous Ingestion: VeniceDB efficiently ingests data from both batch and streaming sources, such as Hadoop and Samza.
- Low Latency Online Reads: Queries can be executed remotely with minimal latency, thanks to in-process caching.
- Active-Active Replication: VeniceDB ensures data consistency across regions using CRDT-based conflict resolution.
- Multi-Cluster Support: Within each region, operators can manage multiple clusters, ensuring scalability and fault tolerance.
- Multi-Tenancy and Elasticity: VeniceDB supports horizontal scalability and elasticity within each cluster.
Architecture Overview
VeniceDB bridges the offline, nearline, and online worlds. Its write path involves three granularities:
- Full Dataset Swap: Swapping entire datasets.
- Row Insertion: Adding many rows to an existing dataset.
- Column Updates: Modifying specific columns of certain rows.
These granularities are supported by Hadoop and Samza. Additionally, VeniceDB allows asynchronous single-row inserts and updates using the Online Producer library.
Read Path
VeniceDB provides the following read APIs:
- Single Get: Retrieve the value associated with a single key.
- Batch Get: Fetch values for a set of keys.
- Read Compute: Project fields or compute functions on values associated with a key set.
VeniceDB’s versatility and performance make it an ideal choice for feature persistence. As AI applications feed their ML training outputs into VeniceDB, it seamlessly serves data during online inference workloads. With VeniceDB, LinkedIn continues to push the boundaries of AI/ML innovation.