Procella: Some notes

Procella: Unifying serving and analytical data at YouTube

Google recently published an article on Procella, the query engine they built to handle analytic workloads for YouTube. Procella is built to address four reporting needs that are typically handled by different tools:

  • reporting and dashboarding
  • embedded statistics in pages
  • time-series monitoring
  • ad-hoc analysis

I definitely recommend checking out the article if databases interest you. A high-level summary and my notes to follow!

Summary:

Procella is a lambda architecture database which supports high-throughput/low-latency (sub-second) query processing. Additionally, it supports low-throughput/higher-latency ad-hoc/reporting queries. Procella does this by leveraging a novel file format (Artus), aggressively caching metadata + physical data in memory, and adaptively optimizing queries.

The Artus file format stores column-level metadata in its file header, allowing predicates to be pushed all the way down to the filestore. Artus files have the same in-memory and on-disk representation, so deserialization is cheap + quick.

Procella is built such that the whole database can be served from memory, but with datasets like Google’s, typically only 2% of data resides in memory at a given time. Intelligent caching techniques + “process affinity” allow Procella to have a 99+% file handle cache hit rate, and a 90% cache hit rate for data in memory. All told, this allows Procella to execute and return query responses with millisecond latencies.

Finally, Procella has a dedicated “statistics-serving” mode in which the database is optimized for serving up stats very quickly, as in the View Count for YouTube videos.

Notes:

  • at scale, worker nodes fail and must be managed
    • tasks are executed on shared resources with imperfect isolation
    • task performance is unpredictable
    • failures + misbehaving tasks need to be accounted for
  • storage:
    • tables are stored across files (“tablets” / partitions)
    • columnar format: Artus
  • metadata:
    • does not use b-trees
    • instead:
      • zone maps
      • bitmaps
      • bloom filters (!)
      • sort + partition keys
    • this metadata is used in query planning
      • info stored in spanner/bigtable
      • also comes from file headers
    • table management
      • create/alter/drop ddl
      • for streaming tables:
        • expire / “age out” old data
        • downsampling / compacting
        • replace streamed data with batch data in a regular cadence
    • batch ingestion
      • ddl to register files with the database
      • the Registry Server maps tables to files (+ “secondary structures”; ie. metadata)
      • data isn’t scanned at registration time unless metadata is missing
        • metadata is processed lazily/out-of-band
    • realtime ingest
      • hit the ingestion service with rpc calls or pubsub streams
      • sends data to write-ahead log on colossus
      • data is temporarily stored in memory until it can be durably stored
      • queries will hit both memory and durable filestore as needed (combining results!)
      • query data with sub-second latency from time of ingest!
    • compaction
      • re-partition streamed-in datapoints into larger blocks
      • user-supplied SQL can filter/aggregate during compaction
      • once compacted, the metadata service is updated w/ new files
    • running queries:
      • uses Metadata Server to prune files to read from filestore
      • builds a tree:
        • nodes: query blocks
        • edges: data streams (aggregate, execute remotely)
      • column encodings support pushing execution down to the file format
    • optimizations
      • aggressive caching in memory
        • metadata
        • file handles
        • data (using Artus) has same format in-memory or on-disk
        • source files are immutable, so caching is easy
      • affinity scheduling
        • data servers store subset (LRU) of data in memory (cache)
        • tasks are routed to data servers to maximize cache hits
        • this impacts high-availability
          • if a worker goes down, another worker can still serve the request
          • will have a cache-miss and must read metadata from store
      • if there is enough memory, procella will act as an in-memory db
        • typical use: 2% of data fits in memory
        • BUT:
          • 99+% file handle cache hit rate
          • 90% data cache hit rate
        • so, not limited by memory, but can certainly take advantage of it
    • Data format (Artus)
      • uses custom encodings, not something generic like LZO/LZW
      • encoded/compressed data is usable without needing to be decompressed
      • does multi-pass encoding:
        • like explain analyze but w/o needing to set encodings a-priori?
        • data characteristics inform the optimal encoding
      • similar list of encodings to Redshift:
        • dict lookup (low cardinality)
        • run-length (the next 7 records are “drew”)
        • delta (change from prev record)
      • nested/repeated records are “pruned”
        • some sort of novel tree data structure for nested/repeated records (different from BQ?)
        • sparse values do not take up space in file format
      • rich file (and column) header information:
        • schema
        • sorting
        • min/max values
        • encoding info
        • BLOOM FILTERS!!!
        • Important support pruning files without reading any data
    • Evaluation engine
      • most databases: compile LLVM machine code from a query
        • compilation time is a bottle-neck
      • lots of complex C++ nonsense
        • make block sizes fit in L1 cache memory
        • templating / metaprogramming to “avoid virtual call overheads”
          • confusing to me: templates are compile-time, wouldn’t this be slower?
      • no intermediate representations of queries are materialized (!?)
    • partitioning and indexing:
      • multi-level partitioning + clustering
        • eg: partition on date, cluster by multiple dimensions
        • dim tables: partition+sort on dim key
        • supports tablet (partition) pruning, avoid “shuffling”
        • unclear to me how similar this is to BQ/Redshift
          • sounds like partitions can be non-dates!
      • unsure what the distinction between partitioning/sorting/clustering is in practice
    • distributed joins
      • broadcast
      • co-partitioned: used if fact and dim tables are partitioned on the same key
      • shuffle: both sides are large, no shared partition key
      • pipelined: cool!
        • used if RHS is a complex query with an expected small result set
        • RHS is evaluated first, then inlined and sent to LHS shards
      • remote lookup:
        • RHS (dim table) is large, but partitioned on the join key
        • LHS (fact table) is not partitioned on join key
        • RPC to RHS data servers to get values for the join
          • unclear to me how this execution differs from something like a broadcast join?
      • tail latency
        • if a dataserver takes longer than expected (longer than the median!), the master will invoke the same RPC call on another server
        • master node batches + rate-limits queries to data servers to avoid overloading them
        • priority is attached to RPC calls to help data servers optimize high-level effects
          • different threads for high-priority vs. low-priority queries
          • big/slow queries are lower priority so they don’t hold up small/fast queries
            • a version of “workload management” that actually works, lol
    • query optimization
      • procella uses virtual tables (like materialized views?)
        • generates multiple aggregates of the underlying base table
        • chooses the right aggregate at query-time
        • supports:
          • index-aware aggregate selection (?)
          • stitched queries:
            • combine tables in union all queries
          • lambda architecture awareness:
            • combine batch + streaming data
          • join awareness: virtual tables pre-join datasets?
      • Note: this was light on details, unclear to me if user-managed or system-managed optimizations?
  • Query optimizer:
    • static + adaptive optimization techniques
      • static: rule based optimizer, push down logic, logical rewrites
      • adaptive: as query is running, collect sample of data, use it to tune physical operators
        • eg. change join type (see above) based on cardinality, quantiles, etc
      • defer statistic collection to execution time, not ingestion-time
        • hard to collect stats on high-volume streaming data
        • this happens during the shuffle step — “collection sites” for the data
      • adaptive optimization can 2x query time (for stats collection) on small queries
        • low latency queries bypass adaptive optimization
          • users can hint the optimizer
        • much more beneficial on large queries
  • data ingestion
    • optimal query processing requires suitably partitioned/clustered/sorted files
      • procella provides an offline file generation tool
        • generate Artus files from source data in MapReduce pipelines
        • most useful for teams that want to import petabytes of historical data
      • not required; procella can ingest non-Artus files too (which types?)
    • serving embedded statistics
      • special “stats-serving” mode for these types of queries
        • eg. viewcount on youtube videos
        • when new data is registered, the data servers are notified immediately
          • avoids lazy-loading of data on first request
        • no remote disk access in serving mode
      • metadata is cached and updated asynchronously
      • query plans are aggressively cached
      • expensive optimizations (like adaptive joins + shuffles) are disabled
  • showing off:
    • over 1bn queries / day
    • 10s of petabytes queried per day
    • 10s of thousands of unique query patterns
    • end-to-end delay (ingest to analysis) is < 1 minute
    • query response times are < 1s for ~1bn rows scanned
  • future work:
    • make it easier for (google?) users to adopt Procella
      • tooling, isolation, quota mgmt, docs, monitoring
    • performance, efficiency, scale
      • improve optimizer
      • better adaptive techniques
    • enhanced SQL support
      • full text search
      • time-series
      • geospatial
      • graph queries