Tackling Heterogeneity in Apache Arrow Value Vectors Using Shapeless
Shapeless is a type class and dependent type based generic programming library for Scala that helps us in tackling heterogeneity in a collection of objects so as to cut boilerplate in our libraries. In this blog post we will see how can we leverage Shapeless to tackle heterogeneity in Apache Arrow value vectors, saving us a ton of boilerplate, in a peculiar use case involving those vectors.
To start with let's briefly visit Apache Arrow. From the Apache Arrow docs, it is defined as:
A language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.
The keyword of value for the purpose of this blog post is "memory format". What Apache Arrow has done is that it has attempted to standardize and introduce a set of in-memory data structures and public APIs to interact with those data structures for the purposes of analytics across all analytics platforms. What it means is that instead of having to serialize and deserialize data to and from platform specific data structures, platforms can now work on and share results in Apache Arrow memory format itself thereby saving on serde costs.
In Apache Arrow, ValueVector
abstraction is used to store a sequence of values having the same type in an individual column. Say, if records of your dataset are composed of integers, strings and booleans, then its equivalent arrow columnar representation would be a VectorSchemaRoot composition of IntVector
, VarCharVector
and BitVector
. Each of these value vectors expose APIs to allocate buffers, get data, set data and so forth. Where you really start to get into trouble is when you attempt to polymorphically generalize over these value vectors. This is because they do not always share a common interface that declare these APIs. For instance, allocateNew()
API is declared in the ValueVector
interface that is defined in BaseVariableWidthVector
and BaseFixedWidthVector
subclasses and eventually available in VarCharVector
and IntVector
, respectively. But there is no common interface that declares get(int index)
API. Each value vector declares and defines this API in their own concrete class implementations. The reason being, each "get by row" returns value that is different in type. As we can see in the contrived example below, we are able to polymorphically invoke allocateNew()
API for IntVector
and VarCharVector
. This is because allocateNew()
is declared in a common ValueVector
interface. We can't do the same with get(int index)
and similar APIs.
Probably not a typical use case for using value vectors since they are columnar in nature i.e., ideally you would want to work on each vector independently but say you need to translate a set of values to vectors and vice-versa on a record by record basis. In that case then you would want to be able to polymorphically invoke value vector APIs to cut on the boilerplate. Let me elaborate on what do I mean by that. Say you have a list of values List(1000, "shivamka1", true)
that you would like to iterate over and set these values in respective vectors. You would like to achieve the same in a polymorphic fashion as shown below. See line no: 20-25
.
The code snippet above however does not compile because value vectors don't share common interface that declares all the APIs. What you would end up doing is invoke value vector APIs for each vector type, as shown below, to achieve the same results.
This clearly is too verbose and also doesn't allow us to generalize over different sets of value vectors representing different record types. This is exactly where Shapeless comes to our rescue. Shapeless has an abstraction called HList
that allows us to piece heterogeneous objects together and type class abstraction called Generic
that provides automatic mapping back and forth between ADTs and HLists. What we can do is that we can make use of these abstractions and define custom type classes and derivatives to piece not only heterogeneous objects but also heterogeneous functionalities together. Let me explain what do I mean by that. What we already know is that each value vector defines how to allocate new buffers, set the values in their respective vectors, etc. What we can do is implement type class derivatives that will convert our ADT to an HList and derivatives that will piece together results from invoking vector API for each HList element.
Here, genericAllocateNew
translates an ADT
to an HList
and invokes allocateNew()
API on resultant HList
. hListAllocateNew
derivative invokes the same API on each HList
element and finally there are derivatives such as intVectorAllocateNew
that knows how to invoke this API on each value vectors eventually! We can define similar type classes and derivatives to fetch values from vectors for a given row index and piece the result together into our ADTs. See the rest of the code here on github.
In the following snippet, we define the ADT UserVectors
and summon type class derivative by calling the apply constructure method like so, AllocateNew[UserVectors]
, for example. This triggers compiler to apply implicit resolution and bring in all the other dependent derivatives required to compose a result, which is allocating buffers to all value vectors, in this case. This approach evidently allows us to generalize over records of different types cutting on the verbosity.
That's all for this post. Hope it helped! Do let me know if you need any clarifications.