Featured in Haskell Weekly issue 456
I am a firm believer in the purely functional programming approach, embodied by the Haskell programming language. While the Haskell community is not huge, it is large enough that I can work on most domains.
Most domains, but not all; data science remains hard to work on in Haskell. Since data science grew out not from software engineering, but from students and scientists, the best data science tools are found in other communities such as R and Python. If we focus further on machine-learning and “““AI”“” (ಠ_ಠ), then the distribution of high-quality tools is even more concentrated in the Python community.
I have started exploring what it would look like to build a Haskell-centric data science workflow more than a year ago, with the implementation of a Series data structure. While this was perfect for my use-case at the time (I wrote about it here and here), the typical data scientist is used to columnar the data structure known as the dataframe.
Recently, an effort to design a dataframe interface in Haskell has been spearheaded by Michael Chavinda, with a focus on exploratory data science. This effort trades type safety for easier interactivity, similar to Python’s pandas DataFrames.
In this blog post, I want to explore a different design tradeoff: what if one were to instead focus on type-safe expressiveness, with no regards to interactivity? What would such a dataframe interface look like?
The design below is based on some intermediate type-level shenanigans. I was inspired by the approach that the Beam SQL project took, based on higher-kinded types.
Let’s get imports out of the way:
{-# LANGUAGE DefaultSignatures #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE TypeFamilies #-}
module HKTGenerics where
-- from `base`
import Data.Functor.Identity (Identity(..))
import Data.Kind (Type)
import GHC.Generics
-- from `vector`
import Data.Vector (Vector)
import qualified Data.Vector
Let’s consider an example: we want to represent a set of people of some sort. We would normally represent one person using a record type like so:
data SimpleUser
= MkSimpleUser { simpleUserFirstName :: String
simpleUserLastName :: String
, simpleUserAge :: Int
, }
Now, a dataframe of such users should be equivalent to:
data FrameUser
= MkFrameUser { frameUserFirstName :: Vector String
frameUserLastName :: Vector String
, frameUserAge :: Vector Int
, }
where each record is a Vector
(similar to arrays in other languages), representing a column. This is the main draw of dataframes:
data is stored in columns, rather than simply e.g. Vector SimpleUser
. This is not always best, but I will assume that the user
has knowledge of the tradeoffs.
The structures of SimpleUser
and FrameUser
are extremely similar. We can use higher-kinded types to unify them by introducing a
type parameter f
which represents the container for values:
data HKTUser f
= MkHKTUser { hktUserFirstName :: f String
hktUserLastName :: f String
, hktUserAge :: f Int
, }
Here, f
has type f :: Type -> Type
, just like the Vector
type constructor. HKTUser
is called a higher-kinded type, because
unlike a type like SimpleUser
, which has kind Type
, HKTUser
isn’t a type, but a type constructor. I won’t go into more detail
than this on higher-kinded types; consider watching the Haskell Unfolder episode on the subject.
Using this method of representing users, we can represent SimpleUser
as HKTUser Identity
, and FrameUser
as HKTUser Vector
.
Here, Identity
is the trivial container.
One more thing before we move on. While Identity
is a trivial functor, it still adds some overhead; HKTUser Identity
and SimpleUser
aren’t exactly equivalent. We can optimize this overhead away using a type family:
type family Column (f :: Type -> Type) x where
Column Identity x = x -- Optimization
Column Vector x = Vector x
Finally, we can unify the representations of SimpleUser
and FrameUser
:
data User f
= MkUser { userFirstName :: Column f String
userLastName :: Column f String
, userAge :: Column f Int
, }
The Column
type family can be thought of a type-level function: Column f a
is a type that depends on f
. To be clearer, let’s
create type synonyms:
type Row (dt :: (Type -> Type) -> Type) = dt Identity
type Frame (dt :: (Type -> Type) -> Type) = dt Vector
Using these synonyms, Row User
is equivalent to SimpleUser
. while Frame User
is equivalent to FrameUser
. We can operate on the
columns of dataframes easily, like so:
-- | Returns the longest first name out of a dataframe of users.
-- If the dataframe is empty, returns the empty string.
longestFirstName :: Frame User -> String
= Data.Vector.foldl' longest mempty . userFirstName
longestFirstName where
longest :: String -> String -> String
= if length x >= length y then x else y longest x y
With longestFirstName
, we can glimpse the performance advantage of using dataframes: the userFirstName
field is really
an array, and so finding the longest first name is an operation on an array of string rather than an array of User
.
How can we build a dataframe? We can turn rows of, well, Row User
into a single Frame User
like so:
buildUserFrame :: Vector (Row User) -> Frame User
buildUserFrame vs = MkUser { userFirstName = Data.Vector.map userFirstName vs
= Data.Vector.map userLastName vs
, userLastName = Data.Vector.map userAge vs
, userAge }
This is a little tedious; for every type of dataframe, we need to write our own dataframe construction function!
Can we write a function like Vector (Row t) -> Frame t
, which works for any t
? Yes we can.
Enter generics
I want to thank Li-yao Xia (Lysxia on the Haskell Discourse) for helping me figure out how to do what you’re about to read!
If you squint, every type we would want to turn into a dataframe has the same structure: a record type where every record
is either Vector a
or Identity a
(nesting dataframes is out of scope for today). We can provide functionality for any
suitable record type like that using Haskell generics1.
In Haskell, the Generic
typeclass has nothing to do with “generics” in other programming language. “Generic” programming in Haskell is done
with ad-hoc polymorphism (i.e. typeclasses). Instead, the Generic
typeclass in Haskell is used to transform any datatype t
into
a generic representation, called Rep t
, which can be used define functions which work over a large class
of types. Specifically, in our case, we want to create a function Vector (Row t) -> Frame t
which works
for any higher-kinded record type t
like User
.
There are many explanations of Haskell’s Generic
, such as Mark Karpov’s Generics explained
blog post. The wrinkle in our dataframe problem is that types such as User
are higher-kinded, and therefore some more care
is required.
Let’s define our problem. We want to create a typeclass FromRows
with a method, fromRows
:
class FromRows t where
fromRows :: Vector (Row t) -> Frame t
We also want to provide a default definition of fromRows
such that downstream users don’t have to manually write instances
of FromRows
:
fromRows :: (???) => Vector (Row t) -> Frame t
default= ??? fromRows
How can we write the default implementation of fromRows
?
The key concept to remember is that Generic
only works with types of kind Type
, i.e. Row User
but not User
. For every
higher-kinded type t
(like User
), we care about two concrete types: Row t
and Frame t
. Therefore, we need to index our
typeclass on both concrete types at once.
Enough word salad:
class GFromRows r -- intended to be `Row`-like
-- intended to be `Frame`-like
f where
gfromRows :: Vector (r a) -> (f a)
We need to provide instances relating to some (but not all)
generic constructs, including M1
(which is always required), K1
, and (:*:)
.
We start with the generic metadata type, M1
:
instance GFromRows r f => GFromRows (M1 i c r) (M1 i c f) where
gfromRows :: Vector (M1 i c r a) -> M1 i c f a
= M1 . gfromRows . Data.Vector.map unM1 gfromRows
Then move on to :*:
:
instance ( GFromRows r1 f1
GFromRows r2 f2
,
)=> GFromRows (r1 :*: r2) (f1 :*: f2) where
= let (xs, ys) = Data.Vector.unzip
gfromRows vs $ Data.Vector.map (\(x :*: y) -> (x, y)) vs
in gfromRows xs :*: gfromRows ys
Finally, onto the representation of fields of constructors, K1
:
instance GFromRows (K1 i r) (K1 i f) where
= K1 . Data.Vector.map unK1 gfromRows
Note that the above will not compile. For the instances of M1
and :*:
, we assumed that r
and f
already had an instance
of GFromRows
. In the instance for K1
, the compiler does not know about the relationship between r
and f
yet. In
some sense, the instance involving the representation K1
is the foundation on which other instances are defined.
We can refine our instance by enforcing that f
be an array of r
using (f ~ Vector r)
:
instance (f ~ Vector r) => GFromRows (K1 i r) (K1 i f) where
gfromRows :: Vector (K1 i r a) -> K1 i f a
= K1 . Data.Vector.map unK1 gfromRows
We can now fill in the default implementation of fromRows
:
class FromRows t where
fromRows :: Vector (Row t) -> Frame t
fromRows :: ( Generic (Row t)
defaultGeneric (Frame t)
, GFromRows (Rep (Row t)) (Rep (Frame t))
,
) => Vector (Row t)
-> Frame t
= to -- Turn `Rep (Frame t)` back into `Frame t`
fromRows . gfromRows -- Vector (Rep (Row t)) -> Rep (Frame t)
. Data.Vector.map from -- Turn every row into a `Reo (Row t)`
How can we use this? Behold:
data User f
= MkUser { userFirstName :: Column f String
userLastName :: Column f String
, userAge :: Column f Int
,
}deriving (Generic) -- This is new
What’s new here is that we need to derive a Generic
instance for User
. Then, the instance of FromRows User
requires
no method implementation:
instance FromRows User
Then, we can re-implement buildUserFrame
trivially:
buildUserFrame :: Vector (Row User) -> Frame User
= fromRows buildUserFrame
Further work: nesting dataframes
The implementation above works, but we assumed that every record type had fields of the form Column f a
.
How about nesting dataframes types? Consider this:
data Address f
= MkAddress { addressCivicNumber :: Column f Int
addressStreetName :: Column f String
,
}deriving (Generic)
instance FromRows Address
data Store f
= MkStore { storeName :: Column f String
storeAddress :: Address f
,
}deriving (Generic)
instance FromRows Store -- type error
Ideally, we would want this Store
type above to be equivalent to:
data Store f
= MkStore { storeName :: Column f String
-- Address gets unpacked into "adjacent" columns
addressCivicNumber :: Column f Int
, addressStreetName :: Column f String
,
}deriving (Generic)
I haven’t figured out how to do this yet, but it would be desirable.
This document is a literate Haskell module!