Intro
ReadStat is a C library used to read and write various formats common in statistical analyses. I was first exposed to (and mostly use) it through R's haven library for interacting with SAS data, a common format in insurance and pharma. I am also a frequent user of dbplyr to analyze data living in (or at least being served from) a database and thought would be an interesting learning exercise to combine these topics through creating a postgres foreign data wrapper for ReadStat.
Postgres comes with a generic file foreign data wrapper (filefdw) that can
serve data from any format that can be read with the COPY FROM
command,
e.g. CSV. ReadStat ships with a converter to produce SAS (or other supported
filetype) data as CSV. Wiring these together through a file fdw should allow for
a quick first implementation of serving SAS data through postgres.
Implementation
Dockerfile setup
This postgres Dockerfile clones and installs ReadStat for command line use.
FROM postgres:16 RUN apt-get update && \ # readstat build deps apt-get -y install make git autotools-dev libtool gettext && \ git clone https://github.com/WizardMac/ReadStat.git WORKDIR /ReadStat RUN git checkout v1.1.9 RUN ./autogen.sh && ./configure --prefix /usr/local RUN make && make install RUN echo "/usr/local/lib/" >> /etc/ld.so.conf RUN ldconfig
Build the image.
docker build . -t readstatfilefdw:latest
Sample Data
I'll use the mtcars
dataset to test things out and demonstrate the
command line output of ReadStat.
tibble::as_tibble(mtcars, rownames = "model") |> haven::write_sas("data/mtcars.sas7bdat")
Converting data and printing to stdout:
readstat data/mtcars.sas7bdat - | head -n 6
model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
Mazda RX4 | 21.0 | 6.0 | 160.0 | 110.0 | 3.9 | 2.62 | 16.46 | 0.0 | 1.0 | 4.0 | 4.0 |
Mazda RX4 Wag | 21.0 | 6.0 | 160.0 | 110.0 | 3.9 | 2.875 | 17.02 | 0.0 | 1.0 | 4.0 | 4.0 |
Datsun 710 | 22.8 | 4.0 | 108.0 | 93.0 | 3.85 | 2.32 | 18.61 | 1.0 | 1.0 | 4.0 | 1.0 |
Hornet 4 Drive | 21.4 | 6.0 | 258.0 | 110.0 | 3.08 | 3.215 | 19.44 | 1.0 | 0.0 | 3.0 | 1.0 |
Hornet Sportabout | 18.7 | 8.0 | 360.0 | 175.0 | 3.15 | 3.44 | 17.02 | 0.0 | 0.0 | 3.0 | 2.0 |
Setting up the FDW
Use the default postgres user and password, as well as mouting my local data folder as a volume into the container:
docker container run -e POSTGRES_PASSWORD=password -i -p 5432:5432 -t -v ${PWD}/data/:/data/ readstatfilefdw:latest
Creating the extension and defining a table with appropriate data types is as follows:
CREATE EXTENSION file_fdw; CREATE SERVER readstat_file_fdw FOREIGN DATA WRAPPER file_fdw; CREATE FOREIGN TABLE mtcars ( model text, mpg numeric, cyl numeric, disp numeric, hp numeric, drat numeric, wt numeric, qsec numeric, vs numeric, am numeric, gear numeric, carb numeric ) SERVER readstat_file_fdw OPTIONS ( PROGRAM 'readstat /data/mtcars.sas7bdat -', FORMAT 'csv', HEADER 'on' );
CREATE FOREIGN TABLE |
Note that the PROGRAM
option is used to tell postgres to call the
readstat cli tool against the mtcars dataset. This means that, even
though the fdw is called a "file" fdw, it can operate on the results
produced by arbitrary command line tools. Here readstat
/data/mtcars.sas7bdat -
converts the sas7bdat
dataset and prints
results to STDOUT.
Using the FDW
Now we can treat the external mtcars.sas7bdat
dataset as a regular
SQL table.
SELECT * FROM mtcars LIMIT 5;
model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.000000 | 6.000000 | 160.000000 | 110.000000 | 3.900000 | 2.620000 | 16.460000 | 0.000000 | 1.000000 | 4.000000 | 4.000000 |
Mazda RX4 Wag | 21.000000 | 6.000000 | 160.000000 | 110.000000 | 3.900000 | 2.875000 | 17.020000 | 0.000000 | 1.000000 | 4.000000 | 4.000000 |
Datsun 710 | 22.800000 | 4.000000 | 108.000000 | 93.000000 | 3.850000 | 2.320000 | 18.610000 | 1.000000 | 1.000000 | 4.000000 | 1.000000 |
Hornet 4 Drive | 21.400000 | 6.000000 | 258.000000 | 110.000000 | 3.080000 | 3.215000 | 19.440000 | 1.000000 | 0.000000 | 3.000000 | 1.000000 |
Hornet Sportabout | 18.700000 | 8.000000 | 360.000000 | 175.000000 | 3.150000 | 3.440000 | 17.020000 | 0.000000 | 0.000000 | 3.000000 | 2.000000 |
EXPLAIN ANALYZE SELECT * FROM mtcars LIMIT 5;
QUERY PLAN |
---|
Limit (cost=0.00..0.75 rows=5 width=384) (actual time=8.098..8.122 rows=5 loops=1) |
-> Foreign Scan on mtcars (cost=0.00..30.10 rows=201 width=384) (actual time=8.096..8.118 rows=5 loops=1) |
Foreign Program: readstat /data/mtcars.sas7bdat - |
Planning Time: 0.842 ms |
Execution Time: 8.861 ms |
SELECT cyl, count(*), AVG(hp) FROM mtcars GROUP BY cyl
cyl | count | avg |
---|---|---|
4.000000 | 11 | 82.6363636363636364 |
6.000000 | 7 | 122.2857142857142857 |
8.000000 | 14 | 209.2142857142857143 |
Summary
Going through this exercise I've learned the built in file_fdw
of
postgres allows one to get up and running very quickly with any
dataset that is supported by COPY
or a command line tool that can
produce such a dataset, as ReadStat does.
For further work, I'd like to investigate the performance of this
filefdw on a much larger dataset. I suspect that performance will
degrade on large datasets since they must be loaded entirely into
memory before postgres may operate on them. Ideally, things like
LIMIT
and WHERE
filters could be pushed down to the ReadStat
level, which I don't think is possible in the implementation here.