Intro

ReadStat is a C library used to read and write various formats common in statistical analyses. I was first exposed to (and mostly use) it through R's haven library for interacting with SAS data, a common format in insurance and pharma. I am also a frequent user of dbplyr to analyze data living in (or at least being served from) a database and thought would be an interesting learning exercise to combine these topics through creating a postgres foreign data wrapper for ReadStat.

Postgres comes with a generic file foreign data wrapper (filefdw) that can serve data from any format that can be read with the COPY FROM command, e.g. CSV. ReadStat ships with a converter to produce SAS (or other supported filetype) data as CSV. Wiring these together through a file fdw should allow for a quick first implementation of serving SAS data through postgres.

Implementation

Dockerfile setup

This postgres Dockerfile clones and installs ReadStat for command line use.

FROM postgres:16
RUN apt-get update && \
    # readstat build deps
    apt-get -y install make git autotools-dev libtool gettext && \
    git clone https://github.com/WizardMac/ReadStat.git
WORKDIR /ReadStat
RUN git checkout v1.1.9
RUN ./autogen.sh && ./configure --prefix /usr/local
RUN make && make install
RUN echo "/usr/local/lib/" >> /etc/ld.so.conf 
RUN ldconfig

Build the image.

docker build . -t readstatfilefdw:latest

Sample Data

I'll use the mtcars dataset to test things out and demonstrate the command line output of ReadStat.

tibble::as_tibble(mtcars, rownames = "model") |>
  haven::write_sas("data/mtcars.sas7bdat")

Converting data and printing to stdout:

readstat data/mtcars.sas7bdat - | head -n 6
model mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6.0 160.0 110.0 3.9 2.62 16.46 0.0 1.0 4.0 4.0
Mazda RX4 Wag 21.0 6.0 160.0 110.0 3.9 2.875 17.02 0.0 1.0 4.0 4.0
Datsun 710 22.8 4.0 108.0 93.0 3.85 2.32 18.61 1.0 1.0 4.0 1.0
Hornet 4 Drive 21.4 6.0 258.0 110.0 3.08 3.215 19.44 1.0 0.0 3.0 1.0
Hornet Sportabout 18.7 8.0 360.0 175.0 3.15 3.44 17.02 0.0 0.0 3.0 2.0

Setting up the FDW

Use the default postgres user and password, as well as mouting my local data folder as a volume into the container:

docker container run -e POSTGRES_PASSWORD=password -i -p 5432:5432 -t -v ${PWD}/data/:/data/ readstatfilefdw:latest

Creating the extension and defining a table with appropriate data types is as follows:

CREATE EXTENSION file_fdw;

CREATE SERVER readstat_file_fdw FOREIGN DATA WRAPPER file_fdw;

CREATE FOREIGN TABLE mtcars (
    model text,
    mpg numeric,
    cyl numeric,
    disp numeric,
    hp numeric,
    drat numeric,
    wt numeric,
    qsec numeric,
    vs numeric,
    am numeric,
    gear numeric,
    carb numeric
)
SERVER readstat_file_fdw OPTIONS (
    PROGRAM 'readstat /data/mtcars.sas7bdat -',
    FORMAT 'csv',
    HEADER 'on'
);
CREATE FOREIGN TABLE

Note that the PROGRAM option is used to tell postgres to call the readstat cli tool against the mtcars dataset. This means that, even though the fdw is called a "file" fdw, it can operate on the results produced by arbitrary command line tools. Here readstat /data/mtcars.sas7bdat - converts the sas7bdat dataset and prints results to STDOUT.

Using the FDW

Now we can treat the external mtcars.sas7bdat dataset as a regular SQL table.

SELECT * FROM mtcars LIMIT 5;
model mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.000000 6.000000 160.000000 110.000000 3.900000 2.620000 16.460000 0.000000 1.000000 4.000000 4.000000
Mazda RX4 Wag 21.000000 6.000000 160.000000 110.000000 3.900000 2.875000 17.020000 0.000000 1.000000 4.000000 4.000000
Datsun 710 22.800000 4.000000 108.000000 93.000000 3.850000 2.320000 18.610000 1.000000 1.000000 4.000000 1.000000
Hornet 4 Drive 21.400000 6.000000 258.000000 110.000000 3.080000 3.215000 19.440000 1.000000 0.000000 3.000000 1.000000
Hornet Sportabout 18.700000 8.000000 360.000000 175.000000 3.150000 3.440000 17.020000 0.000000 0.000000 3.000000 2.000000
EXPLAIN ANALYZE SELECT * FROM mtcars LIMIT 5;
QUERY PLAN
Limit (cost=0.00..0.75 rows=5 width=384) (actual time=8.098..8.122 rows=5 loops=1)
-> Foreign Scan on mtcars (cost=0.00..30.10 rows=201 width=384) (actual time=8.096..8.118 rows=5 loops=1)
Foreign Program: readstat /data/mtcars.sas7bdat -
Planning Time: 0.842 ms
Execution Time: 8.861 ms
SELECT cyl, count(*), AVG(hp)
FROM mtcars
GROUP BY cyl
cyl count avg
4.000000 11 82.6363636363636364
6.000000 7 122.2857142857142857
8.000000 14 209.2142857142857143

Summary

Going through this exercise I've learned the built in file_fdw of postgres allows one to get up and running very quickly with any dataset that is supported by COPY or a command line tool that can produce such a dataset, as ReadStat does.

For further work, I'd like to investigate the performance of this filefdw on a much larger dataset. I suspect that performance will degrade on large datasets since they must be loaded entirely into memory before postgres may operate on them. Ideally, things like LIMIT and WHERE filters could be pushed down to the ReadStat level, which I don't think is possible in the implementation here.