Exploring correlations in large datasets with SQL in Node.js
How quickly we can uncover interesting insights by executing analytics queries with the DuckDB embedded SQL engine, just using Node.js?
--
Project Setup
Set up a new Node.js package:
$ mkdir nyc-taxi-correlations && cd nyc-taxi-correlations
$ yarn init -y
Add basic dependencies:
$ yarn add -D typescript ts-node nodemon
Add DuckDB, the star of the show, using DeepCrawl’s node-duckdb:
$ yarn add node-duckdb
✨ Done in 11.66s.
Download data files from TLC Trip Record Data page into a data directory, and start coding (using pre-installed VSCode):
$ mkdir data src
$ code .
Create the basic boiler plate in src/run.ts
and start it:
$ yarn nodemon -e ts --exec ts-node src/run
Now, as we save changes to the code, it will re-run and output to the terminal.
Getting to Work
The Correlation Function
A quick search brought me back to the Pearson Correlation Coefficient, which we learned about in Stats 101. It can take a range between -1 and 1, where 0 indicates a weak correlation between the variables and -1 and 1 respectively indicates a strong negative or positive correlation.
I put together a TypeScript function for rendering the calculation:
const r = (x: string, y: string) =>
`covar_pop(${x}, ${y}) / (stddev_pop(${x}) * stddev_pop(${y}))`;
Used in our SELECT
statement:
const result = await connection.executeIterator(
`SELECT count(),
${r("tip_amount", "trip_distance")} as tip_distance
FROM read_csv_auto('data/yellow_tripdata_2019-01.csv')`);
Which results in:
{
'count()': 7667792n,
tip_distance: 0.539269014536253,
}
So, it seems there’s at least some positive correlation, albeit not particularly strong, between tip amount and trip distance. I then added in calculations for checking the correlation coefficient with fare (excl. tip) and with number of passengers, which clearly seems to have a very weak correlation:
...
tip_fare: 0.026611783378787134,
tip_passengers: 0.001088787150935159
...
Dataset Performance
We ran this on a single CSV file representing 7.7m trips in one month and it took about 5 seconds to run. What would happen if we add in more CSVs?
With one more CSV:
...
FROM (
SELECT * FROM read_csv_auto('data/yellow_tripdata_2019-01.csv')
UNION
SELECT * FROM read_csv_auto('data/yellow_tripdata_2019-02.csv')
) as trips
We get:
{
'count()': 14687122n,
tip_distance: 0.5327676857033936,
tip_fare: 0.027202256434925805,
tip_passengers: 0.0008883969400021263
}
That covers about 14.7m rows and 1.337GB of CSV data read and calculated in 17.5s, and the correlation coefficients (r-value) still look fairly consistent with what we’ve seen before.
Not the Whole Story
After adding 4 more CSVs, now covering January-July, we notice an inconsistency with previous r-values:
{
'count()': 44459089n,
tip_distance: 0.032890025637241575,
tip_fare: 0.6334371687025606,
tip_passengers: -0.0001533928725559738
}
Where did this difference come from? Perhaps it makes sense to group the data by month.
...
WHERE extract('month' FROM tpep_pickup_datetime) < 7
GROUP BY extract('month' FROM tpep_pickup_datetime)
(We had to add the WHERE
clause because with groups we noticed that the dataset contains some odd data for other months too.)
This results in consistent r-values for January, February and April, but with bigger differences for March, June and July:
...
{
month: 3n,
'count()': 7832325n,
tip_distance: 0.028248351449362007,
tip_fare: 0.9713848900417407,
tip_passengers: -0.0001662104560847138
},
...
{
month: 5n,
'count()': 7565359n,
tip_distance: 0.5258277351357648,
tip_fare: 0.5614612919769764,
tip_passengers: -0.002357654989447674
},
{
month: 6n,
'count()': 6940628n,
tip_distance: 0.11414505917858382,
tip_fare: 0.06219687943806697,
tip_passengers: -0.001546597640316774
}
...
What happened with the data in these months? Could tipping behaviour really have been so much different, with an r-value of nearly 1 for tipping by fare? I suggest we explore this further…
Code is available at https://github.com/jupiter/nyc-taxi-correlations/