After seeing many language wars style posts about vs and the sort of comparisons being made, I realized that there aren’t many helpful side-by-sides that show you how to do x in y language (and vice versa).
I decided to try and see if I could contribute something to the discourse. I’m not really trying to reinvent an analysis wheel and just want to focus on the how something is accomplished from one language to the other so I’m pulling from a few sources to just have some code to translate using the same data for both languages.
Since polars
is new to me and I like learning new things, I’m using it for the examples
Data
Data was obtained from the gapminder
R package and written to parquet via R’s arrow::write_parquet
for better interoporability between R and Python. Additionally, the size is low enough to pull the data as parquet from my GitHub repo.
packages
libraries
return top 10 rows in R
get quick info on the data with dplyr::glimpse()
Rows: 10,545
Columns: 9
$ country <fct> "Albania", "Algeria", "Angola", "Antigua and Barbuda"…
$ year <int> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960,…
$ infant_mortality <dbl> 115.40, 148.20, 208.00, NA, 59.87, NA, NA, 20.30, 37.…
$ life_expectancy <dbl> 62.87, 47.50, 35.98, 62.97, 65.39, 66.86, 65.66, 70.8…
$ fertility <dbl> 6.19, 7.65, 7.32, 4.43, 3.11, 4.55, 4.82, 3.45, 2.70,…
$ population <dbl> 1636054, 11124892, 5270844, 54681, 20619075, 1867396,…
$ gdp <dbl> NA, 13828152297, NA, NA, 108322326649, NA, NA, 966778…
$ continent <fct> Europe, Africa, Africa, Americas, Americas, Asia, Ame…
$ region <fct> Southern Europe, Northern Africa, Middle Africa, Cari…
return top 10 rows in Python
get quick info on the data with pandas
’s info
DataFrame
method
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10545 entries, 0 to 10544
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 10545 non-null object
1 year 10545 non-null int32
2 infant_mortality 9092 non-null float64
3 life_expectancy 10545 non-null float64
4 fertility 10358 non-null float64
5 population 10360 non-null float64
6 gdp 7573 non-null float64
7 continent 10545 non-null object
8 region 10545 non-null object
dtypes: float64(5), int32(1), object(3)
memory usage: 700.4+ KB
This will come back later, but it’s very easy to move your polars
data into a pandas DataFrame
.
Sri Lanka VS Turkey
simple dplyr::filter
and dplyr::select
gapminder |>
filter(year == "2015", country %in% c("Sri Lanka", "Turkey")) |>
select(country, infant_mortality)
simple filter
and select
method chain
This is where you can start to see how powerful polars
can be in terms of the way it handles lazy evaluation. One of the reasons dplyr
is so expressive and intuitive (at least in my view) is due in large part to the way it handles lazy evaluation. For people that are tired of constantly needing to refer to the data and column in pandas
will likely rejoice at polars.col
!
Let’s just compare them all at once
same strategy; more countries
gapminder |>
filter(
year == "2015",
country %in% c(
"Sri Lanka", "Turkey", "Poland", "South Korea",
"Malaysia", "Russia", "Pakistan", "Vietnam",
"Thailand", "South Africa")) |>
select(country, infant_mortality) |>
arrange(desc(infant_mortality))
same as above
(gapminder
.filter(
(pl.col("year") == 2015) &
(pl.col("country").is_in([
"Sri Lanka", "Turkey", "Poland", "South Korea",
"Poland", "South Korea","Malaysia", "Russia",
"Pakistan", "Vietnam", "Thailand", "South Africa"])))
.select(["country", "infant_mortality"])
.sort("infant_mortality", descending = True)
)
Aggregates
grouping and taking an average
gapminder |>
group_by(continent) |>
summarise(mean_life_expectancy = mean(life_expectancy) |>
round(2), .groups = "keep")
now with polars
I think this is probably a good enough intro to how you’d generally do things. Filtering, and aggregatingare probably the most foundational and this could already get you started in another language without as much headache :::
Scatterplots
I’m trying to strike a balance between dead basic plotly
plots and some things you might want to do to make them look a little more the way you want. The great thing about customizing is that you can write functions to do specific things. In some instances you can create simple functions or just save a list of values you want to recycle throughout.
+ plotly
plotly_title <- function(title, subtitle, ...) {
return(
list(
text = str_glue(
"
<b>{title}</b>
<sup>{subtitle}</sup>
"),
...))
}
margin <- list(
t = 95,
r = 40,
b = 120,
l = 79)
gapminder |>
filter(year == 1962) |>
plot_ly(
x = ~fertility, y = ~life_expectancy,
color = ~continent, colors = "Set2",
type = "scatter", mode = "markers",
hoverinfo = "text",
text = ~str_glue(
"
<b>{country}</b><br>
Continent: <b>{continent}</b>
Fertility: <b>{fertility}</b>
Life Expectancy: <b>{life_expectancy}</b>
"),
marker = list(
size = 7
)) |>
layout(
margin = margin,
title = plotly_title(
title = "Scatterplot",
subtitle = "Life expectancy by fertility",
x = 0,
xref = "paper")) |>
config(displayModeBar = FALSE)
A quick note about having plotly
work inside of the RStudio IDE–as of the time of this writing it isn’t very straightforward, i.e., not officially supported yet. The plot will open in a browser window and it’s fairly snappy. The good think is that on the reticulate
side, knitting works! So this side was able to put all this together via rmarkdown
when I started this post and Quarto now that I’m finishing this post (remember any chunk will default to the knitr
engine), so that’s pretty cool. We’re even using both renv
and venv
for both environments in the same file
+ plotly
def plotly_title(title, subtitle):
return(f"<b>{title}</b><br><sup>{subtitle}</sup>")
margin = dict(
t = 95,
r = 40,
b = 120,
l = 79)
config = {"displayModeBar": False}
(px.scatter(
(gapminder.filter(pl.col("year") == 1962).to_pandas()),
x = "fertility", y = "life_expectancy", color = "continent",
hover_name = "country",
color_discrete_sequence = px.colors.qualitative.Set2,
title = plotly_title(
title = "Scatterplot",
subtitle = "Life expectancy by fertility"),
opacity = .8,
template = "plotly_white")
.update_traces(
marker = dict(
size = 7))
.update_layout(
margin = margin)
).show(config = config)
plotly
expects a pandas DataFrame
so we’re just using .to_pandas()
to give it what it wants, but that doesn’t have to stop you from adding any filtering, summarizing, or aggregating before chaining the data into your viz.
Conclusion
Hopefully this is helpful. Feel free to reach out with any feedback or questions.