We Data - R and Python side-by-side for data wrangling

After seeing many language wars style posts about vs and the sort of comparisons being made, I realized that there aren’t many helpful side-by-sides that show you how to do x in y language (and vice versa).

I decided to try and see if I could contribute something to the discourse. I’m not really trying to reinvent an analysis wheel and just want to focus on the how something is accomplished from one language to the other so I’m pulling from a few sources to just have some code to translate using the same data for both languages.

Since polars is new to me and I like learning new things, I’m using it for the examples

Data

Data was obtained from the gapminder R package and written to parquet via R’s arrow::write_parquet for better interoporability between R and Python. Additionally, the size is low enough to pull the data as parquet from my GitHub repo.

packages

library(tidyverse)
library(plotly)
library(arrow, include.only = "read_parquet")
library(magrittr, include.only = "%<>%")

gapminder <- read_parquet("gapminder.parquet")

libraries

import polars as pl
import plotly.express as px

gapminder = pl.read_parquet("gapminder.parquet")

gapminder = (gapminder
  .with_columns([
    pl.col("country").cast(pl.Utf8),
    pl.col("continent").cast(pl.Utf8),
    pl.col("region").cast(pl.Utf8)  
  ])
)

return top 10 rows in R

gapminder |> head(10)

get quick info on the data with dplyr::glimpse()

gapminder |> glimpse()

Rows: 10,545
Columns: 9
$ country          <fct> "Albania", "Algeria", "Angola", "Antigua and Barbuda"…
$ year             <int> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960,…
$ infant_mortality <dbl> 115.40, 148.20, 208.00, NA, 59.87, NA, NA, 20.30, 37.…
$ life_expectancy  <dbl> 62.87, 47.50, 35.98, 62.97, 65.39, 66.86, 65.66, 70.8…
$ fertility        <dbl> 6.19, 7.65, 7.32, 4.43, 3.11, 4.55, 4.82, 3.45, 2.70,…
$ population       <dbl> 1636054, 11124892, 5270844, 54681, 20619075, 1867396,…
$ gdp              <dbl> NA, 13828152297, NA, NA, 108322326649, NA, NA, 966778…
$ continent        <fct> Europe, Africa, Africa, Americas, Americas, Asia, Ame…
$ region           <fct> Southern Europe, Northern Africa, Middle Africa, Cari…

return top 10 rows in Python

gapminder.head(10)

get quick info on the data with pandas’s info DataFrame method

gapminder.to_pandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10545 entries, 0 to 10544
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   country           10545 non-null  object 
 1   year              10545 non-null  int32  
 2   infant_mortality  9092 non-null   float64
 3   life_expectancy   10545 non-null  float64
 4   fertility         10358 non-null  float64
 5   population        10360 non-null  float64
 6   gdp               7573 non-null   float64
 7   continent         10545 non-null  object 
 8   region            10545 non-null  object 
dtypes: float64(5), int32(1), object(3)
memory usage: 700.4+ KB

This will come back later, but it’s very easy to move your polars data into a pandas DataFrame.

Sri Lanka VS Turkey

simple dplyr::filter and dplyr::select

gapminder |>
  filter(year == "2015", country %in% c("Sri Lanka", "Turkey")) |>
  select(country, infant_mortality)

simple filter and select method chain

(gapminder
  .filter(
    (pl.col("year") == 2015) & 
    (pl.col("country").is_in(["Sri Lanka", "Turkey"]))) 
  .select(["country", "infant_mortality"])
)

This is where you can start to see how powerful polars can be in terms of the way it handles lazy evaluation. One of the reasons dplyr is so expressive and intuitive (at least in my view) is due in large part to the way it handles lazy evaluation. For people that are tired of constantly needing to refer to the data and column in pandas will likely rejoice at polars.col!

Let’s just compare them all at once

same strategy; more countries

gapminder |>
  filter(
    year == "2015", 
    country %in% c(
      "Sri Lanka", "Turkey", "Poland", "South Korea",
      "Malaysia", "Russia", "Pakistan", "Vietnam",
      "Thailand", "South Africa")) |>
  select(country, infant_mortality) |>
  arrange(desc(infant_mortality))

same as above

(gapminder
  .filter(
    (pl.col("year") == 2015) & 
    (pl.col("country").is_in([
      "Sri Lanka", "Turkey", "Poland", "South Korea", 
      "Poland", "South Korea","Malaysia", "Russia", 
      "Pakistan", "Vietnam", "Thailand", "South Africa"]))) 
  .select(["country", "infant_mortality"])
  .sort("infant_mortality", descending = True)
)

Aggregates

grouping and taking an average

gapminder |>
  group_by(continent) |>
  summarise(mean_life_expectancy = mean(life_expectancy) |>
              round(2), .groups = "keep")

now with polars

(gapminder
  .group_by("continent")
  .agg([
    (pl.col("life_expectancy")
        .mean().
        round(2).
        alias("mean_life_expectancy"))
    ])
  .sort("continent")
)

I think this is probably a good enough intro to how you’d generally do things. Filtering, and aggregatingare probably the most foundational and this could already get you started in another language without as much headache :::

Scatterplots

I’m trying to strike a balance between dead basic plotly plots and some things you might want to do to make them look a little more the way you want. The great thing about customizing is that you can write functions to do specific things. In some instances you can create simple functions or just save a list of values you want to recycle throughout.

+ plotly

plotly_title <- function(title, subtitle, ...) {
  return(
    list(
      text = str_glue(
        "
        <b>{title}</b>
        <sup>{subtitle}</sup>
        "),
      ...))
}

margin <- list(
  t = 95,
  r = 40,
  b = 120,
  l = 79)

gapminder |>
  filter(year == 1962) |>
  plot_ly(
    x = ~fertility, y = ~life_expectancy, 
    color = ~continent, colors = "Set2", 
    type = "scatter", mode = "markers",
    hoverinfo = "text",
    text = ~str_glue(
      "
      <b>{country}</b><br>
      Continent: <b>{continent}</b>
      Fertility: <b>{fertility}</b>
      Life Expectancy: <b>{life_expectancy}</b>
      "),
    marker = list(
      size = 7
    )) |>
  layout(
    margin = margin,
    title = plotly_title(
      title = "Scatterplot",
      subtitle = "Life expectancy by fertility",
      x = 0,
      xref = "paper")) |>
  config(displayModeBar = FALSE)

Python Plotly rendering

A quick note about having plotly work inside of the RStudio IDE–as of the time of this writing it isn’t very straightforward, i.e., not officially supported yet. The plot will open in a browser window and it’s fairly snappy. The good think is that on the reticulate side, knitting works! So this side was able to put all this together via rmarkdown when I started this post and Quarto now that I’m finishing this post (remember any chunk will default to the knitr engine), so that’s pretty cool. We’re even using both renv and venv for both environments in the same file

+ plotly

def plotly_title(title, subtitle):
  return(f"<b>{title}</b><br><sup>{subtitle}</sup>")

margin = dict(
  t = 95,
  r = 40,
  b = 120,
  l = 79)
  
config = {"displayModeBar": False}

(px.scatter(
  (gapminder.filter(pl.col("year") == 1962).to_pandas()),
  x = "fertility", y = "life_expectancy", color = "continent",
  hover_name = "country",
  color_discrete_sequence = px.colors.qualitative.Set2,
  title = plotly_title(
    title = "Scatterplot", 
    subtitle = "Life expectancy by fertility"),
  opacity = .8, 
  template = "plotly_white") 
  .update_traces(
    marker = dict(
      size = 7))
  .update_layout(
    margin = margin)
).show(config = config)

plotly expects a pandas DataFrame so we’re just using .to_pandas() to give it what it wants, but that doesn’t have to stop you from adding any filtering, summarizing, or aggregating before chaining the data into your viz.

Conclusion

Hopefully this is helpful. Feel free to reach out with any feedback or questions.