Programmers’ preferences for package names

Author

Dheepak Krishnamurthy

Published

July 29, 2022

Keywords

python, julia, rust, R

Are there trends in choosing package names in various programming ecosystems? Do package authors choose names for their packages that are alliterated with the name of the programming language? Let’s venture to find out.

First let’s install a couple of useful packages.

Code
using Pkg
Pkg.activate(@__DIR__)
# Pkg.add("Plots")
# Pkg.add("StatsPlots")
# Pkg.add("DataStructures")
# Pkg.add("HTTP")
# Pkg.add("JSON3")
# Pkg.add("DataFrames")
# Pkg.add("CSV")
# Pkg.add("CodecZlib")
# Pkg.add("Tar")

We can bucket the package names by their starting letter and count the number of packages in each bucket, i.e. a frequency plot.

Code
using Plots
using DataStructures
using HTTP
using Markdown

function get_buckets(items)
  buckets = DefaultDict(0)
  items = strip.(items)
  for item in items
    buckets[lowercase(first(item))] += 1
  end
  total = sum(values(buckets))
  for (k, v) in buckets
    buckets[k] = v / total
  end
  (buckets, total)
end

function frequency_plot((buckets, total); lang, kind="packages")
  fig_size = (800, 600)
  names = [k for k in sort(collect(keys(buckets)))]
  colors = DefaultDict("grey")
  percent = DefaultDict("")
  starting_letter = first(lowercase(lang))
  if kind == "packages"
    colors[starting_letter] = "orange"
    for (k, v) in buckets
      p = round((buckets[k] - WORD_BUCKETS[k]) * 100, digits=1)
      percent[k] = "\n($(sign(p) > 0 ? '+' : '-')$(p)%)"
    end
  end
  ax = bar([buckets[n] for n in names], xticks=(1:length(names), names), fillcolor=[colors[n] for n in names], size=(1600, 1000), legend=false, yaxis=false)
  annotate!(1:length(names), [buckets[n] + (1 / (kind == "packages" ? 350 : 500)) for n in names], [("$(round(buckets[n] * 100, digits=1))%$(percent[n])", 8) for n in names])
  title!("Frequency of $kind in $lang (Total: $total)")

  summary = if kind == "packages"
    """
    The difference in percent of names of $lang packages starting with "$starting_letter" and words in the English language starting with "$starting_letter" is $(replace(strip(percent[starting_letter]), ")" => "", "(" => "")).
    """
  else
    ""
  end
  (ax, summary)
end

nothing
[ Info: Precompiling IJuliaExt [2f4121a4-3b3a-5ce6-9c5e-1f2673ce168a]

English

For a reference case, let’s plot the distribution of words in the English language, per the list in /usr/share/dict/words on my MacOS 12.5.

Code
words = open("/usr/share/dict/words") do f
  readlines(f)
end
WORD_BUCKETS, WORD_TOTAL = get_buckets(words)
ax, summary = frequency_plot((WORD_BUCKETS, WORD_TOTAL), lang="/usr/share/dict/words", kind="words")
display(ax)

Python

For Python, we can get the list of packages on PyPi using https://pypi.org/simple and get the names of all packages from the links.

Code
r = HTTP.get("https://pypi.org/simple")
data = String(r.body)
lines = strip.(split(data, "\n"));
links = filter(startswith("<a href=\""), lines); # filter all the lines that start with a link
packages = first.(match.(r">(.*)</a>", links)); # get the contents of these links, using a regex match
packages = filter(name -> isletter(first(name)), packages); # get only packages that start with a letter.

PYTHON_BUCKETS, PYTHON_TOTAL = get_buckets(packages)
ax, summary = frequency_plot((PYTHON_BUCKETS, PYTHON_TOTAL), lang="Python")
display(ax)
display("text/markdown", summary)

The difference in percent of names of Python packages starting with p and words in the English language starting with p is +3.1%.

Personally, I’m surprised this difference isn’t higher.

Julia

When you install a package using Julia, it downloads a general registry into your home directory, and we can traverse that directory only one level deep to figure out all the names of the packages in the registry.

Code
general_folder = expanduser("~/.julia/registries/General")
for (root, folders, files) in walkdir(general_folder)
  for folder in folders
    if length(folder) > 1 && length(split(replace(root, general_folder => ""), "/")) == 2 && !endswith(folder, "_jll")
      push!(packages, folder)
    end
  end
end

JULIA_BUCKETS, JULIA_TOTAL = get_buckets(packages)
ax, summary = frequency_plot((JULIA_BUCKETS, JULIA_TOTAL), lang="Julia", kind="packages")
display(ax)
display("text/markdown", summary)

The difference in percent of names of Julia packages starting with j and words in the English language starting with j is +0.9%.

Rust

https://crates.io conveniently has a data-access page that links to the latest dump which contains a csv file with the names of all the packages.

Code
using DataFrames
using CSV
using Tar
using CodecZlib
tmp = tempname()
download("https://static.crates.io/db-dump.tar.gz", tmp)
folder = open(tmp) do file
  Tar.extract(GzipDecompressorStream(file))
end
filename = joinpath(folder, only(readdir(folder)), "data/crates.csv")
packages = DataFrame(CSV.File(filename))[!, :name]

RUST_BUCKETS, RUST_TOTAL = get_buckets(packages)
ax, summary = frequency_plot((RUST_BUCKETS, RUST_TOTAL), lang="Rust")
display(ax)
display("text/markdown", summary)

The difference in percent of names of Rust packages starting with r and words in the English language starting with r is +3.6%.

R

For R, similar to Python, we can parse the HTML from https://cran.r-project.org/web/packages/available_packages_by_name.html:

Code
r = HTTP.get("https://cran.r-project.org/web/packages/available_packages_by_name.html")
data = String(r.body)
lines = split(data, "\n")
lines = filter(line -> startswith(line, "<td><a href=\""), lines)
packages = first.(match.(r">(.*)</a>", links))
packages = filter(name -> isletter(first(name)), packages)

R_BUCKETS, R_TOTAL = get_buckets(packages)
ax, summary = frequency_plot((R_BUCKETS, R_TOTAL), lang="R")
display(ax)
display("text/markdown", summary)