Revealing a world of hidden dependencies with Libraries.io

A couple of weeks ago, we announced that Tidelift had joined forces with Libraries.io to make open source software work better for developers and users.

Libraries.io has done a lot of amazing things—many of which Havoc already wrote about—but one of our favorites has been their open data releases, like the one last week, of the largest publicly-available dataset of open source software packages in the world!

This dataset is really unique in how well it helps us understand the inner-workings of the open source universe, but there are a couple particular aspects that really stand out for me.

revealing_hidden_dependencies.jpg

Mapping Open Source Dependencies

A huge portion of the Libraries.io effort has been centered around tracking package dependencies, and, as such, they’ve created an enormous map of millions of dependency interactions across all of open source.  With this, we can not only see and analyze the range of dependencies on any given package, but we can also follow each of those dependencies down into their own dependencies.  

Exploring these hidden dependencies (also called transitive or nested dependencies) in open source is really hard, but it’s incredibly important for solving many issues that plague the ecosystem, namely three big ones: licensing, security, and versioning.

By mapping the dependencies between packages, Libraries.io was able to create a stat called “dependent repositories count,” which does exactly what it sounds like: it looks at a given application-level package and counts the total number of code repositories that require that package as a dependency.

This might seem straightforward, but in reality dependent repositories count is perhaps the single best measure of the popularity of an open source package.  Unlike some existing metrics (downloads, GitHub stars and forks) which are non-decreasing—meaning that the total count only ever increases or stays flat—the dependent repositories count is an active measurement that can go up and down based on present-day usage.

Why is this important?

There are a couple of key reasons why this really matters.

The first is that a stats such as downloads, stars, and forks don’t tell you how many developers are actually using a piece of software; just because they downloaded it or liked it, doesn’t mean it’s running in their application.

The second is that dependent repositories count is the only metric that actively responds to the community’s preference, and it’s the only measure that will decrease if the community stops using a package.  This is incredibly powerful!  It uniquely leverages the collective knowledge of open source developers across the globe, letting their universal wisdom and actions determine which packages are the most critically interconnected.

What this looks like in practice

Here’s a real world example.  Below, I’ve included a table of the top 10 most-depended-upon packages in four popular open source languages: JavaScript, Python, Ruby, and PHP.

Of particular interest is to look at the complexion of the various packages that are the most used in their respective languages: we see some large and conclusive frameworks (express, Django, rails, phpunit), but also a lot of smaller parsers and utilities.  And what’s more, many of these packages would be overlooked by other attention metrics.

Top 10 Open Source Software Dependencies

Rank JavaScript Python Ruby PHP
1 express requests rake phpunit/phpunit
2 uglifier Django activesupport psr/log
3 mocha Flask i18n monolog/monolog
4 gulp six rack laravel/framework
5 grunt Jinja2 builder symfony/console
6 lodash MarkupSafe tzinfo doctrine/inflector
7 body-parser Werkzeug rails mockery/mockery
8 grunt-contrib-watch gunicorn multi_json swiftmailer/swiftmailer
9 babel-core mock rack-test symfony/yaml
10 chai Sphinx thor symfony/event-dispatcher

It’s worth noting that this also isn’t a perfect metric: some communities don’t track dependencies at all, and others have weaker data aggregation. Like any statistic, it can’t paint a flawless picture of the entirety of the open source ecosystem.  What it is, though, is the most reliable and up-to-date measurement of the community’s current attitude about package usage.

Over the coming weeks and months, we’ll begin to dive in a little deeper to analyze some of the data that Libraries.io is collecting to help the world better understand open source software.  

If you are interested in learning more, consider signing up for our mailing list or following us on Twitter.

Keenan Szulik