Originally published on the Two Sigma Ventures blog.
At Two Sigma Ventures, we are strong believers in the power of open source software. As part of Two Sigma, we’re lucky to have nearly 1000 software engineers (200+ with computer science-related PhDs) in our broader organization, many of whom actively use open source software in their daily work and contribute back to a variety of projects. Two Sigma is the creator of a number of popular open source projects, including BeakerX and Flint, and our colleagues include some of the original creators of Apache Arrow and Pandas. Suffice to say, we live and breathe open source.
For several years now we have been following the rise of open source software in the startup ecosystem, and Two Sigma Ventures has invested in several companies that leverage an open core business model or publish open source libraries. These include companies like Timescale, NS1, Radar Labs, GitLab, and one of our most recent investments, Replicated. We are excited about these businesses for a variety of reasons. We have seen firsthand how software created by developers, for developers, leveraging community-based development, can lead to incredible innovation. Moreover, we are excited about how enterprise software is moving towards bottoms-up adoption, and how an open core business can lead to remarkably efficient customer acquisition and growth.
In that vein, we are excited to launch the Two Sigma Ventures Open Source Index to showcase what we consider the most popular and fastest growing open source projects in the world, which you can see in its entirety here for the first time today. We hope that the data we publish here, which will be updated regularly, will provide insights for many different types of people in the tech ecosystem – from entrepreneurs, to developers, to anyone interested in studying high-level trends. Eventually, we plan to add additional pieces of data for each project and allow for more granular filtering and searching and in the coming months we will be publishing more insights and analysis. And in the spirit of the open source movement, we would love to share the raw data with you if you are interested in playing around with it. If you have any suggestions on how we can improve the Index, please don’t hesitate to reach out. And finally, if you’re building a commercial open source business, we couldn’t be more excited to hear your story!
Our Methodology:
We started by using the GitHub API to download all publicly available data on the top GitHub projects ranked by number of “Watchers.” Most other lists that rank open source projects use Stars as their “north star” metric, no pun intended. However, we believe that over time, GitHub Stars has become a vanity metric that is often gamed and that Watchers is instead a more telling signal of ongoing interest in a project. When a user elects to “Watch” a project on GitHub, they receive notifications about the project and its relevant discussions. When users are no longer interested in a project, they often will un-Watch that repository. Therefore, we believe that Watchers are a more interesting signal of sustained project popularity than Stars. Additionally, using license information and through manual sorting, we filtered out non-technical projects, such as books, lists, and educational content.
Our Index ranks projects using what we call the “TSV Score.” The score is a weighted average of the variables listed below, which we normalized to fit our scale of 0 to 100. The weights we chose are listed in parentheses.
- Watchers (40%) – our main metric we use to assess project popularity, as described above, is the number of Watchers per project.
- Watcher growth (25%) – we computed the delta in watchers over the past quarter and believe it gives us an important signal on which projects have momentum in the developer ecosystem.
- Contributors (15%) – the number of contributors provides us a sense of the developer community and interest for a given project.
- Release cadence (10%) – we compute release cadence as the number of commits a project has had over its lifetime. While this can be influenced heavily by individual contributors’ commit patterns and doesn’t give us a sense of more recent contributions, we still believe this metric provides us an indication of the pace at which a project evolves and grows.
- Community health score (10%) – finally, we take into account GitHub’s own Community Health Score metric, which evaluates how well-maintained a repository and its docs are.
We understand that these weights are arbitrary and reflect just one perspective on what’s important in building a great open source community. We’d love to share the raw data with you and have you play around with various weights and other data sources. Let us know if you’d like a copy!
Key Insights:
While we’ll be sharing more in the coming months, our first iteration of the Index has revealed to us a number of fascinating, data-driven insights about the state of software today. See below for our initial findings, and we hope you will slice and dice the data and share any learnings back with us.
- Baidu’s Apollo project is the fastest growing in our Index – Baidu, unlike US counterparts in the autonomous vehicle space, has taken a collaborative and open approach to developing their self-driving car, working in conjunction with 40 other companies and open sourcing their core technology. Interest in this project has picked up substantially in the past quarter, as they have received approval for fully driverless road tests in both China and California.
- Startups penetrating the top 100 – Seven of the top 100 projects were created by private, venture-backed startups or are maintained by commercial entities built by the original project creators. These include Redis, Hashicorp (Terraform), Grafana and Vercel (NextJS) in the former category and Confluent (Apache Kafka ), Databricks (Apache Spark), and Preset (Apache Superset) in the latter category. It is wonderful to see open source innovation being led by tech startups and we’re eager to help support these companies and others as they continue to innovate and build the next generation of great COSS businesses. We imagine this trend will continue and that many of the other projects on our list will lead to commercial entities eventually.
- The dominance of JavaScript – JavaScript has become a dominant web technology in the past decade and this is clearly evident in the top open source projects; 32 of the top 100 projects are written in JavaScript, including 4 of the top 10. The next most popular language is Java, which is the underlying language of 22 projects on our list.
- The tech titans are significant contributors, especially Google – many large technology companies create and maintain open source projects, but none has contributed more significantly than Google, which is responsible for 8 of the top 100 projects (tensorflow, flutter, kubernetes, material-design, guava, Angular, AngularJS, and Angular CLI). The next largest contributors are Microsoft with 3 projects (VScode, typescript, PowerToys), Facebook with 2 projects (react, create-react-app) and Square with 2 projects (retrofit, okhttp).
- VS code is the most popular code editor – VS code has clearly grown in prominence, which we have anecdotally seen internally at Two Sigma. It is the 11th most popular open source project in our Index and is easily the most commonly used code editor among software developers, who love its ease-of-use and customizability.
- Kubernetes is here to stay as the default container orchestration technology – as computing infrastructure moves towards containerized architectures, we are excited about the promise of Kubernetes to help developers build more reliable and performant cloud applications. Kubernetes is not only the 5th most popular project overall, it is the 3rd most contributed-to project, with 3730 contributors. This signifies a highly active community of developers working to improve the core technology.
- The importance of data-driven software – several of the top projects, including the #1 project in our Index, are critical for building end-to-end machine learning and data science workflows. This includes everything from algorithms (tensorflow, scikit-learn, pandas, faceswap, tesseract OCR) to data infrastructure (spark, kafka, redis) to visualization (superset, D3, chartsJS, eCharts, grafana). We have been predicting the rise of data-driven software for the past decade and are thrilled to see this trend reflected in the open source community, as machine learning and data science technologies become democratized.