aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Data
  • Software
  • Technology

How and Why We Choose to Clone all Data on Github

  • root
  • September 23, 2021
  • 7 minute read

Debricked has achieved a not so small feat – we are now able to actively keep and maintain a clone of all data on GitHub! For what reason? You may ask. To understand all the why’s and how’s we have interviewed our Head of Data Science, Emil Wåréus.

Before we start with the questions, who are we talking to?


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

My name is Emil and I’m the Head of Data Science at Debricked. Me and my team of 5 data engineers are the masters behind everything related to data. Also, I was the second employee at Debricked!

 

Debricked Cloning Github Data

Let’s start with the million dollar question: why would anyone want a copy of all GitHub data? 

The short answer is – to have a better and faster representation of the data that we need to service our customers. You see, we want to do big computations on all open source. Yes! You heard that right. On all open source.

If we only wanted to monitor a couple of thousands of open source projects we could do it through the API calls provided by default.

But our products and solutions are not meant to give customers partial coverage; it’s supposed to be extensive. Therefore we decided to index all 28M projects on GitHub, and that’s not the end of it. Soon we will be adding the other large repositories such as Gitlab, and more.

But doing this, cloning all of GitHub that is, poses quite an interesting challenge because of the many different data structures and relational dependencies in the data. Some can be loosely coupled and some can be tight.

As a result, huge challenges arise regarding the time complexity for calculations on such a large dataset. For these reasons we decided to go on a journey and see if we could create an up to date hourly mirror of GitHub locally.

 

Why do you choose to store the data locally?

Haha! Yes, that may sound a bit unintuitive in the era of free cloud credits. But the truth is that it saves us a lot of money. Because the RAM and CPU load is high at all times, constantly chugging data, it is more cost-efficient.

Cloud is great when you have variable loads such as ML training. But when you are pushing a model like ours, scraping data with a lot of database interactions, it is a lot cheaper to host it on prem.

Read More  DeepSeek R1 is now available on Azure AI Foundry and GitHub

 

For all the hardware geeks out there, what does a setup like this look like?

There are a couple of things at play. First, we have a relational database that is not too large, meaning not in the terabyte space. Second, in terms of cloning all the repositories we are looking at about 20 terabytes of data. Third, we have a graph database in which we want to understand how the open source relates to one another.

For this we use Neo4j. Essentially we are doing graph computations to answer questions such as “what are the root dependencies” and “what open source impacts what open source”.

The final component of the puzzle is to do the actual scraping and analytics. Here we have a cluster with a couple of hundred pods working in parallel to both gather and analyse the incoming data.

What people don’t realise is that there is a lot of complexity to this. Just cloning a repository is not enough. We need the history of it and we need to create multiple snapshots of the data for different points in time. This multiplies the problem by many factors.

And before I forget, all of that storage is of course on PCI-Express NVMe SSDs sending Gigabytes of data for analysis every second. We must also thank AMD for creating such nice server CPUs with many cores! 🙂

 

20 terabytes sounds like a lot of data – what does it entail exactly?

There are about 10 000 – 30 000 new pull requests each hour at GitHub. This is 5-10 times more than about 5 years ago, and that’s only the open source contributions on GitHub! We are monitoring about +40M repositories, 100M issues, 80M pull requests and 12M active users.

Thanks to our technology, both hardware and software, we are able to keep our model up to date within the hour. So, whenever someone makes a pull request, comments, stars something or follows someone new we can collect that data with minimal time lag.

 

How did you go about setting this up? Did you wake up one day and think “I need to clone Github”? What triggered it?

Haha, no. Of course not! We did GitHub scraping early on in the history of Debricked. It all started when we realised that a lot of vulnerabilities are discussed in issues and we wanted to link and classify those. Through research we realised that this data turned out to be highly relevant.

Read More  Organizing “spaghetti” Software So It Can Be Easily Modified

About 2% of all issues are related to security, but only 0.2% are correctly labeled with a security tag. It is not until much later in time that those discussions are publicly disclosed as vulnerabilities through NVD and other sources of curated vulnerability information.

To curb this, we developed our own machine learning models to classify issues as security related or not. This way we can provide customers with a huge security demand information about issues that may be a risk, but which have not turned into officially disclosed vulnerabilities yet.

I must brag and say that we have achieved state of the art in the security text/issue classification space and have surpassed previous research in the area.

 

How many Vulnerabilities have you discovered this way so far?

About 200 000. In contrast the probably most popular database for vulnerability information NVD (Natural Vulnerability Database) has about 130 000 where a lot of them are not directly related to open source. We are also investigating how we could properly disclose this data to the public.

 

How did this later on translate to gathering all of the Github data?

We realised that this data (issues and vulnerabilities) is not independent from pull requests etc, which could provide us with additional signal to our models. For this reason we started to scrape commits, releases, code diffs etc., and perform lots of interesting calculations on the data.

We discovered that with stronger links between vulnerabilities and the original source (the open source projects) we could increase the precision of our data dramatically. This turned out to be a quite technical machine learning solution.

We match the software bill of materials to vulnerabilities and leverage all of our different data points which give us superior data quantity and quality in terms of accurately finding vulnerabilities in our customers proprietary code.

 

But the data is not only used for security, right?

Correct! As our customers will know, we are also providing health data on open source projects. For example, we look at the trend of super users commits to a certain repository to build scores which will give us intel about the project’s well being.

This is a good example of how complex the data can be. You want to monitor the contributors that know their way around a project and see that they keep contributing. This, in combination with all the other data we have on a project, can give us an indication of its health.

Read More  Five Steps To Help Make Your Software Supply Chain More Secure

So, in essence data has become our core here. Gathering, understanding and enriching data from GitHub and other sources has enabled Debricked to create services unheard of before.

 

This all sounds great! But what does this mean for current Debricked customers and potential future users?

The most impactful part of it is that you get more accurate information on vulnerabilities. We correctly map vulnerabilities to repositories at GitHub, commits, issues etc. which in turn increases our confidence that a vulnerability actually affects an open source project, and gives us insight into how it is affected.

This is a problem in general in our industry; the precision of the vulnerabilities that you are presented within your Software Composition Analysis tools, Vulnerability Scanners etc.

It is a difficult challenge to match the description of vulnerabilities and vulnerability information in for example email threads and other highly unstructured data sources to the actual software they affect. And then to find out if you are using that open source component and determine how you are using it and if you are in fact affected by the vulnerability.

Good tools have 85-95% precision (True Positive rate). While we have seen competitors and free tools go as low as 60%, Debricked currently has a 90-98% precision rate on the languages we fully support. We manage this while being one tenth of the size of some of our competitors. This would not be possible without the data combined with our algorithms and raw talent in the team.

 

A little bird whispered that this might open doors for some interesting new functionalities?

Yes! I can’t tell you all the secrets just yet… But by having all this data we are able to determine if a vulnerability affects a certain class or function. A vulnerability is only relevant if you are using the class or function containing the vulnerability in your software.

This can remove second order false positives and increase the accuracy of your internal triaging of the vulnerability. You will know exactly where in your code the vulnerability is, and if the code is called. This greatly reduces the amount of vulnerabilities you actually have to check, which will save a lot of time. So, stay tuned!

 

This article is republished from hackernoon.com


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

root

Related Topics
  • Data vulnerability
  • GitHub
  • Github Data
  • Software
You May Also Like
View Post
  • Gears
  • Technology

Samsung Art Store Brings Art Basel to Homes Worldwide With New Curated Collection

  • June 15, 2026
View Post
  • Technology

The consequences of relying on AI for accurate news

  • June 10, 2026
View Post
  • Gears
  • Technology

WWDC26: Apple unveils next generation of Apple Intelligence, Siri AI, powerful parental controls, and an expansive set of software improvements

  • June 8, 2026
View Post
  • Technology

IBM and Google Cloud Announce Strategic Partnership to Scale AI with Human Expertise and AI‑Powered Delivery

  • June 4, 2026
Data center
View Post
  • Data
  • Public Cloud

Data Sovereignty in Spain. It’s Not Just About the Law, It’s About Efficiency

  • June 3, 2026
View Post
  • Technology

Banks race to patch new cyber vulnerabilities, and other cybersecurity news

  • May 25, 2026
pope-leo-xiv-cq5dam-1500.844
View Post
  • Technology

Pope Leo XIV to Publish First Encyclical on Artificial Intelligence and Human Dignity on 25 May

  • May 22, 2026
View Post
  • Technology

Portfolio to Clients, and is Strengthened by Ongoing Project Glasswing Work

  • May 20, 2026

Stay Connected!
LATEST
  • 1
    Expectations vs. Reality: The AI We Thought We’d Have in 10 Years
    • June 19, 2026
  • digital-nomad-freelancer-worker-2151205464 2
    One paperwork problem – Get your Digital Nomad Visa employment documents fast from UK, EU or Singapore
    • June 16, 2026
  • 3
    Samsung Art Store Brings Art Basel to Homes Worldwide With New Curated Collection
    • June 15, 2026
  • 4
    You Do Not Need to Invest in the IPO of SpaceX, Anthropic, and OpenAI
    • June 10, 2026
  • 5
    The consequences of relying on AI for accurate news
    • June 10, 2026
  • 6
    Connecting AI agents with unstructured data using Google Cloud Storage MCP Servers
    • June 10, 2026
  • 7
    WWDC26: Apple unveils next generation of Apple Intelligence, Siri AI, powerful parental controls, and an expansive set of software improvements
    • June 8, 2026
  • 8
    IBM and Google Cloud Announce Strategic Partnership to Scale AI with Human Expertise and AI‑Powered Delivery
    • June 4, 2026
  • Data center 9
    Data Sovereignty in Spain. It’s Not Just About the Law, It’s About Efficiency
    • June 3, 2026
  • 10
    Ink vs Pixels. What you miss versus what you are actually missing.
    • June 1, 2026
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • 1
    Banks race to patch new cyber vulnerabilities, and other cybersecurity news
    • May 25, 2026
  • pope-leo-xiv-cq5dam-1500.844 2
    Pope Leo XIV to Publish First Encyclical on Artificial Intelligence and Human Dignity on 25 May
    • May 22, 2026
  • 3
    Portfolio to Clients, and is Strengthened by Ongoing Project Glasswing Work
    • May 20, 2026
  • reMarkable Paper Pure 4
    Everything The reMarkable Paper Pure Actually Does
    • May 14, 2026
  • 5
    Scaling cloud and AI: Microsoft Azure’s commitment to Europe’s digital future
    • May 11, 2026
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.