The Shadow Software Supply Chain

💡

The following content is opinion. All views expressed are my own and do not represent the views or policies of my employer, past or present, or any other organization with which I may be affiliated.Read all disclaimers at beave.rs/disclaimer

🎓

This post was written as part of IGME 585 - Project in FOSS Development

Modern package management is one of the most transformative aspects of contemporary applications. Application teams have achieved historic efficiency as libraries deal with the repetitive grunt work, freeing corporate developers to focus on high-level domain-specific logic. The explosion of Javascript, Python, Golang, and other languages is tightly linked to the vast ecosystems of open-source libraries available.

These package managers make tracking dependencies for security and licensing concerns easy. With relative ease, software owners can assemble a tree of dependencies, their dependencies, and so on. While changes in dependencies can inadvertently have significant downstream effects (such as the shift of a library that Rails depends on switching to GPL v2, causing license incompatibility for most Rails projects), the fact that these downstream effects can be tracked is evidence of a mature software supply chain management system.

The management of these interlocking components is conditional upon efficient collection and notification. The problem is that a shadow software supply chain, with unclear licensing and risky code, is almost undetectable from other contributions.

Stack Overflow and the Modern Software Engineer

Stack Overflow is a valuable resource for software engineers. It is a great place to learn how to solve problems that you may not otherwise know how to solve. However, especially with younger programmers, I fear that there is an over-reliance on third-party snippet websites.

It is not just Stack Overflow; countless blogs and YouTube tutorials cover most of what most programmers need. A person can get a SWEN undergrad degree with relative ease if they know what to search online. Copying and pasting from online sources has become central to young coder's meme culture. While those engineers should be trying to read through and re-implement (adhering to security best practices), copying and pasting is so easy that there is little perceived need to understand what the code does, provided that it works.

Traceability and Fragmented Code

What distinguishes shadow source from open source, and why is it a concern? Shadow source is "shadow" because it is untrusted, untraceable, indiscernible code from internally-developed code. When first-party code is used, at least one person internally understands what is done. Teams can then build SDLC practices around changes to bring changes to security standards. This can include threat modeling and development best practices to mitigate malformed data bugs.

When any third-party code is used, be it through an open-source library or an online source, the developers focus on the methods' inputs and outputs. As long as the inputs and outputs match what they expect, not much more thought is given to the internals, overlooking the risks with malformed queries.

Large open-source libraries reduce risk by undergoing intense security scrutiny, especially from large adopters. Like any software, there will be bugs and exploits, but they are much more difficult to find. Package management makes it relatively easy to update if a bug is found by security researchers or through a bug bounty program. There will be some remediation nightmares (like Log4Shell), but those are few and far between.

The fundamental difference between large-scale OSS and shadow source is that shadow source lacks visibility from security researchers and notification infrastructure while still suffering from people not looking at the internals.
Online examples often are as straightforward as possible while relying on as few libraries as possible. This is a dangerous combination.

Security is not a concern, nor a responsibility, for online contributors. Writing secure code is unnecessary and a time-sink for people on unpaid forums (like Stack Overflow). Online, simple solutions drive engagmenent. Therefore people with blogs or YouTube channels are actively incentivized to publish samples that are not production ready.

The AI Problem

Shadow source is nothing new, but there is another factor that will cause it to explode in scale: generative AI. I first used GitHub Co-Pilot in the early public beta period. While I still write code by hand often to prevent my skills from atrophying, it is a fantastic tool when I need something to get done quickly.

At the same time, generative AI enables subpar development practices. The ease developers can generate code means that getting a working solution can quickly be effortless, and standardization's productivity benefits become less visible.

Standardization brings a host of benefits for security, reliability, and quality. These benefits include easier remediation across a codebase if a bug is found, shorter onboarding when contributors switch teams and faster development. The speed of development is the most convincing reason to invest in standardization.

AI removes that specific benefit while leaving the others in place. In fact, I have stopped using Co-Pilot for projects that I care about long-term sustainability because I have noticed that my software devolves into unmanageable spaghetti code with lots of repeated (but slightly different) blocks.

Widespread rapid remediation, if a bug or vulnerability is discovered remains a concern, and small and needless variability increases the risk of a bug or vulnerability occurring. So, with this in mind, how do you shrink shadow code and promote standardization when the self-interested incentives are diminishing?

A Path Forward

AppSec teams must adapt their developer relations strategies to organizational culture and objectives. However, generally, I believe that AppSec teams should try to support and guide teams to a more secure codebase rather than a heavy-handed approach. I may expand on this philosophy in a future blog post, but it is too long of a discussion for this post. The central insight is that for most teams, the best approach is to make the path of least resistance the one that also brings the most valuable security benefits.

Do It Once, Do It Right

I have long supported a "do it once, do it right" development philosophy. I coined this phrase on my high school robotics team to describe our investment into high-quality and reliable code, even if it comes with more upfront investment. If you invest in doing something the right way the first time, you will end up with a better solution in less time than if you put a quick solution together and have to replace it later.

Security and software leadership should try to push for a culture that values taking longer for a high-quality product rather than a "move fast and break things" attitude. Moving fast is good, but users have little tolerance for things breaking. Instead, teams should emphasize building a solid foundation that allows for faster iteration on top of it. That way, development teams can continue to think at a high level but have trust in the underlying technologies.

If a problem arises in one of the underlying technologies, it can be remediated once and pushed across the various applications. Since more effort can be devoted to developing some of the core technologies, the individual risk of an incident occurring is much less than if several different variations of a desired end goal are deployed.

A robust set of libraries reduces the need for shadow sources by providing an easy and trusted way to interact with common operations. For example, a generative AI might build SQL queries using string concatenation. If the application uses an ORM, then the ORM reduces the need to interact with direct SQL and provides a unified place for input sanitization.

As you move to higher-level abstractions, developers get a better developer experience, and security teams have more opportunities to influence data flows. A developer would then interact with the preferred methods out of self-interest because it is more convenient than using a shadow source.

Driving Adoption

Service-oriented security can be achieved through API proxies, third-party cloud providers, open source libraries, or first-party internal libraries. It is service-oriented in the sense that it relies on libraries and services to simplify the developer experience, and it puts security teams in the position of aligning to and serving the interests of development teams.

Open Source

Open-source libraries provide many useful abstractions, and for most teams, they are a very good starting point. Most of the functionality needed for applications does not provide any meaningful competitive advantage. Thus, redundant development work is simply unnecessary duplication.

While open source has seen widespread success in powering enterprise applications, further investments in adoption, creation, and standardization of libraries can help improve security and productivity. This still requires proper software supply chain management, but if developers can be given even easier interfaces without having to rely on shadow sources, it will increase traceability.

Cloud Providers

Cloud providers can be an easy way to quickly standardize on certain technologies and offload much of the security work to a third party. For small companies or companies that want to have as small of an IT and security staff as possible, cloud services are likely one of the best ways to build quickly on secure code. They take responsibility for some aspects of security (though a secure implementation is still needed) and make a very smooth developer experience.

Cloud providers are not an ideal solution for long-term growth, however. They create significant lock-in, which can create headaches if a product is ever deprecated or becomes prohibitively expensive. Cost is another major concern, especially for very large organizations. Compute resources are a very cost-competitive area, so the profit margins are small. Instead, the cloud service providers justify their value through services and APIs.

The costs can add up quickly, and the bureaucratic world of enterprise procurement means that you may end up paying more for an inferior product with a preferred vendor than you could otherwise get. There is certainly a value argument for cloud services in some contexts. However, as costs continue to increase, it will become less justifiable for many organizations.

Internal Abstractions

Internal abstractions provide value when a company has many different products being developed or is in a regulated environment. These abstractions can be better tailored for specific business constraints, creating a better developer experience. However, they require much greater investment than open-source or cloud solutions.

A good use of internal abstractions is as a wrapper around trusted open-source libraries. With this approach, there can be both a well-tested open-source library and additional opportunities for security engineering teams to funnel data flow and add their own controls. The developer experience for the wrapper libraries is likely better than that of the core library.

Code Reviews

Automated and manual code reviews are an important step for software security. As part of the merge approval process, reviewers should look for functions that may be generalizable but were instead put inline. Extra concern should be given to inline, high-risk functionality that is repeated in multiple parts of the codebase (such as database queries).

Security-Oriented Generative AI

I do not pretend to understand how generative AI works. However, since there are security-oriented partners dominating the industry (Microsoft and Google), code generation could be pushed to emphasize strong security practices and permissive licenses.

In short tests, ChatGPT and Google Bard tend to output highly insecure results, but radically increase security when adding "securely" to the prompt. For example, in a test of ASP.NET code, adding "securely" shifted from raw string concatenation in SQL queries to language-specific SQL construction functions which have built-in sanitization).

They should also place heavier weights for training on large, permissively-licensed projects. Much of the problem with shadow source is that the code provided has not been tested, and the licensing is unclear. Licensing for AI generation is still unclear (though sources like blogs and forums are even more unclear), but tracking attribution and prioritizing permissive licensing would help as the licensing issues get resolved in courts.

Summary

Shadow source code is pervasive with online forums, and will continue to grow with generative AI. The mainstream adoption of generative AI will make it easier to have poorly structured and written written code, leading to much more vulnerable applications and lengthening remediation. As maintaining poor codebases becomes easier, teams will need to be more intentional about adhering to best practices. The best way to encourage best practices is to make the most secure path also the path of least effort for developers, leading to natural adoption of best practices.

The Shadow Software Supply Chain

Stack Overflow and the Modern Software Engineer

Traceability and Fragmented Code

The AI Problem

A Path Forward

Do It Once, Do It Right

Driving Adoption

Open Source

Cloud Providers

Internal Abstractions

Code Reviews

Security-Oriented Generative AI

Summary

Security Opinion and Analysis.

Sign Up for my e-mail newsletter to get updates whenever I post.

You might also like

The World Has Too Much Code

BSides Toronto - Quality Engineering

2023 All Things Open

Services: The AppSec Inflection Point

Subscribe to new posts.