Feed Me Weird Things

2026-03-08 hylaeus

Please send me links to public repositories containing SuperCollider code. Get in touch if you'd like to share more private source code with me, if you are concerned I might have your source code and you'd like me to remove it, or if you'd like to volunteer to help curate a large collection of SuperCollider source code.

Hadron Needs Real-World Inputs

I've been working in a vacuum on Hadron for a few years now while building many of the fundamentals required for an interpreter, such as a lexer, parser, and "semantic analyzer," a concept borrowed from LLVM's C++ compiler clang.

However, as I'm starting to work on a code formatter, and building up some support code for the language server, it's become readily apparent that I need to build a database of as much extant SuperCollider source code as possible. The more diverse styles and idioms of SuperCollider usage I can capture, the better.

Building a sample source code corpus affords a number of advantages for Hadron, and possibly some benefit for the SuperCollider project in general:

I can validate Hadron's parser against the SuperCollider reference parser I maintain, Sparkler. This helps me build confidence that both parsers are proven against real-world code inputs by actual users.
With caveats, it can be used for benchmarking and optimization of any SC parser, including the existing SCLang lexer/parser stack (which is getting a rewrite, exciting!).
Fuzzing benefits from a large set of input examples, so my Hadron fuzzing efforts will certainly become more efficient.

With that in mind, I've started a collection of publicly available SuperCollider source codes. I made a corpus repository in the Hadron organization on Codeberg, but I've made it private to members of the Hadron organization only, to avoid it being an obvious target for AI scrapers but also to allow for the possibility that some folks might be willing to share private code with the corpus, and I wanted to keep those access controls in place.

Corpus Terms Of Use

I will not share the corpus outside of a select group of Hadron developers. While portions of the corpus may very well be distributed by their own authors, I will never distribute the corpus.
The source code will not be used to train any generative large language model. We may make some small-scale statistical inferences from it, for example benchmarking Hadron code against parts of it, but no part of the work will be used to generate new source code.
Any contribution to the corpus, even those publicly available, can be removed at any time by simply contacting me with the request.

Current Effort

I've already gathered around a million and a half lines of code from three sources:

All registered Quarks. The quarks repository has a directory.txt file containing repository links for each quark. I wrote a python script to add each one as a submodule to the corpus.
The SuperCollider class library, distributed with SuperCollider. I established a Hadron fork of the SuperCollider repository that only contains the class library, and added that fork to the submodule. I'll keep it roughly up to date on releases.
I scrolled through a search of GitHub for projects listing SuperCollider as the primary language. This revealed many of the already-registered Quarks but also uncovered a lot of other interesting projects. These I have organized by author in the corpus.

I still need to comb through the Awesome SuperCollider lists, and I think some folks are moving to Codeberg so it's worth a search there, too. Additionally I think GitHub could use another pass through, as well.

Are We The Baddies?

I had a queasy moment when trawling through GitHub for SuperCollider code. I was getting a little close to the line for my comfort in terms of starting to resemble Large Language Model scraper behaviours. So far, I have taken the opt-out approach to the publicly available SuperCollider source code. Meaning, I have taken the fact that the author has posted their code on GitHub as implied consent for inclusion in the corpus. This seems to me to be much less of a stretch than implied consent for inclusion in the training set of a for-profit generative language model. But, it still feels like a strech.

I'm going to raise awareness about the corpus by circulating this blog post in a few spots known to be popular with SuperCollider users, in the hopes of both generating additional submissions of source code, but also in the interest of gathering feedback about it.

These are, in some ways, "complicated" ethical times. "Complicated" is a polite way of saying that as AI eats my industry I've seen a bunch of folks I respected seem to lose sight of some of the basics in terms of how harmful this technology is, and how dangerous. It makes me wonder about my own ethical sensibilities, and I find myself worrying and questioning when my plans start to resemble those of the robots.

As always, I'd love to hear your thoughts.