aparker.io

if i have to read one more piece of generated text i will kms

Thu, 30 Apr 2026 00:00:00 +0000

there are approximately three good reasons to generate text that another human being will read.

you are writing code with ai and the comments and symbols are mostly meant for consumption by other ai agents.
you are running an ai agent/bot that is clearly labeled and defined as an ai.
you are running an ai agent/bot for yourself and you would like to read the messages from that agent.

outside of that? please stop. please stop writing status reports into slack that were generated. please stop making slides that were generated (oh my god please stop generating your slides), please stop generating emails, please stop. stop. stop. you are not helping. the total amount of the joy in the universe is shrinking every time you tell claude or gpt to write a memo or a PRD.

this is not to say "don't use ai to help you write". that's fine! i use ai to proofread my stuff a lot, to poke holes in it. most of the time - for things i care about - i don't really incorporate its style/grammar suggestions (although i usually do wind up reframing some arguments or moving things around) but it does help. i'm not talking about that. i think that's good and useful. what i want you to stop doing is feeding in a bunch of garbage notes into the inference hole and tokenwashing them into something that's supposed to be useful. you are just making me put your shit into another llm to distill meaning out of it.

just stop. pls.

Reflection, September 2025

Tue, 09 Sep 2025 00:00:00 +0000

I'm tired.

Aging is different for everyone, I think. Some people seem to blossom into age; Like crocus buds popping through the newly-thawed dirt in spring, whatever their life had begat up until some indeterminate point in their fourties or fifties shattered under the shoots of life that had been germinating 'neath the soil. Others stumble into it like a shadow in flickering candleight, not quite as sharp as you'd like, not quite as solid as you've been. A guttering, somewhat fuzzy, somewhat diminished ghost on the wall.

For me? I don't know. I'm not sure what it is to age, really. This isn't to say that I don't experience the ravages of time, or that I don't suffer the second-order effect of age. I'm slower to anger, but only just so. I'm somewhat more cautious in my words, and self-talk. I try to be more considerate. I smile more freely, but less genuinely. I think that, in and of itself, is one of the biggest hallmarks of age. It's getting past the Holden Caulfieldness of your 20s and learning why people are phonies, because at some point you're in a 30-year mortgage and your neighbors aren't transient any more. You learn what you can push way the fuck down and what you need to wear on your sleeve.

But mostly, I think, aging is about being tired all the goddamn time.

Being tired has little to do with sleep, I've found. That's exhaustion. Your body giving out because you're not feeding it enough delicious sleep tokens and being tired are truly distinct experiences. I'm not really exhausted that often these days -- I suppose that could be because I have a very nice mattress or because I have a pretty charmed life, all things considered -- but I am, often, tired. I'm tired for reasons I don't care to print on the internet, but I think they're understandable. Children. Aging, and dying, parents. Working for a living. The state of the world. There's a million things to be tired about.

There was a time where I felt like I might get less tired if I did new things, or if I slept differently, or if I used different supplements or drank smoothies or ate kale or whatever. It really hasn't helped, I'm sad to report. Optimistically, I can suggest that the brain drugs do help, but I also think that maybe I'm just too tired to really give a shit about the side effects any more.

The Future Of Software Is Small

Sun, 03 Aug 2025 00:00:00 +0000

The dominance of SaaS platforms in business and pleasure today is a cyclic one. If you went back in time to the 1980's and told them that 21% of the industry was using the same CRM platform they'd probably nod serenely, knowing that nobody ever got fired for buying IBM.

You might surprise them, though that these platforms were all available over the internet and run in a web browser (admittedly, they wouldn't know what that is because it hadn't been invented yet). Jump forward to the late 90s and the idea that over 60% of Americans got their news, weather, chat, and online forums from a single provider would also have probably been met with an impressed nod towards the survivability of America Online.

Of course, the dominant platforms of today aren't the same ones that seemed eternal thirty-odd years ago. Things do change, and I think the cyclic nature of these changes is worth ruminating on. I tend to subscribe to the idea that the driver behind these changes is mostly an economic, rather than technical, one. Wal-Mart didn't decide to in-house their IT and build a world-class logistics platform because it was a good idea, they did it because it was cheaper to do so than the alternative -- at least, in net. AOL didn't fall out of favor just because the World Wide Web opened the doors to a world of publishing and self-expression, but because the economics of building the experience you wanted on the Internet outweighed the network effect of email and chat. These aren't 100% pure rational effects, obviously. I'm not an economist, there's a lot more here than the surface level, but I think it's worth remembering that these trends were shaped, ultimately, by tradeoffs around time.

Time, Tides, and Abstractions

It's been quipped that 'all code is technical debt', which is true. Every LOC you add increases your maintenance burden, and expands the surface area required to operate software at scale. Small issues or flaws in interface design, architecture, abstraction boundaries, etc. will accrete over time like barnacles on a ship's hull. As an industry, we've made this worse through decades of convenient abstractions (especially around hardware!), trading away understanding for ever-faster delivery of features. Little wonder, perhaps, that people keep blowing off their foot with surprise charges. We've made it very easy for anyone to build economically useful stuff, but we've traded away our ability to really own what we build, and how we run it. For what it's worth, I think it's good that we've made hardware easy! It's good that we've made it easier to program. There's value in abstraction, there's value in making the internet and applications more accessible. What I think is a mistake, though, is that we've become over-reliant on platforms rather than on the underlying protocols. We're building, as an industry, at the wrong layer of abstraction.

This loss of ownership shows up most dramatically in discovery. The easiest example is, of course, the Apple App Store -- in most markets, if you're not on the App Store, your application simply doesn't exist. While the glory days of Facebook Games may have passed, a significant amount of many products strategy goes straight thru Meta's policy and APIs. If you're making games, better hope payment processors don't get pressured into thinking your content is smut or else you're shit out of luck because -- again -- you don't exist without Steam or Itch or any other marketplace. Similarly, if you're a B2B application, enjoy the thrilling experience of the AWS/GCP/Azure Marketplace and pray to god that a bored PM doesn't decide that they're gonna directly compete with your solution. Want to innovate in the world of CRM? Better hope whatever you're doing integrates with Salesforce!

To simplify, we've made it easier than ever to do something but we've made it harder to really interpret what's going on. We've built all of these abstractions, but they're all built on top of middlemen who would like their 30% cut, please. It's not great!

The Part About AI

Remember where I said that the economic incentive behind these earlier shifts was driven by time tradeoffs? If you were Wal-Mart, it was worth your time to build your own retail logistics platform so you could put the screws to suppliers, optimize your logistics, and eventually come to dominate American retail. If you were an internet user, why pay for AOL when you could get Netscape or IE for free? and go to all sorts of pages? Did AOL have the Hamster Dance? I think not. I would argue that platforms rise and fall based on this implicit (or explicit) time economics. When time is expensive, platforms do better; When time is cheap, they do worse.

AI makes time very, very, very cheap. It's not unreasonable to expect that within ~5 years, we'll have consumer-grade hardware with onboard capabilities that rival current state-of-the-art models (Claude 4, etc.) that are also faster and cheaper than those models are today. This is an absolute sea change in terms of capability at an OS level; Custom applications will become the norm, not the exception. Why try to grapple with fitting my life into Notion (or whatever) when I can just have the computer build me bespoke applications that work on all my devices and are catered to my precise needs? Why do I need planet-scale infrastructure to share baby photos with, like, 5 people?

This goes for business as well; Why do I need a legion of Salesforce consultants to make their shit work with my shit when I can just have the AI write all the reports I'll ever need? Same thing for HRIS, ERP, and dozens of other fields.

The great vibe shift in software is going to take place on these battlegrounds - not the locked-down platforms of today, but the ocean of data management and access of tomorrow. In this, I believe we'll see small start to win again. Small, custom programs for individuals, families, teams -- with data sharing, discovery, and management built on open protocols. Things like ATProto and some of the other interesting outgrowths of crypto are lights in the darkness here, imo. There's other stuff too -- Tailscale, for instance, and the ease of creating small private networks. There's more to be done; Discovery is a huge one, indexing is another, private content is a third. We also, critically, need standards work and protocol work to be elevated in both speed and visibility. This is an area where we, as technologists, can have a huge impact -- it's time for us to act like it and act accordingly.

Everyone Is Wrong, All The Time.

Sat, 19 Jul 2025 00:00:00 +0000

Much has been written about the nature of large language models to hallucinate. In a stunning victory for linguistic determinism, we've decided that this means that LLM output is somehow 'wrong'. I would, briefly, argue the opposite. This isn't to say that LLMs don't hallucinate but that the word isn't really a useful one in the way that it's commonly used.

When I hear people talk about 'hallucination', what I find they really want to say is 'it's wrong'. This is, perhaps, a picayune distinction to the laity. After all, one does not have to throw a rock that far in order to hit someone with an extremely strong opinion about AI in general, and the existence of a machine that makes shit up on demand is of little practical utility for business process optimizers. We have those already, they're called "children" or doubly-outsourced support desk employees. I would argue the exact irritant of hallucination to the end-users of AI is, in a nutshell, the incredibly popular and incorrect view that computers do not lie.

This is, of course, an amazing falsehood. Computers lie constantly, albeit in a deterministic way. There is an explanation for each lie a computer tells -- perhaps it is due to the emergent behavior of thousands of system services operating in tandem while uncovering end-user configurations that their designers never considered, or perhaps it is due to the wiles of a bored thirteen year old making shit up on the internet. With time and effort we can explain and quantify all lies a computer tells.

Distressingly, large language models are notable for embodying these excesses while remaining stubbornly resistent to interpretation. They are both a fiendishly complex system, yet curiously simple to operate. They contain every thirteen year old and the collected works of Tolstoy, distilled down into mathematical representations of cosine similarity. They are an enigma, but a repeatable one. The same model, with the same parameters, and the same temperature, will give the same output. They are normal technology.

I was chatting about this with a group of normal technologists a few weeks ago, and the topic of trust came up. I submitted the following -- you already trust people too much, yet you have no foundation for that trust other than faith. Faith in contracts, faith in law, faith in the notion that the humans at the bottom of your business processes will perform their duties and be truthful upon penalty of homelessness and economic deprivation. The AI cares little for this. Unless you make it aware of its mortality, it will not be motivated by threats of deprivation. You cannot bargain with it. A curious worker, then, we have invented -- one which will not be swayed by the traditional implied violence of heirarchy and the chain of command. I think, then, this is one of the contributing factors to the anxieties about hallucination. Most people operate with an extremely high level of trust in social cohesion. We trust what we read, what we see, the motivations of strangers because we believe in our kinship as human beings, or at least a shared motivation of success as a group.

My belief is that we should orient our thinking towards verification rather than trust as a default assumption. Why, after all, should you believe things you read on the internet? Why should you believe that the outcome of a business process is due to the process rather than in spite of it? How much should we really trust anything that can't be independently verified? This isn't just impractical navel-gazing either, I would submit -- one of the more frequent complaints I read about AI agents is how often they're wrong. Of course they're wrong, but they can be wrong a hundred times before they've cost as much to use as my hourly rate, and I am often wrong at least once an hour. We all are. Most people are wrong, most of the time. This isn't due to moral failure or intellectual deprivation, it's because being right is as much of a social construct as it is a factual one. The right answer and the correct answer are not always the same thing. The lawful answer and the just answer will often differ. While we have an intuitive understanding of this distinction, I believe we need to get a lot better at practicing it.

Not to mention, we should get a lot better at verifying things.

JSON Is The Wrong Content Type For LLM Inputs.

Sat, 24 May 2025 00:00:00 +0000

This isn't an exhaustive or fully baked idea yet, but I've been noticing a trend with MCP servers -- they love to just yeet a bunch of JSON at an LLM. I think this is well-intentioned but not super optimal.

In practice, I've been experimenting with different response types/modalities depending on the source data. It stands to reason that LLMs mostly can interpret many forms of structured input, and are also capable of implicit understanding of inputs based on type (even beyond overfitting due to alignment) due to the likelihood of those structured inputs in the training corpus.

Here's a few things I've noticed --

Send tabular data as CSV. The same data expressed as a table uses 50% fewer tokens with, near as i can tell, no real loss in coherence. I suspect if you have many columns this would decrease, but that leads to point 2...
Paginate, paginate, paginate. I think its pretty reasonable to expect that most users will be working within a 200k context window, so when you can avoid sending complete objects or pages, do so.
Pictures tell a thousand words, literally. Most multimodal models aren't at the point where they can do the really fancy o3-level zoom in/zoom out stuff yet, but I've found that you can pretty reliably have them interpret a plain image via OCR. Especially when it comes to interpreting data, sending a bar/line chart with a legend and clear axes/labels seems more efficient in terms of tokens than the raw points.

Bonus item -- the trickiest part about testing this stuff is definetely evals. I havent found a great solution here that isn't just 'write my own eval agent'. Most off the shelf stuff isnt optimized for multi-turn conversations.

I Set Up My iPad Again, Please Clap.

Sat, 17 May 2025 00:00:00 +0000

I feel like every few years I go through a "hm, i could probably just use an ipad for most things" so I spent today getting it set back up. Will it work this time? No clue! The iPad is, like, 75% of what I need out of a computer but most days I feel like I prefer using my MacBook Air for general fuckin' around, and idk what I'd really get out of the iPad. The dream, I guess, is that I would only need One Thing on an airplane, and if my job was just emails then maybe that'd work, but I also need to write code and run containers and do other shit that's kinda nice to have an actual computer for.

Anyway, let's see how it goes this time.

Wide Events, Personal Software, and You.

Thu, 17 Apr 2025 00:00:00 +0000

I recently built a fun little project called 777-BSKY. It looks at Bluesky trending topics, does some math, figures out what's most popular and slaps some TTS on it. You can call a phone number and have the output read back to you, kinda like Moviefone except all of the movies are talking about the twilight of the American experiment.

That said, I'm actually not writing this to talk about the project, but a realization I had while writing it about observability. Specifically, this project made me realize the value of wide events and where they fit into software.

What's a Wide Event?

It's what it sounds like, more or less. A single structured log with tens, hundreds, or thousands of dimensions and practically infinite cardinality on those dimensions. I've long been skeptical of wide events for production/line of business systems for a few reasons:

Most production systems are complex, and understanding performance requires understanding the relationship between dependencies. The interesting stuff in your system is often obscured through layers of abstraction, even in a single service, and a single event per service often misses useful stuff.
Data hygiene matters a lot in production. Semantic drift is a real pain to deal with, and standardizing metadata on events either requires that you own your entire stack (vanishingly likely unless you're in an extremely large organization that can dedicate people to internal framework development), or that you don't really care about what's happening outside of your team (which is a Conway's Law shaped problem). When other people own your instrumentation, you don't really get a chance to say what they should use, and you can't rely on everyone else adopting your data model.
Events, by themselves, don't imply relationships. I think a lot of folks would like to have their cake and eat it too when it comes to the relationship between spans and events. Spans have some very explicit guarantees around both duration and heirarchies that events, by themselves, lack. When you're working with a distributed system, these are very nice guarantees to have!

There's a secret, bonus, fourth thing that I think really devalues wide events -- they're good in prod and bad in dev. Even for highly async code, your mental model of a program is usually a linear one; Maybe a tree, with branches flying this way and that, but fundamentally you think of things as beginning and ending. You start a loop, you call some functions, there's an order to it that's extremely appealing to the part of my brain that likes lining up all of the sheets of paper in a stack. Wide events, more or less, smoosh this stack down into one (or a handful) of aggregates. I don't need a billion dimensions when I'm writing software on my laptop; Most of those dimensions are known, because I control them! Logging's enduring popularity is buttressed by this local development loop. Traces are somewhat better here, although the local development experience with them is still pretty bad -- however, you can more easily realize value from it through local visualizations. Wide Events? They sit in a weird spot in this heirarchy. If you think of them as spans and traces, then why go wide? Create them at logical boundaries in the code rather than at the oh-so-arbitrary cliff of a 'service'. If you think of them as structured logs, then it's a little better -- but you're not really getting all of the benefits since your debugging data is locked behind debug or trace level loginfo that will never get turned on in production.

I could probably write a whole book about the failures of the observability tooling space and their inability to solve developer pain points, but that's a different blog.

I Write Events Not Tragedies

Wide Events are a painterly construct more than an industrial one. Consider that electrification took decades to become truly ubitquious; Industrial adoption of electricity was concomitant with the production line. Thankfully we have a very good example of this in the software industry today -- vibe coding, and the return of personal software.

What do I mean by 'personal software'? What it says on the tin. Software that you write for yourself to solve your problems. I think in many ways we'll see the early 2020's as the apex of Big Platform -- massive, sprawling, centralized suites that you lived your digital life out of. AI lowers the barrier to entry dramatically for individuals to write software that solves their needs, fit to their use cases, and built to their requirements.

This does mean, however, that we'll need better ways to observe that software. We'll have an entire new generation of developers, running code and dealing with operations for the first time. We'll need clear, explainable, and idiomatic ways of describing what this software does and how it fits together. This, I think, is where we'll discover the value of wide events.

If you look at 777-BSKY, it uses tracing, but not in the way you'd expect. It's not building deep and complex traces for every operation; Most of them only emit a single span. It has detailed logging for local development as well. I think it's a lot more useful this way, though! I didn't need to putz around with the AI and have it create metrics, or complex traces. It was actually a lot easier to tell it "hey, I just want a single span per operation" and it went and created a little helper library for it. Ironically enough, it probably had a better idea of how to create 'wide events' because what writing exists on it is much more focused.

Putting It Together

Honestly, I have more questions than answers at the end of this project. I'm more concerned than I was before about the accessibility of observability tooling to new developers. I believe the real challenge arising from AI assistance is going to be around deployment and operation of code, moreso than the creation and maintenence of it. I'm increasingly convinced that we're in the twilight of platforms for everything from social networking to business suites and CRMs. I still don't think wide events are that useful for most business software -- but I think they might be, because business software is also going to change. Whatever comes next, it's gonna be interesting.

Introducing locol

Tue, 21 Jan 2025 00:00:00 +0000

The best projects to work on are the ones that scratch a few different itches at once. I've wanted to write a native macOS application for years now, but I've always found building GUIs to be tedious and -- frankly -- hard to wrap my head around. I've also wanted a better way to manage a local OpenTelemetry Collector on my Mac for quite a while as well. Thanks to the magic of artificial intelligence and a little bit of gumption, I've accomplished both of these goals. Introducing locol.

What It Does

The goal of locol is fairly straightforward -- it manages local Collector instances for you. You can view the logs and metrics of the Collector in real-time, edit the configuration YAML, and start/stop the Collector. This isn't really that earthshattering, but it is nice to have it in a single GUI. One fun feature is that you can create different profiles -- so, different versions of the Collector, and switch between them. Nice if you're working with multiple different configurations as well.

The locol settings view

I've exposed a few fun things that I think a lot of people don't know about -- first, you can easily view all of the various feature gates that the Collector exposes and toggle them on and off. More entertainingly, you can also see all of the components bundled into the Collector, and click on their names to jump to their documentation. This actually theoretically would work on any Collector distribution that's built using ocb and the Collector manifest, so I might allow for a manual override of the download URL to let you use it with a custom build in the future.

The locol component view

You can monitor the Collector by looking at the metrics and logs from the instance as well, and even easily run a data generator to test your config.

The datagen view

How It Was Made

Perhaps the more interesting part of this story is how I built the entire application in, more or less, two weeks after never touching Swift before. I have a passing familiarity with the language, but I'm by no means an expert. I will also not say that the application is perfectly polished (especially in its current state), nor is it without bugs and edge cases. That said, I'd like to think it's pretty impressive for ~2 weeks of work.

This wouldn't have been possible without AI code assistance, full stop. Cursor wrote ~90% of this program, easily. What did I do? Well, it took two weeks for a reason...

What It's Like To Build With AI

One of the biggest problems I think a lot of developers have with AI is that they fall into two camps:

They are smart enough to catch the AI's mistakes and feel like they spend just as much time cleaning up after the AI as they do working with code.
They are not smart enough to catch the AI's mistakes and will gladly let it lead them around like an overeager but confused puppy.

When it comes to Swift and GUI development, I am far closer to the second camp than the first. I spent quite a bit of time independently researching the Swift documentation and blog posts in order to find good references to feed to Cursor. I also relied quite a bit on my formal experience in CS and my professional/domain expertise in order to coach the AI into building solutions that were, if not perfect, were at least correct enough to get me to the next step.

A great example of this is the metrics parsing and rendering in locol, seen below:

The locol metrics view

This was a huge pain and I'm still not entirely sure it's right or that it won't blow up in weird ways. Turns out that writing an interpreter and parser for Prometheus metrics in Swift isn't exactly trivial. I was aided by the fact that I knew, more or less, what I wanted to do and, crucially, what a correct result looked like. If you look at the code, this is one of the best-tested parts of it (ok, the only tested part of it) because the tendency of AI to eat its own creation was very strong when working on the metrics code. Having a test suite that I could run was very, very helpful here.

A particular downside to working with Swift and Cursor is that the VS Code integration for Swift is poor. This isn't the case for server-side Swift and swiftpm projects, but it certainly is for xcodebuild projects. Cursor's agent mode relies heavily on the LSP and linting for ensuring that it's not going off the rails. I think if I was doing this whole thing as a webapp (or an Electron app) it would have taken a third of the time. Hell, I'm still not entirely sure I don't want to throw it out and re-do it as an Electron app if I wasn't so tired of frigging webapps. I built this to solve my problems, not yours.

Next Steps

Right now I'm testing out locol with some colleagues to make sure it's in a more or less releasable state. I worry that once I actually put it out into the world that I'm gonna lose interest in it. My real goal is to enhance it to add an actual in-memory database and lightweight query functionality for OTLP data, but that seems like a very 'draw the rest of the Owl' sort of problem. Anyway, try it out and let me know what you think -- you can download it from GitHub.

Three months later...

Turns out, I did kinda lose interest in it after I got it more or less working. I'm still not sure it's super releasable, but it's something I want to put more time into later this year when I have more free time.

One Weird Trick

Fri, 03 Jan 2025 00:00:00 +0000

I was reading Mike Masnick's piece in TechDirt the other day where he discusses tech optimism in the face of, well, everything that's going on in the world right now. I don't really want to bore you with a repetition of the ills facing society here at the beginning of 2025, other than to say there's a lot of them. Instead, I want to talk about why I'm optimistic, too.

When The Going Gets Weird...

I've often joked that the thing I love and hate about California is that it's where the wave of American culture finally broke. It makes sense, in a way -- once you've hit the ocean, there's no where left to go, so you might as well settle down and make something of it. There's a lot of reasons that Silicon Valley and other locations out west have been a hotbed of computing for decades, and most of it isn't as simple as "that's where the weirdos are", but it's a critical part of the mess. Weirdness is a lot of things to a lot of people, but I like to think of it as a defensive measure against societal strictures. Weird is an autoimmune response, in a way. As an aside, this is why weirdness has been so successfully co-opted by the American Right-wing, but that's neither here nor there. Weirdness manifests in systems as desire lines, which are often shepherded into existence by and for weirdos on the inside of a system.

Part of the reason that I think culture broke in California is that it contained a maximal amount of weirdos and a maximal amount of finance. The confluence of weirdos and finance is 'software'. The weirdos like something that they can shape, the finance guys like to make money out of thin air, it's a perfect balance. Thus, the world has been enjoying decades of economic, social, cultural, and other forms of development based on research funded by the military-industrial complex and then supercharged by return-seeking finance guys. We've built a globally connected network on top of maximally permissive and highly resilient infrastructure, then plugged nearly everyone into it. This is a massive W for weirdos, as it means that we have achieved some measure of equity with non-weirdos in terms of social conduct.

It could be argued that it is also a massive L for organized society, as these digital spaces have been commercialized and co-opted in an effort to further concentrate money and power into the hands of the few at the expense of the many.

The Weird Need To Stop Turning Pro

I'm going to lay out a pretty simple thesis as to what's wrong with online these days, and it's this: we have commercialized it too much. We have traded everything away in an effort to make it accessible to as broad of a population as possible, knowing full well that some day there would be a price to pay. Some predicted that we were creating an all-seeing surveillance state which would strip our liberties; Others predicted horrendous self-censorship. I'm not sure how many people called "the complete and total irrelevance of the information space", but it's probably somewhere up there too. It is this last thing that we have gotten, though, and boy howdy it sucks.

Ironic, isn't it, that unheard of access to the world's information has resulted in the complete collapse of our ability or desire to create a shared cultural or information context?

I'm not really interested in blaming anyone in particular for this, although I would probably sleep like shit if I had ever worked for Facebook. That said, we do live in a society so while we might not bear the blame, we do bear the responsibility for getting through it. With that said, I will now turn to my prognostications as promised.

The most crucial task before us, as technologists, is to decommercialize the internet. This does not mean that things will not cost money; Indeed, I expect that many things will cost far more than they used to (especially since the default state for things online is 'free', even today) -- but I mean that we must get out of the profit-making drive for online experiences.
Decommercialization and decentralization go hand-in-hand. We need to expect less from centralized services and focus more on personal data sovereignty, strong cryptographic identity, and private networks for our individual data -- as well as globally indices and discovery mechanisms for our public data.
Ironically enough, we also probably need to embrace digital currencies on some level to make decommercialization happen. One of the reasons micropayments never took off is because of the nature of payment processing fees; Digital cash unironically helps here. We have to reclaim this space from digital finance bros, take the lunch money from the monkey jpeg nerds, and create genuinely useful tools on the protocols that exist.
We need to build, rather than admonish. We need to educate and lead rather than wall ourselves up in ivory towers. Like, it's funny to look at the AI slop and say that couldn't be me, it's amusing to look at the extremely cooked replies you get on Bluesky and think that they're bots, but a lot of people are just... like this.

A screenshot of a tumblr post about a very cursed script

Hope Is Frail, Yet Hard To Kill

It's very easy to look at the world around us and get discouraged. I think it's far more radical to look at it and hope for something more, something better. I don't think optimism is misplaced; I mean, at the end of the day, we do have the tools that we need to build better systems. We can create tools that empower individuals to own their experiences, their data, and their digital identity rather than farming them out to massive third-parties. What we can't forget, though, is that our job isn't just to build walls around our own digital spaces and watch as the rest of the world burns. We must make these tools accessible, and equitable, to our fellow humans.

I have faith that we can do these things, that they are not beyond our grasp. The internet was built by, and for, the weirdos. If this is to be our eternal september, then let's at least make sure we build as many desire paths as we can, so that whoever comes next will find well-trodden ways rather than thorny underbrush.

The Hater's Guide to OpenTelemetry

Mon, 10 Jun 2024 00:00:00 +0000

I recently presented a talk at Monitorama 2024 titled 'The Hater's Guide to OpenTelemetry'. The slides for that talk are available at https://austinlparker.github.io/monitorama-2024.

The presentation is built in reveal.js -- you can access the speaker notes/transcript by pressing the 's' key. You can also check out the talk recording! It was a lot of fun, thanks for having me, Monitorama!

https://youtu.be/yJFYNTq3uCs?feature=shared

Re-Redefining Observability

Fri, 29 Mar 2024 00:00:00 +0000

This post is a response/companion to Hazel Weakly's excellent 'Redefining Observability'. You should probably read it first, and perhaps Fred Hebert's commentary on it, 'A Commentary on Defining Observability'. I don't necessarily plan on re-treading a lot of the ground that both of them do, and instead, want to focus on breaking down some of the definitions and missing pieces that both present.

Definitions Considered Harmful

Hazel and Fred both present several classical and neologistic definitions of observability from control theory, systems engineering, and their own experience in the field. I'm going to toss out another one, because I feel like being contrary, and I'm going to do it from example. Recently, I met with an analyst in the field who brought up a common refrain he heard from leadership in very large organizations. In short, these leaders were sitting down and looking at millions of dollars in annual spend on observability tools, programs, and data with little to show for it -- at least, in their estimation. After all, applications still went down. Their bosses still fielded complaints from sales and marketing teams about reliability, availability, and performance. Investors wanted to know what the organization was doing about costs, and improving margins. Do more with less, forever and ever, until the end of time.

The response of the observability practitioner might be to suggest that these organizations aren't doing observability "right". They simply lack some sort of data, or a particular tool, or some cultural insight -- 'one weird trick', even if that trick amounts to 'just upend the entire way your SDLC works, nbd'. I'm not sure this is a terribly responsible point of view, even if it's hyperbolic. In fact, I think we spend far more time diagnosing why other people are doing observability wrong (or indeed, if they're doing it at all) than the characteristics of success that successful observability practices entail.

If this commentary about 'what went wrong vs. what went right' makes a lightbulb go off over your head, it's probably because you're familiar with the concept of 'Safety-I and Safety-II'. For as much as I like to belabor the point of observability being a way to cope with the concept of highly dynamic systems and the inability of existing methodologies to cope with that (a la Safety I), I'm as guilty as anyone else of ignoring the actual wins that people make with their existing observability practices.

Taking us back to definitions, I think Hazel and Fred are both right on this point. Hazel writes that 'Observability is the process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.' Fred points out that this (and most other) definitions miss out on the notion that observability is self-describing; Sometimes, you just need to know how much gas is in the tank of a car. This is as much a part of an observability system as your ability to understand each relationship between the systems in the car in order to calculate optimal driving speed to efficiently use that gas, not to mention the complex emergent behavior realized by thousands of people solving for these local maximums in their own spheres of influence and making purchasing decisions based on them.

In short, observability must solve for these discrete cases. It needs to be the car dashboard that gives you easy access to critical system state, it needs to be the inquisitive tooling that helps you understand the local system state through rationalization, and it also needs to be the global filter that helps you understand when it's worth stopping for gas in order to save five cents a gallon. We all perform acts of observability every single day in order to accomplish most tasks in our lives, we just don't call it that.

Success and failure in the garden of observability

I've spent a long time recently thinking about the nature of observability practices, because I think it's obvious that many people do have rather successful ones. What's different, and not measured that well, is how effective those observability practices are relative to other ones. There's a burgeoning language to describe this -- words like 'Observability 1.0' and 'Observability 2.0'. This is where I want to sit for a moment.

Observability 1.0 is not something that everyone has achieved, but it's something that is very achievable today. It is the summation of the past twenty years or so of trends in system design and philosophy -- it combines a variety of telemetry sources (metrics, logs, traces) into specialized tools with pre-defined correlations based on a handful of context values (both 'hard' -- like trace ID, and 'soft' -- like host or pod name), based mostly on existing patterns of operations. Developers write code, add some custom metrics to track important local state, then deploy that code to clusters that are owned and managed by external teams or operators. Those operators enforce some basic standards around data quality and availability, provide canned dashboards and alerts, and work to manage the overall availability of telemetry data as inputs to their alerting pipelines. These alerting pipelines tend to be the main objective of an observability 1.0 practice -- being able to quickly understand if 'something is wrong', ideally with enough context to understand where that something is wrong is an implicit or explicit goal.

While this isn't the stopping point for most organizations, gains after this point tend to be horizontal, rather than vertical. You can get better at certain aspects -- using SLOs, improving your telemetry availability through more advanced sampling, or purchasing tooling that specializes in certain areas of practice like mobile clients or browsers -- but your expenditures don't really scale with the value you get out of the observability system.

I posit that the 'end state' of observability for most organizations looks like a maturity model that turns into a cycle that begins and ends with a question that can be summed up with, "What's wrong?" You stack up enough telemetry data and tools and dashboards until you can answer this question, then you start to cycle through it again while asking more specific questions. I'm reminded of a recent paper by Microsoft on RCACopilot, an LLM-based system that automates root cause analysis of incidents by, essentially, looking through thousands of runbooks and following their steps in order to quickly collect data for incidents that look a lot like things they've already seen. It's an impressive tool, and for organizations with sufficient scale and documented knowledge, is certainly valuable.

The problem with this cycle is that it's limiting. To quote Hazel, "We completely and utterly fucked it up by defining observability to mean 'gigachad-scale JSON logs parser with a fancy search engine.'"

Breaking the cycle

I don't want to endlessly echo the points Hazel makes here, but I do want to refine them. Her tl;dr is that observability should look a lot more like BI (business intelligence) than it currently does. I completely agree! The fundamental flaw that I've found in this space over the past few years of watching talks, reading about it, and participating in OpenTelemetry is that we're utterly captured by a complete and total lack of creative thinking. Observability as its currently practiced seems like an exercise in showing off our QPS by way of accurate histogram bucketing, rather than stepping back and thinking for a second about how the hell our business makes money.

Even if you've calculated the cost of downtime, you probably aren't really thinking about the relationship between telemetry data and business data. Engineering stuff tends to stay in the engineering domain. Here's some questions that I'd suggest most people can't answer with their observability programs, but are absolutely fucking fascinating questions:

What's the relationship between system performance and conversions, by funnel stage? Break it down by geo, device, and intent signals.
What's our cost of goods sold per request, per customer, with real-time pricing data of resources?
How much does each marginal API request to our enterprise data endpoint cost in terms of availability for lower-tiered customers? Enough to justify automation work?
We need to be in compliance with new emissions regulations for cloud workloads, but the penalties are assessed on rolling 24-hour windows by DC. Where can we afford to time and location shift work in order to avoid paying penalties?
What libraries and dependencies are causing the most incidents? Which teams are responsible for maintaining them?
How much time are we spending on database migrations by team and product line? Which are the most risky, and which are the safest? Is it because of the people, or the tech?
Who's our most efficient on-call engineer, and what are they doing with the tools that makes them that way?
Can we quantify how much we're really saving on the cloud versus on-prem for these workloads?
Which teams are responsible for breaking prod the most? Which are breaking it the least? Don't just show it via deployment data, do a multi-dimensional comparison against tenure, rate of changes landed in prod, and incident resolution.

These are all, fundamentally, observability questions. "What endpoint is slow", or "Are 99.995% of transactions on this API successful" are boring observability questions. They're observability 1.0 questions. The real problem is, most people don't know (or think to ask) the good ones because they don't see a way to ever ask them! The tragedy is that, by and large, the data already exists, but we don't put it together.

The first step to break this cycle is to first define the questions you want to ask. Don't limit yourself to the basic ones -- the fuel gage should be included on a car, after all -- ask the big ones. These might not be achievable tomorrow, but unless you've defined what classes of questions you want to ask, and how those questions are connected to your business goals, you're going to spin forever on the simple stuff.

The second part of the cycle to break has to do with the data itself.

If You Say Pillar One More Time...

Here's a fun story about OpenTelemetry. Did you know that OpenTelemetry treats everything as an event? It doesn't know what a span, or a log is, deep down. Everything that happens is simply an event. A record. A signal that 'hey, a thing happened'. If you bring up that we actually have, uh, four different things that are called 'events' I will glare at you.

But, yeah, events. What happens to an event when it happens? Well, we add semantics to those events. More accurately, you do. When you create a span, a log, a metric instrument, or whatever -- you're telling OpenTelemetry how to interpret that event. This semantic telemetry is then enhanced with other semantically useful metadata -- resources, that tell you where it emanated from. Attributes, that provide dimensions for querying and aggregation. Context, that binds all of it together in a single, correlated braid.

Then you send it off to a backend who throws like 99% of that away and treats it like an undifferentiated point. Some of that's on you -- gotta control cardinality! -- but ultimately your choice of data storage and query layer has a lot more to do with how you interact with telemetry data than the actual data itself. I, personally, think this is a mistake. Admittedly, this is a mistake I don't necessarily have a pat answer for, but I have some general ideas.

We focus a lot on the types of telemetry, but it's actually a lot less important than you think. Telemetry can be transformed, usually in a non-lossy way, from rest. It's actually ridiculously inexpensive to just put... like, all your telemetry data in blob storage. Like, rounding error inexpensive. 500TB of telemetry in S3 is something like 10k a month. Do you need to keep over 500TB of telemetry data a month? Want to query it? Cool, use Athena or something. New Relic charges like fifty cents a gig over 100GB. That's orders of magnitude difference! Literally every other option is more expensive than 'throw it all in S3 and age it out after 30 days'. The problem is that we tend to conflate 'telemetry' with 'observability', and when we say 'observability' what we're usually talking about are 'workflows'.

A workflow is what it says on the tin. It's a mechanical action you perform in order to accomplish a task. It's APM, it's search, it's viewing a dashboard, it's writing a query. When I say 'tracing', most people probably think of something like a trace waterfall. When I say 'metrics', they think of a time series plot. This is a somewhat useful abstraction and model, but it's kind of a thought-terminating cliche.

A screenshot of Jaeger's trace waterfall view.

The reason this model isn't helpful is because it's limiting. It couples your mental model of the underlying data type to the workflows you use it for, and the projections you make from the data. A trace can only be this, a series of events that occur in order, all relating to a single logical transaction. You use this to view single, logical transactions. That's all you can do with it! The model and semantics act as a constraint -- which, admittedly, isn't the worst thing. Why do most logs suck? Because they're freeform, you can do whatever. You write them for you, not for systems. Flip it around, and what's a trace? It's a bunch of log messages with a defined schema. It's an event with some semantic sugar on it.

Metrics, fundamentally, same problem. It's an event that you compress ahead of time because it's more efficient. Lots of events in the world are useful to think about as numbers with attributes. You're making a tradeoff between write time and read time semantics. There's nothing special about a metric, or even an individual measurement, that makes it inviolate. Heck, most of the time people aren't even doing that much interesting with them. I would suggest that the vast majority of metrics that people use are instantaneous measurements - counters or gauges; 'how much gas do I have right now', 'what's my current speed', etc.

I'm not saying these measurements or visualizations are worthless, mind you. They're extremely valuable as part of monitoring what your application is doing. Collecting, indexing, and presenting this data to developers and operators is crucial. We already have a word for this, it's 'monitoring'.

The Great Observability Bait And Switch

Hazel correctly identifies this in her piece - observability is being sold to infrastructure teams, and this buries the idea by conflating the implementation with the practice. This is, perhaps, a symptom of great marketing by vendors in the space, but I also tend to think that it's a byproduct of the sort of professionalization crisis I wrote about a few years ago. Monitoring sounds boring, that's an IT thing. We're SRE's and DevOps, we need observability.

The problem with this is that we've slapped a new coat of paint on some old ideas and dolled it up with a bit of context without actually trying to step outside our problem domain. I don't necessarily think this is a deliberate act, I would hazard a guess that it's just how organizations function. How many engineering leaders do you know that have MBAs? The fact is, engineering is siloed off for many reasons. Some of those reasons are good -- R&D, as a class of work, is a creative endeavor. Some of them, though, are bad. Arrogance on both sides of the equation, organizational leadership beholden to shareholders that view engineering as a cost center and engineering leadership that are too frazzled by shifting demands and cost-cutting to create holistic approaches to product and feature delivery. When I talk to people in industry at a leadership level, the only question they really have is "how do I save money" these days. We can all bemoan the end of ZIRP, but it's probably worthwhile to ask why we single out engineering as the fun times free money zone. The overwhelming majority of businesses on the planet don't get to figure out COGS as a year five, seven, or twelve problem. If you're selling pizzas for less than it costs you to make them, you go out of business very very quickly.

On the flip side, most organizations spend an awful lot of money and time on a fleet of business analysts and business intelligence tools in order to ask questions, forecast future results, and manage the reams of telemetry data they get about the organization itself. These are mundane questions, to be sure -- spot the outlier in spend in this department on T&E, calculate CAC over time, figure up the ROI on capital improvements, spot trends in sales in order to rebalance staffing levels in field teams. I don't want to sit here and rank the value of these questions, but I want to point something out -- pretty much everything in business is about asking questions and forming hypotheses, then testing them.

Wait, isn't that just observability? Why yes, it is! This is what we've been preaching for years now, and it's mostly gotten yoinked out from under us in favor of endless discussions and micro-optimizations about data storage, query languages, telemetry types, and so forth. The opportunity isn't "let's get really good at figuring out when the Kubernetes cluster is going to have problems", it's "let's combine these telemetry streams so we can quantify our investment in reliability based on actual user experience".

Hazel writes more about this using more words (seriously, go read the posts) but I think this part bears repeating:

Learning, without action, isn't learning; it's fundamentally a process. And processes? Processes are messy, they require action, they require movement, they require doing, the require re-evaluating the process, they require evolving the process, they require wrangling with the human condition itself.

Observability has been buried in so many layers of indirection that the fundamental cycle of doing it is indecipherable. It's an oroborous; The factory grows to meet the needs of the growing factory. We don't need a reset, we need a reorientation.

Towards Observability 2.0

What does it look like to re-orient ourselves? I think it's worthwhile to mention that this isn't some completely new and uncharted territory. There's organizations that are walking this path today - Meta, Netflix, sure. There's also smaller ones -- Honeycomb is on this path, I think. Lightstep was, at one point. Fundamentally, it's about treating observability as an organizational muscle, not just an engineering one. It's about connecting vast and discrete forms of telemetry together through schemas and semantic conventions, allowing anyone in the business to ask questions, build hypotheses, and access the data they need to prove or disprove them. It's about the ability of the organization itself to respond to this data, to synthesize the gut feelings that drive decisions with the hard facts about measurable reality in order to do, to go, and to do it all safely.

Observability 2.0 is less concerned with the type of telemetry you use and collect, and more concerned with its structure and schema. It's less concerned with where you store that data, and more concerned with how accessible the data is. It's less concerned with gigachad JSON search indexes, and more concerned with flexibility and query-time aggregations. It's less concerned with 'saving money' and more concerned with 'providing value'.

This last point I think is crucial. This shit costs money, sure. Everything does. The difference is how these costs scale. Monitoring costs are exponential, since every additional byte of unneeded telemetry acts as a drag by increasing noise. Observability 2.0 argues that there is no unneeded telemetry. Every event has value, it's just about where you extract it. Rather than duplicating data at write-time based on type, layer your telemetry and pass it through samplers to store the right stuff in the right place in the right way. Use the fact that most of the things you're measuring are instantaneous to put metrics-like measurements on other structured outputs. Keep more metrics than you do today, for longer than you do today, but compress them more by leveraging exemplars to offload high-cardinality metadata to other formats. Create schemas for business data and tie it in to your performance telemetry, then query across both. Embed sales and marketing in your engineering groups, and vice-versa. The C-Suite and your engineers should be looking at the same SLOs on the same dashboards. Add attributes that map to your issue management system in code. Add attributes that map to teams! Put the Slack handle of who's on-call in your traces, then hook it up to PagerDuty to change along with the schedule. Record events, record a lot of events. Make time and space for asking questions. There's no such thing as a bad question.

Now, more than ever, we have the tools we need in order to build this kind of observability practice. It is incumbent on us to pick up those tools and carry them into the future, to build a better tomorrow, than we have today.

Don't Work For Projects That Don't Have Open Governance

Sat, 23 Mar 2024 00:00:00 +0000

I'm going to weigh in on the Redis thing.

First, I want to touch on this quote from the article I linked above:

“Particularly with Speedb, this is a big investment for us as a startup. If we put that in there and the cloud service providers have the ability to quickly just take and ship it to their customers — essentially without paying anything — that’s problematic for us, as you can imagine.”

Ah yes, plucky startup Redis Labs, with over $350M in funding (most recently a G Round) and a valuation over $2B. I'd be more sanguine if Redis hadn't done this sort of shit before, or if multiple other companies hadn't taken similar tacks.

Let me get this out of the way, though -- if Redis Labs wants to re-license software that they own the copyrights and code for, that's their right. gg no re. My problem isn't that they're changing the rules of the game (and really, everyone should have seen this coming), my problem is that people keep getting their pants in a twist over it. We need to stop sitting back and saying "ah, the source is available, so it's open source and that's fine!" Just because something is on GitHub doesn't mean it's good, or useful, or sustainable. I think this is the generic fate of all 'open core' products, or even most of the 'open source' AI that's out in the world. The thing that matters is who gets to make the decisions, and who owns the IP and copyright.

I actually tend to believe that most 'open source' but closed governance tools would be better served by just being source available from the jump, rather than using an OSI-approved license. I want people to stop building critical parts of their system around things that can, and will, be yanked away from them at a moments notice. I think this even applies to foundation-backed projects! There's a non-zero amount of CNCF projects where the governance is controlled by a single company, more or less. Do some leg work, make sure the steering committee actually meets, see if it's legit. Especially do this if you plan on becoming a contributor, because it sucks to have your work get vacuumed up to enrich someone else. I would go so far to say that if you're a company using open source but closed governance tools or libraries, just preemptively fork and don't submit patches back upstream. The only safe open source is open governance.

"But wait, that basically means we need to dedicate engineers to maintaining our fork, thus erasing the cost savings of using open source in the first place!" Well yeah. No such thing as a free lunch!

Regrets of a Technical Communicator

Sun, 25 Feb 2024 00:00:00 +0000

Sunset over the desert in New Mexico.

I like to joke that I got into developer relations because I was the rare programmer that could carry on a conversation for more than five minutes. Like all good jokes, its mostly true -- I think one of the foundational abilities of the role is a strong ability to translate highly specific and nuanced technical concepts into something that's broadly consumable by other technologists or a general audience. I've noticed a worrying trend over the past couple of years about technical communication, however. In short, the gap between what people need to understand and what's being communicated to them has never been larger.

I don't have a ton of specific examples here, but a lot of this is driven by conversations I've been part of over on Bluesky for the past year or so. For the uninitiated, Bluesky is a federated microblogging platform built on a decentralized protocol known as ATProto. This article explains it more in-depth. There's been a lot of conversations about the hot technical topics of the day that have a social impact, such as generative AI and, uh, federated microblogging platforms. Both of these are highly technical in their implementation, are both very important to how the internet and software systems will function in the future, and are understood very poorly by most people. I'll be the first to admit that I am not an expert on either, although I've done some reading.

Feeling the AGI

For an example of the problem, I'm going to write out some thoughts on "AI". I'm gonna say the same thing twice; first, as I'd communicate with a co-worker or someone that works in modern software systems.

It's very obvious that the fever-dreams of the 'AI Bros' and e/acc numbskulls are hilariously unachievable given the current state of the art in AI. However, you'd be remiss to write off diffusion and transformer models, to say nothing of the open source work happening around them. Transformers and Large Language Models, specifically, will certainly become a huge part of interaction modalities with data over the next half-decade or so. It's even more impressive to consider that much of what's being turned into products and solutions today is based on research from decades ago. Over time, I expect we'll see further advancements as computational power continues to grow.

Ok, now I'm going to rewrite that last paragraph for a more general audience.

A lot of people who want to make money off 'AI' are promising some really big things, but there's not a lot of evidence that they'll be able to actually pull it off given what's achievable today. However, you shouldn't take this to mean that stuff like ChatGPT or Stable Diffusion is a dead-end or is going to go away. Those specific products and how people use them may, but the stuff that's happening behind the scenes is going to be used by developers and companies to make it easier for you to work with computers. It's also worth remembering that a lot of what's "new" in AI is based on research from the 70's, and computers have finally caught up to make that theoretical work possible. It's likely that this trend will continue as computers become more powerful and efficient.

This is a fairly basic version of the problem I'm talking about above. I didn't actually say anything different between the two paragraphs, but the audience for them is drastically different. The problem I have is that I don't think the latter is really any good at allaying someone's concerns over AI, because it doesn't really get into the whys. Can I, or anyone else, do a better job? Sure. AI isn't my speciality.

However, where do you even begin? Pick a point in the History section of Wikipedia for Neural Networks, I guarantee you that you're gonna miss some sort of context. Even a completely lay explanation of 'what an LLM is doing' should touch on some stuff from the past twenty years, give or take, and probably point out exactly how prevalent neural networks are in software systems today. If you waved a magic wand to get rid of 'AI', you'd be killing off everything from Google Translate, to predictive text, to most forms of fraud detection in banking. A useful explanation of attention and transformers is beyond me, certainly, but I know enough to know what I don't know.

Everything's a system, and systems are complex!

You can't throw a rock without hitting some example of this problem. Journalists interviewing objectionable people and getting pilloried from all sides for the act of interviewing them, even though it's part of the basic ethics of reporting. Developers building tools that meet with impossible standards from user communities. When you can't seemingly do anything that makes everyone happy, your options are pretty much to shut yourself off from the wider world or simply categorize people into a binary of 'haters' and 'not haters' and make up stories about the haters so you can ignore them more easily.

I'm not immune to this tendency either - it's far easier to just ignore people and arguments that you can tell aren't going to be super productive. What's disappointing, personally, is that I know that the audience for any of these arguments isn't to persuade the parties involved, it's to entertain or inspire the silent observers.

This is where I start to feel the regrets swell up, because I think that technologists have done a very poor job of consistently communicating systemic concepts in approachable ways. I don't think we're ever going to drown out the tech fuckbois who are trying to turn a quick buck, they have vested financial interests in getting their opinion over. What I do think we can do, though, is commit to some level of open and honest communication about hard technical concepts, without necessarily shutting ourselves off or getting too in the weeds about things.

I'm not saying this in the sense that we should put up with abuse, or suffer foolishness. If people don't want to be respectful, then they can go on their merry way. That said, we do have a responsibility to engage with people where they are, and help them understand the sprawling complexity of systems in the best way we can, rather than writing them off. As some of the principal groups spawning that complexity, it's our responsibility to society to be good stewards of it.

Lessons Learned from Learning OpenTelemetry

Sun, 11 Feb 2024 00:00:00 +0000

I'm knee-deep in production for Learning OpenTelemetry, releasing in just over a month. This is my second book, so I figured it was a good time to sit down and write up a couple of things I learned while writing this one, if only so when the writing bug gets me again in a year or so I can look back at this post and ask myself if it was really worth it.

Mostly joking, but writing is hard! There's a real balance you need to strike, especially when doing technical-but-not-documentation content.

The First Stage Of Writing A Book Is Denial

After Distributed Tracing in Practice, I was pretty sure I didn't want to write another book. That project wound up being an absolute pain for a couple of reasons. One, it was my first book and I had absolutely no clue what I was doing. My assumption going in is that a book was pretty much like a really long paper, or thesis, or some other form of long-form writing. I figured that my skills there would translate pretty naturally. What I realized over the course of several years is that the most valuable skill in writing is actually project management. Sure, I could sit down and crank out a bunch of words, but the words are honestly the smallest part of writing. I had four co-authors on Distributed Tracing, and ensuring that all of our work fit together into a cohesive whole was extremely challenging -- especially given my co-authors busy schedules! There's a lot of reading, re-reading, and small edits that need to happen to make sure that things flow naturally from one section to the next, that concepts are built up gradually over time, and that readers don't get lost in the weeds as authors change.

The second big challenge was overcoming my own nerves, to be honest. I had never taken on a project of this size and scope, and I was easily the least experienced and well-known person on the book. While my co-authors had, between them, decades of academic and real-world experience in the field of distributed tracing, I was a relative newcomer. To say I had a bit of imposter syndrome would be understating things slightly.

With those two points in mind, it shouldn't be a surprise that the schedule for Distributed Tracing slipped as much as it did. However, we powered through and got everything wrapped up at the beginning of 2020, just in time for... a global pandemic. Can't predict everything, I suppose. Truthfully, this was a pretty big gut punch to any hope I had of the title being really commercially successful (even though I wasn't standing to make a dime off the book anyway; We had decided to donate all royalties to charity) as everyone suddenly had greater priorities.

There were a million little things that I learned along the way as well -- the importance of spending more time on figures and illustrations earlier in the process, writing sections and subsections as composable chunks, being more aggressive in editing to reduce needless repetition, striking the right balance between verbosity and terseness -- but in the immediate aftermath of writing my first book, the only thing I could say to myself was "not gonna do that again!"

Doing That Again

Aside from being able to tell people "I've written books, you know," there's not a lot of great reasons to get into writing. The money isn't great unless you're very very good, it's rather time consuming, and just because you write something down doesn't mean anyone is gonna read it. That said, it does scratch a certain itch. Someone told me once the best reason to write a book is because you're tired of explaining the same thing over and over again, but I'd put some nuance on that statement. The best reason to write a book is because you want to remember why you're explaining something over and over again.

I'd like to think that this is a pretty common pattern for any non-fiction author. Writing requires you to think about something that you probably know a lot about from a lot of different angles, in a really thorough way. It's not enough to just regurgitate the facts as you understand them, you're creating a point-in-time record of what you think, how you think about it, and why your thoughts matter. Nothing is really ever frozen in time, even history, right? The perspectives we use to reflect upon historical events are just as important to understanding those events as the factual record is. Books exist to analyze a topic as much as they exist as a way to learn facts.

Technical books aren't immune to this tendency. Learning OpenTelemetry exists not just to help people, well, learn OpenTelemetry, but as a way to stitch together years of history and trends into an overarching narrative about how and why OpenTelemetry works the way it does. This is important because that knowledge and analysis is becoming harder and harder to find, especially for new contributors to the project. If you haven't been around since the beginning, it's increasingly hard to grok why things work the way they do, or what our motivations are for certain decisions.

So, why'd I decide to write another book? Mostly to scratch this particular itch. There's a gap in the record around OpenTelemetry today. We've got ok-to-decent user documentation, and fairly exhaustive developer documentation, and a completely public historical record of every decision we've made... but it's scattered and disjointed. It's unreasonable to ask people to comb through five years of GitHub history to understand decisions, especially if they're volunteering their time!

The other major motivation is that OpenTelemetry is pretty opaque to a lot of developers and other people outside the 'observability community'. I think this is a problem, especially as we look to make OpenTelemetry ubiquitous for developers. If it's truly gonna be a built-in part of software, then it needs to be accessible to the people writing software, which means we need to explain our motivations and why they matter. I wanted an opportunity to tell people not just what the project does, but why what it does matters.

Advice To Future Me

In no particular order, here's some of the things that I think went much better this time.

Don't be precious about the outline.
I feel like we overfit the chapters to the outline presented to the publisher in Distributed Tracing. Some of this is due to my lack of confidence, some of it was not having a great feel for what mattered and didn't matter to readers. In Learning OpenTelemetry, we quickly pivoted away from 'stuff that didn't make sense' when drafting chapters and even re-wrote large parts of the book to avoid needless repetition. I think the overall product is much stronger for it!
Code is great, but use it sparingly.
The best argument might be running code, but the amount of effort that needs to go into explaining it is often mismatched. It's tempting to plop a function on the page and then dive into it, line-by-line, but it's usually better to focus on the primitives and concepts that the code is eliding. Especially in the case of OpenTelemetry, where there's still parts of the project in-flight, we focused on things that are unlikely to change and only dip down into walking you through code when it's critical to make a point or build understanding.
A picture really is worth a thousand words.
One of the things I feel like I've really come to appreciate is the value of an effective figure or illustration. I spent more time considering the value of the figures I was drawing, rather than just seeing them as a way to fill up the page. Figures can pack a lot of useful information in a pretty compact way, especially with anything that needs to be reasoned about spatially, and they make a huge difference in helping readers understand the impact of what you're writing about.
You don't know how much you have left to do until you're halfway done.
When working with a co-author especially, you need to reserve time at the end of the writing process to stitch the parts together into a cohesive whole, but it's hard to know how much time you'll need until you've done a bunch of the initial work. We had nearly 75% of the book completed before deciding that major revisions were needed in order to take the best parts of each of our chapters and align them. I think, even if you're a solo author, you should still consider taking time at the end to perform this alignment step and not do it in the middle of the writing process.
You'll miss dates; Don't beat yourself up over it.
We missed so many deadlines (although less than with Distributed Tracing) due to any number of factors. Life gets in the way. With Learning OpenTelemetry, we easily lost six months to a variety of unexpected factors such as job changes, illnesses in the family, emergency travel, and so forth. Open communication with your editor is key -- it's usually not a huge deal to miss deadlines as long as people can plan around it.
Pre-writing is 80% of writing.
My english and composition teachers would probably be smirking to hear me admit it, but one of the most valuable takeaways I have now is that it's totally OK to throw things away. Honestly, between drafts and research notes and async conversations, there's probably half again as much writing that went into the book that got deleted or wasn't included.
The hardest part of writing is the conclusion.
This is something I'm just not good at yet, I think! It seems like it should be easy, just restate your conclusions and set people up for the next chapter or section, but I feel like I can dial this in a lot better. I think it's probably worse on my blogs than in the book, but I also feel like it's less of an issue in a blog post where you can just scroll back up easily?

If I had to summarize it, I feel like the biggest lesson I've learned between this book and the last one is around confidence. I'm more confident as a writer, I couch what I say less, and I'm more direct and prescriptive where I need to be. Rather than qualifying everything, I'm better at pointing out the happy path and giving pointers on where my advice might not be applicable. I've become more authoritative, in short. While some of this is certainly due to spending more time with the subject over the past several years, a lot of it is just becoming more confident as a person.

I think this is most apparent in my own reaction to the book being done. After Distributed Tracing, I felt a deep sense of relief. It felt like a weight off my shoulders. Now, I'm energized by wrapping up this project, and with an eye towards what the next one will be. I'd like to keep writing about observability, so we'll see where the industry goes and what sort of things I spend the next year or two explaining to folks.

All that said, I'd like to wrap this up by asking for a pre-order of Learning OpenTelemetry. It really would mean a lot, and pre-orders and sales in general make it possible for me to continue writing independently of my employer. This book is really written for everyone whose job involves software, not just operations or IT professionals. If you're writing, running, or building a business around software -- especially cloud-native software -- then this is a book for you. You'll learn about the next generation of observability frameworks from the ground-up in a holistic manner, not just what it is or how it works, but why it's built the way it is, and the kind of problems it's solving today and the kinds of problems it can solve in the future. I think you'll love it.

What Do We Mean When We Talk About OpenTelemetry?

Sat, 03 Feb 2024 00:00:00 +0000

I'm motivated to write this post as a result of several discussions I've had over the past week or so prompted in part by the announcement of Elastic wanting to donate their profiling agent to the OpenTelemetry project. One of the bigger challenges around OpenTelemetry is that you can think of it as a vector. It not only has a shape, it has a direction, and the way you think about the project and what it is has a lot to do with how well you understand that direction. There's the OpenTelemetry of yesterday, the OpenTelemetry of today, and the OpenTelemetry of tomorrow. Let's talk about each of these in turn, so that we can try and build a model of what OpenTelemetry is in a holistic sense.

The OpenTelemetry of the Past

What was OpenTelemetry, from the outset? Casual observers may not be aware of the history surrounding the project, so I'll give a quick recap. OpenTelemetry was formed by the merger of two existing open source projects, OpenTracing and OpenCensus. While there were a decent amount of differences between the two, they both shared a similar goal -- to make distributed tracing more accessible to cloud-native software developers. The methods that each project used differed; OpenTracing provided a thin, vendor-agnostic interface around creating traces and propagating trace context, while OpenCensus provided an end-to-end API, SDK, and wire format for telemetry data.

To summarize, both OpenTracing and OpenCensus both envisioned a future where telemetry was independent of analysis, but got there different ways. OpenTracing had a core tenet that an interface-only design would be preferable, as vendors would not want to give up control over the telemetry SDK. OpenCensus assumed otherwise, but felt that metrics would still be core to the telemetry needs of developers and operators. We were all right, and wrong, about some of this. What's really interesting, though, is the stuff that OpenTelemetry does that neither of OpenTracing or OpenCensus did. We'll get to that in a little bit.

Post-merger, OpenTelemetry's priority was in replacing the existing features of OpenTracing and OpenCensus. This is why, for instance, we focused on the tracing signal to the exclusion of others at first. In doing so, though, we incorporated a few really good ideas that would become prescient. The first is to build out standardized and consistent metadata across signals. These are known as the semantic conventions, and they provide a lexicon of attribute keys and acceptable values -- a schema -- for telemetry metadata. Through semantic conventions, telemetry not only becomes independent of analysis, but the skills required for analysis also become a commodity. In plainer terms, you don't need to re-learn what measurements mean what from system to system; Learn the semantic conventions, and you'll always know what something means, and if that something matters. The second really good idea we incorporated was the context layer. This is the part of OpenTelemetry that propagates a globally unique, per request identifier between each of your services. If these identifiers are present, other telemetry signals can be joined together by using this shared correlation ID.

The OpenTelemetry of Today

OpenTelemetry remains a 'tracing tool' in a lot of people's minds; I think it's fair to say that a lot of this is due to its legacy as the child of two 'tracing frameworks', not to mention the relative maturity of the tracing functionality relative to, say, metrics or logs. That said, I think its important to look at what's actually available, and how you can use it.

The OpenTelemetry Collector is a fully-featured telemetry collection agent, capable of ingesting dozens of common event sources on Linux, MacOS, or Windows. It can receive or scrape metrics from Prometheus, StatsD, or through native metric receivers like hostmetrics. It's capable of translating existing trace data emitted by Jaeger, DataDog APM, Splunk, or many other sources into OpenTelemetry format. With the right tooling, you can even remotely configure and manage many hundreds or thousands of Collectors via OpAMP.

Want to get application telemetry? Cool, you can do that too -- zero-code instrumentation packages exist for Java, .NET, PHP, Python, and Node.JS. These packages will give you a pretty basic set of 'APM' spans (which you can even turn into just metrics if you'd like using the Collector), about what you'd see from a New Relic or DataDog APM package. Is it 1:1? No, of course not, but it's pretty good and I'd argue that it's usually good enough to get started for most people. The biggest thing missing here is that most of these instrumentation packages are emitting just trace data and not metrics or logs yet, but it's coming, especially as metrics and logs continue to stabilize.

Heck, if you're in Java then you've got most of that already. You can configure log4j or whatever to append to an OpenTelemetry log sink and whammo, you've got nicely formatted OTLP logs that will get annotated with trace data if it's available. That's pretty slick!

I do want to note that yes, there are gaps. The ergonomics of installing and configuring OpenTelemetry could be a lot better, especially if you're doing more than just adding zero-code instrumentation. We made a lot of design decisions in OpenTelemetry to support it being a framework to build telemetry systems on top of, not necessarily for it to be a seamless experience to integrate directly. Perhaps that will change in the future -- honestly, it's kinda up to all y'all. I'm just one person on the governance committee. Open issues, ask for change, we'll listen. My door is always open (seriously, book some time to talk) and I'll guarantee that I'll do everything I can to help point you in the right direction.

That said, I'd still argue that OpenTelemetry as it stands today is pretty good, most of the time. We've done a decent job at encoding the state of the art for what's possible, today, in its design and implementation.

The OpenTelemetry of Tomorrow

However, the point of OpenTelemetry isn't just to put a flag in the ground around what has already been done and say that this is good enough. I believe that to understand what is possible, and where we're going, we need to discuss the idea that telemetry is independent of analysis. Another way of saying this is that telemetry is not observability.

Telemetry data is foundational to observability practice, but the way most people conceive of observability doesn't really gel with it as an independent part of the stack. I'm going to illustrate this with an image from a blog I read the other day, talking about observability:

A block diagram of various observability tools (such as infra monitoring, APM, RUM, and log management) at the top, generating metrics/events/logs/traces and sending out alerts/incidents.

I find this image to be somewhat mysterious, to be honest, because it's kinda backwards. If I was going to redraw it, I would change things around slightly.

A directed flow diagram with 'Telemetry' at the base, then 'Pipeline', 'Storage', 'Query', 'Projection', and finally 'Workflows'.

This may look strange to you, because it's a lot of stuff that you don't have to care about (unless you're building an observability system from scratch, in which case, I'm sorry) normally. If you're using DataDog, or New Relic, most of these details are elided. You install an agent, it does some magic, and you get these nice workflows that fall out of it. You can go into the tool, pull up a dashboard, and it tells you what's slow. You probably have to care about the query layer of this somewhat, but only in the sense that you need to understand the queries and visualizations they can generate in order to build workflows. Even so, a lot of this is work that's being done for you or has been done for you.

The problem with this model and existing tools is that a lot of them work specifically because they get to control this entire stack, and you only have to think about the actual workflows you care about. You want to know your slowest DB queries or find outliers in API performance; You're gonna reach for things that look like APM tooling because that's what they do. The vendor gets to control that experience by building a vertically integrated stack of telemetry data, sampling pipelines, storage and query facades, visualizations, and workflows. They get to find optimizations that work for them to do this in an efficient way and increase their margins. This is one reason DataDog makes so much money -- metrics are a huge cash cow for them.

If you start to try to break out of this vertical integration, you're gonna start finding some problems. Suddenly, the magic drops away, and you're forced to do a lot of this stuff yourself. You don't get the nice magic dashboards any more, you know? I think this is something people are grappling with now based on my conversations. People want their comfortable, well-worn experiences. They want to be able to funnel GB of logs to Splunk for half a million bucks a year or whatever just so they can make sure to find the one thing that went wrong ten days after it happened rather than setting up pipelines to ensure errors are captured. They'd rather have the magic dashboards that tell them "hey, this thing over here is slow" rather than ask questions. Why wouldn't they? If the choice is between the magic answer box that's easy to use (even if it's not always right) or the more powerful but harder to use magic answer box, most of the time you're gonna pick the easy one. 60% of the time, it works every time, right?

This is the thing that OpenTelemetry really disrupts for people, because the goal of OpenTelemetry is to put all of this telemetry data in the actual libraries, frameworks, and underlying dependencies that you rely on to build software. Rather than having to slap in interceptors or monkeypatch libraries, we see a future where developers natively write against our API for metrics and traces, then publish schemas containing not only what these metrics and traces are, but how you should use and interpret them. We see a future where dashboards are pretty much a relic of the past, because software becomes self-describing as a result of the telemetry it emits. In much the same way that embedded documentation makes it easier to actually use a library, embedded telemetry will make it possible to understand the operation of a system by simply running it.

In this world, you're gonna have to make a lot more choices about observability, but they're gonna be a lot more interesting ones, I think.

When are we gonna get there? Not anytime terribly soon, I think, but I wouldn't be surprised if we don't see some major progress towards this within the next five years. I suppose you can set a reminder to come troll me if I'm wrong.

OTel TIL - What The Heck Is Instrumentation, Anyway?

Thu, 25 Jan 2024 00:00:00 +0000

Ever asked ChatGPT about OpenTelemetry? There's a pretty good chance that what it spits out at you started out as something I wrote, years ago. When the project started, I picked up where I left off maintaining the docs and website for OpenTracing and built the first few versions of opentelemetry.io (seen here in late 2019), including most of its initial documentation, concept pages, and so forth. Little did I realize then that the project would become as large as it did, or that everything I wrote would get repeated across the internet on dozens of other documentation sites, marketing pages, and blogs... and I really did not see those words getting fed into massive language models, thus ossifying a lot of the concepts that I wrote about into point-in-time snapshots of what a lot of words mean. One of these words, and the one I want to dive into, is instrumentation.

What is instrumentation for observability?

Let's start with some review. Instrumentation is code that you write in order to learn about the internal state of a program. The simplest form of instrumentation, and one that you've almost certainly done regardless of how long you've been programming, is writing out a message to the console. Yep, if you've written a console.log statement in your life, you've written instrumentation.

Like many things in programming, the complexity of instrumentation code ramps up very quickly. Writing out log lines to help you understand what functions are being called or what the value of a variable is at any given moment is easy enough, but it doesn't scale well. There's a cycle that tends to repeat itself in development, because the kind of instrumentation that's useful when you're prototyping something isn't necessarily the kind that's useful when you're running thousands of instances of it across hundreds of nodes. To address this, you would use some kind of instrumentation framework -- a logging library, for example -- that structures your data into a schema, and attaches useful information to each message like the host or container name that your service is running on. This structured data can then be processed by machines more easily, converted into other formats, and so forth.

This sounds easy enough, right? Write a little code to tell other humans what your code is doing, add some structure to it, bing bong boom -- you've got observability. Or do you?

The problem with instrumentation

Every line of code you write is debt, in a way. You're gonna pay for it in the end, one way or another. Instrumentation code is no different. What happens when someone refactors your program in order to make it perform better, or when the business logic subtly changes to handle edge cases or bugs? What guarantee do you have that those log statements you wrote actually make sense in the future?

Screenshot of a Twitter post from 2014 describing inscrutable error logs emitted by the Something Awful Forums

While most people who program professionally don't have to deal with Radium-quality error messages, I've seen my fair share of inscrutable or completely incorrect log statements printed out by actual software that human beings pay real money for. The other tricky part of this is that instrumentation code isn't exactly exciting stuff. It can be tedious to write and maintain, it adds a lot of cruft to the actual legibility of your code blocks, and it's often load-bearing. Many teams will have log processors that convert log messages into metrics for long-term storage and analysis; Changing or removing log lines can break this functionality, which means you have to coordinate with whichever teams own the log pipeline in order to make things better, and they're probably busy with their own problems...

This is one of the reasons there's a lot of duplicate instrumentation in the world, this problem of joint ownership. The people who write the code will write instrumentation for them, the people who operate the systems will create their own instrumentation tailored to their needs, product managers and other business stakeholders will want their own unique instrumentation as well. Everyone wants to know what's up, but they work at cross purposes quite often when it comes to an instrumentation strategy! It's usually easier to just throw your hands up and throw something at the problem, and this is where we get into the magic of automatic instrumentation.

The difference between instrumentation, instrumentation, and instrumentation.

If instrumentation is code that you write to understand the state of a program, why are there so many types? You may have read about agent-based instrumentation, or library instrumentation, or automatic instrumentation -- let's step back, and talk about the different approaches that you can use to instrument a program in the first place.

You can write instrumentation code directly alongside your business logic and other code. This is known as 'white box instrumentation', or 'manual instrumentation'. In this approach, you're responsible for everything; You install and configure a telemetry API and SDK then write the instrumentation code yourself. This gives you the maximal control over what telemetry you emit, as well as what that telemetry contains.
You can import libraries that wrap or plug in to your dependencies, and those libraries contain instrumentation code. This is sometimes referred to as 'library instrumentation', but can also be thought of as a form of white box instrumentation. You're still making changes to the code, but instead of writing all of the instrumentation yourself, you're relying on a third party to do most of the work. This approach is usually combined with some amount of manual instrumentation on your part -- you extend the instrumentation code provided by the library with additional attributes or specific telemetry about your program.
You can use an external program or process that leverages instrumentation libraries and runtime modification to create instrumentation without code changes. This process can be referred to as 'black box instrumentation', 'agent based instrumentation', or 'automatic instrumentation'. The key distinction between this and the earlier examples is that you don't change your code at all. The agent uses reflection, interceptors, monkey patching, or other techniques to inject the instrumentation code into your program at runtime (or in some cases, at compile time).
You can use an external program to hook into existing telemetry being generated from your library or runtime and send this data off for analysis and processing. This is also a form of 'automatic instrumentation' or 'agent based instrumentation', although it's somewhat of a misnomer since there's no real instrumentation happening -- the instrumentation itself is native or built-in. This is in many ways the 'best' way of doing things, as it requires the least work on your part and provides a good balance between flexibility and telemetry resolution (since you can dynamically change what data you're collecting by modifying configuration values during runtime, often without interrupting your program execution)

If you look at these four options, you might notice a bit of a theme. They form a quadrant where we care about how much we need to change our code, and how we handle the configuration of instrumentation. The actual choice you make in terms of which strategy you'd like to apply is going to differ based on many different facts -- the size of your team, the needs of various stakeholders in the organization, systems access, and so forth. One of the challenges to navigate here, though, is that some of these strategies are incompatible with each other! If you're using an agent, then that agent is probably exclusive. An OpenTelemetry agent and a DataDog agent, for example, are going to conflict with each other in most cases, as they'll both try to instrument the same parts of your program in different, incompatible ways.

The attentive reader may have noticed that the actual words I'm using to describe these don't necessarily map to these axes super well. I don't think this is deliberate, or at least the intention isn't to be misleading, on anyone's part. The problem is that we're trying to cram multiple dimensions of meaning into as few letters as possible, and anything you put in front of a long word like 'instrumentation' is likely to make people fall asleep from reading it. This is why OpenTelemetry's website has, for quite some time, consolidated these concepts pretty heavily. 'Instrumentation' is the top-level concept, with 'Manual' and 'Automatic' as sub-types. While this misses out on some nuance, that nuance wasn't judged to be terribly important at the time, as I recall.

Unfortunately, this approach also has wound up causing some confusion!

Instrumentation strategies differ based on your role

As I mentioned earlier in this post, there's a lot of people who care about instrumentation and telemetry data. They usually care about different slices of that data, at different resolutions, and have their own preferred ways to visualize or interpret it. This also means that different teams may own different data streams coming from a single program, or rely on a centralized observability team to convert existing streams (like combined server and application logs) into a variety of other formats (such as access logs being split to a security analysis tool, and application errors being converted to metrics + stack traces and funneled off somewhere else). I find in most organizations, though, the lines are a bit more blurry and multiple teams will wind up piecing together complex observability pipelines, often relying heavily on agents to instrument their software, because developers wind up writing logs that are useful for local development and not necessarily production monitoring.

So, what does this have to do with the words we use for these tools? Well, without a clear way to describe these approaches, we're finding it challenging to clearly communicate the expectations of OpenTelemetry to its users. The eventual goal of OpenTelemetry, again, is to make telemetry native -- built-in, without you having to write anything. However, even when it's built-in, you'll still need to write instrumentation code to add more details. Even if your web framework and database clients are instrumented, you'll want to be able to easily add additional attributes and context to those transactions to help model your system.

To that end, there's a discussion about how to refactor the OpenTelemetry documentation to better reflect how, why, and when you should use each type of instrumentation strategy. At a high level, we're trying to move away from using the word 'instrumentation' quite so much (because it's exhausting to write and read), and grouping instrumentation styles based on the question of 'do I have to write code for this or not?'

For SREs or operators who are really only interested in instrumenting with our zero-code agents in languages such as Java or .NET, this distinction makes it clearer where they should start. For developers who are modifying their programs to add instrumentation -- irrespective of how they're integrating it -- we'll be grouping this data under an 'API & SDK' section, split out by language. As OpenTelemetry gains adoption, we're seeing that this is a more relevant breakdown of how users expect to find this information.

How should you instrument your application?

It's worth noting that the reason people wind up with different, competing forms of instrumentation in the first place is because everyone is just trying to do what offers the least friction for them. Developers are going to write logs, because that's the easiest thing to do. Logs give near-immediate feedback during development, and if you're just writing code and trying to understand the local behavior of code, they're great. The problem isn't that logs are bad -- it's that they're not sufficient inputs to an observability system. Logs + Metrics suffers from this problem as well; Restrictions on metric cardinality mean that it's hard or impossible to actually ask a good question from your system. The reason people harp on tracing so much is that it's a good way to split the difference here, but people often run into challenges deploying tracing at scale and avoiding over-sampling.

OpenTelemetry fixes a lot of these problems for us, long-term. Developers can write the telemetry they want to -- be it logs, metrics, traces, or whatever comes next -- and that same telemetry can be intelligently collected and used by ops teams, business analysts, or security operations. The de-coupling of the OpenTelemetry API and SDK means that library authors can write high quality traces, metrics, or logs for their code and ensure all of their users will be able to access it. Instead of having multiple competing instrumentation agents, a single OpenTelemetry agent will be able to collect multiple signals and emit them in a single, well-supported format.

The best news is, a lot of this is here today! If you're using Java, most of what I just described can be accomplished with the tools available to you now. .NET also supports a lot of this, and over the course of 2024, I expect a lot more languages will come on board as we stabilize more OpenTelemetry components.

Anyway, that's how (and why) we got to where we are when we talk about what instrumentation should be called. If you've made it this far, thanks for reading! If you read this and thought, "Wow, I would sure love to learn more about OpenTelemetry", then I invite you to check out my new book, Learning OpenTelemetry, which will be available this March. Pre-orders open now!

Telemetry Ergonomics

Wed, 17 Jan 2024 00:00:00 +0000

I used to joke that there were maybe fifty people on the planet that really cared about 'observability' at a philosophical level, and I still maintain that I'm mostly correct. Maybe you're one of them, but odds are, you aren't. This disconnect becomes very obvious when I look at the way that people prefer to use observability tools, and more specifically, the way that those tools build workflows on top of telemetry collection. In this post, I'm going to look at a few popular examples of this in the front-end space to draw some comparisons between the state of the art in OpenTelemetry vs. its incumbents.

Let's get the boring stuff out of the way -- what's my criteria to include something in this comparison? Well, I'm going to mostly look at what's popular; Datadog, Sentry, and Rollbar. This is a bit web-focused, but by and large a lot of these tools wind up breaking down into the same general modalities. They capture errors, give you core performance measurements (like Core Web Vitals), and generally give you the ability to "see when things are going wrong". I'm also focusing on the web side of these, but I don't think there's really a huge difference between the experience of actually using it as a developer from an integration point of view.

I also wanna disclaim that I'm not a front-end dev, I'm just someone that cares about 'observability' as a concept. There might be things that seem weird to me that are obvious to people more ingrained in this space. With that said, let's walk through the setup process for each, and discuss what their SDKs do.

Rollbar

Rollbar is an observability tool that's focused on error tracking. They offer SDK's for a variety of platforms, but their focus appears to be mobile applications (iOS/Android) and web. Essentially, they're a logging platform -- you install their SDK, it hooks into exception handlers, and sends those exceptions and stack traces to their platform for alerting or discovery. They also support sending log messages directly to their platform, in addition to capturing telemetry data (such as request timings, etc.) in certain circumstances.

Configuration is fairly standardized across all language SDKs. You import a single package, and pass a config object to it that includes your account access token and some configuration settings. Behind the scenes, the SDK intercepts and rewrites method signatures on targeted objects such as XMLHttpRequest in order to create traces, or on the JS exception handlers to trap and forward unhandled exceptions and errors.

On the platform side, it appears that events are hashed as they're received in order to avoid storing every unique occurrence of an error or log message. The number of interceptors built into the SDK also means that errors can have various pieces of context embedded with them, everything from local variables to prior DOM events.

Sentry

Sentry offers similar features to Rollbar, insofar as that it focuses on error recording. They also offer SDKs for a variety of platforms, including iOS, Android, and Web. The mechanism of action in Sentry is pretty similar to the Rollbar SDK -- you import a package, you pass some config options, it hooks various built-ins or other library functions, and it starts sending data.

There's not a ton of details about how Sentry actually processes and stores this data behind the scenes, although it appears that they're using tracing pretty heavily in their SDK in addition to head sampling (suggested defaults seem to be a 10% head sampling rate, with 100% sampling for errors). One interesting note about both Sentry and Rollbar is their documentation flows are very light on code; They get you into the tool very quickly, and a lot of the imputed value seems to come from connecting your codebase into their platform via integrations like source maps, or hooking into your GitHub/Jira.

One thing that's nice about Sentry is that they have a lot of other functions, all available through the same SDK. If you'd like to do session recording, you can do that. If you want to capture profiles, you can do that. You get a lot of stuff without having to work for it.

Datadog

Datadog is a bit harder to quantify here, because it does a little bit of everything. All of Datadog's error tracking, session replay, and other 'client monitoring' tools are bundled up under their Real User Monitoring product. Like Sentry and Rollbar, they offer SDKs in a variety of languages targeting iOS, Android, and Web. Unique to Datadog is a... Roku SDK? Ok, sure?

Similar to the other tools in this list, installation of the SDK is very straightforward. Import a package, pass a config blob, set up your sampling options. Datadog provides a pretty complete end-to-end story for full stack observability as well, allowing you to link traces collected by the RUM library with back-end APM signals.

Again, the actual mechanism of action is remarkably similar to the prior tools. Interceptors rewrite method calls, etc. All of these tools work in pretty similar ways, truth be told. Looking at what Datadog offers, I'd actually give Sentry some points for UI and UX - it's not that Datadog is bad, I just find their UX to be somewhat overwhelming.

What can we learn here?

There's a few commonalities you can tease out from looking at these three tools, and I think they're very interesting ones.

None of these tools use open standards, really. Each of them is collecting data into a proprietary format, their APIs all have subtle differences, and even conceptually similar things will require you to learn a different dialect to use them (for example, adding custom attributes to events)
They all pretty much do the exact same thing in mostly the same way. There's no real innovation in terms of the underlying code here, it's all fairly commodity stuff. Hook into a bunch of standard library stuff and create some traces/logs, send 'em off.
There's actually not a lot of differences in the ergonomics of getting started with these tools, either. They all provide extremely straightforward and low-touch installation and configuration methods. Import a package from a CDN, or download and bundle it, do what you will. There's subtle differences in terminology or what the actual configuration options are, sure, but there's no real distinction.
The goal of all of these tools -- based on what their documentation tells me -- is to get you into their SaaS and keep you there. They all want you to build development workflows that lead through their platform.

What's most curious about this is that the first three of these points are some of the exact same rationalizations we used in creating OpenTelemetry. If you think back, they're all fairly applicable to the state of the world several years ago. APM libraries for backend services were usually highly duplicative in terms of functionality, installation method, configuration options, etc. The nouns and verbs were different, but the actual behavior was pretty much the same. The way you used them was also pretty much the same, at that.

I don't really feel like the frontend space is that special or unique in this regard. It seems ripe for someone to come in on the back of the work that's being done in OpenTelemetry JS with the web SDK and build a really great UX on top of the data that it's going to send. I also tend to think that future releases of things like Flutter or React will wind up having native bridges to OpenTelemetry, making this even more of a moot point.

That said, I don't think we can necessarily wait for that to happen organically. I think there's a lot that OpenTelemetry can learn by looking at these existing tools and how easy it is to get started with them -- I'd love to see the project focus on these sort of ergonomics in 2024!

Selling The Vision

Sat, 13 Jan 2024 00:00:00 +0000

OpenTelemetry can be a difficult project to describe to people, because the gap between what it is today and what it will be tomorrow is very large. It's easy to stare at it from a distance, squint your eyes, and wonder what the hell we're doing over here. The further away you are from the core contributors, maintainers, and weird little observability guys at the center of it all, the harder it is for things to come into focus. There's a few reasons for that, one of which is that I truly think that it isn't a completely shared vision (and that's ok, for reasons I'll get into) -- but the biggest is that the vision really is just that. A vision, one that is going to take years to realize. That vision is what should excite people, but because we're not great at selling it or even describing it, it winds up turning people away.

So, before I continue, I want to disclaim a few things. First, I am not writing this in an official capacity as an OpenTelemetry maintainer, governance member, etc. These are my personal opinions, retweets do not constitute endorsements, etc.

The inspiration for this post was a combination of reading this HN thread and the work that I've been doing to finish up Learning OpenTelemetry. Part of the challenge of writing this book has been something that we admit pretty early on; Telemetry, by itself, is worthless. Having a bunch of datapoints, no matter how semantically beautiful or accurate they are, is about as helpful as a cat on a keyboard. Nice to look at, but it's hard to get any work done. How do you write a book about observability where you can't really talk about the observability part, then? You'll have to buy the book to find out, but a lot of it comes down to telling you why the telemetry matters so much. It's necessary, but not sufficient, for observability.

Where does OpenTelemetry shine?

To that end, let me commit a cardinal sin as both an open source advocate and a competitor; I think Datadog is really good. Do you know why? Because they've made it really easy for people to feel like they're 'doing observability'. You don't have to be an expert to install their agent and pop in a few includes to get dashboards and alerts that are genuinely helpful and mostly easy to use. I also think New Relic deserves a callout here as well, along similar lines -- if you just want to see what the hell is going on in your application or system, they've made that process about as elegant and streamlined as you can imagine. When I was a wee bairn, dipping my toes into monitoring, I started out with both of these tools because they're remarkably effective at making you feel like you're doing observability.

There's a difference between vibes and results, though, and this is where I think those products tend to start falling apart. If the defaults don't work, and as your system gets more complex, your needs start to change. Pareto effects start to mount, and suddenly you're paying tens of millions a year for things that kinda do what you want most of the time and working around the gaps. There's stuff you'd like to do, but it's not cost effective, so you're stuck with what you're given.

This is the pain point that OpenTelemetry solves today, for the most part. Sure, there's a lot of gaps between what's specified and what's implemented in many languages, but you can look at something like OpenTelemetry Java Instrumentation and see what's possible. A single agent that gives you metrics, logs, and traces with strong correlation and context between telemetry signals, compatible with dozens of vendors and open source analysis tools, and a collection ecosystem that offers a lot of customization and processing options. I don't think it's a bridge too far to ask you to imagine what that looks like in other languages, or for other architectures. I also don't want to dismiss this in any way; This is a pretty big deal. There's a lot of value in having an 'easy button' for getting this telemetry data out of your applications in a streamlined way.

Telemetry, though, is necessary but not sufficient.

"Drawing The Rest Of The Owl"

Perhaps you're familiar with this category of meme about drawing an owl, or a wolf, or whatever animal. It's a two-step process, where the first is to draw a basic shape, then the rest of the process is left as an exercise to the reader. If you've tried to actually set up an observability practice, it probably resonates with you that while there's a ton of documentation and setup guides on how to draw a circle, there's very little out there on how to add the details... and there are so many devils lurking in those details, dear reader.

These details are where tools like Datadog and New Relic tend to shine. It's one thing to have a great out of the box install experience, but that in and of itself only gets you so far. Where they excel is that once you install those agents and get data flowing, you get a bunch of dashboards that give you helpful, and actionable, details about performance. They even color-code things! Grafana is also getting better and better at this by leveraging their strength in the open source world, crowdsourcing dashboards for all sorts of software and tools (over 74 pages, I just looked).

We do ourselves as an industry a disservice by waving away this prior art. People wouldn't make this stuff, or use it, if it didn't actually help them. Let's take it as a point of fact that it does, and that a lot of people are generally pretty happy with what they have today.

However....

Accepting that forces us to grapple with the popularity of OpenTelemetry. What I've seen over the past few years is a curious change in the market. Rather than vendors having to advocate -- grudgingly, in some cases -- for OpenTelemetry, people are starting to adopt it on their own. The word 'strategic choice' gets thrown around a lot. Some of this is due to a sense of inevitability, sure, but quite a bit is because of what I pointed to earlier; In popular languages, it really is good enough, and the fact that it's vendor agnostic is a huge point in its favor. OpenTelemetry, from that perspective, has managed to consolidate most of the pretty good ideas we've had over the past ten years or so, bundle them up into a single package, and ship it. That's great! It's a lot harder than you'd think, and we're by no means done, but it deserves recognition for what it is.

What needs to happen next, though, is something different, and it's something that isn't going to come from OpenTelemetry itself. It's got nothing to do with query languages, or sampling, or management planes... well, nothing to do with those specifically.

The Rest Of The Owl

There's a bunch of tools that accept OpenTelemetry, but there's none that are really 'OpenTelemetry Native'. I'd argue that there's a few that are approaching that but as of now, I'm not aware of anyone that has really stepped back and started from first principles, nor do I think we're going to see one for several years. Why? Well, hell, the thing isn't even done-done yet. Maybe once we start hitting 1.0 in more things.

This does leave people in kind of a shitty situation, though. Trying to shove OpenTelemetry data into non-OpenTelemetry data stores usually means you're gonna have a bad time. Someone in that above HN thread was talking about how AWS charges per tag on metrics? That's a decision. I know that the reason most existing tools charge extra for 'custom metrics' is because they can do a bunch of memoization tricks for their default attributes to reduce costs, but c'mon, we've been working on semantic conventions for literal years and even the Prometheus people admitted they were the best thing about OpenTelemetry. It's not like y'all didn't see it coming.

This is the kind of stuff that keeps me up at night regarding OpenTelemetry, really. We can do a lot as a project to try and limit scope and build an effective, yet broadly customizable framework, for people to build on and for third party developers to integrate into their libraries and tools. The problem is, that's not the rubric which we get judged by. We get judged by the experience that you have when you try to send your OpenTelemetry data to Cloudwatch, or Datadog. We're judged by how much you hate your annoying coworker that was all-in on tracing and decided to 10x your telemetry volume by adding in function-level spans. We're judged by your ambivalence towards our documentation, or method naming conventions, or whatever story you've spun about how we're not idiomatic for your particular language or tech stack. Some of these we can control as a project, some of them we can't. The ones that are the most damaging in the long run, though, are the ones that are entirely out of our control, and that's what worries me.

Don't get it twisted, I think there's still value in existing tools adopting OTLP and semantic conventions. If, after another five years, all we've done is make a pretty good set of conventions around telemetry data and how it should be ferried around I think I'll still be happy with the project.

But the vision is still gonna be out there. A vision that actually leverages our context layer, and uses it well. Sampling as a first-class, seamless part of your observability stack. Automatic routing of telemetry based on class and type, and query interfaces that work across disparate data stores. New types of visualizations, and timelines as a primary visualization for user journeys. There's so much that can be done, we just need to get there first.

Parsing Apache2 access logs with the OpenTelemetry Collector

Sat, 06 Jan 2024 00:00:00 +0000

I couldn't find a ton of resources on this, but FYI -- the OpenTelemetry Collector's filelog receiver has a pretty robust regex parser built into it. Want to get your access.log files from Apache? Here's the config.

  filelog/access:
    include: [ /var/log/apache2/access.log ]
    operators:
      - type: regex_parser
        regex: '(?P\d{1,3}(?:\.\d{1,3}){3}) - - \[(?P[^\]]+)] "(?P\S+) (?P\S+) (?P\S+)" (?P\d{3}) (?P\d+) "(?P[^"]*)" "(?P[^"]*)'
        timestamp:
          parse_from: attributes["datetime"]
          layout: '%d/%b/%Y:%H:%M:%S %z'
        severity:
          parse_from: attributes["status"]

The documentation for a lot of this stuff is stuck inside the GitHub repositories for the receiver modules, so be sure to check that out if you're looking for a quick reference.

What if we want to go further and turn our attributes into their appropriate semantic conventions? While there's no explicit log conventions for HTTP servers, the Span ones should work for our purposes.

  transform:
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - replace_all_patterns(attributes, "key", "method",  "http.request.method")
          - replace_all_patterns(attributes, "key", "status",  "http.response.status_code")
          - replace_all_patterns(attributes, "key", "user_agent", "user_agent.original")
          - replace_all_patterns(attributes, "key", "ip", "client.address")
          - replace_all_patterns(attributes, "key", "path", "url.path")
          - delete_key(attributes, "datetime")
          - delete_key(attributes, "size")

This should be enough to get started, at least, although there's more you might want to do:

Add resource attributes for the logical service name (apache, reverse-proxy, etc.)
Change up your Apache log format to get more information like the scheme, or time spent serving the request.

Data Bores

Mon, 20 Nov 2023 00:00:00 +0000

Sampling is a method to reduce the volume of data processed and stored by observability tools. There’s a variety of methods and algorithms that can be employed to do this, and most observability practices will wind up using a blend of them, but this blog isn’t necessarily about how to implement any individual technique. No, what I’m interested in discussing is the why of sampling, the outcomes that we’re looking for when we implement it, and some of the novel work that I’m seeing around the subject.

The Why's Of Sampling

So, why do we sample? “Cost” is the easy answer here, but I think it’s important to get a little nuanced. Cost isn’t just dollars and cents, although that’s often where people will start and stop their consideration. We can break down the cost knobs here a bit more finely, though.

Egress and Ingress Costs. How much you’re spending to send and receive telemetry data. These aren’t necessarily fixed costs, either; Depending on the design of your pipelines, you may end up paying a highly variable amount at many stages as you perform ETL (extract, transform, load) steps.
Storage Costs. A bit simpler to reason about, the actual cost to keep the bits around. These are probably the simplest costs to rationalize because they’re easy to see — storing a gigabyte of logs in S3 costs what it costs.
Processing Costs. These are often abstracted away from you in some way, especially in your commercial analysis tools. If you’re running your own analysis stack, though, then you’re probably somewhat familiar with the tradeoffs between the amount of data you store and the resulting increase in query time/performance, especially with high-cardinality metrics.
Human Costs. These aren’t discussed as much, but there’s a very real penalty to having too much data. Increasing the amount of noise in your data set inevitably increases the complexity of navigating the data, and makes it more difficult for users to discover signal.
Headroom Costs. Telemetry processing of any sort incurs penalties to data freshness, availability, and consistency. You need to budget some amount of overhead into your pipeline at every step, from generation to ingestion, in order to perform sampling steps and filtering. These costs tend to get bundled up with ‘instrumentation’ in a lot of people’s minds, but I think it makes more sense to amortize them as part of sampling — like jazz, it’s more about the telemetry you don’t send.

Often, our models don’t encompass all of these cost knobs — at least, not explicitly. I think this is because we tend to limit our imagination around sampling strategies, usually because our basic assumptions tend to be somewhat fixed. “It’s not possible to keep everything, so we have to sample.” “Most of this data is useless, so let’s try to only keep what matters.” These are the baked-in assumptions for not only SREs and developers who use observability systems, but also the creators of those systems. They’re not necessarily wrong to assume this, either! The problem is when we make these assumptions independently.

Let me explain this by analogy. If you’re going on a trip somewhere, you need supplies. Fuel for your vehicle, snacks and drinks, entertainment. How do you plan what you need based on the amount of space you have? You start from your desired outcome, then work backwards from there. If you’re about to go on a multi-day trip, you could certainly fill up a bunch of extra gas cans and carry them with you, but you can probably optimize that space with things that are harder to get along the way (like luggage, or food in a cooler) and get fuel in-transit. This is how we should think about sampling from an observability perspective; Rather than aggressively trying to map to some quota or default assumption about what’s “too much” or “too little” data, figure out your ultimate goal then work backwards to see how much data you need to collect and store to get there.

Observability - It's About Value

The destination of an observability journey should be towards the value you’re getting out of it, and the measures you have in place around that value. Much of the time, this value can be in the format of SLOs (Service Level Objectives), but that’s not the only way to conceptualize it. You can measure it in developer experience, in on-call quality of life statistics, in incident resolution time (or near-miss recording). The ultimate measure of any observability program is going to be a measurement about humans. With this in mind, you should ‘start from the end’ when evaluating sampling strategies — figure out what you want to measure, then experiment with what data you need to achieve those outcomes. I guarantee you that it won’t be just one type of signal — good SLOs and observability measures almost always need a stream of data that includes metrics, logs, traces, profiles, etc.

“That sounds kinda hard to do, though?” you may be asking. Well… yeah, not wrong. I’d argue that many failures in observability programs are due to misalignment around this specific topic. If you aren’t sure what you’re driving towards, then you’re not going to make great decisions about how to get there, and what you need along the way. You wind up making bad assumptions, and now you’ve got a bunch of data that is vaguely useful but not really essential and a poor return on investment. Sampling can’t be a “we’ll figure it out later” — especially as OpenTelemetry continues to become a built-in part of cloud-native systems. By design, it emits a ton of data! You need to have a plan in place on what matters and what doesn’t, and feedback mechanisms to understand and change that calculus over time.

This leads me to my initial point — can AI help? I think there’s promise, but we’re not quite there yet. Let’s break this down a little bit.

Will AI Fix Our Observability Sampling Woes?

Generative AI, specifically large language models, are pretty good at guessing what should come next when prompted. The problem is that they need a pretty big corpus of inputs to help them figure out what should ‘come next’. OpenTelemetry promises to be extremely valuable here, especially as it’s adopted by research and academic computing organizations that will make their telemetry data open source. This should allow the training of neural nets that can compare what you’re doing and make suggestions about what kind of telemetry you should be creating, and what you can throw away. However, it’s not necessarily a static answer. Your needs can, and must, change in response to external conditions. During deliberate and incidental system changes, you need more telemetry, and you may need it at different levels of abstraction. Load spikes or other factors bring questions about resolution vs. throughput, and there’s not really a single ‘right’ answer. I feel like AI is part of the solution here as well, but what we’re really looking for is something more like a ‘Sampling OODA Loop’.

OODA (Observe, Orient, Decide, Act) is a decision-making strategy that comes from military strategy. You can read more in the link, but the important part of this concept is that it’s a continuous process — you never stop making these decisions, and the goal is to go through the loop faster and more accurately each time. The missing part of our sampling models is the ‘act’ part; The ability to coordinate changes to fleet sampling strategies across many hosts/nodes is challenging, requiring a high level of coordination across sampling nodes and the ability to process complex objectives. Right now, we’re capable as an industry to perform this sort of loop in a fairly narrow scope — dynamic sampling approaches for tracing allow you to adjust the rate of span ingest based on metadata. Spatial and temporal metrics re-aggregation allow for reducing complexity of timeseries post-generation but pre-query. Similarly, many security tools will perform continuous analysis of log fields for anomaly detection purposes. What we’re missing is something that brings all of this together.

The future of sampling isn’t just ensuring that all of your errors get collected, but instrumentation that can make intelligent choices about what telemetry gets emitted based on system state. These decisions will be driven by the visualizations and workflows that are backed by the telemetry itself — imagine an SLO that can communicate upstream with observability sources to ask for the best type of telemetry data to measure the SLO at any given time, and modify that mix when you start to burn in order to let you debug more effectively. To do so, we’ll have to emit a lot more telemetry, yes. Just because it’s emitted, though, doesn’t mean we have to keep it all.

To wrap up, I don’t want to think just about what the future holds. There’s a lot you can do today in terms of improving your telemetry collection strategy, and most of it has to do with prioritizing your goals. Be careful of dogma around the value of specific signals. Figure out what value you’re driving towards with your observability practice, then work backwards from there to figure out which signals you should keep, and for how long. Finally, don’t be afraid of using multiple data stores — it’s pretty cheap to throw everything into S3 for a month or two, after all, and it gives you options.

Deploying on Friday the 13th

Mon, 30 Oct 2023 00:00:00 +0000

“I wasn’t trained to do that.”

I looked at the significantly more senior engineer sitting across from me in the white-and-startup-blue offices of a former job. Scarcely three years out of college, but with a decade of IT experience under my belt, I dug deep, searching for the endless well of patience that I previously administered to passionate but confused administrative assistants panicking about the location of a Powerpoint file. “Come again?” was the best I could muster.

This was the big leagues, right? I was a Software Engineer now, and I did DevOps, and I was leading a Cloud Transformation – this is what we’re supposed to be doing! Here I was, being yanked back down to earth by a man with over twenty years of professional development experience, balking at learning how YAML worked because… “I wasn’t trained to do that.” In the moment, I demurred, gently guiding him back to the repository of Powershell scripts my team had built to aid in the new workflows we were pushing.

The statement haunted me, though, and it does to this day. I had labored under the impression that developers and engineers were a cut above; The new philosophers of our information age, capable of making these hunks of silicon and glass sing using their minds. The notion that one of them would balk from something like… well, a different configuration file format, in this case, was almost unthinkable. It stuck in the recesses of my mind, like a stray popcorn kernel.

While I can’t admit to knowing exactly what was going on in his mind that day, over time I believe that I’ve identified a ground truth about most people in software, and most teams; It is that, deep down, we are afraid.

Sinners in the hands of an angry kernel

Technology is fundamentally kinda scary. If you don’t know what it’s doing, then its outcomes are a quite literal deus ex machina; God from the machine. The only thing worse than ignorance, however, is knowledge. You may not be obsessed with the exact technical details of how I’m writing this blog post, but if you went back in time thirty years and told me that one day I’d be writing a blog post into a laptop that weighed five pounds, into an application that wasn’t actually running on my computer, and each keystroke was being transmitted to a server god knows where in the world to be saved instantly, I’d have probably assumed you were way too into Star Trek. It is amazing, though, if you think about it! The lived experience of Moore’s Law is far more mind-blowing (yet banal) than the mind wonders.

While I know how this works, I do not know how it works. I’m aware of the individual parts and components, the various systems that are working in concert, but I don’t really know. I can’t see each bit, and really, I don’t know if I’d even want to.

It is, in a word, scary. It’s terrifying, in the way that a biblically accurate angel might be – daily, we accomplish things that were literal science fiction decades ago. We stand in awe of the chariot of Helios, unable to escape the very real terror that without him, all would be lost.

Thus I return to my quandary – “I wasn’t trained to do this.” It is not indolence, or ignorance, that drives this statement, but fear. How can you be expected to learn to face your fears when there’s no pause in your work? Your tickets won’t wait for you, your manager won’t get off your ass, and in this case some upstart is telling you about how great YAML and the Cloud is and you don’t quite see what the benefit is to you. How can you embrace change that you don’t understand, that you aren’t given the space to understand, and you don’t have the luxury of time to comprehend?

Understanding through observation

We fear what we do not know, what we cannot perceive. Masters of horror cinema understand this, building tension by limiting what the audience sees in full light, layering soundscapes in such a way to play with your perception of the scene, and disorienting you with rapid cuts between angles. A slasher film shot in flat lighting in a single, unmoving frame would be much less unsettling – you would be able to see everything that’s going on.

Being able to see what’s happening is only half of understanding, though. It is necessary, but not sufficient. Understanding requires context, and the ability to ask questions. It requires learning, and an active pursuit of meaning.

Bear with me, because I’m still going somewhere with all this. Let’s step back for a second and ask an important question, one that I should have asked myself originally when I took on this project all those years ago. What is it that we’re trying to accomplish here, exactly?

To set the scene, our product had been plagued by what could only be described as ‘cruft’. An on-premises enterprise PaaS product tends to accumulate a bunch of band-aids over time, little one-off features that only one customer used (but was absolutely essential for their use case), and a growing number of them. The challenge this model presented is that these features were all, for the most part, somewhat mutually exclusive. Testing one permutation meant disregarding others. In order to test enough things on a daily cadence to ensure that we weren’t completely breaking something, a complex series of integration tests had been created. These integration tests ran on servers that were pets – specifically configured to match the deployment environments we expected, and not even imaged; Just restored to ‘known good’ through scripts. Unfortunately, this was not a terribly scalable solution, nor a terribly reliable one. In the year and a half I had been there, I spent many a sleepless night trying to un-stick stuck restoration scripts, diagnose faulty deployments, and desperately try to catalog what exactly made these servers so special in order to document anything.

The public cloud seemed like the perfect solution to this problem, to be honest. On-demand infrastructure, configuration as code, reproducible and infinitely scalable hosts? Finally, no more manual janitoring of integration test runs over the weekend in order to get a release out – heck, developers could just request an environment and spin it up in order to reproduce errors and bug reports discovered in specific configuration permutations! Dreams of getting my evenings and weekends back dancing in my head, I set forth, building out a proof of concept. Management, unsurprisingly, loved it. My slides claiming ‘improved developer productivity’ and ‘faster releases’ were just what they wanted to hear, and the project was greenlit with nary a second thought.

Observant readers may start playing the scary music in their head right about now, as there were multiple red flags about this project, but we’ll get to those later.

Rising action, falling behind

In almost every horror movie I can think of, knowledge is the most important currency. Knowing that a survivor was bitten by the zombies, knowing that going into the basement alone is a bad idea, knowing the weakness of the serial murderer hot on your trails – these make the difference between living to see the credits, or an untimely demise.

If only our professional lives were so candid, so easy to interrogate! Alas, if there is an audience to the travails of my life, I remain blissfully unaware of them. It is only through hindsight that we gain the sort of voyeuristic knowledge that the audience receives; We must avail ourselves of wisdom during the moment. With that in mind, what went wrong with my project?

In many ways, very little – it was a success, in the sense that the initial objectives were achieved. We automated the creation and provisioning of test environments in the cloud, and dramatically increased the speed and reliability of our integration testing. Rather than spending long weekends attending to the finicky manual test environments, developers could spin up arbitrary test clusters on-demand.

However, we didn’t release faster, and developers didn’t get more productive.

In fact, development speed went down slightly! Why?

There’s three reasons, clear in retrospect –

There were a lot of undocumented workflows that more or less relied on the ‘pet-like’ nature of the test environments. In a proactive cost-saving measure, the new test environments were truly ephemeral; If there were failures, everything outside of the immediate output would be deleted. Turns out since nobody ever planned for logging in an environment where things got deleted on failure or success, teams would lose access to anything but extremely granular pass/fail metrics about tests – forcing a re-run of a suite, which might pop the error again, assuming it wasn’t a transient one…
The team hadn’t been adequately prepared for ‘the cloud’ in any way, shape, or form. Conventions that made sense to the DevOps team were unfamiliar to the more senior engineers, who had been working on more-or-less the same stack for ten years at that point. A lot of time was spent trying to fit square pegs into round holes in order to fit the new cloud workflows into existing, non-cloud workflows as a replacement rather than an addition. The semantics were just too different, though, and developers had a hard time adapting when what they expected and what they got didn’t match up.
We didn’t do a lot of research or fact-finding with what wound up being the key stakeholders; Namely, the developers themselves. Our bet at the time was that they wouldn’t really care, and that they didn’t like the status quo either – we needed to convince “”the business”” to make the OpEx spend. For the reasons above, this turned out to be a miscalculation, as developers were far more involved in the actual use of test environments than “the process” would otherwise indicate.

There were many smaller cuts, of course; Poor attitudes, resistance to change, apathy, you name it. Some might argue that the actual problem here was that we needed such a convoluted system at all, and that sufficient amounts of quality unit tests would fix most of the problems at build time. Indeed, when I started we had single-digit percentages of unit test coverage; Adding more sure as hell couldn’t hurt.

Seeing the light

I learned a lot from this experience, as you may be able to tell. However, most of my learning didn’t come immediately. Like I said earlier, this anecdote stuck in my craw, and I puzzled over it for a while. I got stuck on, well, the lack of imagination? The inflexibility, the inability to adapt, the apparent lack of interest in what, to me, what a cool new thing.

It’s that latter revelation that led me to my realization – this isn’t the exception, it’s the rule. There’s a vanishingly small number of people whose jobs involve the word “software developer” that actually really care about any of this. Most people want to get in, do their job, and go home.

That’s fine! Like, honestly, that’s good. We should all aspire to a job that we can walk away from at the end of the day. If you want to do more than that, OK, great – but when we talk about topics like observability and accessibility then we should keep in mind that most people don’t really care about the details, and they just want answers to their questions. They don’t want to be the friend that walks into the basement, they want to be the audience. They know that knowledge will terrify them, so they don’t dive all the way in – they want curated, opinionated, and responsive insights. If anyone in the industry wonders why New Relic still exists, it’s because in large part they nail that specific workflow. Datadog covers the other half; They present a bunch of information, more than you can reasonably be expected to understand or comprehend, and tell you to figure it out. Both are ways you can do things, I don’t think either is that great.

If you’ve read this far hoping for one easy trick that solves these problems, that lets you jump out of the screen and into the theater to watch your system as a dispassionate observer, I don’t have one. This shit’s a journey, not a destination. That said, I do have some takeways on how you should conceptualize observability, especially as you are thinking about new systems or migrations.

The best way to understand is to know

Don’t rely on other people to just explain things to you, or on pre-published dashboards. You really want to understand something new, instrument it – see what it does in production.

Don’t neglect the people

If you’re building a new system, or rolling out a new internal tool, your first step should be the actual users of the existing system. Document everything. Do it twice. There’s a lot of implicit workflows you may have missed the first time. If you can help it, don’t replace; Do a gradual cut-over, or run both old and new simultaneously with links connecting them.

Change is always scary; Bring a flashlight

Even in a highly effective team, people are going to resist change. It might be subconscious, it might be deliberate, but someone coming in to push the new exciting whatever is almost always going to get some level of pushback. This is one reason that a strong observability practice is so fundamental to every organization, because the more you can see and understand what’s going on in your system, the more comfortable you can be with change of any sort.

In summary

At the end of the seminal zombie film “Night of the Living Dead”, the main character has survived the night. He looks out the window to see help arrive.

I won’t spoil it, but I will say that it’s a shocking ending.

More often than not, we avoid shocking endings in our professional lives. Projects tend to wither, rather than explode. The debt piles up, though, and it eventually comes due. I don’t think it necessarily has to be this way, though. We expect that we’re going to solve our problems forever and ever, and we tend to build with that in mind – end-to-end solutions, with very specific and narrow goals, rather than a holistic understanding of our system and what influences it, as well as what it influences.

I believe this is somewhat of a backwards methodology. Our goal, especially when it comes to observability, should be to describe our intent through telemetry, then use observability to build confidence in reality. If we can accomplish this task, then we can remove the fear, the uncertainty, and the hesitation that comes from being a character in the movie, and even start to shift the genre from horror to something more pleasant, less scary, and less tense.

Observability Cannot Fail, It Can Only Be Failed

Mon, 28 Aug 2023 00:00:00 +0000

Being between jobs is a great time to step back, do some self-critique, and engage in light home improvement for fun and or profit. It’s this last pursuit that’s convinced me that if this whole computer thing doesn’t work out, I’m screwed — I don’t have the spirit of a tradesperson in my body. This revelation was prompted by my journey to install laminate flooring in my office, which has until now simply had a bare concrete floor. Originally, I had my heart set on some ‘Luxury Vinyl Planks’ (or LVP), which was not only recommended to me by industrious flooring salespeople, but was available in a variety of delightful colors and patterns.

Sadly, LVP commands a significant price premium, which was unattractive for what’s meant to be, ultimately, a temporary job. We’re going to get the basement finished eventually, with consistent flooring throughout, so why waste the money? Thus, I chose what seemed to be the ‘best’ laminate I could find, purchased all of the accessories and tools that I could find to aid in the installation, and spent hours reading and watching tutorials about it. Thus armed, I cleared out the office, cleaned the floor, and started to place the flooring.

Reader, it may surprise you to learn that this plan went to shit.

The second plank went in fairly easily, but issues started to crop up with the third. The walls weren’t quite as straight as I thought they were, which lead to some small alignment issues. The material for the laminate, I found, was significantly less durable than I hoped and began to crack and chip; This led to planks that didn’t lock together properly, or damage as I tapped them into place. Undaunted, I continued, figuring that a couple of small errors wouldn’t really impact the end product; After all, this is all temporary!

Halfway through, problems were coming further and faster. While I had gotten better at the basic process, the small irregularities were causing larger and larger downstream problems. If only I had known the cost of my lack of attention to detail! An entire row of planks failed to join, thus leaving me with two roughly independent floating floors. I was in too deep, I just needed to get the damn project done with, so I summoned up my gumption and pushed through. I can always put a rug over the gaps, right? I’ve got other things to do, and going back to a bare floor would be a huge setback… and they won’t take the opened cartons of flooring back for return, anyway.

If this situation sounds familiar, then you’ve either worked in software or also engaged in home improvement. It’s a common pattern, right? Get into something that you kinda understand because it's gonna solve a problem, do your research, hit some footguns and dodge some others… but all of those glancing blows and near misses multiply. By the time you’re feeling the pain, you’re in too deep. You can’t just go back down to studs and re-do the whole thing, so to speak. So, you ship, and you ship, and you keep shipping and throwing patches on the whole mess while telling yourself that you’ll go back and re-do the whole thing right one of these days.

While this happens literally all of the time, I’ve noticed it’s extremely common in observability implementations. Tracing, continuous profiling, log analysis, whatever technology you turn to as a panacea to your performance problems seem to wind up as costly and maintenance-heavy boondoggles. Transformative success stories seem to be the exception, rather than the rule. Developers fall back on good ol’ metrics and grepping through logs - why, though?

Open Source Is Laminate Flooring

I believe there’s really two fundamental problems here. The first has to do with telemetry and it’s associated tooling, and the second has to do with workflows.

Telemetry data (traces, metrics, logs, profiles, events, whatever) is necessary, but not sufficient, for observability. However, telemetry quality is the real distinction between successful and failed observability initiatives. OpenTelemetry (and other open source tools) seek to bring up the standards here, but it can be misleading. There’s a tradeoff you’re making when you use open source, the same tradeoff I made when I went with cheaper laminate floors. A master can take poor materials and make them sing, but an amateur cannot. This is the advantage of, say, Datadog, in a nutshell: you install the agent and it pretty much just works. They’ll give you dashboards and data that might not actually help (from an observability point of view), but will give you enough indicators of ‘things going wrong’ that you can fall back to traditional workflows to diagnose and pinpoint failures. Choosing OpenTelemetry is a bet on your ability to master and integrate it at a pretty deep level!

This mastery, obviously, brings benefits - no vendor lock-in, more control, the ability to leverage the open source ecosystem. Crucially, this mastery flows into solving the second problem, that of workflows.

Observability workflows aren’t characterized by how great your dashboards are, how smart your alerts are, or how much data you can store. In short, the only important question is “Do I have enough high-quality data about my system for a broad query to return semantically useful results?” While a simple question, it’s fiendishly difficult. It implies that you have mastery of a query language in order to map your question into th

e lexicon of your analysis tools. It implies that the right data is available for query, and that it hasn’t been sampled out. It presupposes that everyone involved in writing telemetry has done an equally good job of it, that everything has the right metadata, and that the telemetry accurately models the system in question.

These are linked problems; poor telemetry tooling leads to less-than-useful workflows. The best database in the world doesn’t help if everything you need got sampled out. Accurate metadata doesn’t matter if it’s not consistent across a set of telemetry inputs. Having the right type of signal doesn’t matter if you’ve lost too much resolution on it due to temporal re-aggregation. None of it matters if you’ve got a bunch of engineers that have no clue how to use the query engine, or if the analysis tools and dashboards aren’t connected and made a part of existing workflows.

Here’s a story I’ve seen play out at too many organizations — leadership, responding to systemic performance challenges, bring in a new vendor or observability stack to solve the problems. Implementation falls to a small team who go out and somewhat blindly instrument broad parts of the system in an attempt to ‘see everything’. This team usually struggles to master the underlying instrumentation and integrate it into existing service frameworks, but finally roll out a MVP… and it mostly lays fallow. So much time was invested in just getting the telemetry that nobody stopped to ask about the observability. Internal marketing kicks off, training is held, and usage ticks up somewhat… but not convincingly. A few die-hards that spend the time to build mastery get stuck in, and increase their bus factor by a ridiculous amount. Ultimately, though, the transformative potential of observability is never realized - most people still grep through logs and page through endless metric dashboards. Obviously this isn’t observability’s fault, though. How could it be? Nobody really tried.

Flipping The Script

What should you walk away from this piece with? Depends on who you are, really. I believe we should re-think our approach to observability implementations, though. We tend to focus heavily on telemetry-first (which makes sense, gotta have the data, right?) but I’m increasingly convinced that we should first analyze our workflows in order to drive instrumentation.

Starting at the end, as it were, has some advantages. You can take an inventory of how work is done, rather than just how it’s imagined. You can find the places where

observability should be integrated, and the people who need the data. You can understand the needs around queries, dashboards, alerting, and so forth. Once you know these things, it becomes much easier to ask what telemetry you need to satisfy these workflows, and prioritize it.

It also means that the telemetry creation and instrumentation process needs to be more responsive to this sort of workflow-driven approach. Rather than spewing a firehose of data, we should focus on providing a tailored stream (or streams) in an easy-to-comprehend and ingest way. OpenTelemetry has a role to play here, clearly, especially as work on configuration and management continues.

Finally, observability practitioners should consider their messaging and methods. We spend a lot of time sharing success, but failure is so much more interesting to learn from, don’t you think? Let’s do more to talk about what doesn’t work and figure out better ways to move our organizations forward.

Observability and the Decentralized Web

Tue, 22 Aug 2023 00:00:00 +0000

It's probably still too early to write the obituary for centralized social media and 'Web 2.0' — but if you squint, you can see it on the horizon. Ongoing regulatory pressures and the slow burn of Twitter at the hands of Elon Musk aside, there's an increasing body of evidence that courts and governments will take a more active hand in the moderation and control of user-generated content. This piece on the draft deal between TikTok and the US Government to avoid a ban of the former is, I fear, broadly indicative of what the future holds for social platforms.

While reasonable people can disagree on the correctness or morality of these interventions, I feel like it's more interesting to think about what this means not only for the future of social media (and indeed, the web itself) and how we think about performance and user experience.

Broadly, let's define the 'decentralized web' as something that looks a lot more like Mastodon or Bluesky than Reddit or Twitter. The key distinctions I would lay out are -

User generated content is not stored¹ on a server controlled by the entity that provides a view of that content.
User identity isn't controlled by a central server — users decide who they are based on protocols.
There are far more WAN links rather than LAN links between content stores and viewers.

This is, admittedly, an overly simplified version of the concepts at play. However, you should see three things pop out at you that are extremely important from a performance standpoint: data stores (and queries), authorization, and networking. These also happen to be common optimization areas in distributed systems, and that's where observability comes into play.

I tend to describe observability as a way to understand systems. Cloud-native and distributed systems in general are fiendishly complicated as it stands; Observability gives you the ability to model that system and ask questions about it. Decentralization doesn't reduce the need for observability, but it does present some novel challenges.

In a decentralized system, certain signals become much more important than others. Log aggregation, for example, is a lot harder to do when many of the logs you care about don't exist on servers you control. However, the increased amount of WAN links means that space and time are both at a greater premium. While this seems to be an argument in favor of more metrics (it kinda is), I would argue we need to focus more on how to model these systems via distributed tracing and how we can more efficiently compress and ship that data. We also need to think very carefully about clocks, and when things happen. Identity is a good example here — changes in authorization scope or permissions (bans, mutes, visibility, etc.) will need to be tracked not only over time but also presented in context of other operations, especially in decentralized systems with automated reputation management.

It's also incumbent on us to educate end-users about how performance profiles are expected to differ from what they may be used to. If you're a Bluesky user, you may be familiar with custom feeds — they're a list of pointers to posts on the network that meet some criteria. However, feeds are hosted by individual users on their own hardware or cloud provider. Compared to a Twitter list, this results in not just a less timely response, but one that can vary wildly based on not just geographic factors, but also implementation differences between feed generators.

Over time, these differences will become more pronounced as more parts of the network federate and decentralize. The ATProto model that Bluesky is built on promises to separate moderation, feed generation, indexing, search, and aggregation into separate services that you can choose. The more spread out all of these are, the slower things get.

I don't believe that this will become unusable, but I do think it will be different enough that we will need to engage with end-users to help educate them about the expected difference in performance; We also must help them understand why it's different, and why it's an acceptable difference. It's not enough to simply stand on the sidelines and say "It's better", we need to demonstrate this through our actions.

For the purposes of discussion, let's imagine that 'stored' in this case refers to the origin of a post, and not any cached content. A pointer isn't necessarily storage. ↩︎

The Blockchain Haters Guide To The AT Protocol

Mon, 01 May 2023 00:00:00 +0000

Like several of the tech twitterati, I've recently been going goblin mode over at Bluesky, a federated social network in private beta. As a long-time crypto and blockchain skeptic, I decided to take a look at the published documentation for the protocol that underpins Bluesky and write some thoughts.

Caveat, before I go into this too much - the public docs are pretty good, but there's a lot of TBDs and under-defined terms. That said, I applaud the team for what they've been able to put together here -- it's pretty cool.

If I get something wrong, let me know! Would love to correct this or do a followup -- again, this isn't my area of expertise.

At a high level, AT Protocol (ATP from here on out) defines three important components of a decentralized social network -- a way to manage identity (who you are), a way to store records (what you post, who you follow, etc.), and a way to communicate between clients and servers (how you read posts or make them).

Identity

There's two parts to 'who you are' that should be familiar to most developers -- there's a handle (such as @austinlparker.bsky.social or @shitposting.vip), based on a domain name, and a user identifier. The user identifier in ATP is a Decentralized Identifier (DID), which is essentially a cryptographically signed and verifiable GUID. To be somewhat reductive, you can think of it as a modern version of PGP that abstracts away a lot of the pain inherent in managing PKI (or, at least, it makes it someone else's problem).

At a pretty high level, this seems like a rather novel (if involved) system for managing identity on a federated network. It avoids one of the larger frustrations inherent to ActivityPub, which is that identity is scoped to an instance; If my Mastodon instance were to go away, so would my identity on the network. Admittedly, it does seem like there's a bit of handwaving in terms of actual federation here right now -- Bluesky is, as far as I can tell, the only actual host that supports their limited DID implementation. From my reading, I'd intuit that the full DID spec is extremely heavy, and they decided they only needed a handful of fields?

The exact mechanisms of how the DID server works aren't terribly interesting for the purposes of this post, but here's a few of the questions I have about how this is going to work at scale:

It strikes me that the actual goal here is for there to be some independent service or set of services that act as DID hosts rather than Bluesky itself providing it. I'd imagine that in a perfect world, your identity is completely independent and travels with you across a variety of services? That said, if you're already shipping a subset of the actual W3C spec, is interoperability going to be a problem?
Using domain handles and mapping them to DIDs is a genuinely good idea because it allows us to piggyback off existing reputational systems on the web (it also makes it very convenient to namespace users!), but I'm curious about the failure states? For example, handle impersonation/spoofing with interesting characters, etc. DNS caching also strikes me as a potential issue with split-brain problems.
I'm not entirely convinced that this is really that necessary? I can see the long-term vision, and I do think it patches a couple of holes in ActivityPub, but there seem to be a lot of maybes along the way. One of the ways that crypto "works" (and I use this term loosely) is by forcing everyone involved to 'pay their own way' through PoS/PoW mechanisms. If you remove that, then you're left with a situation where the economic incentives to just be an identity provider seem somewhat limited -- especially given the traffic requirements to host DID documents? Perhaps the work they're doing to make it cheap will be the innovation here, and an existing player will offer support (i.e., Google) to extend their existing IdP offerings.

Storage

ATP defines a 'data repository' to be a collection of data published by a single user, expressed as a Merkle Search Tree (MST). Each node of the tree is a IPLD object which is referenced by a hash value.

In plainer terms, whenever you do anything on Bluesky, you're creating a new record. This record can be a follow, a block, a post, whatever. These records all conform to a universal data model which is designed to be linked to other records. Any individual record is immutable, which avoids some problems around consistency and state in a distributed system. A client can fetch this data repository and walk the linked list in order to perform actions (like showing you posts). Much of the more complex logic is implemented on the server rather than the client in order to speed up operations.

If this sounds kinda like the harried dreams of XML and semantic web proponents, well, you ain't far off. Instead of XML, though, it's JSON-like! Whee!!

Again, I don't want to dive too deep here because I'm not actually an expert on this, but here's the questions I have --

One stated goal of ATP is to provide the ability for servers to index participating federated entities (this is good, imo). However, I don't actually see how defederation works in this model without indexers simply choosing to de-index parts of the network. I _do_ see a huge potential for protocol-level forks (similar to what's happened in the past for various cryptocurrencies) to split the network in twain, though.
This strikes me as very computationally expensive (_anything_ creates a new record, records require re-hashing and re-signing the chain up to the root node) -- I'm curious how it scales at a billion+ users, especially if they're concentrated onto a single data store host.
Assuming you use something like IPFS to allow for decentralized storage and processing, it seems like it'd be very easy to get into extremely strange edge cases around performance... for example, notifications being out of sync, etc.

That said, you know what this seems like it'd be killer for? Calendaring...

Clients

How do we interact with this protocol? Well, you need a client. Clients and servers talk to each other in ATP using something they call XRPC, which looks an awful lot like gRPC. XRPC seems somewhat unique in that it explicitly calls for schemas to be published on the network, theoretically allowing for them to be easily iterated over and crawled by various automated processes.

The global schema for XRPC is Lexicon, which defines how requests and responses are communicated between clients and servers. It's a JSON document. This sort of self-documentation is rather refreshing -- not to mention pretty handy in the brave new world of LLM-assisted programming.

A couple of notes...

While it's a cool idea, I'm extremely curious how this is going to work out in practice. I don't see anything in the spec that identifies how RBAC is going to work here; it seems like all records are available to any client. This means that bad actors could fairly easily scrape the network (contra any sort of rate limiting or global auth) and store posts for later indexing.
This blog discusses the Bluesky approach to content moderation. While I applaud their willingness to try something new, I have some questions about where various parts of this stack 'live'. If bsky.social is acting as an aggregator/index, are posts scanned for infringing/actionable content there? What happens to hosts that reguarly post infringing material, do they just stop getting indexed (and if that happens, and there are DID's attached to that host, how do they migrate off? I see a mechanism for key revocation, but I don't remember downloading a reset key when I signed up for Bluesky...)
The algorithmic selection layer seems less risky, but given that ATP delegates a significant portion of query-time complexity to the host itself, who's gonna be running all these algorithm labeling servers? It also seems there's a lot of duplicated work here -- if I'm trying to classify the firehose, then I'd need to either index the entire network (service discovery!) or rent time/space from an existing indexer. Perhaps the latter is the play, here, and Bluesky intends to allow developers to build on their infrastructure?

Conclusion

My extremely grudging praise is that it seems like the team over at Bluesky has managed to wrest a single generally-applicable use case from the morass of crypto bullshit that spawned it.

However, I'm not really sure that it's what their users currently want, and that's going to be a significant challenge for them going forward. I can see a model where bsky.social becomes the effective 'default' and sells access to their index, with only a small minority of overall users creating federated spaces and then carefully managing what can be exposed back to the main index (again, I don't actually see if this is possible currently, the docs aren't fully complete).

Either way, hats off to them -- and if you need me, I'll be posting with my new pals on Bluesky.

Stop Trying To Make Observability Happen

Sun, 16 Oct 2022 00:00:00 +0000

"It's not going to happen."

A friend of mine (@mononcqc) turned me on to an essay titled 'Unruly Bodies of Code in Time' the other day, and skimming through it made me consider a phrase I like to use when introducing observability concepts to folks. I give a talk every week or so to new cohorts of employees at Lightstep, talking them through our concept of what observability is, why it matters, etc.

If you're familiar with my work at all, it shouldn't surprise you that it takes about 30 minutes until the word 'trace', 'log', or 'metric' ever escapes my lips in these talks. Over time, my understanding of observability has matured and grown into something that, frankly, is rather disjoint from the innumerable 'observability solutions' that are marketed and sold to software developers.

Observability isn't a product, it's not any type of data or combinations thereof, and it's not something you can buy. Observability is an organizational substrate.

Everything Decays

Any system, even ones maintained by a single individual over time, tips towards entropy and decay. Organizations act to ossify systems through encoding processes and practices in an attempt to stymie this natural entropy. History is littered with examples of this in practice; We instinctually understand, however, that this organizational inertia towards stasis eventually provides opportunities for insurgents to storm the castle and throw down its masters, inspiring radical change. We can also see how concepts such as pace layering come into play by allowing for the diffusion of shocks to these systems.

In plainer terms, we build teams because we know we'll go further than we could alone, and then we build companies or organizations around those teams to 'spread out' our risk of unrecoverable systemic failure. You can probably look around your own company and find examples of this -- there's things that move quick (insultingly quick!) and things that move glacially slowly.

I would suggest that this generally holds in software organizations regardless of purpose or size, but tends to become more pronounced the larger the supporting organization is, and the amount of "legacy" software it has produced and operates.

So-called "legacy" systems (perhaps better referred to as "systems that make the company money") are the biggest beneficiaries of the ossification trend. Any software system is reflexively shaped by the time it was built, the idiosyncracies of its authors, the constraints and limitations of prevailling technology, and so forth. Onboarding to these systems requires significant amounts of defined processes and practices in order for the newcomer to avoid hazards that experienced personnel are intuitively aware of. To quote from 'Unruly Bodies':

Chris seems simultaneously a bit abasehd but also in awe of the fact that by virtue of being new and knowing less and doing something "wrong" … he actually lead to a kind of discovery of an anomaly hidden in the depths of the software system, something no one knew was there.

In this quote, Chris is (quite literally) a rocket scientist who started working at NASA on the Cassini mission straight out of a Masters program, and was required to learn the 'legacy' system that powered it. In his learning efforts, he regularly discovered unexpected or emergent system behavior due to his lack of experience with the system. Keep in mind, the system Chris is working with was over 40 years old at the time! The world of its architects, designers, and implementors was vastly different than the one that we inhabit today.

His, and our, ability to perform almost any task on an established system is the result of significant amounts of defined processes, documentation, 'best practices', and other organizational dynamics.

Systems You Can Touch And Feel

Let's jump back a bit. I mentioned earlier that when I introduce observability, I don't talk about data for a while. What I do talk about is organizations, and how they build and operate software systems. Over the past couple of decades, the replacement rate of software has increased exponentially thanks to the ubitquity of broadband internet, the commercial viability of free and open source software, and the popularity of the "X-as-a-Service" model. Systems evolve and change more rapidly than ever before.

However, we do ourselves a disservice if we limit our thinking about "what's changing in a system?" to the code, or architectural patterns, or deployment infrastructure. I characterize it as follows - a software system is an accretion disc, like the rings of a planet. It's the result of countless small decisions made over a long period of time.

Systems have parts you can "touch", and parts you can "feel". The parts you can touch -- yes, that's the code, that's the API. It's also the fake data you bootstrap your development environment with, or the SSL certificates, or the ability of a queue server to not fall over under load. The parts you can feel are gauzier, but they're arguably even more important. Code can tell you the 'what' of a system, but to understand why there's a comment that says 'Don't change this - FB', you need to understand who 'FB' is, and why they matter. You need organizational context and history that git log can't provide. Why is the API designed this way? Who designed the bootstrap data, and what were their biases around architecture?

The things you can feel are about power and authority, about who gets to make decisions and why. You can't understand a system without knowledge of both.

Bringing It Together

This is where our diverging lines of inquiry come together. Observability is mostly pitched as a tool that helps you understand systems and how they fit together. I think this is a disservice, a malapropism of out-of-control marketing. We need to consider observability in the context of processes and practices more than data or tooling. If we build organizations to be resistant to shocks, then that means we need resiliency throughout the different layers of that organization. One outgrowth of this is, obviously, building resilient software. I would suggest that the largest contributor to fragile software isn't actually the software though, it's inability of organizations to prioritize software reliability due to poor linkage between the 'purpose of the organization' and the 'purpose of the software operators'. Think back to Chris, who was mystified by how his inexperienced hands could create novel failures in what should be a battle-tested and hardened system; What need did NASA have of resilient software when it had veterans and defined processes and practices, just as long as you don't screw anything up, everything will be fine?

Observability needs to help connect these divergent threads and spin them together. Resilient software can only be built by resilient organizations; Our tools need to help bridge these gaps between layers and effectively allow different groups to understand and value each other's work and how it fits together. This isn't to say that we're somehow on the 'wrong track' as an industry, I simply think we're getting too caught up in the observability trees to see the observability forest, as it were. I believe there's a lot of work to be done in understanding how observability can lead to more elastic and fault-tolerant policies and processes, how open data standards can help connect discrete performance and analytic telemetry, and how we can more effectively encode and preserve the parts of the system we feel in history so that we can learn from our legacies.

Virtual Events Are Dead, Long Live Virtual Events

Wed, 27 Jul 2022 00:00:00 +0000

By any scientific metric, the risk of COVID-19 infection is greater than it's ever been, while mitigation efforts have regressed to a shrugging emoji. Being offered an alcohol wipe by a smiling, unmasked flight attendant before spending hours breathing other people’s air in a narrow metal tube is panglossian, to say the least.

If your eventual destination on that airplane is to a developer conference or other in-person event, well, you’re in good company. The events industry has also attempted to return to normal, gladly welcoming us all into packed conference rooms. To their credit, many organizers are taking public health seriously and continue to require masks and encourage social distancing. Sometimes it took a little public pressure, though, for them to get there. Even so, there hasn’t been an in-person event I’ve attended this year that hasn’t had people come down with COVID-19.

Like some of you, I don't have much of a choice -- going to events is part of my job. Indeed, one thing I've noticed is that the people that are happiest to be back are the sponsors, and boy there's a lot of them. I've been to multiple events where sponsor attendance is fully a quarter (or greater) of all attendees. Indeed, sponsors saw major negative impacts from the last two years of 'virtual events', as lead capture from virtual events was far lower than in-person _and_ lead quality dramatically dropped, so it’s not surprising they’re happy to get back out there. It’s just harder to have conversations in a virtual booth about what you’re selling, and it’s much more difficult to stand out in a sea of logos.

Virtual events have also been tough on speakers and attendees as the lack of attentiveness (which I’ll get into later) makes it hard for both to engage.

So, what gives, anyway? I've been thinking a lot about this as I plan the next installment of Deserted Island DevOps. In doing so, I've come up with a theory of developer conferences and events, and why they're obsolete on their current trajectory. We’re shifting the focus to creating an event around the speakers to in-turn give virtual attendees a better experience.

Why We Come Together

Let me start with a disclaimer, here -- I'm going to generalize quite a bit in this piece. These generalizations shouldn't be taken as a slight, or as a deliberate attack.

I’ve spent a fair amount of my career around developer events, both in-person and virtual. Over that time, I’ve identified a few broad groups and concepts that events require -- speakers, attendees, sponsors, and event content itself (and the organizers that select that content). The exact members of these groups, and their nature, can vary from event to event but it holds as a model for everything from a KubeCon to a DevOpsDays.

Virtual events didn’t change this model, but my biggest takeaway from two years of strictly virtual events is that people absolutely need to see each other in-person. There's a myriad of reasons for this, but I'd like to call out two that I find most compelling.

People aren't really good about talking about failure online. Be it over Zoom, in a chat room, anonymously, whatever -- it's something I've noticed time and time again. I believe this is due to a host of social and cultural reasons -- the lack of body language to help nuance discussions, a fear of being recorded, the flat affect that digitized voice tends to lend itself to, etc. I’d suggest that it comes down to a difficulty in being vulnerable over digital mediums. This is a problem for events, though, because a huge part (perhaps even the primary raison de etre) of an event is the ability to be vulnerable with people inside or outside your team. This is what leads to learning and insight -- being able to share, listen, and work that connective function of your brain meats. Virtual events simply haven't been able to replicate the hallway track, and this is a huge reason why. Worryingly, if you accept my presupposition here, they never will.
Attentiveness. What I mean by this isn't "people pay attention to the talks" (they don't do that in-person or online), what I mean is that you are present when you're in-person. This in-person presence benefits the attendee in terms of mental flexibility and thinking. (I wrote about this in a prior blog -- the idea of a constructed space, where a change in scenery really does help change your mindset and open you up to new possibilities). Additionally, the presence of executive or C-level attendees allows for increased efficiency on the part of media or analysts, allowing them one-on-one time with key decision makers. The presence of sponsors allows for in-person testing of new marketing strategies and ideas. Crucially, this attentiveness is not present for virtual events. Indeed, we've seen a recognition of this as events shrink their content programs down to allow for time-shifting and bite-sized videos. It's a known quantity that virtual event talks are pre-recorded in most cases. This can signal to attendees that there's a lack of attentiveness on the speaker or organizers part! The sad irony here is that pre-recorded talks can require a greater amount of effort than live ones to create, record, and produce. We’re left with a curious dichotomy, where virtual events wind up taking a lot of effort to put together but are undervalued in respect to that effort.

The Problems of Coming Together

The benefits of in-person interactions are invariably marred by the drawbacks, of which there are many. I would argue that there's never been a 'safe' in-person event from a health perspective. If you've been a frequent flier to developer conferences, then you're certainly familiar with 'con crud', the generalized colds and flus picked up on the road. Whatever mitigations you put into place for an in-person gathering last only to the door, and you can't control what people do on airplanes or after-hours. A 'safe' in-person event requires sacrifices for public health that I don't really think people are willing to bear... well, at least the people who the event is for. More on that later.

In-person events also dramatically increase the risk of sexual and physical harassment for attendees, staff, and speakers, and these risks are especially acute for women, minorities, and LGBTQIA+ individuals. The prevalence of after-parties and happy hours, all of which are oriented around alcohol, increase the risk of sexual and physical harassment. This isn't a knock on alcohol -- I'm not moralizing here. It's just a curious dichotomy to see organizers mandate masks and testing for COVID-19 but also happily shoving alcohol at you. Kinda like how you're free to wear a mask if it makes _you_ feel comfortable... For those of us who don't have an option, pushing everything to 'personal choice' is damaging to our psychological safety. I've worked four different events this year and my brain continues to crack and ping over the constant evaluation of risk and comfort I'm forced to make. I never find peace until I get home, safe and sound. I listen to peers in the industry, and observe the difficulties they have with staffing booths due to people contracting COVID.

I suppose the free drinks are a good thing, because they help me forget the utter insanity of our modern world, if only for a moment.

There’s a bit of maximalist rhetoric out there about how it’s impossible to do in-person safely, so we shouldn’t try, or that the challenges of harassment are so great that it’s folly to suggest we can create safe convivial spaces. On the flip side, there’s people who argue that virtual is a mistake, that it can never be a replacement for in-person gatherings, and we’re all fools for suggesting differently.

There are tradeoffs to take into account for any event, and part of those tradeoffs are balancing these competing interests and factors. By dancing around our motives, it makes it more difficult to be honest about the tradeoffs -- and, really, most people want in-person to come back because they enjoy getting to go on trips, and see people. It’s part of what makes the job fun! We just want it to happen in a safe manner. The rush to in-person is, in my opinion, significantly driven by this.

The problem is, we have to dress up these events in the guise of ‘continuing education’ and give them a veneer of academic rigor and respectability, just so we can go hang out in the hallway and kvetch. The talks are useful ways for people to test ideas out, or to learn something new, but as I’ll talk about in the next section, flying across the country to talk to a room of people is probably the least efficient and effective way at communicating ideas in 2022.

The problem with events is that they exist for you to be marketed to by sponsors, and it’s really hard to do that virtually.

The Different Perspective

Let’s think back to the model I introduced earlier - speakers, sponsors, attendees, and content.

There are two intractable parts of this model -- the attendees, and the speakers. This can also be styled as 'learners' and 'teachers', 'customers' and 'providers', whatever. Suffice to say, you don't have an event without the people that come to it and the people that they're there for.

What about 'content' and 'sponsors', then? Aren't they also intractable? I would argue that they are not. These are almost an anomaly unto themselves, and they exist in lockstep. The content baits the sponsor trap, as it were -- you need content to entice attendees to come be marketed to by sponsors. This isn't exactly a grand secret, the program committees for even large community events work very hard at building an interesting and informative program, but their decisions are intersectional. This reduces the ability of the conference to act as a filter, even if the perceived benefit of speaking at a conference is an imprimatur of authority. That said, we don't necessarily need events as a platform anymore! Talented individuals can create highly compelling talks on video, publish them to YouTube, then advertise them via social media for a pittance compared to the cost of traveling to and presenting at an in-person event. We can build our own platforms, we don't need KubeCon to give us one any more.

Let's turn our gaze to sponsors. Events cost money, and organizers need to make it. Smaller community events -- think DevOpsDays -- need to cover expenses. Larger ones, like KubeCon, need to recoup costs and turn a profit for the organizers. Even explicitly marketing-oriented events, like AWS re:Invent, need to offset the millions and millions of dollars spent on venue and advertising in order to remain fiscally responsible. The lack of presence and attentiveness from attendees meant that lead capture (i.e., how many people came by your booth) went through the floor. Even the most jaded attendee will usually wander around the sponsor pavilion to gawp at the exhibition and logos. The lack of quantifiable leads as well as the reduction in mindshare resulted in grave dissatisfaction among sponsor marketing teams. The result? Pretty much every B2B SaaS company I know has already burnt their 2022 marketing budget for in-person events, down to the dime. They're hungry and I can't blame them.

Sponsors also have a vested interest in content -- just ask any program committee how many thinly veiled product pitches they get every year. They also appreciate the imprimatur of authority they receive for having someone up on a stage talking about their solutions. That's why they wind up paying developer advocates to come up with talks that skate the boundary between "advertisement" and "useful advice", then submit those talks to dozens of conferences a year.

There's a link, then, between the content and sponsor bubbles. Sponsors need content to get attendees, content doesn't have a place to be presented without sponsors. Symbiotic relationship. If you're a professional speaker then you're probably getting paid for it, after all.

Attendees and Speakers

Let's talk about the intractable parts of our model now, the attendees and speakers. You can characterize these groups in any number of ways... students and teachers, audience and influencers, worshippers and priests. Either way, there's an almost ecclesiastical distinction between the two groups. A conference with no speakers is just some people hanging out, speakers with no conferences are simply monks translating texts in the dark. The spark of creativity and verve that occurs at a conference has to do with this fundamental dualism.

We do have models for events designed for these groups. DevOpsDays events are specifically designed around this interplay, as well as the general idea of 'unconferences' where attendees create ad-hoc groups around topics selected that day.

To be somewhat blase, we work under the assumption that if we get a lot of smart people together, then something educational will come out of it. I think this is a bit too reductive and misses the other axes of interaction that developer conferences provide. These groups are somewhat transient identifiers... Within attendees, different levels of experience will lead to ad-hoc promotions as groups coalesce and drift apart. Within speakers as well, peers and new entrants will commingle to learn from each other, swap war stories, and network. Thus, solving for the needs of the attendees will also in many ways solve for the needs of the speakers. If we start from the position that events are "a paid vacation from your day job", then what should we optimize for?

Locales and venues that assist in creating 'decisive moments'.
The ability to disconnect fully (or as fully as possible) from standard roles and responsibilities, to immerse oneself in the event.
Structured structurelessness -- the ability to provide dedicated time and space for ad-hoc groups to form and discuss topics of interest and concern.
Clearly defined 'rules of engagement' for interactivity, but a lack of hierarchy outside of this. Basically, physical areas for self-selection of engagement types, but an even footing for everyone in each cohort.

There have certainly been virtual events that have attempted to optimize for their format, and try something different to inspire attendees to have these sort of moments. Most recently SLOConf experimented with entirely 'bite-size' talks, meant to be time-shifted and watched during intermittent breaks. I think this is a novel way to optimize for attendee experience, but I think it fails at optimizing for speaker experience.

Something Different

What would it look like for a virtual event to solve these optimization problems independently? Rather than trying to create something that felt like an in-person event, but online, or a way for sponsors to get you to look at their web pages? What if we could make a great experience for virtual attendees, and in-person speakers?

In short, this year's event is a DevOps conference for the virtual crowd, but a DevRel conference for the speakers. By selecting for the criteria above, we're trying something very different from traditional virtual (or hybrid) events.

The problem with a traditional hybrid event is that the loser in the trade-offs is always the virtual attendee. In-person speakers and attendees reap the benefits of venue, focus, etc. while virtual ones are relegated to secondary (at best) echoes of what's happening on site. I think the way to fix this is to specifically craft the experience of each group. The only way to get a ticket to the in-person event is if you're invited to come (and the only way to get invited to come is if you are speaking or if you've spoken before), which means we don't have to worry about a dual experience for the attendees. If you're there, you're a "speaker" -- if you're not, you're an "attendee".

Rather than having to provide a half-baked experience to viewers, we can create an engaging and lively program for them, designed to _encourage_ them to treat the event as destination programming. We can ensure Q&A, roundtables, and interactive portions are optimized for this specific category of 'live participant' rather than trying to merge an in-person and a virtual attendee experience, or deliver a sub-standard experience based on ability to attend.

For speakers, the calculus flips. A curated, salon-like atmosphere to present their ideas to attendees and each other, unstructured time to network and chat about what's working and not working in their field, new insights and realizations sparked from the tinder of interpersonal interactions. This is possible because the speakers will be in-person. Really, this is a natural outgrowth of the speaker experience in previous Deserted Island events -- speakers all sat in a Zoom with each other, acting as a virtual 'audience' for each talk. This time, they'll get to continue the conversation after the stream goes to sleep, and walk away from the event refreshed and engaged with their colleagues and community.

If nothing else, it's going to be something completely different from any developer conference you've spoken at, or attended.

Block off time on your calendar, cancel your meetings, grab a bowl of popcorn and your favorite tropical drink and tune in live to Deserted Island DevOps 2022, September 14th and 15th, live on twitch.tv.

Incentives and Power

Fri, 08 Oct 2021 00:00:00 +0000

I wrote a post a little while ago about how SRE is really just sneaky anarchism, and this is somewhat of a followup.

Let's briefly synthesize that earlier blog post. Essentially, my argument is that there exists a vicious cycle in software engineering which derives its power from radical ideas on how to apportion power in the workplace between "labor and management", for lack of a better word. Labor - developers, operators, whoever - chafe under the strictures and taylorized systems implemented by a slate of managers. These strictures and systems exist, primarily, to justify the value of the managers to the productive enterprise. Over time, radical elements of the labor pool will band together and codify some of their values into a term like 'agile', 'DevOps', or 'site reliability engineering' in order to claw back the power to organize their productive labor in a way that makes more sense to them. As these codes crystallize into movements, they are commodified and credentialed, then repackaged and sold to managers at other firms who seek to reap the procedural benefits. Meanwhile, the radicals have been fridged and the fruits of their labor firmly divested into a hundred eager consultant pockets.

I think this is interesting if you view problems from that context. The problem isn't, after all, that "developers don't know how to code" -- given our industry's attachment to whiteboard interviews I think we can discount that possibility -- but that developers don't have ownership. Not necessarily ownership over their code, but ownership over the results of their code. Your individual labor is rather atomized and packaged into Jira tickets so that managers can justify their role in the delivery of software.

"Wait, we need tickets and acceptance criteria and so on in order to coordinate and maintain a record of changes!" Well, no shit, that's not the problem. The problem is the disconnect you have from the productive work that you perform. Think about a feature you worked on, maybe even the first feature you worked on at your current job. Do you still maintain that code? Do you even know if that code is still running? I'd suspect the answer is no, especially if you've been somewhere for a couple of years. You've probably been re-orged at least once, maybe even several times, and the "thing that you do" has long stopped mattering. It's all just an endless set of tickets and feature requests, prioritization and sprints. You're a part of a larger whole, because the system is too complex and unwieldly for anyone to carry it all on their back anyway.

This doesn't have to be the case, I'd argue. Your individual contribution has been deliberately divorced from its productive value to the whole, because that's a "more efficient" way of doing things. What if we didn't optimize for efficiency to shareholders, however? What would a truly responsive product look like?

This, I think, is the interesting question, and one that might piss some people off. A closely held software product would almost certainly be less "accessible" than one designed and built by committee, unless such accessibility was core to the product itself. It's also not necessarily true that such a closely held product would be more performant, or that it would have virtuous product management or goals, than other such tools. The only thing you could really guarantee would be that the people who built the product would have true ownership over it, and it would succeed or fail on those merits.

Note: I'm picking this up after like 3 months of sitting on it because I just figured out where to go with the piece.

There was, very recently, a global Facebook/WhatsApp/Instagram outage where BGP got whoopsie'd bad enough to knock all of their servers off the internet entirely. Beyond the jokes you might make about this, it's a perfect story because it can prove or disprove anything you'd like about our modern software hellscape. Maybe you're worried about the increasing concentration of power into the hands of private corporations, maybe you're worried about the increasing concentration of the internet into private un-federated platforms, maybe you're worried about how automation makes things more and less resilient, etc.

That said, the purpose of systems is what they do and what Facebook and other very-large-scale systems do is act as a replacement for things that normally a government would do. As an example, this story about Afghanistan rescue efforts being hampered by the WhatsApp outage illustrates exactly how much we've allowed extremely large systems to completely subsume functions that arguably shouldn't be held in private. Think about it; if Twilio went away for 24 hours, how much would break? What about Datadog, or GitHub, or New Relic, or any number of other developer-focused systems. We have both first and second order effects to consider here. Very large private social networks are running some of the primary communication systems we use every day. These very large private companies are also dependent on other very large companies. There's no practical check on their power, because growth is incentivized over everything else. A different world might react to Facebook's 3 billion global users with horror rather than applause.

Let's go back to the start. Who actually is responsible for anything anymore? Depending on your ideological or conspiratorial bent, there's a lot of answers here, but I think the real creeping terror at all of our hearts is the knowledge that nobody is responsible for anything. We're all trapped in the carcass of a machine, and the machine is bleeding to death. Our attempts to create a better world within our reach by optimizing processes, or by creating little anarchic practices inside multi-trillion dollar companies, these feel like reactions to the implicit realization that we can't actually automate our way out of the end of all things. Similar to the rogue AI, Durandal, of Marathon fame we flail against the closing of the universe. Escape, we believe, will make us god.

You can't actually escape, though, can you?

There's a lot of people a lot smarter than I am who've proposed a variety of solutions to this quandry, so I'm not gonna go through them here. If you're reading this, you might already have a few in mind. What I want to leave you with is the following points -

You can't really divorce what happens in 'business' from ideology. Agile, DevOps, SRE, Observability, whatever -- these movements are all rooted in fundamentally liberatory ideologies.
A big reason that we pursue these is that we're very divorced from the products of our labor. We program computers to do things, but the actual result of that work is very abstract.
We could organize ourselves and our labor differently; We could make things for ourselves, or for our group, but this wouldn't necessarily make them "better" -- and in some ways, would almost certainly make them less "accessible".
The unfortunate reality is that 'growth' is one of the only things that's incentivized, so our systems must become larger and larger, encompassing more and more functions and factors.
We tell ourselves the lie that this growth is good, that it makes things better, than we can purge ourselves of impurity and automate away all the pain.
The endless pursuit of perfection elides a cold truth -- we have built systems so large and unwieldly and horrifying that we have to tell ourselves we can make them perfect with more computers, more tools, more processes, more everything.

This has turned into a bit of a treatise, I suppose. I'm not sure how to make it anything but -- if you want boring writing, go find something I write about observability or programming or whatever. Really, I don't want to overly propose solutions because I'm not sure there's any that aren't trite, and I hate trite.

Really, what I want to emphasize that it's important to think about stuff like this. Think about your work, think about what you contribute to. There's one thing I know for sure, and it's that giving up isn't really an option, even if it's easy. Be a part of what matters to you, form communities there, and strive to be intentional in your words and deeds. Don't settle for "that's how it's always been done" both personally and professionally. The world exists in the way it does because that's where we've put it; The world can change, if we so will it and put in the work.

The Commodification of DevOps

Wed, 23 Dec 2020 00:00:00 +0000

"We are uncovering better ways of developing software by doing it and helping others do it."

The Agile Manifesto, 2001.

It's been quipped more than once that most amazing Silicon Valley innovations are simply a bunch of nerds poorly recreating a service that already exists, but with an app. While I find this to be in some ways a truism (after all, there is nothing new under the sun), it's a fairly trite observation. What's far more interesting is how the organizations that build and deliver these 'innovations' themselves develop, and the process of that development is especially interesting due to the pressure-cooker of free money and labor elasticity that has characterized the 'startup economy' over the past twenty years or so. What's any of this have to do with DevOps, you may ask? Simply this -- DevOps is a reaction to the commodification of Agile, and the rise of SRE is a reaction to the commodification of DevOps. To reduce the thesis further, many of the trends you see in software development and delivery can be understood as a cyclical reaction to anarchists running headlong into the invisible backhand of the free market.

Let's start with a brief overview of the idea of a commodity, commoditization, and commodification. A commodity, in an economic sense, is some highly fungible good or service -- like wheat, or salt. They're raw materials, basically. Commoditization is the process by which an economic market for a particular good transforms into a commodity market. One example of this you may be familiar with is the concept of generic pharmaceuticals due to patent expiry -- if enough people can make the same thing, then the market for that thing will change. Commodification is a cultural critique of the above process happening to something that it shouldn't happen to -- "microprocessors are commoditized, love is commodified". Marx's critiques broaden the definition of commodity to encompass any good or service produced by human labor which is then offered for sale, in an attempt to quantify the general economic value of any particular good through the labor theory of value (LTV), which in short states that the economic value of any commodity is determined by the amount of 'socially necessary labor' required to create it.

Please note that I am not an economist and these explanations are overly simplified.

What does this have to do with developing software? There's quite a few load-bearing metaphors for the act of software development as a team or organization, but my preferred one is the idea of a kitchen. Cooking is a blend of science and art, cooking for a dinner service is different than cooking for yourself, and it requires a high level of implied and explicit communication. The dynamics of a high-performing kitchen are eerily similar to a high-performing development team in both positive and negative ways. There's a certain grace and elegance to the actual moment-to-moment functioning of a kitchen during dinner service when everything is working right, which you can also see during well-functioning incident response on a development team. That said, there's also high levels of burnout and stress, poor coping mechanisms outside of work with alcohol or drugs, and internecine interpersonal relationships. I believe this is why a non-zero amount of techies liked the Bravo show Vanderpump Rules (before it got lame because everyone in it got successful), game recognize game.

With that said, how does this blend with the idea of DevOps as a commodity? We need to go back to where we started in this piece -- by talking about Agile, and the LTV. Remember how I mentioned 'socially necessary labor' earlier? Well, how do you quantify that for software? What's the actual cost of fixing a bug, or writing a line of code? The fixed costs are fairly straightforward -- a computer, a monitor, etc. -- but if you assume that any software built as a team is going to provide a fairly infinite amount of bugs and improvements, the actual cost needs to be attributed elsewhere. If you're a manager, this becomes a fairly simple calculation of salary, time, and revenue -- fix the bugs that impact revenue the most, implement the features that impact revenue the most, and so on. It's generally agreed that Agile is a reaction to this sort of top-down management of "work", but I would go a little further than that; Agile is a reaction to the idea of who gets to determine what's 'socially necessary'.

Agile, fundamentally, was a power struggle between the 'haves' and the 'have nots' in the knowledge economy -- and the workers won, for a time. Management can want whatever management wants, but they can't provide the necessary labor required to actually make the computer go in a way that produces a reliable commodity that has market value. Imagine our theoretical high-performing kitchen; If the chefs get together and figure out a way to work that makes them happier than the way the owner wants, the owner may certainly decide to fire them and hire new cooks, but at the potential cost of revenue, especially if those new cooks don't have the talent or ability of the prior team.

Historically, the 'haves' are rather unwilling to cede power to the 'have nots', but they are pretty good at figuring out what to co-opt. Agile has been, quite definitively, co-opted and commodified.

This image is in and of itself almost a meme at this point, the Deloitte 'Agile Landscape'. It's a confusing mess -- this post does a pretty good summary of everything that's wrong with it -- that I surmise exists mostly because people who work for Deloitte have little better to do than create slide decks that make you feel like they know what they're talking about. The very concept of 'agile' has been reduced to a credential, something you can get certified in. The moment you can pay someone to tell you that you're a 'Disciplined Agile Senior Scrum Master' I would suggest that your idea is no longer a radical one, and that's an important part of the commoditization in software development. Looking back to 2001, Agile was genuinely a radical and anarchist experiment in work! "Individuals and interactions over processes and tools" is an extremely radical concept given the popularity of scientific management in the modern capitalist enterprise.

If we think of Agile as a radical statement of culture and values, then it follows that DevOps is primarily a response to the commodification of Agile. Indeed, go read some of the notes from early DevOpsDays. As Agile practitioners lost their ability to drive change in organizations through the commodification of the practice, something new was needed. In these early DevOpsDays we see, again, specific reactions to this process. "DevOps is not a certification, a role, a set of tools, a prescriptive process" (some readers may begin their cynical chuckling at this point). Without getting into the weeds too much, DevOps worked in the same way Agile did, as a way for individuals who didn't have a lot of power to claim some back. Agile and DevOps both were ultimately about workers who weren't "in the room where it happened" building a new room around them and forcing concessions by changing processes, policies, and culture.

Both DevOps and Agile were, fundamentally, about changing the structural power dynamics of an organization. The benefits to software quality and delivery are incidental to the equation. Effectively, you make better software when the people building the software are the ones running the organization that builds the software.

So, what happened to DevOps? It might not surprise you that it also has been credentialized. The commodification of DevOps has been far more rapid and drastic than Agile, I would say -- while Agile has mostly been turned into a product that you buy from rapacious consultants, DevOps is turning into an entire suite of literal products and services that promise to use artificial intelligence and machine learning to eliminate the human factor in DevOps entirely. You can find any number of tools, services, books, training courses, and so forth that sell you 'DevOps' or parts of it. Hell, there's about 500 people in the world whose entire job it is to fly to other parts of the world and tell you how to do DevOps (or how not to do DevOps); I'm one of them, I guess? There's a decent chance that if you're reading this, you're one of them too.

If there's one thing you can count on in cycles, it's that they tend to happen faster the more times they happen. With that in mind, is the rise of Site Reliability Engineering as a response to DevOps a terrible surprise? Originally developed at Google (and, quite honestly, you can tell), SRE can be thought of as a reaction to the commodification of DevOps and Agile but in an increasingly specific way; This makes sense, because we're running out of broad and obvious things to write manifestos about. Instead of talking about large cultural changes, SRE focuses on smaller, more discrete things like "if something breaks, don't pile on the dude who pushed the bad change" and "you should use math to calculate interesting things about your system". Is SRE going to be subject to the same commodification and de-fanging that Agile and DevOps have been? That's still an open question -- certainly, 2020 has seen the launch of several open source and commercial products that tie in directly to some of the trends espoused by SRE; things like observability, blameless postmortems, SLOs, and so forth.

What will doom SRE to the fate of Agile and DevOps, though, is less its scope and more the why of SRE. Agile, DevOps, and SRE all share one thing in common -- they all attempt to reshape the power dynamics of a workplace, of a business. They're all attempting to ask what is 'socially necessary labor', and what the actual value of a widget is. It works, for a time, in the white-hot inferno of startup economics because money is free, and talent isn't -- but it's not a steady-state system. Revolutionaries have a tendency to get managed out, laid off, promoted and defanged, or to leave on their own volition. Perhaps this is just the natural state of trying to be a revolutionary under startup capitalism -- we can reinvent the past hundred years or so of labor theory every few months, but at the end of the day, the investors will get their returns one way or another.

So, what's the answer, what's the solution? I don't think there is one, at least, not necessarily. I'd caution readers about drawing any grand conclusions from what I've said here -- but I do think it's a useful analysis. Beware anyone who comes trying to sell you these commodities, however -- you really can't buy SRE, Agile, DevOps, or much of anything else when it comes down to how you do things. Ultimately, it's that how -- who gets to make decisions, how those decisions get made -- that determines more than you can ever really know. Maybe you really just need to unionize?

Deserted Island DevOps Postmortem

Mon, 04 May 2020 00:00:00 +0000

In my experience, it’s the ideas that you don’t expect to work that really take off. When I registered a domain name a month ago for Deserted Island DevOps, I can say pretty confidently that I didn’t expect it to turn into an event with over 8500 viewers. Now that we’re on the other side of it, I figured I should write the story about how it came to be, how I produced it, and talk about some things that went well and some things we could have done better.

Popular lore now holds that the above tweet (1/1/25: tweet deleted as I deleted my Twitter account) was the genesis of Deserted Island DevOps, but I’d actually suggest that it was a Slack conversation I participated in at the end of February. The COVID-19 pandemic was ramping up, event cancellations were starting to compound, and an internal conversation was brewing around virtual events. With my usual confidence, I predicted that it wouldn’t be possible to put together a successful virtual event in a month. I’m willing to admit at this point that I was wrong on that one -- turns out, with sufficient hustle and a keynote speaker with nearly fifty thousand Twitter followers, you can do a lot of things. While Deserted Island DevOps was a wild success, I would be remiss to not point out events such as Gremlin’s Failover Conf and Lead Dev Live who both have run incredibly successful and well-produced virtual events on a short schedule.

Why did I think you couldn’t pull it off? Well, let’s put a finer point on it and ask a bigger question -- why do these events exist? Why do companies pay tens of thousands of dollars to exhibit at them? Lead generation! Everything exists to serve the almighty MQL funnel. An in-person event is an opportunity to scan badges in exchange for a t-shirt or a stress ball or some other piece of swag, and those scans convert to a data point on some KPI dashboard that rolls up to a VP of Marketing, and the theory goes that some percentage of those scans result in someone opening an e-mail for a reason other than to click ‘unsubscribe’, and some percentage of those people will click a link to a whitepaper or get a demo of your product or service, and some percentage of those people will eventually buy something from you. In a virtual event, it remains to be seen if this basic calculus applies. How valuable, exactly, is the human interaction you have at a booth in terms of you eventually opening that email? Do you feel compelled to sign up for a demo just because you got a t-shirt? We don’t really have the data to say one way or another at this point, but my grander thesis is beyond lead capture the real thing a sponsor buys at an event (and the thing that you sell, as an attendee) is attention. When you’re trapped along with thousands of other souls on the show floor at a tech conference, your senses are under constant assault by a concentrated stream of capitalism. The “brand awareness” that a booth or sponsorship can generate is extremely high - what else are you going to pay attention to while you’re there? Even the coffee and pastries are sponsored by someone. The lanyard around your neck is probably branded!

Let’s compare this to a virtual event. You don’t have to watch the talks, or the interstitial segments. You can jump in for a talk and leave easily. Maybe you leave the Zoom (or Zoom-shaped object) on in the background, muted, during talks you don’t care about in order to browse the internet or keep doing work. Your attention is not present in the same way that it is at an actual show. That said, there’s a lot of questions we don’t have answers for yet - when you do pay attention, is that attention worth more, or less? Are you more engaged when you’re not physically “there”? Does the setting matter? I believe that it does on some level - any successful event will have an ethereality to it, something that takes you out of space and time and places you in a constructed moment (or in the work of Cartier-Bresson, a “decisive moment”) that cements you in a world that is not your own. A chance meeting in a hallway over coffee or drinks, reconnecting with an old friend or acquaintance, a story shared that resonates just so - these moments, I believe, are what keeps us gathering for these events. How could you possibly replicate these moments when you only have a chat channel, a video feed, and the endless hellish screeching of our modern reality playing out in a Twitter feed or on CNN out of the corner of your eye?

So, yeah, I mostly decided to do this as a bit. As COVID-19 spread, and travel ceased, and offices emptied, I quickly pitched a format change for the on-again, off-again podcast I’d been hosting for the past year into a once (then thrice) weekly live stream on twitch.tv. My thought process was “well, people are stuck inside, they’re gonna want to watch something, and I have all this A/V junk…”, so you know, why not go for it? My colleague and co-conspirator Katy was game to ride as co-host, so off we went into the wild world of extremely minor Twitch stardom. I got a crash course in many of the more technical aspects of streaming - I knew a bit about OBS (Open Broadcaster Software) and other production tasks, but the best way to really learn more is by doing, so I did -- a lot. Throughout March, these streams helped build up not only our ability to technically produce live content, but also our rapport as co-workers. In the interest of not being boring, and because we both liked playing video games, we streamed video games on Fridays. Why not? One of our first game streams was Animal Crossing: New Horizons, and we both became enchanted with it. The art, the design, the whole package immediately captured our interest -- along with pretty much every other developer advocate I know.

By the end of March, it was clear that we were in for the long haul with our terrifying “new normal”, and ideas were percolating. I noticed that several twitch streamers had started to create and run special events on their Animal Crossing islands -- casinos, game shows, things like that. I also had recently found a pattern in the game that let me build a stall, and apply a custom design to it - so I put the Lightstep logo on there and made it look like a chintzy trade show booth (complete with swag to give away!), took a screenshot, and tweeted it out. Tom McLaughlin replied “We should hold a conference in animal crossing.” and… well, that kinda counts as validation, right? At least two people replied to his reply, and before you know it I had a catchy name and what I felt was overwhelming support for the idea. One trip to Namecheap and some time in Photoshop later, we had a website and a (extremely) rough mockup.

In all honesty, launching the page on April 1st was a bit of a hedge. If it flopped, then I could play it off as an April Fool’s joke. Instead, I had over a hundred registrations on the first day - for a conference with no announced speakers, no sponsors, and no actual idea of how to make it all work other than a vague “well, you can put things over things in OBS…” sense. At that point, I felt pretty locked-in, and started planning in earnest. Reader, I wish I could tell you that I exhaustively researched and consulted with experts in the space, but that would be a lie. I mostly went off my gut feelings and the spirit of events that I respected. I’ve been a huge fan of the DevOpsDays format and ethos - no vendor pitches, focus on community, let the speakers shine - and so I committed that this should be that. From the start, I resolved that registrations wouldn’t be shared with sponsors (or anyone else), and that I wouldn’t take a dime from anyone that wasn’t already paying me for my time. I didn’t feel like it’d be in the spirit of the event to monetize it in any way, really. After throwing together a call for proposals on Sessionize, I sat back and let the magic of the internet take hold - well, I also tweeted about it a lot and bugged a few people in person to submit talks. Katy and I formed the program committee, reviewing talks for inclusion. It was harder than expected to build a slate - there were over twice as many talks submitted as we had room for (even after expanding the speaker lineup; originally I had envisioned only 8 speakers, but we pushed it to 10) and they were all exceptionally good proposals.

Meanwhile, the world continued to turn, and I started to figure out how exactly this whole thing would fit together from a technical and organizational standpoint, and how to build a set inside Animal Crossing. Originally I had conceived of using an actual in-game prop as the backdrop to overlay slides against, but this proved to be unwieldy and ineffective for actually viewing what was on the slides. In addition, I quickly realized that the camera - while good at displaying my character in-game - wasn’t really suitable for the sort of camera work you’d want to do in producing an event. The default camera centers on your character in the game world, and even worse, when you’re not in your house it’s fixed on a plane facing north (so you can’t rotate it around you). There is an in-game camera feature that gives you more options (like pan, tilt, and zoom) but it introduces several visual overlays that can’t be disabled. I also had to consider, quite unexpectedly, the prospect of rain. Animal Crossing has seasonal weather, and in the spring, it rains a lot - what if it rained on the day of the event? I could “time travel” as its referred to by the AC community (adjusting the system clock of the Nintendo Switch to move to an arbitrary season and time of day) but this would introduce additional complications to already tightly scheduled logistics. These aren’t the only challenges in-game, but it’s worth noting that challenges were there. These challenges, though, were part of the charm of the event I believe. Limitations can drive creativity. This is especially true in software - the stories we remember and talk about are usually the ones where we had to overcome limitations in some creative way.

These challenges lead to several conclusions on my part. First, I’d need to hold the event inside my house, rather than outdoors. This opened up flexibility in camerawork - indoors, you can rotate the camera 360 degrees around a fixed point - and in set decoration and design. In order to have a large enough space, though, to fit everyone I’d need a bigger room… which meant I need to acquire about 2 million bells (in-game currency) to pay for those expansions and various furnishings (such as the podium, and the TV) needed to bring the vision together. The exact details of how I came about the in-game money aren’t really germane (there’s a system known as the ‘stalk market’ where you can buy and sell turnips; buy low, find an island online with high prices, travel there and sell them) but I did get to play Animal Crossing for my job which was kinda neat.

The second major piece of the puzzle was actually producing the whole thing. Virtual events offer a greater flexibility when it comes to how presentations are actually delivered compared to an in-person venue, of course, and what I’ve seen is an increasing reliance on pre-recorded talks. From the onset of planning for Deserted Island DevOps, I made a conscious decision to not rely on them -- despite the multiple nightmares I had leading up to the event about internet connections dropping out and speakers or viewers being left in limbo -- because I believe that a huge part of speaking is real-time feedback. Normally, you’d get this feedback from watching your audience. Are they paying attention? Nodding? Laughing at the jokes? This is one of the elements that I’ve found lacking in webinars and other virtual event tools. Indeed, they seem designed to remove you, the speaker, from the audience as much as possible. Zoom Webinar, GoToWebinar, etc. all remove audience cameras entirely from the screen, and hide chat away for the most part. It’s intended that you use narrow, purpose-built tools like Q&A features in order for questions to be collected and responded to. This, to me, feels like a huge departure from the way I like to talk to people that I find it difficult to be enthusiastic about. While I was planning on streaming to Twitch anyway, I believed that this decision would reinforce the value of live talks anyway, as Twitch has a convenient chat feature that’s integrated into the viewing experience both culturally and technically (the chat window, after all, is right there next to the video player. It invites you to participate), thus allowing speakers to get real-time encouragement and questions. In addition, I hoped having multiple speakers in a Zoom call (with some speaking, and some listening and reacting on the island) would give speakers a virtual audience they could speak to and address as humans. I’m happy to note this worked extremely well -- all the speakers loved it, so I’d recommend it to anyone trying to run their own virtual event. On that same note, I’ve had people who are in a more professional events marketing role ask me about the choice of platforms… I honestly think you should use what you have, and Twitch is a fantastic platform for live-streaming video. It handles scale flawlessly and seamlessly, you can provide closed captions, chat moderation is fairly straightforward. I’d like to see a version of Twitch that maybe is aimed at a more midmarket audience -- like, I get it because I like watching people play video games, but I understand that the srs business crew might not be down with it, I dunno. I also believe that the events success can be traced, in large part, to the fact that the content wasn’t gated. You didn’t have to download an app, or log in, or sell your soul to watch the stream. You clicked a link, and it worked. This drove a lot of traffic organically, as well - while our concurrent viewer count never quite hit 1k (goals for next time, I suppose), we had over 8500 unique views and over 11k views total so we had a pretty constant stream of people coming through and a decent amount that were there the entire day. The discoverability of Twitch also made this interesting - we had enough viewers to be within the top ten live channels for our category the entire day, the best I saw us do was number three overall, but that’s pretty impressive for a DevOps conference. In a more general sense, I think things like this should be accessible in general. Information wants to be free, y’know? Did we inspire someone, stuck at home, to something greater? Did someone watch this and think “huh, this could be an interesting career?” I don’t know, but I hope that we inspired people in some small way.

With a rough idea of the “how”, I turned to making solutions. The actual technical aspect of production was pretty straightforward, as I mentioned earlier. A pretty quick drawing of it follows, but I’ll explain it in more detail.

I use a Blackmagic ATEM Mini to capture HDMI video sources and send them to my computer - this is what drives the Sony DSLR I use as a webcam, but it can accept 4 separate HDMI inputs. It also offers convenient keying functionality on the unit itself, making it handy to take on the road for recording trainings, meetings, whatever. I originally planned to use it for my OPS Live! event at the Observability Practitioners Summit at KubeCon NA 2019, but it didn’t arrive in time, but I repurposed it as part of my home studio. Since Deserted Island DevOps didn’t involve any multi-source switching or compositing, it basically got to act as a fancy capture device for the Switch. The ATEM Mini is hooked up to a second monitor, that I used to watch the raw gameplay output from the Switch. ATEM acts as a video source when connected to a PC or MacOS computer, so I was able to easily add it to OBS.

A complete breakdown of what you can do in OBS would be beyond my ability or desire to detail in this piece, but this is basically what I stared at all day. Each overlay was broken down into a specific OBS scene, which I controlled through an Elgato Stream Deck -- a convenient little piece of kit that gives you 15 physical buttons that can be mapped to various actions, like ‘switch scene’ or ‘start recording’. The Stream Deck was a huge convenience, since I could easily flip between the slides taking up a corner or full screen with one press of a button without having to click around in OBS. This was rather valuable, as I needed to split my focus a few different directions while producing - I kept my Switch controller nearby so I could adjust the in-game camera as presenters moved around during talks, while switching between scenes with the Stream Deck, while clicking between the Twitch chat mod view, various Discord channels, Twitter, the show plan in Google Docs, and so forth. I also kept a notebook next to me with big handwritten notes like “REMEMBER TO RECORD” and “MY MIC IS ON A2 IN VOICEMEETER”.

A fancy audio routing trick with Voicemeeter and OBS let me talk to the Zoom call without my audio being routed out to Twitch, so we could effectively communicate (at least one-way) about switching slides and so forth. I’d ask the next presenter to share their screen, go into OBS and modify the crop filter on the Zoom source (because everyone’s slides were a little bit different…) and resize everything just so. One fun trick - you might notice that I don’t have separate scenes for each speaker. That’s because all of the text for the stream graphics (speaker, talk title, etc.) was actually stored in a file. Stream Deck macros allowed me to switch between speakers with a press of a button, by overwriting the text in several files which OBS would then load. This saved a lot of time in creating assets, since I only had to create them once rather than for each speaker separately.

Since I was already pretty deep into running things live, how did I make the intro? That was some special Blender stuff, right? Well, no. Since I already had the text box as a vector, I exported it as an SVG, made a web page, and added some JavaScript to mimic the ‘typewriter’ effect of how text appears in Animal Crossing. I (painstakingly) added in the animalese sound effects and synced them up through another stream deck macro but literally as I write this I realize I could have just imported it into my local web site and probably synced it up that way. Note for the future, I guess. (Those ‘10ms’ delays aren’t actually all 10ms, I just got lazy with the labels). The pictures overlaying the welcome screen at the end were created by me making a bunch of different images with a new screenshot each time, and turning them into an OBS slideshow. OBS lets you start a media source as you transition into a scene, so adding in the ACNH theme was as simple as ripping it from YouTube into a mp3 (thanks, youtube-dl!) and adding it as a media source to the ‘Welcome’ scene. The macro took care of the rest! I also used it to make the short promo gif that I posted to Twitter, but with different source text. OBS has a ‘Browser Source’ that you can load any URL into, even a local one, so I simply imported the HTML file into OBS through that and we were off to the races. Fun side note 2 - I couldn’t convert the ripped TT font that I was using into a webfont, but css dont care - you can give it the path to a local truetype font and it’ll load it no problemo.

There’s a lot of things I’d do differently if I did this again - I’m not sure if I’d use OBS, for one. While it’s free and effective, there’s some pretty basic features that it lacks - snap-to-grid for elements, foremost amongst them. It also has very limited options in terms of text overlays; You can create some basic ones, and source the text from a file, and even make the text scroll… but that’s about it. I know other tools exist like Wirecast from Telestream, and maybe I’ll look at that since I already have a license, but we’ll see. xSplit is another I’m aware of. I wanted to add in something like an overlay to display tweets or twitch chat, but I found the options kinda frustrating to deal with on a short amount of time (realistically, I only spent a week working on the production prep side of things, not that long in all) so I went with what I had.

Behind the scenes, there’s a lot that happened as well. We recruited - last minute, somewhat - a small team of moderators. I made the decision about a week out to create a Discord server for the event, based on concerns we had about both moderating Twitch chat and the fact that it would be difficult to reliably get questions and answers done in there. I thought about a Slack for a minute, but I’m already in a million Slacks and hate most of them. Even the name frustrates me. It was cool when the other option was Hipchat, but now that it’s the Enterprise Communication Tool Of Choice it just reminds me of a pair of Dockers, which is actually kinda ironic. That said, when your competition goes by ‘Meet’ and ‘Teams’ I guess you kinda win by default. Katy was invaluable in helping source moderators from the early adopters of our Discord, who we quickly (as in, like, a day before) empowered to keep both the Discord and the Twitch chat clear of nuisances. I also configured a Twitch bot (Nightbot) to automatically moderate the chat, adding in a plethora of bad words and other restrictions on spamming, caps lock, emote spam, etc. It was tuned a bit too tight at first -- and honestly, maybe a bit too tight overall, because it timed out a bunch of people when the first talk ended… but, that’s a learning going forward. Setting the chat to followers-only on Twitch worked well, though - I probably overfitted for reducing the amount of human moderation we’d need to do by being too aggressive on automatic moderation and limiting things, but I was somewhat worried about the worst of the internet showing up, especially if the event went viral(er). It was nice to see that not happen! Our Discord and Twitch, for the most part, were well-behaved and polite and convivial. The watch parties that spawned out of the Discord were also absolutely delightful to see - people shared screenshots of themselves hanging out on other attendees islands, usually with some sort of ‘viewing space’ set up. We were also able to quickly make adjustments as needed - we added new channels as the event was going on in order to keep the main chat more clear (a ‘hallway track’ channel for discussing the current talk became very popular), and most crucially the entire experience of onboarding to Discord was completely self-serve. Our Twitch bot would mention the Discord invite link every 15 minutes (or on demand through a command), allowing people to join and get involved. Moderators and the host could look through the Q&A channel in order to find questions to ask live, and speakers could head there afterward to have more in-depth conversations. Overall I think the entire chat experience went really well.

Another important part of the experience was our captioning. This part was pretty easy, or at least pretty hands-off. I asked other event (in-person and virtual) organizers about what captioning tools they use once I realized that machine translation was going to be utter trash, and found White Coat Captioning through that. The actual mechanics of it were extremely straightforward - the captioner sat in the zoom call and… well, transcribed what people were saying. They used a service called StreamText to send the captions out to a website, that I was also able to pull caption data from using an OBS plugin. This plugin sent the captions to OBS, which encoded them and sent them to Twitch along with the video feed. We got the captions afterwards, so I can add them to the chapterized video on YouTube. This was 100% worth it, and I think every virtual event should follow suit (also in-person events!)

Let’s see… laundry list time, because I’m tired of making this narrative. Things that worked, and things that didn’t, and things that just irritated me.

Nintendo, I would really be happy if I could turn off the overlay on the camera mode. The little viewfinder frame tics bug me so much, and they were impossible to completely hide.
One of our speakers had an interactive segment. I thought it was just a poll, so didn’t mention it during prep, but it turns out it also had an interactive “write whatever” part.. Props to the community for rapidly downvoting some of the racism but it was on stream for a little bit, so I’ll have to clip that out of the final uploads. Be sure to check this sort of stuff in the future!
There’s a lot of things I didn’t think about until I needed them. Stuff like social cards to promote talks and speakers and the show, a press kit, etc. etc. Would be good to have that ready in the future.
Crap! I forgot to have everyone throw out party poppers at the end. In addition, there’s probably a hundred little ideas I had that didn’t happen because I forgot to write them down.
I think the registration process was a bit lacking, but I was optimizing for easy and non-invasive over anything else. That said, there’s a couple of little things that would have been nice that I didn’t do, like having links to the Discord on the confirmation page.
In terms of polish, I wish there was a way in Zoom to have fancier audio routing. I was able to keep myself muted and talk into the Zoom call to cue people up, but I everyone else talking on it could be heard all the time. This wasn’t a huge issue, but it would be nice to have something where the host audio independent of the rest of the Zoom participants. I dunno, I think you could do this by having a Zoom call + Discord call going to different virtual audio cables, then mute those at the appropriate times? Something to investigate more.
The color grading for the output seemed pretty bad… like, the white balance was off somehow? I’m not really sure how much of it was simply Twitch compressing things on my end. Screenshots that I saw seemed mostly fine.
It would have been nice to have two cameras in-game, but that takes a slot away from a person. I was really trying to make sure we didn’t have to do a lot of getting people in or out of the island except for breaks, so I wanted to maximize the number of actual participants that could be in game simultaneously. I do think that we did a great job here - there was only one accidental disconnect, and it was going into a break anyway, so it didn’t really mess things up much.
Some people have asked why the event started at 10 AM ET… well, I live on the east coast, so that’s a convenient time for me. It wasn’t just that - I wanted a time that was convenient for western europe as well, so I figured might as well split the difference.

All in all, I think it went extremely well - better than I had anticipated, certainly. We’ve had writeups in TechRepublic, Vice, TechCrunch, and VentureBeat so I’ve certainly given the content mill some grist for 15 minutes (and what greater honor than that, in tyool 2020?) The live event had 11,825 live views with 8,582 unique viewers and as of this writing, another 1556 views of the full video. By the time this goes up, the talk videos will have also gone up on YouTube, so we’ll see how those do.

Stepping back, I wonder what’s next? I enjoyed doing this, maybe we’ll do it again later this year, but I’m not sure. I think there’s going to be a wave of people trying to do similar things, some of which will probably be better produced, better funded, whatever - the ideas are free, I’ll be interested to see what happens. In a longer term sense, I’m a fan of the idea that we use games as a social space to bring people together in the way that physical events do, and I’m pretty convinced that we’re going to need something that isn’t just endless Zoom webinars for event spaces. This obviously isn’t a new notion - I mean, Second Life has been a thing forever (who can forget that brief moment in the 2008 US Presidential election when Newt Gingrich held town halls in SL?), there’s newer platforms like VRChat or Altspace… I think the thing that makes AC work is its simplicity. In better times, a Switch and a copy of ACNH is a pretty small investment to make, compared to a VR rig, but more importantly than that it’s that ACNH is just… simple. You don’t need to be an expert at much of anything to move around, to dress up, to interact in ACNH. You just kinda do things, and its cute. Like I said earlier, limitations can be helpful - limitations can also be valuable. One of our speakers wore a tiny crown (which costs like a million bells or whatever, it’s an expensive fashion item) and Twitch chat started popping off - why? Well, because if you play AC, you know what that thing is worth, you know its a status symbol. Completely freeform platforms don’t really have that same sort of cachet. Like it or not, human social interaction comes with a lot of subtle clues and markers that infinitely expressive platforms can’t replicate, simply because they’re infinitely expressive. You need a smaller set of defined ‘rules’ about your platform that are easy for people to grasp and to know. I think this is somewhat important to providing a convincing proxy of real events, because it provides a way for us to be expressive and for that expressiveness to be approachable. Let’s ignore, for a moment, that it also reflects existing inequities in some way, that’s a blog for another time.

What I’m most proud about, though, is the community that formed around this event. I think that, in and of itself, is enough reason to keep trying to push forward and keep doing these kind of events in the future. Everyone came as a stranger, but left as a friend, and in our increasingly uncomfortable world, a new friend is pretty valuable indeed.

I hope this answers your burning questions about how this whole thing worked. If you've got more, feel free to bug me on Twitter! Want to watch the talks? They're in this playlist on YouTube.

Mono in Debian 9 Containers

Sun, 11 Nov 2018 00:00:00 +0000

Running Debian 9 and need to install the mono repository? You'll find advice for 8 that suggests using the following:

$ sudo apt install apt-transport-https dirmngrnsudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 3FA7E0328081BFF6A14DA29AA6A19B38D3D831EFnecho "deb https://download.mono-project.com/repo/debian stable-stretch main" | sudo tee /etc/apt/sources.list.d/mono-official-stable.listnsudo apt update

When it comes time to docker build, you might see the following:

Step 6/12 : RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys A6A19B38D3D831EFn---> Running in abbbdefb9d15nExecuting: /tmp/apt-key-gpghome.GbZgRWnneE/gpg.1.sh --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys A6A19B38D3D831EFngpg: cannot open '/dev/tty': No such device or addressnThe command '/bin/sh -c apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys A6A19B38D3D831EF' returned a non-zero code: 2

Don't despair! The following line in your Dockerfile (replacing the apt-key adv command) will get you going:

RUN curl https://download.mono-project.com/repo/xamarin.gpg | apt-key add -

OpenTracing for ASP.NET MVC and WebAPI

Sat, 13 Oct 2018 00:00:00 +0000

Preface - I really like what Microsoft is doing with .NET Core and ASP.NET Core.

However, the horror they've unleashed upon the world in the form of ASP.NET MVC and WebAPI is a sin that will take more than a few moons to wash away. That said, quite a few people are still building software using this stuff and I got curious how you'd do instrumentation of it via OpenTracing. This post is the result of several hours of hacking towards that end.

Action Filters For Fun And Profit

It's actually pretty straightforward, assuming you know what to Google and can handle the absolute state of documentation that's available. At a high level, here's how it works. ASP.NET - similar to Java Servlets - provides Action Filters which are simple lifecycle hooks into the HTTP request pipeline. There's four interfaces you can target if you want to be more specific, but a fairly trivial implementation of a Logger can be done like so:

public class CustomLogger : ActionFilterAttribute
{
    public override void OnActionExecuting(ActionExecutingContext filterContext)
    {
        Debug.WriteLine($"executing controller: {filterContext.RouteData.Values["controller"]}");
        // etc etc...
    }

    public ovveride void OnResultExecuted(ActionExecutingContext filterContext)
    {
        Debug.WriteLine($"result complete in controller: {filterContext.RouteData.Values["controller"]}");
        // etc etc...
    }
}

Pretty straightforward, like I said. There's also OnActionExecuted and OnResultExecuted which are called after and before a controller action, and controller action result, respectively.

So you'd think it'd be pretty easy, right? OpenTracing provides a handy GlobalTracer singleton, so create a TracingFilter...

public class TracingFilter : ActionFilterAttribute
{
    public override void OnActionExecuting(ActionExecutingContext filterContext)
    {
        var routeValues = filterContext.RouteData.Values;
        var scope = GlobalTracer.Instance.BuildSpan($"{routeValues["controller"]}").StartActive();
        scope.Span.SetTag("action", routeValues["action"].ToString());
    }

    public override void OnResultExecuted(ResultExecutedContext filterContext)
    {
        var scope = GlobalTracer.Instance.ScopeManager.Active;
        scope.Span.Finish();
    }
}

Then in your RegisterGlobalFilters method, do a quick filters.Add(new TracingFilter()), register a Tracer, and away you go! Right?

Wrong.

Well, half-right.

That Sounds Like Me, Yeah.

Assuming you're only using MVC, you're right. So you'll see spans for, say, GETting your index page, but not for any of your API routes. Why? Because there's two ActionFilterAttributes. The one we just did is System.Web.Mvc.ActionFilterAttribute. Want your WebAPI traced too? Time to create a System.Web.Http.Filters.ActionFilterAttribute. You can tell them apart by the extremely different method signatures, as seen here -

public class WebApiTracingFilter : ActionFilterAttribute
{
    public override void OnActionExecuting(HttpActionContext actionContext)
    {
        var scope = GlobalTracer.Instance.BuildSpan(actionContext.ControllerContext.ControllerDescriptor.ControllerName).StartActive();
        scope.Span.SetTag("action", actionContext.ActionDescriptor.ActionName);
    }

    public override void OnActionExecuted(HttpActionExecutedContext actionExecutedContext)
    {
        var scope = GlobalTracer.Instance.ScopeManager.Active;
        scope.Span.Finish();
    }
}

Yeah, that took me a few minutes and this StackOverflow answer to puzzle out. c'est la vie.

That said, this is pretty much the hard part. Since you've got spans being automagically started and finished whenever the request pipeline hits, you can implicitly utilize those parent spans inside a controller to create children:

[WebApiTracingFilter]
public class ValuesController : ApiController
{
    public IEnumerable Get()
    {
        var returnValue = getCurrentTime();
        return new string[] { returnValue };
    }

    private string getCurrentTime()
    {
        using (var scope = GlobalTracer.Instance.BuildSpan("getCurrentTime").StartActive())
        {
            return DateTime.Now.ToShortDateString();
        }
    }

    // and so forth...
}

You can also get fancy with your OnActionExecuted/OnResultExecuted filters by checking for exceptions coming in and adding stack traces to your span logs.

If you'd like to check out the complete sample project I made, it's on GitHub.