It is 9:59 AM in Melbourne, 9th October, 2024. Sunlight filters through my windows, illuminating swirling motes of dust across my living room. There is a cup of tea in my hand. I take a sip and savor it.
I text the other senior engineer, who unlike me is full-time, on the team: "I'm ready to start at 10", as is our custom.
The minute hand moves.
It is 10:00 AM in Melbourne, 9th October, 2024. The sun is immediately extinguished and replaced by a shrieking skull hanging low in a frigid sky. I glance down at my tea, and it is as blood. I take a sip and savor it.
I text the other senior engineer on the team: "Are you ready to enter the Pain Zone?"1, as is our custom.
The Pain Zone, coated in grass which rends those who tread upon it like a legion of upraised spears, is an enterprise data warehouse platform. At the small scale we operate at, with little loss of detail, a data warehouse platform simply means that we copy a bunch of text files from different systems into a single place every morning.
The word enterprise means that we do this in a way that makes people say "Dear God, why would anyone ever design it that way?", "But that doesn't even help with security" and "Everyone involved should be fired for the sake of all that is holy and pure."
For example, the architecture diagram which describes how we copy text files to our storage location has one hundred and four separate operations on it. When I went to count this, I was expecting to write forty and that was meant to illustrate my point. Instead, I ended up counting them up three times because there was no way it could be over a hundred. This whole thing should have ten operations in it.
Retrieve file. Validate file. Save file. Log what you did. Those could all be one point on the diagram, but I'm being generous. And you can keep the extra six points for stuff I forgot. That's ten. Why are there a hundred and four? Sweet merciful Christ, why?
At two of the four businesses I've worked at, the most highly-performing engineers have resorted to something that I think of as Pain Zone navigation. It's the practice of never working unless pair programming simply to have someone next to you, bolstering your resolve, so that you can gaze upon the horrors of the Pain Zone without immediately losing your mind. Of course, no code alone can make people this afraid of work. Code is, ultimately, characters on a screen, and software engineers do nothing but hammer that code into shapes that spark Joy and Money. The fear and dread comes from a culture where people feel bad that they can't work quickly enough in the terrible codebase, where they feel judged for slowing down to hammer the code into better shapes that sadly aren't on the Jira board, and where management looks down on people who practice craftsmanship.
The last doesn't even require malicious management — it just needs people that don't respect how deep craftsmanship can go. These are the same people that do not appreciate that an expert pianist is not simply pressing keys, they are obsessively perfecting timing and the force applied to each key, alongside dozens of factors that I can't comprehend. An unthoughtful person will see something and think "It can't be that hard" or more generously "That looks quite hard" . The respectful thought to have when viewing any competent professional in a foreign domain, in every domain that I'm aware of, is "That must be way harder than it looks."
I have now seen enough workplaces, and thanks to the blog have access to enough executives, to know that this is what most cultures degenerate into. Terrible companies are perpetual cognitohazards where everyone is bullied all day. The median companies (which some people call "good" for lack of ever having seen better) lack the outright bullying but still consist of people that are trying to convince themselves that it's fine to feel disempowered or subservient all day. There are times in my life where I have to deal with this, but it is hardly fine. The best are places where you can get at least some of the things that a person needs other than rent money2 .
This place has a better culture than most, as bullying is mostly not tolerated, we have fully remote work, and it is a terrible faux pas to accuse someone of not working fast enough outright, though you can hint at it gently.
It is worse than most places at software engineering, as... oh, you'll see.
In any case, I have a deal with the team. Every morning, grab your coffee, attend your meetings, and at 10 AM we navigate the Pain Zone together for at least three to four hours. Management is blissfully unaware that this force of camaraderie and mutual psychotherapy is the only way that things continue to limp along.
We have one simple job today. The organization wants to know:
Someone has already landed all of the logs our system produces in the data warehouse, so we can examine them in there, alongside the actual data. Is that smart? I dunno, something feels a bit weird about it but I have no concrete objection. My co-navigator and I decide to look at the logs for one data source.
This is where, five seconds in, we begin to become lost in the Pain Zone.
Let's say the data source we picked was "Google Analytics". We search the landed logs for Google Analytics, expecting to see something like this.
Source |
---|
Google Analytics |
Here is what I actually see in the source column, and yes, it actually looked exactly this bad.
Source |
---|
6g94-8jjf-eo84757h4758z", "jobStatus": "JobStatus.Waiting", "jobExpiry":"2023-10 |
That... is not "Google Analytics". In fact, what the fuck is that? It looks like someone has dumped a random snippet of JSON into the logs, but not even the entirety of the JSON. The strings aren't terminated. We should have around fifty source systems, so how many distinct source systems appear in— FIFTY-SEVEN THOUSAND?
We've been writing total nonsense to half the logs for over a year and no one noticed? We only have two jobs. Get the data and log that we got the data. But the logs are nonsense, so we aren't doing the second thing, and because the logs are nonsense I don't know if we've been doing the first thing.
I take a deep breath. The plan is to submit my notice on December 2nd anyway, so this is fine. This is so fine. The other engineers already know I'm leaving, and we've all committed to do the best we can for two months for our spiritual growth. The ability to do painful things is a virtuous skill to cultivate as a responsible adult.
Okay, how is this happening? Well, it turns out that we're embedding a huge amount of metadata in filenames, and the Lambda functions that produce all of this — of course, we're serverless, because how can you hurt yourself without a cutting-edge? — use lots of regex to extract data. Unfortunately, because we don't have any tests, someone eventually wrote some code to download data that passed a big JSON blob instead of a filename to the logging function, and that function happily went "Great, I'll just regex out the source system from the file name!" Except it wasn't a filename, so it has instead spewed garbage into the system for months.
I find something like this every time we enter the Pain Zone. Sometimes we've laughed so hard that we've cried at the things we've seen $2,000 per day consultants do.
The issue is raised with the team, but because fixing this critical error in our auditability is not on the board and Velocity Must Be Up, fixing the logs is judged to be less important than... parsing... the nonsense logs. Why? We have another saying on our team, which is "Stop asking questions, you're only going to hurt yourself".
I take another deep breath.
Okay, we'll continue with the work instead of fixing the critical production error. We can't query the Google Analytics stuff based on source system, so let's pick another one. We also draw data from Twitter once every two hours, and that source column isn't broken for that. I just need to be able to associate the log events to begin working on that. That is, we'll have one log that says "I downloaded the data from Twitter" and another log that says "I checked that it had all the correct stuff in it", and I just have to tie them together.
I don't like the way this table is configured for various reasons, but I'm expecting to see:
Event ID | Source | Event | Success |
---|---|---|---|
1 | Downloaded | True | |
1 | Validated | True |
Then I can just do something like:
select
event,
success
from
log_table
where event_id = 1 and source = 'twitter'
Now I can see if all the correct stuff happened.
But I cannot find an event_id column or anything that looks like one. I hit up the expert on this system, and am informed that I should use the awslog
column. I look at it.
It looks like this:
awslog |
---|
converted/twitter/retweets_per_post/year=2023/month=03/day=11/retweets_per_post_fact-00045-8b3226g9.txt | Validated |
I mean, firstly, what the fuck is this? Secondly, what the fuck is this? Thirdly, well, you get it. Why not just store this in a relational format? Why are they all in one column? Why do you hate me specifically?
Stop asking questions, you're only going to hurt yourself.
I am expected to use regular expressions to construct a key in my query. As far as I can tell, the numbers and letters don't represent or uniquely identify anything, they've really just been appended for no reason. I waste a fair amount of time figuring out if I can use them.
December 2nd, I tell myself. Of course, I could be working on a book, shipping a hobby project, and dedicating more time to the business we're committing to in January, but December 2nd was the plan. It will be a great exercise to come up with a plan to gradually refactor all of this while delivering the things we're supposed to. It will make me a better engineer. December 2nd, December 2nd, hold fast.
Okay, we can write a regular expression to identify all Twitter sources that came from 11/03/2023. This is very stupid, but compared to minimum wage in my home country, I am being compensated spectacularly to deal with this particular brand of stupidity.
But wait, we retrieve this once every two hours, which means that while I can find all the Twitter data pulls from the 11th, I can't actually tell which rows are associated with the 8 AM run versus the 2 PM run. This perplexing awslog column only identifies things down to the day, not the hour. We have another column that logs the exact time down to the second that a Lambda function has fired, but each step happens at a slightly different time, and each source takes different amounts of time based on filesize.
I message the team. "Any ideas for how to identify specific runs that don't assume there is only one run per day?"
We take a ten minute break. We return.
I am informed that there is no way to do this. All I can think of is to create a heuristic per data source, such that I see when the file was acquired then scan for the validation event that happens closest to the acquisition event without going so far ahead that I read the next successful validation event by mistake. I just wanted to see if data was landing in the platform. And to make things worse, I suddenly remember that I've seen this awslog thing before. A month after I joined the business, I saw it, and I said that it was unacceptably bad. The response was that it's okay because all the data we want is technically inside those strings, and this design is more flexible. Of course, since then we've added our first data source that is downloaded more than once a day, so it turns out, shockingly, that they should have Just Used Postgres and not tried to be excessively clever. As always.
How have we been running things like this for two years? Millions of dollars were spent on this system. Our CTO, who has never written code themselves, gets on stages every few months and just lies to people about things that the CTO can't possibly understand, pretending that any of this works and that they're a leader in the space. Then their friends buy the same software — I know because recruiters keep calling to ask me if I'll help lead the efforts. Almost every large business in Melbourne is rushing to purchase our tooling, tools like Snowflake and Databricks, because the industry is pretending that any of this is more important than hiring competent people and treating them well. I could build something superior to this with an ancient laptop, an internet connection, and spreadsheets. It would take me a month tops.
I've known for a long time that I can't change things here. But in this moment, I realize that the organization values things that I don't value, and it's as simple as that. I could pretend to be neutral and say that my values aren't better, but you know what, my values are better. Having tested code is better. Having comprehensible logs is better. I'm wasting their money sitting around until December, which is unethical. I'm disrespecting myself waiting two more months for a measly Christmas break payout, which is unwise. I've even degraded team morale because I've convinced some of the engineers that things should be better, but not management, so now some of the engineers are upset. I'm a net negative for this team, except for that one time I saved them so much money that it continues to cover all three of our managers' salaries combined.
As an afterthought, the person who just informed us that we have no way to associate logs to their respective ingestion events adds:
"By the way, I think that there's a chance some of the logs don't actually report the right things. Like the ones that say Validated: True
are actually just hardcoded strings in the Lambda functions, and the people that wrote them may have meant to type in things like File Landed: True
but made mistakes."
I am dumbstruck. The other senior is laughing hysterically.
It is 11:30 AM in Melbourne, 9th October, 2024. The wind is a vortex of ghost-knives sending birds careening from the sky. I glance down at my tea, and it is liquid hatred. I take a sip and savor it.
"Hey, are you still there?", my pairing partner replies.
"Yeah. Yeah. Listen, I'm done. I'm out today."
"What? What about December?"
"I could get the entire terrible first draft of a whole book out by December if I wasn't wasting time on this."
"... Fair."
I briefly consider contacting my partner, but I know she'll support me. I could check in with my parents, but they'd just worry for no reason. I could chat with my co-founders, but they're just going to tell me to do what I need to do. I could sleep on it, but that would just be to give myself the illusion of responsibility even as I barrel towards wasting two more months to earn money that, thanks to five years of diligently navigating various Pain Zones, I don't even need.
I resign at 2:00 PM.
It is 3:00 PM in Melbourne, 9th October, 2024. I have called my director, who is highly competent, and explained why every engineer wants to quit, and finalized the paperwork. My last day is the 5th of November, 2024. My only job title is now director of my own consultancy, and in January my savings will start to tick down. I glance down at my tea, and it is tea. I take a sip and savor it.
Firstly, I gave a talk at GDG Melbourne which you can watch here. The audio quality is not great, so I forgive anyone who taps out. The comments are weird because I asked people to flip a coin and respond with "This guy is the next Steve Jobs" or "This guy seems like a real piece of work", which I only regret a little bit. I should not be allowed to run a business.
Secondly, I gave a webinar to US board members at the invitation of the Financial Times. Suffice it to say that while people are sincerely trying their best, our leaders are not even remotely equipped to handle the volume of people just outright lying to them about IT. Also apparently my psychotic blog does not disqualify me from Financial Times affiliation, which is wild, but is maybe a useful lesson that the world is desperate for sincerity even when it isn't dressed up as corporate maturity.