Alan's Lair

AI-assisted coding trumps vibe coding

Alan Franzoni — Thu, 19 Jun 2025 15:04:41 GMT

Being an AI evolutionist rather than revolutionist, I have refrained myself from posting too much about AI - a lot of people are doing so by the way, I very much doubt you’re missing my opinion.

But, I recently got back from Pycon Italy and I realized one thing: everybody uses AI at work. This includes me.

At the same time, in a professional software development context, nobody gives a f- about vibe coding.

The name is cool, but the idea of leaving the AI do the all the dirty work is just silly. And, despite the hype, this is very clear to all software professionals. AI provides a lot of value right now, but generating working software from natural language requirements seems a worse version of the classical “just get a bunch of juniors to get this job done” disaster.

Properly using AI to enhance our coding is a deeply intellectual exercise and it’s high time we recognize it as such. Insisting on cherry-picked examples where “vibe coding created my app even if I don’t know how to program” is pretty pointless - for a professional programmer. I’ll let other people, who haven’t experienced the magic feeling of something you create coming to life, or have stopped doing so, rely on vibe coding. It’s great for prototyping, or for designers, or for examples.

I’ll keep using AI as one of the many tools I have, as a professional developer, to improve my work, just as compilers, linters, IDEs, new languages and libraries.

How to use AI to suck at user experience

Alan Franzoni — Sat, 26 Aug 2023 13:31:47 GMT

Or: the perils of blindly using AI to save some money.

IMPORTANT NOTE: while I'm pointing out some specific companies in this post, it's not meant to discredit them; instead, I hope to trigger a bit of reflection on how AI should (and should not) be used, and how it should not replace your product and user experience thinking.

I admit I was kind of waiting for it to happen; bad usage of AI was going to create some problem for me one day or another.

Full disclosure: I'm not an AI skepticist, I've got quite a bit of machine learning and data science experience, and I think the current LLM-oriented AI wave certainly has value, but I believe it's an evolution rather than the revolution that hype-riders claim it to be.

The facts

Some days ago I was having a short vacation in Lienz, Austria, a nice small city from where you can follow a bike path through the mountains to San Candido, Italy and then return by train (or vice-versa). I still recommend the trip!

We were going to take the train to San Candido at 8.50am, and we arrived at the train station parking at 8.30am, so we had quite a bit of time. But: you need to pay for the parking. No problem, right? Unless the parking meter, which looked quite modern, only accepted coins: no cards and no bills. We needed to pay a total of EUR 8,00 and hadn't got enough coins. So, what? Technology to the rescue, of course: we downloaded the official parking app, as advertised on the side of the parking meter.

parking meter Parkster ad

So, I duly scanned the QR code which led me to the Parkster IOS app . I installed it and looked for the area code in order to pay:

parkster screenshot

I tried looking manually for the parking area; no luck. I gave location access to the app, and it worked flawlessly, the map was pointing precisely to our parking spot. And, while entering the code, I could see that some parking areas indeed existed as they were partially matching:

partial area matches

So, the app seemed to be working indeed, it was just that our parking area wasn't showing up. By then, we were risking running out of time, so we scrambled to change some coins, and we barely made it to the train. If you have access to the app in your App Store, try yourself - the behaviour is still the same as of August, 26th 2023.

Trying to fix the issue

But I had took photos and screenshots, so, once I got back home, I sent an e-mail to the Lienz Town Hall, to Parkster customer service (the e-mail was right there in the ad) and to Lienz Tourist Information (to their credit, Parkster is the only addressee that answered so far), explaining the issue I found with a certain amount of details and proof:

1st support request

I got this response from Parkster:

answer 1

This is quite a generic answer, which is something that can happen with support requests; a "template answer" is sent at first to reduce operator burden, and only if the customer follows up then a "real operator" is involved. But the "you need to register" is either factually incorrect (I was using the app without registering) or misleading; so I followed up:

followup 1

and I got this:

answer 2

Now I was getting puzzled. Did they read my message? It appeared so, since the answer kind of made some sense. But, the area code was available both in screenshots from my first e-mail, and it was explicitly there in my followup question.

Then, I tried registering, and... the area codes would finally show up! During registration, I was asked "in which country you'd like to park", so I suspect that only related area codes are shown. Why? I don't know, but I don't actually care, there may be some reason. So, I had one more followup with the customer support just to explain that this "detail" isn't written anywhere and this was quite a UX problem. But at this point I had some suspicion about my support agent; it seemed like he wanted to answer something to me, but he did not really care how I was communicating something which should have value to his own employer (I sincerely doubt I was the only one who had experienced such issue, Austria is quite packed with tourists):

final interaction

that made me think "I am really chatting with a human or are they using ChatGPT to answer?!?" - I just told you I have registered and now I can find the zone! And the other "live tickets" and invoice parts are totally unrelated!

On such last message, I spotted a different sender than beforehand (I don't know why):

and, ThinkOwl indeed is Customer Service Software powered by AI, whatever that means.

Wrapping up

For the parking app

Some customer was trying to help you and point out a bug or a UX problem to you.

An answer like "Thank you for your message. Our development team has been informed and we'll consider possible changes to our system in the future" would have both great and enough. And, the change could be something as stupid as adding "Please make sure to register for the correct country or zones won't be found" to the "not found message".

AI is not really helping in this situation. It probably lets some departments to meet some cost and answer time targets, but they'll miss the opportunity to learn customer problems and improve their product and user experience.

Don't use AI this way. In fact, I expect AI to improve customer experience by empowering customer support representatives to work better and faster, answering quickly to FAQs but letting them take more time to answer more complex - but legit - answers or to interact with development teams to actually fix real, underlying issues.

For the Lienz municipality

Product thinking goes end-to-end. Installing a coin-only parking meter in 2023 (or 2020, I can't know the exact date, but it looked quite recent) because you're relying on a third-party app can be a good idea, but, then, you need to test your user flow end-to-end. Your city relies a lot on tourists, and you don't want to piss them off!

How to do a bad comparison of cloud vs colocation

Alan Franzoni — Mon, 13 Mar 2023 21:49:05 GMT

There's this post by Ahrefs , a SEO company from Singapore, which is getting quite a lot of shares recently.

While nobody ever said that the cloud was cheap, it's a terrible comparison. Basically, the article enumerates what they have bought (a number of Dell servers + network equipment and colocation costs) and roughly translates those to AWS cloud services.

Some problems:

Apparently, all of Ahrefs servers lie in a single datacenter. Good luck with disaster recovery and availability.
The cloud is nice because, you know, it's a cloud. Scalability. Flexibility. On-demand resources. Nobody forces any customer to always rent 850 EC2 servers. Ahrefs could have used EC2 machines and cloud databases with autoscaling to pay less without getting performance problems. They could have used completely managed solutions (e.g. Lambda) to totally forget about the need to manage a server, and to get a dynamic pricing.
No operational cost is included. It's true, cloud engineers aren't cheap; but a few of them can automate and manage quite a large fleet of cloud resources. On the contrary, deployment and maintenance of such a large hardware fleet is probably more complex. What was the setup cost?

Don't do those comparisons. Choose to compare a scenario, instead, and put some numbers to estimate the risk of problems, the cost in case of downtime (which is quite easy to prevent on the cloud), and the availability of relevant resources (are you keeping your datacenter technicians on payroll even when they don't have any work to do?).

The cloud will almost inevitably come out more expensive than raw server/colocation costs, but quite often it's going to be cheaper than a lot of "enterprise" managed servers that I've seen in my life (like, some agency managing servers in a DC for a customer).

Towards the end of the article, Ahref acknowledges that there are tradeoffs for cloud vs on premise, so I think the first part of the article is even more insincere. They know what they're doing: it's an hybrid approch where the predictable workload is kept in a private DC, but probably there's an underlying cloud infrastructure that, if everything goes wrong (e.g. the DC burns) would be able to take 100% of the required workload. Quite a different story.

Password requirements: myths and madness

Alan Franzoni — Thu, 22 Dec 2022 20:41:14 GMT

More than 11 years have passed since the venerable XKCD Password Strength strip:

And yet this still happens nowadays on a variety of websites. Maybe it's even more frequent than 10 years ago:

The hilarious part for this website is that the embedded strength checker properly recognizes a good password; but then an additional policy is bolted on, just in case users were too happy. In this case, whitespace is not accepted, and only some non-alphanumeric chars are considered special.

I tried taking a look at those idiosyncratic password requirements, and for most policies I couldn't find a direct relation to some recommendation or compliance. So, here I'm trying to dig in what the underlying reasons are for this folly.

Please use a mix of uppercase and lowercase letters, digits, special chars in your password

Usual explanation: for security reasons you must choose a strong password!

My verdict: Misunderstanding

The usual explanation is correct, but does not imply that a password should contain a lot of strange, hard-to-type and hard-to-remember characters. That's exact what xkcd passwords are about: a longer password with plain ascii lowercase letters can have the same entropy as a shorter password from a larger charset.

If you host a web application: Enforce a password's strength, not strange user-hostile policies. There're libraries for that (e.g. https://github.com/dropbox/zxcvbn) that work both in backend and frontend.

You cannot use whitespace, accented letters, hyphens, [...] in your password

Usual explanation: Very often there's no explanation for this, or it's something like "your password contains unsafe characters".

My verdict: No real motivation beyond, possibly, incompetence

This is a terrible one. You waste a lot of time at generating a "strong password" that complies with a website's policy, and then you discover that you could not use "-" in it. Or just whitespace for XKCD-style passwords.

There's no technical reason for restricting any character from appearing in any password. The usual approach for handling a password is:

take the password as a string, as the user typed it;
Prepare it (see links at the end of the paragraph);
encode with a specific, well-defined encoding (e.g. utf-8);
add a salt (probably your favourite library will require it in its input) and use a password-derivation algorithm (pbkdf2, Argon2, whatever) to get an "hashed" password and save that salt+hash in your storage of choice.

What I suspect is that improper escaping and/or encoding techniques are being used in a website imposing that restriction. It's a red flag.

Caveats:
In some case there can be UX reasons for limiting the input charset, especially regarding special chars. Maybe you don't want people to use chars that cannot be found on some keyboards, because then, if people move across the world and use different computers, they're unable to enter their password. Or maybe the keyboard layout isn't yet set when the password is entered (example: full disk encryption password at boot) and you risk that a user is entering a password at install time, but what they write at boot is actually different. Maybe newline or other control chars are an hassle.

If you host a web application: make sure you're correctly processing the string coming from the user, rather than adding arbitrary requirements that force the users back-and-forth from a password manager to your application.

EDIT:
Commenters on HN correctly noticed that I forgot a passage when handling passwords, the so called "string preparation". It doesn't really matter too much with what we're dealing here, but here are some links:
https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
https://www.rfc-editor.org/rfc/rfc8265
http://unicode.org/reports/tr15/#Stabilized_Strings

So, it appears (Unicode is a terribly large and complicated standard, I may have misunderstood something) there could be good reasons to reject some input chars since there could be no stable encoding under any normalization process. But it doesn't seem the case in the scenarios I've seen.

You cannot enter a password with fewer than X chars

EDIT: it seems that the message in this paragraph was widely misunderstood; I've changed the wording. Dec 23 2022, 17:15 CET

Usual explanation: security! I want your password to be veeeery strong!

My verdict: Makes sense, but could be better, and if X is too large, it's just annoying.

This is the exact opposite of the first point. Sometimes you get a 16 or 20 minimum length for the password. Why? If I'm autogenerating a password with all-possible-and-incredible-chars, eight should be enough. NIST standards even allows just 6 chars in some situations (check 5.1.1.1 Memorized Secret Authenticators).

This is connected with password strength. A short alphanumeric password is usually easy to bruteforce with adequate money and gpu resources.

But, still, a website should aim at password strength, not at password length. If I use four unicode code points out of the more-than-one-million available, by picking those outside the BMP, my password is probably very safe already, since there are (10**6)**4 possible passwords, many more than (127**8), which is more-or-less the number of possible ascii-printable passwords with 8 chars.

If you host a web application: probably your password strength checking library is not so good at understanding strange Unicode code points, so, you'll better go for a minimum length, but don't go wild: 8 or 10 with a password strength checker is the way to go. And: always clearly state the minimum length which is required for a password.

You cannot enter a password with more than X chars

Usual explanation: whew. It's too large we cannot handle it.

My verdict: Largely unmotivated unless X is quite large.

Once hashed, passwords take up the same amount of space. Restricting the maximum number of chars to a low number makes no sense and then forces people to use strange chars to get complexity right.

This is especially troublesome when the browser field is short, you can't see the password that you're pasting because it's masked, and you finally set a password that's not the one you expect. Then, when trying to login, the "enter your password" field isn't got the same limitations, and you cannot login because you're using the wrong (too-long) password.

Caveats:
An higher length limit for passwords ~~must~~ should exist to prevent DoS to an app, but should be rather high, like, not lower than 64 chars. Also, some algorithms that were (and maybe still are) quite common for password verification, like blowfish, don't properly work beyond 45 chars or so. Still, that's not the far-too-common 12 or 16 chars that can be found on plenty of websites.

EDIT:
A reader pointed out that properly-implemented password hashing algorithms aren't really subject to DoS. But, you may not be 100% sure about your implementation quality, and a a longer password doesn't mean it's necessarily more secure.

If you host a web application: please allow long passwords at least 64 chars long; even 256 should pose no DoS risk whatsoever. Also, make sure that the HTML field itself is larger than the allowed password size, so you can tell the user "password too long" if they're pasting into it.

You cannot paste into this field

Usual explanation: Security! I want you to actually write this and know what you're writing and write it twice in two boxes! Don't copy-paste from somewhere else!

My verdict: Largely unmotivated right now, and dangerous.

Somebody thinks that disallowing paste is a good idea so people actually need to write a password twice and they can't make typos. But this prevents using password managers, which are a Good Idea. This is a very old thing that some people thought being a good idea in the past, but I think it's not considered ideal since.... 2005?

If you host a web application: please allow pasting into password fields.

You cannot show or copy this field

Usual explanation: security! your password is For Your Hands Only. We cannot even trust your eyes.

My verdict: ~~Security through obscurity~~ bad idea.

Don't do this. While setting a password, there must be an icon to show it and make sure I'm not doing mistakes. It won't be saved in plaintext anywhere else! This is especially true if you're not allowing pasting a password. Then there's no way to make sure I've written what I wanted to write! The default masking of the password is just designed to prevent shoulder surfing.

If you host a web application: please allow showing what I entered when setting or entering a password anyway. Otherwise I can keep doing mistakes!

EDIT:
This entry was edited after some comments. I know that's not what "security through obscurity" mean, it was a kind of joke about "obscurating" the password.

Your password has expired

Usual explanation: used to be widespread, and even required and recommended by widespread operating systems and applications.

My verdict: Old idea

This once happened to be recommended, because a leak of password hashes was considered possible, and the password was in plaintext or was using weak hashes, and reuse was frequent. Nowadays, it's not recommended any more: unless there has been a known leak, if the password is properly salted and hashed, has a reasonable complexity, and hasn't been reused, there's no need fo rotation

Too many failures - your account is locked.

Usual explanation: we don't want your account to be bruteforced

My verdict: Bad idea especially if "too many" is low, like 3 or 5, and my password is complex.

The idea is that you "protect" a user preventing their account being bruteforced. But the CIA triad includes Availablity, and you're basically opening a giant door for a DoS this way; just know somebody's username and you can lock their account. Who knows what it takes to unlock it.

Account locking could be used in 2FA contexts (requires the other factor to provoke a DoS - that's why in some cases you enter a wrong password and you're still asked for your second factor, and only at the end you fail the authentication if either was wrong), but it's still usually pointless to lock an account after a few attempts - they're too few to guess a password.

Caveats:
Some compliance policies require locking user accounts.

If you host a web application: consider using temporary locking with exponential, kind-of-stochastic backoff (like 10s wait the first time before being able to enter the password again, 22s the second...) after 10-20 attempts instead. If you can, it's probably useful to limit this measure to specific IPs that are attempting the bruteforce (but can be hard if it's a distributed attack).

Final thoughts

I really hope that passwordless login will make some steps forward in the next years (e.g. https://passkeys.dev/, https://www.yubico.com/authentication-standards/fido2/). In the meantime, I'd love a "password complexity api" so that my browser (or password manager) can already generate the right password for a website.

References

https://www.auditboard.com/blog/nist-password-guidelines/
https://pages.nist.gov/800-63-3/sp800-63-3.html

Ransomware-resistant backups with duplicity and AWS S3

Alan Franzoni — Thu, 27 Jan 2022 19:50:24 GMT

Ransomware and backups

Article updated: February 5th, 2022

Ransomware is changing the security scenario. Once upon a time, attackers who entered your systems could pull some data to sell it somewhere; they could deface your website for kudos; but, unless they had some compelling reason (disgruntled former employees), they probably wouldn't destroy your data and your backups.

Now they would, though. Cryptocurrencies make it very easy to ask for a ransom and never get caught. Hence, cryptolocker-like malwares are spreading. You're not safe just because why should anybody hack into my little server: if a widespread exploit is found, cybercriminals will perform a mass scan on the whole internet, exploit and implant a malware on every single vulnerable system, then ask some amount of cryptocurrency to get your data back.

So: people should revise their own threat models. And your backup strategy is likely one of the first things you should revise. It's quite likely it was designed for a failure scenario (hardware failure, accidental deletion), not for an attack scenario. But even half-serious cybercriminals, once they get access to your server, will delete all backup sets they can access, so that their encrypt-your-data-and-ask-for-money threat actually works.

Once upon a time

You had backups on a tape library. A tape drive is usually append-only, and the backup is handled through a dedicated system so it was hard (if possible at all) and clunky to wipe an older backup from a compromised system. And tapes would get physically rotated - getting hold of an old tape was not feasible for a remote attacker, so at least you got an older backup which was safe.

Then approaches like bacula were common - the bacula server (where the backups are saved) would contact the bacula clients (systems to be backed up) and asked for data to be saved. The bacula server deleted old data using its own schedule, there was no way for the bacula client to tell it "wipe everything" so again, barring any bacula server exploit, bacula server data held backups safe.

More recently, cloud approaches like rsync.net leverage ZFS snapshots to make remote data immutable - they can be removed only according to your retention schedule (details). BorgBase works in a similar way and has a good writeup about backup strategies. Such providers usually work well since it's their very job, and should be considered when choosing a backup strategy. I won't recommend any of them since it's outside the scope of this post (and I haven't tested all of them!); just rememeber to consider restore costs and speed as well as backup costs for those providers since, just like raw cold-storage providers (e.g. AWS Glacier, Backblaze B2, et cetera) they sometimes tend to ask for a plus whenever you need to download your data - at the moment of this writing BorgBase doesn't seem to charge for download, though, but they reserve the right to terminate your account if they see excessive usage, so the topic is definitely of concern for backup providers.

But, many modern backup systems don't take the ransomware scenario into account, since the backup software needs full access to a local or remote filesystem. Some actually try: restic offers rest-server, which has an append-only option to prevent malicious deletion of the server's content. But you need an additional system just to host the server.

The cloud to the rescue

So, here we'll see how to use the good old duplicity backup software to perform backups to AWS S3 in a ransomware-resistant way. You can use any cloud storage that supports properly fine-grained permissions (see later) and is supported by duplicity.

For the sake of this article, I suppose you've got access to two distinct machines: your workstation and your server, and you'd like to have your server backed up. You could actually employ a single machine to perform all the steps, but you'll need to make sure that your master access to AWS S3 is never compromised, otherwise say goodbye to your ransomware resistance.

So, make yourself a favour and use two separate machines.

Create your S3 bucket and credentials

Those steps must be taken from your workstation

These examples don't leverage the aws cli to prevent accidents with API key leakeges; if you use the AWS S3 Console, and you've properly configured 2FA for your account, it's less likely you can be attacked that way.

Create the bucket

Enter the AWS Console, choose the S3 service, create a bucket (we'll call it sample-duplicity-backup for the sake of this article) in your preferred region (I will use eu-central-1) ; and use these settings:

Disable ACLs

Block public access

Enable bucket versioning

Enable object lock

Configure object lock

Then, select the bucket you just created, go to Properties, choose Object Lock, and enable Default Retention. For the purpose we have, Governance mode is OK; choose a 40 days default retention period, and save your changes.

Create a IAM user

Now you need to create a suitable IAM user to be used with duplicity. Let's call this sample-duplicity-user; it should only have programmatic access, and you should attach only this policy to such user:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucketMultipartUploads",
                "s3:AbortMultipartUpload",
                "s3:ListBucket",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": [
                "arn:aws:s3:::sample-duplicity-backup",
                "arn:aws:s3:::sample-duplicity-backup/*"
            ]
        }
    ]
}

Download the access key ID and the secret access key for such user.

Install duplicity

The following steps must be performed on the server

Duplicity may be available in your distribution, but I suggest you pick a recent version (0.8.x) in order to make sure the s3 remote backend is properly supported - check the homepage for further info. Also make sure the boto3 library and gnupg are available on your system.

As an example, for recent Ubuntu versions (tested on 20.04):

Add the stable duplicity PPA
apt -y install duplicity python3-boto3 gnupg

Create a backup script

Put some data in /root/data for the purpose of this test.

Then, add the following snippet to a script; set the passphrase as you wish (duplicity encrypts the backups using symmetric cryptography - support for public key crypto is available but I haven't had great success with it so far), and fill in the AWS credentials you got from the previous step:

#!/bin/bash -e
export PASSPHRASE="XYZXYZXYZ"
export AWS_ACCESS_KEY_ID="XXXXXXX"
export AWS_SECRET_ACCESS_KEY="YYYYYYYYY"
/usr/bin/duplicity \
--s3-european-buckets \
  --s3-use-new-style  --asynchronous-upload -v 4 \
   incr --full-if-older-than 30D \
  /root/data \
  "boto3+s3://sample-duplicity-backup/data"

This command will backup the contents of /root/data to your s3 bucket, using a
data prefix.

One detail to note: the --full-if-older 30D means: create an incremental backup
since the previous one, or create a new full backup if more than 30 days passed since our last full backup. It is essential that the number of days that we set here is smaller than the object lock days.

So, great! Now, from your workstation, you can check that your S3 bucket contains some data; check the Objects tab of your bucket, and enter the data directory:

If you like, you can run the script again. You'll notice that duplicity only adds new data to the s3 bucket, it never removes or changes existing files once they have been uploaded.

Attack scenario

Now a Bad Guy exploits a vulnerability and takes total, root control of your server. He gets all your data; you can't help that, it's compromised. He even gets access to the credentials for sample-duplicity-user, and, being Bad, after deleting your data directory he tries to delete the content of your backup bucket.

For the sake of simplicity, we'll configure an aws cli account (do it either on your workstation or on your server, it doesn't really make a difference. Remember to delete such credentials afterwards) using the same sample-duplicity-user credentials and pretend you're the badguy:

aws configure --profile badguy

Now, try acting as the Bad Guy:

$ aws --profile badguy s3 ls s3://sample-duplicity-backup/data/
2022-01-25 21:41:36       3228 duplicity-full-signatures.20220125T204132Z.sigtar.gpg
2022-01-25 21:41:37        267 duplicity-full.20220125T204132Z.manifest.gpg
2022-01-25 21:41:36      27298 duplicity-full.20220125T204132Z.vol1.difftar.gpg

He can read the backup. This is expected. But can he delete the backup?

$ aws --profile badguy s3 rm s3://sample-duplicity-backup/data/duplicity-full.20220125T204132Z.manifest.gpg
delete failed: s3://sample-duplicity-backup/data/duplicity-full.20220125T204132Z.manifest.gpg An error occurred (AccessDenied) when calling the DeleteObject operation: Access Denied

The attacker cannot delete files; our object lock policy prevents that!

But, there's still one thing that the attacker could do: overwrite existing files. In fact, AWS S3 Put actions (and permissions) don't make a distinction between "add an object" or "overwrite an existing object" (and it would be quite difficult and slow to implement in a distributed storage system):

$ aws --profile badguy s3 cp temp.txt s3://sample-duplicity-backup/data/duplicity-full.20220125T204132Z.manifest.gpg
upload: ./temp.txt to s3://sample-duplicity-backup/data/duplicity-full.20220125T204132Z.manifest.gpg

Ouch! This (apparently) succeeded:

But the reality is that the original data is still there: the attacker has uploaded a new file with same name as ours, but we can still retrieve the first version; Object Lock and Bucket Versioning work together. Check the "Versions" tab for our modified file:

Choose the older version and download it: it's the original file! So, if you get hit by a ransomware, you can still retrieve all your original files. Accessing those in a programmatic way requires the s3:ListBucketVersions permission (check the API call as well) - we don't even have it for our backup IAM user.

Retrieving the original files

A full demonstration of how to retrieve the original files is trivial and beyond the scope of this article; but I'll leave some breadcrumbs here. With an user with the ListBucketVersions permissions, call this command:

$ aws s3api list-object-versions --bucket sample-duplicity-backup
{
    "Versions": [
        ...
        {
            "ETag": "\"500e2a10137f805dba21f4bb7bf3678a\"",
            "Size": 50,
            "StorageClass": "STANDARD",
            "Key": "data/duplicity-full.20220125T204132Z.manifest.gpg",
            "VersionId": "w_ZOOamRomLYpZO5Gc.Vy_nrl0cP6bWq",
            "IsLatest": true,
            "LastModified": "2022-01-25T20:50:15+00:00",
            "Owner": {
                "ID": "132149dd72a1af36909b73ae719ccba0096cd23aa62158308ef4b9619f3b63ed"
            }
        },
        {
            "ETag": "\"2dd325e03ac11eb6edaf0e9a7b177064\"",
            "Size": 267,
            "StorageClass": "STANDARD",
            "Key": "data/duplicity-full.20220125T204132Z.manifest.gpg",
            "VersionId": "ORkLIhukxVZKTeQruI9hPDWXUzAJoYgb",
            "IsLatest": false,
            "LastModified": "2022-01-25T20:41:37+00:00",
            "Owner": {
                "ID": "132149dd72a1af36909b73ae719ccba0096cd23aa62158308ef4b9619f3b63ed"
            }
        },
        ...
    ]
}

Then retrieve this file in its older version:

$ aws s3api get-object --bucket sample-duplicity-backup --key data/duplicity-full.20220125T204132Z.manifest.gpg duplicity-full.20220125T204132Z.manifest.gpg  --version-id "ORkLIhukxVZKTeQruI9hPDWXUzAJoYgb"
{
    "AcceptRanges": "bytes",
    "LastModified": "2022-01-25T20:41:37+00:00",
    "ContentLength": 267,
    "ETag": "\"2dd325e03ac11eb6edaf0e9a7b177064\"",
    "VersionId": "ORkLIhukxVZKTeQruI9hPDWXUzAJoYgb",
    "ContentType": "binary/octet-stream",
    "Metadata": {}
}
$ ls *.gpg
duplicity-full.20220125T204132Z.manifest.gpg

Apply for all files where you actually need it. Actually, it's quite likely that the attacker will have noticed the bucket is versioned and object locked, so he won't waste time at overwriting all files.

Wrapping up

I hope this article can be useful from an implementation standpoint, but the most important takeaway is: always think about your threat model every time you take a decision. Reason in the terms of what an attacker could do when entering your system; security is never binary.

And, please: don't run your applications as root, but run your backup script as root, and make sure its ownership and permissions are properly set, like -rwx------ 1 root root 365 Dec 9 21:52 duplicity_run. It's great to have a ransomware resistant backup, but you shouldn't make things too easy for an attacker!

An exercise

An attacker enters your server, but he realizes you're using object lock and bucket versioning. But he really wants to find a way to get some money from you, since he knows you're full of XMR in your wallets. Hypothesize how could he proceed, then go on and think about what you could do to prevent damages to your system and to your backups.

To my readers: since this is sort of security-related and a topic with a potentially high impact, please let me know if got anything wrong. Contact me by commenting or directly via e-mail.

Updates & footnotes

I received a decent amount of feedback about this article. I think this means that the topic is interesting at least interesting. Thanks to the HN crowd as usual for commenting.

Even though this should be a basic functionality, it seems incredibly hard to get done right unless you have a dedicated SRE team.

So, some additional, interesting points (I meant for some of those were covered in my 'exercise' paragraph, but I'm now jotting them down):

You should remember to periodically verify your backups, usually by restoring them - this is true in general for backups, not just for ransomware resistant ones. For the ransomware situation, you must make sure to verify your backups more often than their expiration period on S3, otherwise an attacker could a) tamper with your backup script making it ineffective and b) wait for governance lock time to expire, and hope old data gets deleted by some schedule (they can't actually delete data because they don't have such right). So, make sure you verify backups before deleting old data.
Use a proper account for backups; the risk, sometimes, is that backups get deleted/attacked because of accidental exposure from the backup account. I recommend using a separate AWS account with highly restricted IAM users for backup scenario; you can even replicate buckets between accounts
Somebody stealing your S3 credentials could just use them for their own purpose and save arbitrary data there, with you incurring the costs. But you don't want an hard limit before s3 stops working (I'm not even sure it's possible), otherwise you'll enable the attack to escalate into a ransomware. So I recommend setting a proper budget alert for your AWS account and maybe an alert for excessive S3 data usage
Some people report that duplicity, when used with a great amount of data, can start using an excessive amount of memory and ultimately gets killed because the server runs out of ram. I can't confirm or deny this either, just make sure, again, to monitor your backup job - so you get errors if it fails - and to periodically verify them.
If you've got a lot of data and not a lot of bandwidth - hence the initial backup may take you a lot time - make sure you pick a good, reliable provider that won't be gone in one or two years. That's the reason I used S3: I suppose it will be still there in 10 or 20 years.

Log4j haters: just STFU

Alan Franzoni — Thu, 16 Dec 2021 21:25:01 GMT

I think the behaviour of many people towards log4j developers and towards the project is simply ridiculous. I understand the memes; it's the internet, after all. But I can read posts and tweets by many IT professionals - developers, managers, security engineers - that treat the log4j project and the people that work on it as absolute shit.

Such library is being used by (probably) thousands of projects since ~20 years, and it's available for free. The impact of this vulnerability is a testimony of Log4j success.

So, my dear haters, I need to inform you that Log4j comes with a license that says "You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License". You shouldn't blame Log4j developers; you should blame all the users and developers who chose it and used it, for free. Because they accepted that license. And you should blame all those people that didn't realize the first fix was incomplete. Log4j developers had absolutely no obligation.

If I ever release some open source project that achieves the same success as Log4j, and, at any time, gets a similarly dangerous vulnerability, I'll make sure not to release the hotfix for free. You'll be able to get it with a special, one-time, paid-for commercial license. The community won't love me anymore, but I'll probably be able to retire in some nice place with all the money that I'll get for sure (because you won't rely on a patch from a random stranger on the internet, right? You want the fix from me, and you don't want to check what's in it).

P.S.

I've read very few articles about how bad the disclosure process for Log4shell was. Apache Software Foundation had less than three weeks to patch before the disclosure went completely public (Google Project Zero usually has a 90 days deadline), and there're articles indicating that some vendors detected active exploitation just a few days after the first private report. I'd speculate somebody tried to get rich here by selling a very interesting 0-day.

Consistent Hashing for Dummies

Alan Franzoni — Thu, 28 Oct 2021 12:28:02 GMT

Today I'll discuss about an interesting concept: consistent hashing. It's a widely employed technique to properly perform sharding in distributed storage systems. I'm not aiming at a rigorous explanation (please don't use the raw snippets I provide in production code!), but I hope I can make the concept simple enough.

The problem

What problem are we aiming to solve?
Let's suppose we need to handle more data than it's possible to store in a single server. What's the problem? Just create a number of shards and distribute the data!

But, then, clients need to know where to connect in order to store or retrieve a value. So we pick a primary key and feed it into a function, an hashing function in fact, since it maps our data to a fixed size output; such function will tell us where to connect, by returning a shard index so that 0 <= shard_index < shard_count, e.g:

def get_shard(primary_key, shard_count) -> int
    pass

Let's suppose our primary key is a 64-bit unsigned integer, and that we want to create 10 shards. Our first thought could be: let's split the keyspace in ten consecutive parts of (almost) identical size, check where our primary key belongs, and return the position of such part as the shard index, that is:

MAX_UINT64 = 2**64 - 1
def get_shard_linear(primary_key: int, shard_count: int):
    shard_size = -(-MAX_UINT64 // shard_count) # ceil division, we don't want the shard size to be accidentally too small
    return primary_key // shard_size

This would work. But there's a problem:
What happens if there's a lot of data with a primary key around a certain value, and very little elsewhere? We'll have one or two shards doing far too much work, while the others sit idle.

Modulo to the rescue! What if we just do primary_key modulo shard_count?

def get_shard_mod(primary_key: int, shard_count: int):
    return primary_key % shard_count

Looks nice! Unless the primary keys are sorted modulo-wise, it will probably distribute our load evenly.

But then, what happens if we discover that our system gets overloaded, and we'd like to add another shard? Our indexes will change, and we'll need to move some data around. But how much data, and which?.

With the linear sharding, the shard size will shrink, we'll need to recalculate the shard index, and probably move a lot of objects around for most shards; many objects that were towards the end of shard 0 will be shifted to shard 1, and so on; that's quite a lot of work:

ten_shard_size = -(-MAX_UINT64 // 10)
for n in range(ten_shard_size-3, ten_shard_size):
    print("10-shard: {} -> 11-shard: {}".format(get_shard_linear(n, 10), get_shard_linear(n, 11)))

for n in range(ten_shard_size*2-3, ten_shard_size*2):
    print("10-shard: {} -> 11-shard: {}".format(get_shard_linear(n, 10), get_shard_linear(n, 11)))

Remember, when using ten shards, indexes go from 0 to 9; when using eleven, indexes go from 0 to 10.

Output:

10-shard: 0 -> 11-shard: 1
10-shard: 0 -> 11-shard: 1
10-shard: 0 -> 11-shard: 1
10-shard: 1 -> 11-shard: 2
10-shard: 1 -> 11-shard: 2
10-shard: 1 -> 11-shard: 2

But even the modulo sharding won't help us: the shard index key will change for most values.

for n in range(20, 30):
    print("10-shard: {} -> 11-shard: {}".format(get_shard_mod(n, 10), get_shard_mod(n, 11)))

Output:

10-shard: 0 -> 11-shard: 9
10-shard: 1 -> 11-shard: 10
10-shard: 2 -> 11-shard: 0
10-shard: 3 -> 11-shard: 1
10-shard: 4 -> 11-shard: 2
10-shard: 5 -> 11-shard: 3
10-shard: 6 -> 11-shard: 4
10-shard: 7 -> 11-shard: 5
10-shard: 8 -> 11-shard: 6
10-shard: 9 -> 11-shard: 7

The solution

So, the properties we'd like to get from our ideal hashing function are balance - objects should be distributed evenly across our shards, and monotonicity - if a shard is added, objects should flow only from existing shards to the new one; there should be no need of internal reshuffling.

And, guess what? That's exactly what consistent hashing does!

A simple implementation of this algorithm goes like that:

# Python implementation by Peter Lithammer
def get_shard_lamping_veach(primary_key: int, shard_count: int):
    b, j = -1, 0.0

    if shard_count < 1:
        raise ValueError(
            f"'num_buckets' must be a positive number, got {shard_count}"
        )

    while j < shard_count:
        b = int(j)
        primary_key = ((primary_key * int(2862933555777941757)) + 1) & 0xFFFFFFFFFFFFFFFF
        j = float(b + 1) * (float(1 << 31) / float((primary_key >> 33) + 1))

    return int(b)

See it in action:

import random
random.seed(1024910)
moved_to_new = 0
for x in range(0, 10000):
    n = random.randrange(0, 2**64)
    ten_shard = get_shard_lamping_veach(n, 10)
    eleven_shard = get_shard_lamping_veach(n, 11)
    if ten_shard != eleven_shard:
        if eleven_shard != 10:
            raise ValueError("object flows to non-new index")
        moved_to_new += 1

print(f"{moved_to_new} objects changed index to 10")

Output:

898 objects changed index to 10

We picked 10,000 random indexes. As you can see, there were no objects whose shard index changed to a value different than 10, the newly-added shard index, and a reasonable amount of objects (quite close to 1/11 of 10,000 objects, in fact) were moved to the new shard index.

Explaining how this algorithm works is beyond the scope of this post; take a look at the last paper in the references if you're interested! But I hope you now understand what consistent hashing does: a consistent hashing function maps its input to evenly distributed outputs, and if the number of shards changes slightly, the output location changes only slightly. I haven't tested how the Lamping-Veach algorithm behaves if you wildly modify the number of shards (e.g. go from 10 to 20).

References

Source for the Python implementation for Lamping-Veach consistent hashing
Karger, Lehman, Leighton, et al: Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. This is the original paper which introduced consistent hashing to the world. The original algorithm is somewhat more complex but exhibits additional properties, since it was originally designed for situations where each client could have a different view of what working shards were.
Lamping, Veach: A Fast, Minimal Memory, Consistent Hash Algorithm. This paper introduces the original C version of the aforementioned consistent hashing algorithm, along with an explanation of how it works.

The curse of the downvote

Alan Franzoni — Sat, 21 Sep 2019 12:07:46 GMT

I don't have strong opinions on Facebook - I'm not even a user anymore - but I think that the "like/dislike" mania is going a bit too far. I've read yesterday that an engineer was fired from Facebook for having a YouTube channel, but that's beyond the scope of my post today. In one of his videos, he says that FB culture is driven by likes - you need to be popular, not to be good.

Well, I actually think that may be appropriate for Facebook. Eat your own dogfood - isn't that one of the golden rules for just any product?

But, what I hate nowadays is the curse of the downvote (or the upvote). It's everywhere: on Facebook, on Reddit, even on Hacker News - you get downvoted, then probably your comment is hidden from most people. The most upvoted comment (or article) gets more impressions.

But what does an upvote, or a like, or a share, or a retweet, or a downvote, or a flag, actually mean? I think we've lost the context.

When I still was on Facebook, and some of my friends shared an hoax or just some falsehood, I was usually quick to point that out - with due references. Then they just told me "hey, I shared the article. I'm not the one who wrote that. I may not even agree with that". So... what does "share" mean?

I think we need better connections for our actions. What's an upvote (or a retweet or a reshare) ? Does it mean "I agree"? Does it mean "I have reasons to believe the poster is factually correct"? Does it mean "I like it for no particular reason"?

What's a downvote? Does it mean "I think the poster is factually incorrect"? Or "I don't like this opinion"?

What's a flag (in HN) or a report in other sites ? Does it mean "The poster suggests something illegal"? Does it mean "I think it's an hoax, fake news, just trolling"? Or it just means "I wouldn't like to see it there, throw it away". What does that REALLY imply?

Yes, maybe that's complex. "Upvote" or "Downvote" seems easier. But that's not just our opinion anymore. Upvotes and downvotes do shape conversations and discussions, and using them unwisely just perpetrates echo chambers and unhealthy ideological silos.

I think we can - and we should - do better. Much better.

Maybe we should show comments / articles in a random fashion, and let people "grade" them by quality of the opinion and/or correctness. And only after a while, when commenting stops, we should create a "top ten" for our comments; or something like that; we should make sure that more than one idea gets exposure, rather than going from the start with a "winner takes it all" mentality. It's not good for anybody.

Photo by Fabian Gieske on Unsplash

Machine Learning: a sound primer

Alan Franzoni — Wed, 31 Jul 2019 08:40:54 GMT

I see many people who would like to take a glimpse at machine learning, and try to understand a bit how it works. Very often, they can either get pre-baked examples with very specific (and possibly too advanced) approaches - like deep learning - or math-oriented explanations that can be dry or just uninteresting.

I recently discovered a rather famous free textbook that I hadn't touched before: An Introduction to Statistical Learning . As you may infer by the non-glamorous title, that's a book that doesn't try to sell you something fancy about machine learning. It's quite a practical and non math-heavy introduction to most useful machine learning topics, which will lead the reader to develop an intuition for what ML methods do. SPOILER: neural networks aren't covered! So, if you're just running after the hype, that's not the book for you.

The only real drawback from the original book is that most examples and demos are coded in R. I don't especially like the language, as it is highly specialized and, most probably, you'll need to know another language beyond it for general-purpose processing.

So, I'm happy to link a couple of repositories that offer most examples from the book, but coded in Python; those should be more accessible to most people, as the language is very widespread:

https://github.com/tdpetrou/Machine-Learning-Books-With-Python/tree/master/Introduction%20to%20Statistical%20Learning

https://github.com/JWarmenhoven/ISLR-python

Happy StatLearning!

EDIT:

There's a MOOC as well covering most of the topics from the book, by the same original authors

Photo by Lacie Slezak on Unsplash

Standalone, single-file, editable Python scripts WITH DEPENDENCIES

Alan Franzoni — Tue, 19 Feb 2019 20:58:00 GMT

How badly I wanted something like that?

The problem: Python for scripting

Beside programming and data science, I find Python to be a very useful glue language; I think it's great for shell replacement when bash/zsh scripts get too complex, but there's one caveat: as long as you can work with its standard library, you're in the sweet spot. As soon as you'd like to use an external dependency, that can be a problem, because if you don't want to contaminate your system with external dependencies, you'll either a) hope that your system packages a proper version for such library or b) start needing virtualenv and so on.

Both options are ok for manual development, a bit less ok if you're willing to deliver such scripts to multiple servers for automating some kind of process.

For sure, there're many options to fully package a Python executable - PyInstaller comes to my mind, but other exist. But then you've got a kind of "build process" for your script, and you cannot edit it directly on a server. But I find that, very often, for internal tasks and scripts, my process is exactly that: I do edit the script on the server, then, when I get it right, I copy it on my version control system and deliver it to other machines. Yes, I wouldn't do the same for "real" software, but as I said, those are often internal scripts, used for reporting, cron jobs, other small automated tasks.

The solution: editable python scripts with isolated dependencies

So what? That's what I baked. Not a perfect solution, but a decent one. Just have python and pip on your system, add a REQUIREMENTS string (equivalent to the content from requirements.txt), then import everything.

This will install the dependencies in separate location in a temporary directory at first use, then reuse them when necessary.

So: just copy & paste the following snippets, edit the two USER SERVICEABLE sections, then start writing your desired code at the bottom. The snippet here includes an example of how to run requests, so you can just delete the requirements, imports and requests call at the bottom if you don't need that.

#!/usr/bin/python3
import os
import sys
from tempfile import gettempdir, NamedTemporaryFile
import hashlib

# USER SERVICEABLE: paste here your requirements.txt
# the recommendation is to create a development virtualenv,
# install the deps with pip inside it, then do a `pip freeze`
# and paste the output here
REQUIREMENTS = """
certifi==2018.11.29
chardet==3.0.4
idna==2.8
requests==2.21.0
urllib3==1.24.1
"""
# USER SERVICEABLE end

def add_custom_site_packages_directory(raise_if_failure=True):
    digest = hashlib.sha256(REQUIREMENTS.encode("utf8")).hexdigest()
    dep_root = os.path.join(gettempdir(), "pyallinone_{}".format(digest))
    os.makedirs(dep_root, exist_ok=True)

    for dirpath, dirnames, filenames in os.walk(dep_root):
        if dirpath.endswith(os.path.sep + "site-packages"):
            # that's our dir!
            sys.path.insert(0, os.path.abspath(dirpath))
            return dep_root

    if raise_if_failure:
        raise ValueError("could not find our site-packages dir")

    return dep_root

dep_root = add_custom_site_packages_directory(False)

deps_installed = False

while True:
    try:
        # USER SERVICEABLE: import all your required deps in this block! and keep the break at the end!
        import requests
        # USER SERVICEABLE end

        break
    except ImportError:
        if deps_installed:
            raise ValueError("Something was broken, could not install dependencies")
        try:
            from pip import main as pipmain
        except ImportError:
            from pip._internal import main as pipmain

        with NamedTemporaryFile() as req:
            req.write(REQUIREMENTS.encode("utf-8"))
            req.flush()
            pipmain(["install", "--prefix", dep_root, "--upgrade", "--no-cache-dir", "--no-deps", "-r", req.name])

        add_custom_site_packages_directory()
        deps_installed = True

# HERE you can start writing the actual code of your script

r = requests.get("https://www.google.com")
print(r.status_code)

How does this work?

It creates a subdir in the directory where temporary files are held on your filesystem, and downloads the dependencies there using pip. Such subdir name is autogenerated depending on your requirements - so, if your requirements change, a new directory is employed.
Then, it adds such directory to your sys.path, allowing Python to find modules and packages there.
When you restart the script, it first tries to find the libraries that were previously downloaded, and only if that fails it goes downloading the libraries again.

CAVEATS:

Of course the target system must have internet access, at least to pypi or github or other vcs (depending on your requirements format), and python and pip must be installed.
If your requirements change, nothing deletes files in your temp directory. But that's usually sweeped at system boot or by cronjobs, so it's not a real problem. BUT: if your cronjob only partially sweeps files in the subdir, it could break something (check if anything like that exists on your system, I can remember some older Redhat/Centos doing that).
If your packages have binary dependencies and/or require to build extensions, you still need the shared libs (for runtime) AND the proper header files / -dev packages. There's no silver bullet for that in this recipe.

Alternative approaches:

If you already use uv and you assume it's already installed on your systems, there's a great article here on how to use uv for python scripts with dependencies

Photo by Milan Popovic on Unsplash

Application authors: please don't force users into your language or packaging details

Alan Franzoni — Thu, 10 Jan 2019 17:10:00 GMT

This story has been boiling in my head since long; today I chose to (finally) publish it.

Long story short: in order to use a certain application, I should not need to understand how to use the language or its packaging ecosystem. Delivery and distribution is a relevant part of your app.

This does not apply to libraries, frameworks, or tools that are highly contextual for a certain language/environment ecosystem, and would be used only by a developer, in any case.

What do I mean? I'll start with an example. It's chronological, it's just the latest thing I came into; here's b2, the command line tool to get access to Backblaze B2 backup repositories. And those are its installation instructions:

What is pip?
Is such command safe to use?
Will it work in all situations?
answers:
Python's package manager. It makes the b2 library available, and such library has the so-called b2 script, which exposes the b2 CLI executable.
possibly safe, but will install b2 within your global/per-user python dependencies, and may alter or install additional packages as dependencies, and such deps may be picked by other, unrelated software in the system;
it may require root to be run correctly.

Should every user of Backblaze B2 CLI know how to code in Python and understand how the Python environment works? I think that should not be necessary.

b2 is just an example, other famous tools - awscli and fpm come to my mind - just follow suit.

Most probably, if the tool you're downloading is written in C or C++, you'll probably expect to find a compiled binary, at least for some OSes and architectures; The same applies for Go, which makes binary creation for multiple platforms exceptionally easy. You should strive to provide some kind of equivalent piece of software for your users. .

But, even for tools written in Python or JavaScript (they look to be the ones that suffer most from the problem I'm describing), in most situations, the "right thing" to do is to provide a standalone binary, regardless of how it's written. You can provide standalone binaries for some archs (this is what borg does - it's a great backup utility written in Python), or you can rely on native or add-on packaging (additional repositories for Linux, choco for Windows, homebrew for Mac), like httpie. And, when you create such binary or package, you should make sure it's totally standalone (i.e. it doesn't alter global system state).

Why? Because leveraging packaging tools that were originally meant for developers of a language is brittle and risky, and can be hard to understand. You may end up compromising the system integrity if the wrong dependency slips into globally installed packages - this is especially true for Python, which is often used by many, many system tools. And, if all third-party apps did it that way, we could have continuous breakages. You can't assume yours is the only application installed in a certain system!

And, if you're writing an app, you can have full control of your execution environment and dependencies. You can choose a single Python or NodeJS version, and all of its dependencies. You don't need to support tons of variations!

In some situations, a CLI evolved out of a library. This is the case for pygments, a very nice syntax highlighting library. It's meant to be used as a library from inside python projects, but it also exposes a widely used pygmentize binary that is employed by many other tools. I'd love to have that CLI tool available as an independent, standalone package!

Final considerations:

As a software author, think whether you're building a library, a framework, or an application. If you're writing an application, think how you can deliver it to your final users, without the need for them to understand how you've written it.
If you're a maintainer for a package, think about what you're maintaining. If something is both a library and a command line tool, consider creating multiple (possibly independent) packages.

In the meantime:
If you find a Python CLI tool that you'd like to use, and it uses the pip way and you don't to mess with system dependencies, I recommend pipsi. I don't know if something similar exists for Node or other environments.

FAQ:
Q: But I'm a solo open source developer! Packaging this way would steal too much of my time!
A: That's totally fine. Just make sure you don't suggest dangerous things to people. If you're using a developer-only approach, just state that as your target; but the tools I've written about above (b2 and awscli) certainly are of a different class.

Q: Any suggestion for easy packaging?
A: I find homebrew very nice for Mac packaging and distribution. For Linux, I use fpm with docker - I've actually created a specific integration project fpm-within-docker. For Windows... no idea! I know about chocolatey and that's about it.

Photo by Malcolm Lightbody on Unsplash

Misaligned Expectations: investigating the expectations gap

Alan Franzoni — Thu, 05 Jul 2018 21:12:20 GMT

As some of my followers already know, I'm enrolled in the great Master's program at Georgia Tech, the OMSCS.

As a part of my studies, I'm doing some research to investigate the expectations gap between the higher education and the industry sectors; why does the university teach students this way? What do students expect? And what do employers and professional?

Help us by answering a small survey, or just subscribe if you'd like to receive the results when the research is completed:

https://www.misalignedtech.com/

SCP taming: stop local silliness

Alan Franzoni — Thu, 26 Apr 2018 10:38:49 GMT

Every ~~day~~ now and then, I get an scp command wrong. Scp is designed after commands like rcp and works totally fine for local-to-local file copy.

While this can (or could) be useful in some contexts, It's not what I like to do these days; very often, if either hosts is not remote, it's a typo on my part, and results in spurious "somebody@host.example.com" files somewhere in various directories on my disk.

So, what I like to do? I put a small script like this in my path before the actual scp executable in order to pre-check arguments, and prevent accidental local-only copies.

#!/bin/bash
ORIG="/usr/bin/scp"

for var in "$@"
do
    if [[ $var = *":"* ]]; then
        $ORIG "$@"
        exit $?
    fi
done

echo "ERROR: Missing colon. You need to pass at least one remote host specifier in source or dest"
exit 1

Productivity, the office, and the open floor plan

Alan Franzoni — Wed, 18 Apr 2018 12:43:37 GMT

UPDATE: this article was written at a time when remote work wasn't so widespread. After having worked remotely myself, I don't think that the office is so critical anymore, even though I believe it's still useful to have a nice office somewhere where you can meet people in person, every now and then. But a very nice office is even more important nowadays; if your office sucks, there're zero chance that people will want to go there!

There's one pattern that, nowadays, I find amusing; the productivity mantra is repeated everywhere. Everybody wants to get more productive, every company is trying to make their employees more productive. Robotics, AI: everything calls for it.

From Wikipedia:

Productivity describes various measures of the efficiency of production. A productivity measure is expressed as the ratio of output to inputs used in a production process, i.e. output per unit of input

So: productivity is a ratio.

That's a great idea! Let's do more work in less time, and let's use the rest of our time in other activities - be it study, research, exercise, leisure time, or whatever.

Or, let's do more work in the same time as before! That's a good idea as well; we'll achieve more, and maybe, if our company recognizes such additional value, get paid more.

But then. Open floor plans. Shared desks.

How can those things go together?

Focusing on software development/tech firms, there seem to exist a widespread discontent around open floor plans, and yet the number of companies adopting them seem to be growing - I don't have exact data, but it seems to be a new trend. Traditional corporations that want to start behaving like startups start ditching private offices. The usual reason: an open office workspace promotes collaboration. Yes, maybe it does. But do all the employees in a company just collaborate between each other all the time?

Most probably, they won't. Most probably, they'll need plenty of time to focus and do deep work. The risk is to overoptimize for a single aim, and forget the whole picture.

More often than not, an open office plan is a marketing synonym for let's cram how many people we can, quite at random, in the smallest possible space.

Incidentally, I think that open office workspaces are one of the reasons^[1] that actually make 100% remote work positions effective: the productivity drop of open workspaces is larger than the drop caused by collaboration/communication difficulties in remote work situations.

Personally, I think that a good office trumps any other work environment/style in terms of productivity. Of course, life is a multi-objective optimization; can you find all the employees you need around a specific location? Are you paying high enough for those employees to live comfortably around that office? If any of those answers are not a definite yes, hybrid or full remote makes a lot of sense!

But, how should a good office be organized? I have experienced a variety of work environments and, well, in my opinion a sort-of open space rooms work fine, even great, but there're some dispositions:

Don't make them too large. Around 100/150m² (1000-1500 sqft) should be ok to accommodate enough people working together.
Reserve enough space for each worker. Usually 7-10m² (70-100 sqft) for each person is a good guidance.
Make sure desks are personal, and they're deep and wide enough. No desk should be narrower than 150cm (60 inches), and ideally you should aim at 180/200cm (70-80 inches). It should be possible for two people to sit at the same desk at any time, without messing with table legs, other people's legs, or any other item. Pursue no friction for collaboration.
Provide some place (lockers, cabinets, whatever) to let people put their personal and work belongings in, so the desktop area can be tidy and clear.
Provide good tools. Large screens, powerful workstations, if needed.
Make sure that both visual and auditory noises are minimal. While at his desk, any worker should not be able to see their peers' monitors. Phones should be silenced. People requiring to make constant noise (e.g. salesmen, tech support) should not be put in the same room as deep-working people (e.g. software developers). Carpeted floors and doors that can actually be closed can be very useful.
Respect privacy and safety. People should not fear shoulder surfing or that anybody could approach them without being seen, because that prevents many workers from getting relaxed enough to enter a flow state of mind.
Offices should not be hallways. There should be no need for somebody to cross a room just to reach another place.
Enforce a behavioural code. Talking is totally permitted, albeit in a low voice and not shouting around the room. I find pair programming 100% fine in such environments.
Provide other community space, with whiteboards, large screens, coffee, snacks, where people can meet & discuss when they need to collaborate rather than focus.

Using this kind of office space, you can accommodate 10-15 developers in a room, and change their position when they need to constantly work with somebody else. Collaboration turns out to work out properly, but distractions are limited. A decent trade-off.

What I finally argue is: if you're using an open floor plan just to cram more people in the same space, you're not improving productivity. You're (possibly) just improving total production. Because more people working in a bad environment could still produce more than fewer people in a good environment.

But, take some time to do some calculations. Your engineers are possibly the most expensive part of your budget. Are you totally sure that you'd want to spend a conspicuous amount of money for their wage, and then let them work at a fraction of their potential efficiency? This is especially relevant within the context of remote or hybrid work. Try comparing hybrid/full remote expenses to the ones of having everybody in good office, not in a terrible one.

Sure, real estate and rents are a cost. A distracting work environment won't reduce their abilities 1% or 2%. I'd speculate that the productivity drop can easily reach 20%-50%. Take note of your "developer density" and when it's approaching the limit, start looking for new office space. Don't wait to be Too Crammed To Do Anything Useful.

Sometimes I hear somebody say that "hey, that's how it works nowadays, adapt or be ejected, I can work with noise and distractions". Let's suppose that such people actually exist and do a lot of great work; how many of them can you recruit in the current state of talent crunch? And, are you really sacrificing a lot people that could do great work on the altar of smaller offices? Remember: when creating an office space for a team, you should think about the whole team, a lot of different people. Even though it works for you, it may not work for them. That's true for hybrid/remote as well; you need to select the right people to work at the office, and those may not be the best remote workers; and vice-versa.

UPDATE July, 20th 2018:
There're more and more people that agree with me about open floor plans, in that they're detrimental to productivity and conversation:

https://theconversation.com/a-new-study-should-be-the-final-nail-for-open-plan-offices-99756
https://code.likeagirl.io/a-research-roundup-to-show-that-your-office-layout-is-toxic-and-some-tips-for-making-it-better-8434864b0ab2
https://m.signalvnoise.com/the-open-plan-office-is-a-terrible-horrible-no-good-very-bad-idea-42bd9cd294e3
https://joshuatdean.com/wp-content/uploads/2020/02/NoiseCognitiveFunctionandWorkerProductivity.pdf

Photos by Breather and Annie Spratt

The other one being long commutes that waste time and destroy workers' morale. ↩︎

Command line data crunching with Python

Alan Franzoni — Wed, 14 Feb 2018 14:44:51 GMT

Every time I'm doing some data crunching on the command line, I find myself juggling between sed, awk, sort, uniq, etc. While I like the UNIX way of having one tool doing one thing well, I sometimes find it slightly boring to put all the tools together, sometimes stretching their features a bit too much.

I know that Perl and Ruby support implicit loops / prints - see this and that. Those switches makes it easy to work with data on the command line, but I don't use those languages a lot anymore, so I always need to lookup something in online manuals before performing something useful. And I never took my time to learn awk properly, so maybe I wouldn't need al that.

On the contrary, I still use Python quite a lot, and it's becoming the de-facto standard for data science purposes. Using it on the command line by piping something in & out of it, by the way, isn't always so easy - the -c switch allows passing a command in, but it's not always easy to understand whether a char is being interpreted by bash or by the python interpreter, and Python is whitespace-sensitive, too. So a command line like:

$ python -c 'import sys;for x in sys.stdin:print x'

won't "just work":

  File "", line 1
    import sys;for x in [1,5]:    print x;print x
                 ^
SyntaxError: invalid syntax

But: there's a bash feature to interpret escape sequences in single-quoted strings, so this will work fine:

$ echo -e "hello\nworld\nthis\nis\nme" | python -c $'import sys\nfor x in sys.stdin:\n    print(x.strip())'
hello
world
this
is
me

I find Python string manipulation to be great and usually fast-enough for not-so-large datasets, so you can do very interesting things and shell out to standard unix commands only if and when you actually need to. As long as you rely on the standard lib only, you're quite safe about portability, too.

AN IMPORTANT NOTE: if you're treating non-ascii data, I suggest you set the PYTHONIOENCODING variable, especially if you're using Python3, since that interpreter version converts to unicode objects wherever it is possible:

echo -e "ààà\nworld\nthis\nis\nme" | PYTHONIOENCODING='utf-8' python3 -c $'import sys\nfor x in sys.stdin:\n    print(x.strip())'
ààà
world
this
is
me

Enjoy your command line! And if you want to become a command line data processing guru, I cannot recommend this book enough.

Photo by Daniel Cheung on Unsplash