The other explanation is that often these are just mistakes that occur with a team of experts in their field but not data management, without a budget for building a more robust system, manually doing a lot of things with data. It's so easy to copy and paste something into the wrong place, to sort by a field and get things out of order, all kinds of issues like that.
On the other hand, any time a hypothesis appears significant, the first reaction should be to verify that all the data going into the calculation is correct, rather than just assume it is. In my day-to-day industry experience, significant results come far more often from incorrect data than an actual discovery.
CSV occupies, even years after moving away from more raw data work, way too much of my brain is still dedicated to "ways of dealing with CSV from random places".
I can already hear people who like CSV coming in now, so to get some of my bottled up anger about CSV out and to forestall the responses I've seen before
* It's not standardised
* Yes I know you found an RFC from long after many generators and parsers were written. It's not a standard, is regularly not followed, doesn't specify allowing UTF-8 (lmao, in 2005 no less) or other character sets as just files. I have learned about many new character sets from submitted data from real users. I have had to split up files written in multiple different character sets because users concatenated files.
* "You can edit it in a text editor" which feels like a monkeys-paw wish "I want to edit the file easily" "Granted - your users can now edit the files easily". Users editing the files in text editors results in broken CSV files because your text editor isn't checking it's standards compliant or typed correctly, and couldn't even if it wanted to.
* Errors are not even detectable in many cases.
* Parsers are often either strict and so fail to deal with real world cases or deal with real world cases but let through broken files.
* Literally no types. Nice date field you have there, shame if someone were to add a mixture of different dd/mm/yy and mm/dd/yy into it.
* You can blame excel for being excel, but at some point if that csv file leaves an automated data handling system and a user can do something to it, it's getting loaded into excel and rewritten out. Say goodbye to prefixed 0s, a variety of gene names, dates and more in a fully unrecoverable fashion.
* "ah just use tabs" no your users will put tabs in. "That's why I use pipes" yes pipes too. I have written code to use actual data separators and actual record separators that exist in ASCII and still users found some way of adding those in mid word in some arbitrary data. The only three places I've ever seen these characters are 1. lists of ascii characters where I found them, 2. my code, 3. this users data. It must have been crafted deliberately to break things.
This, excel and other things are enormous issues. The fact that there any are manual steps along the path for this introduces so many places for errors. People writing things down then entering them into excel/whatever. Moving data between files. You ran some analysis and got graphs, are those the ones in the paper? Are they based on the same datasets? You later updated something, are all the downstream things updated?
This occurs in all kinds of papers, I've seen clear and obvious issues over datasets covering many billions of spending, in aggregate trillions. I can only assume the same is true in many other fields as well as those processes exist there too.
There is so much scope to improve things, and yet so much of this work is done by people who don't know what the options are and often are working late hours in personal time to sort that it's rarely addressed. My wife was still working on papers for a research position she left and was not being paid for any more years after, because the whole process is so slow for research -> publication. What time is there then for learning and designing a better way of tracking and recording data and teaching all the other people how to update & generate stats? I built things which helped but there's only so much of the workflow I could manage.
While I appreciate a good rant just as much as the next person, most of these points have nothing to do with CSV. They are a general problem with underspecifying data, which is exactly what happens when you move data between systems.
The amount of hours I have wasted on unifying character sets across single database tables is horrifying to even think about. And the months it took before an important national dataset that supposedly many people use across several types of businesses was staggering. That fact that that XML came with a DTD was apparently not a hindrance to doing unspeakable horrors with both attributes and cdata constructs.
Sure, you can specify MM/DD/YY in a table, but it people put DD/MM/YY in there, what are you going to do about it? And that's exactly what happens in the real world when people move data across systems. That's why mojibake is still a thing in 2026.
You're blaming a lot of normal ETL problems on DSVs.
Like, specifying date as a type for a field in JSON isn't going to ensure that people format it correctly and uniformly. You still have parsing issues, except now you're duplicating the ignored schema for every data point. The benefit you get for all of that overhead is more useful for network issues than ensuring a file is well formed before sending it. The people who send garbage will be more likely to send garbage when the format isn't tabular.
There are types and there is a spec WHEN YOU DEFINE IT.
You define a spec. You deal with garbage that doesn't match the spec. You adjust your tools if the garbage-sending account is big. You warn or fire them if they're small. You shit-talk the garbage senders after hours to blow off steam. That's what ETL is.
DSVs aren't the problem. Or maybe they are for you because you're unable to address problems in your process, so you need a heavy unreadable format that enforces things that could be handled elsewhere.
We are talking here in the context of scientific datasets. Of course ETL plays a part here. However here it is really more the interplay of Excel with CSV which is often outputted by scientific instruments or scientific assistants.
You get your raw sensor data as a csv, just want to take a look in excel, it understandably mangles the data in attempt to infer column types, because of course it does, its's CSV! Then you mistakenly hit save and boom, all your data on disk is now an unrecoverable mangled mess.
Of course this is also the fault of not having good clean data practices, but with CSV and Excel it is just so, so easy to hold it wrong, simply because there is no right.
> so you need a heavy unreadable format
I prefer human unreadable if it means I get machine readable without any guesswork.
If you get an .xls which doesn't have very esoteric functions, I expect it to open about the same way in any Excel program and any other office suite.
With CSV I do not have that expectation. I know that for some random user-submitted CSVs, I will have to fiddle. Even if that means finding the one row in thousand rows which has some null value placeholder, messing up the whole automatic inference.
> "You can edit it in a text editor" which feels like a monkeys-paw wish
Yes :) Although I will note that some editors are good enough to maintain the structure as the user edits. Consider Emacs with `csv-mode`, for example. Of course most users don’t have Emacs so they’ll just end up using notepad (or worse, Word).
Systems have been caught out that review pull requests, that’s a simple and clear one. The more obvious to me for most people is anything you do that interacts with your email without an explicit approve list of emails to read.
Yes, but none of this applies to the local codex agent that runs when I tell it to and has access to my computer. Like: „scan this folder of PDFs and create an excel file with all expenses. Then enter them into my tax software.“ This needs access to very sensitive data and involves a quite complex handling of data. But the only attack vector I see is someone injecting prompts into my invoice files.
The overall speed rather than TTFT might start to be more relevant as the caller moves from being a human to another model.
However quality is really important. I tried that site and clicked one of their examples, "create a javascript animation". Fast response, but while it starts like this
```
Below is a self‑contained HTML + CSS + JavaScript example that creates a simple, smooth animation: a colorful ball bounces around the browser window while leaving a fading trail behind it.
Weird; I clicked through out of curiosity and didn't get any corruption of the sort in the end result.
I also asked it some technical details about how diffusion LLMs could work and it provided grammatically-correct plausible answers in a very short time (I don't know the tech to say if it's correct or not).
I got the exact same thing. But trying out another few prompts I couldn't get it to happen again. I wonder if its a bug with the cahcing/website? I can't imagine they actually run interference each time you use one of the sample prompts?
These are vastly different scales though. “If North Korea wanted to, they could spend a lot of money and get into your system” is wildly different to “anyone with a few bucks who can ask ‘please find an exploit for Y’ can get in”
To be fair, the recent Axios supply chain attack was North Korea based, and probably cost them very little money. So it illustrates that you don’t have to “spend a lot of money” to get into our systems.
You’re not getting a worthwhile sla on a subscription at this rate. What are you going to get? A few dollars? An sla isn’t useful unless it actually bites for the provider and actually compensates the customer. And it costs money - how much are you willing to spend for this insurance?
I think it’s always good to dig a bit deeper on these things.
This seems ridiculous to you, compared to a very obvious win with a Lego sorting vacuum.
Lego isn’t niche, and the explanation isn’t a weird technical thing that only experts would get and understand how important or valuable it is.
Yet it’s not being done.
Is there nobody who has realised this gap but you? Has nobody managed to convince people with money that it’s worthwhile? Have you tried but failed?
Or is it not many many thousands of people who are wrong but you?
Is the problem harder than you think? I’ve worked with robotics but not for a long time and I think the core manipulation is either not really solved or not until recently. I don’t know about yours but my kids also don’t fully dismantle their Lego creations either so would the robot need to take them apart too? That’s a lot of force. And some are special.
How people want Lego sorted is pretty broad. Kids don’t even need it sorted that much. And the volume can be huge for smaller buckets of things.
Is the market not as big as you think? Is it big enough for the cost, I’d buy one for £100 but £1000? £10,000?
How does it compare for most people against having the kids play on a blanket and then tipping it into a bucket? Or those ones that are a circle of cloth with a drawstring so it’s a play area and storage all in one? I 3d printed some sieves and that’s most of the issue right there done.
People are solving actual problems, but lots of problems are hard, and not all of them are profitable.
As a gut feeling, there is such a large overlap of engineers and large Lego collections and willingness to spend lots of money and time saving some time sorting Lego that the small number of implementations usually split over many years is very telling about the difficulty.
In principle, anonymized case studies do not require consent and historically, they were often published without. Without personally identifiable information, this is and always has been 100% legal. But in modern practice, many journals acknowledge that making a case fully anonymous in the age of the internet might not even be possible without taking away everything noteworthy, so they require some form of consent nowadays.
That's not so easy, especially for clinical case studies. If any data points are irrelevant, they should not be stated at all, because they actually might not be irrelevant after all and by arbitrarily changing them, you could confound results. On the other hand, it has been shown that three or more indirect data points can already be enough to unmask you in an anonymized report. And most reports usually contain many more than that. So it's not surprising that journals would cover their backs by requiring consent, even if the law does not explicitly demand it.
It’s been known since at least the 90s that it’s really hard to fully anonymize patient records. You can’t be certain but you can infer probabilities from very little information.
I don’t know how typical it is, but HIPAA explicitly doesn’t cover patient data after anonymization, and anecdotally I’ve had an anonymous case study published about me without my consent (although I was notified after).
> Today’s backyard AI looks like AI. It is not AI.
Getting real tired of people new to AI thinking only recent LLMs are AI somehow. BoW was a pretty solid technique and that only requires you to learn how to count to one.
reply