Hacker Newsnew | past | comments | ask | show | jobs | submit | petalmind's commentslogin

I think I understand this "Cartesian product" reasoning behind 4NF/5NF, I just find it irrelevant I guess.

Cartesian product is explained in Kent: case (3) in https://www.bkent.net/Doc/simple5.htm#label4.1 ("A "cross-product" form, where for each employee, there must be a record for every possible pairing of one of his skills with one of his languages")

I do not explicitly mention this Cartesian product even tho it is present in both posts ("sports / languages" in 4NF, and "brands / flavours" in 5NF).

> it demonstrates [one of] the precise problem[s] the normal form sets out to solve: a combinatorial explosion of rows.

I just don't understand this wording of "a combinatorial explosion of rows" — what's so dramatic here? I don't need four iterations of algebra-dense papers to explain this concept, I think it's pretty simple frankly.

And my implicit argument is, I guess, exactly that you could design tables that handle both problems without invoking 4NF and 5NF — people are doing that all the time.


Could you share an example of writing stule that you enjoy?

I have so many questions about that. Should that normal form basically replace 5NF for the purposes of teaching?

Why do they hate us and do not provide any illustrative real-life example without using algebraic notation? Is it even possible?

I just want to see a CREATE TABLE statement, and some illustrative SELECT statements. The standard examples always give just the dataset, but dataset examples are often ambiguous.

> (in its joins)

Do you understand what are "its" joins? What is even "it" here.

I'm super frustrated. This paper is 14 years old.



Chris Date has a course on this using his parts and supplies example. Don't have time to find it but maybe ai can find it.

https://www.oreilly.com/videos/c-j-dates-database/9781449336...

https://www.amazon.ca/Database-Design-Relational-Theory-Norm...


Imperative mood "normalize" assumes that you had something not-normalized before you received that instruction. It's not useful when your table design strategy is already normalization-preserving, such as the most basic textbook strategy (a table per anchor, a column per attribute or 1:N link, a 2-column table per M:N link).

And this is basically the main point of my critique of 4NF and 5NF. They both traditionally present an unexplained table that is supposed to be normalized. But it's not clear where does this original structure come from. Why are its own authors not aware about the (arguably, quite simple) concept of normalization?

It's like saying that to in order to implement an algorithm you have to remove bugs from its original implementation — where does this implementation come from?

The other side of this coin is that lots of real-world design have a lot of denormalized representations that are often reasonably-well engineered.

Because of that if you, as a novice, look at a typical production schema, and you have this "thou shalt normalize" instruction, you'll be confused.

This is my big teaching pet peeve.


> But it's not clear where does this original structure come from. Why are its own authors not aware about the (arguably, quite simple) concept of normalization?

I find the bafflement expressed in the article as well as the one linked extremely attractive. It made both a joy to read.

Were I to hazard a guess: Might it be a consequence of lack of disk space in those early decades, resulting into developers being cautious about defining new tables and failing to rationalise that the duplication in their tragic designs would result in more space wasted?

> The other side of this coin is that lots of real-world design have a lot of denormalized representations that are often reasonably-well engineered.

Agreed, but as the OP comment stated they usually started out normalised and then pushed out denormalised representations for nice contiguous reads.

As a victim of maintaining a stack on top of an EAV schema once upon a time, I have great appreciation for contiguous reads.


> Might it be a consequence of lack of disk space in those early decades

A plausible explanation of "normalization as a process" was actually found in https://www.cargocultcode.com/normalization-is-not-a-process... ("So where did it begin?").

I hope someday to find some technical report of migrating to the relational database, from around that time.


> Normalization-as-process makes sense in a specific scenario: When converting a hierarchical database model into a relational model.

That makes much more sense as reasoning.

If I can also offer a second hazard of guess. I used to work in embedded in the 2000's and it was absolutely insane how almost all of the eldy architects and developers would readily accept some fixed width file format for data storage over a sensible solution that offered out of the box transactionality and relational modelling like Sqlite. This creates a mindset where each datastore is effectively siloed and must contain all the information to perform the operation, potentially leading to these denormalised designs.

Bit weird, given that was from the waterfall era, implying that the "Big Design Up Front" wasn't actually doing any real thinking about modelling up front. But I've been in that room and I think a lot of it was cargo cult. To deal with the insanity of simple file I/O as data, I had to write a rudimentary atomicity system from scratch in order to fix the dumb corruption issues of their design when I would have got that for free with Sqlite.


Yeah, no NULL is ever equal to any other NULL, so they are basically unique.

You are also guaranteed to be able to retrieve your data, just query for '... is null'. No complicated logic needed!

> Someone, somewhere writing down a list and that list being blessed with the imprimatur of Academic Approval (TM)

One problem is that normal forms are underspecified even by the academy.

E.g., Millist W. Vincent "A corrected 5NF definition for relational database design" (1997) (!) shows that the traditional definition of 5NF was deficient. 5NF was introduced in 1979 (I was one year old then).

2NF and 3NF should basically be merged into BCNF, if I understand correctly, and treated like a general case (as per Darwen).

Also, the numeric sequence is not very useful because there are at least four non-numeric forms (https://andreipall.github.io/sql/database-normalization/).

Also, personally I think that 6NF should be foundational, but that's a separate matter.


"1979 (I was one year old then)."

Well, we are roughly the same age then. Our is a cynical generation.

"One problem is that normal forms are underspecified even by the academy."

The cynic in me would say they were doing their job by the example I gave, which is just to provide easy test answers, after which there wasn't much reason to iterate on them. I imagine waiving around normalization forms was a good gig for consultants in the 1980 but I bet even then the real practitioners had a skeptical, arm's length relationship with them.


> I imagine waiving around normalization forms was a good gig for consultants in the 1980 but I bet even then the real practitioners had a skeptical, arm's length relationship with them.

Real-talk: those consultants are absolutely essential - and are the unsung heroes of so many "organic" database projects that would have gotten started as an Excel spreadsheet on a nontechnical middle-manager's workgroup-networked desktop, which grew over time into a dBase file, then MSAccess/JET, then MSDE or MSSQL Express if they (think) they knew what they're doing, and then if it's the mid-2000s then maybe it'll be moved onto dedicated on-prem Oracle or MSSQL box - but still an RDBMS; I remember in 2014 all the talk was about moving data out of on-prem RDBMS siloes and onto Cloud(TM)-y OLAP clusters (trying to hide the fact they're running stock Postgres) which acted as a source for a Hadoop cluster - all to produce dashboards and visualizations made with the $100k Tableau license your company purchased after their sales guys showed your org's procurement people a good time in Cancun.

None of the evolution and progress described above could have happened if not for the awful DB designs in that initial Access DB - the anti-patterns would be carried through the DB whenever it ascended to the next tier of serious-business-ness, and each and every design-decision made out of innocent ignorance gets gradually massaged-out of the model by the regular and recurring visits by DBA consultants - because (and goddamnit it's true): a depressingly tiny proportion of software people (let alone computer-people) know anything about DB design+theory - nor all the vendor-specific gotchas.

What I still don't understand is how in 2026 - after 30 years of scolding beginners online - that we've successfully gotten greenhorn software-dev people to move away from VBA/VB6's dead-end, PHP's unintentional fractal of bad design, and MySQL's meh-ness - and onto sane and capable platforms like TypeScript, Node, and Postgres - all good stuff; and yet on my home-turf on StackOverflow, I still see people writing SQL-92 style JOINs and CREATE TABLE statements covered in more backticks than my late grandmother's labrador. I honestly have no idea where/when/how all those people somehow learned SQL-92's obsolete JOIN syntax today.

So in conclusion: the evidence suggests that not enough people today truly understand databases well-enough to render expensive DBA consultants irrelevant.


Thank you, your take on the evolution from spreadsheets to proper databases is englightening.

yep. born 1960.

> Also, personally I think that 6NF should be foundational, but that's a separate matter.

I share your ideal, but there exists a slight problem: no RDBMS I'm aware of really facilitates 6NF or DKNF (or even Codd's full relational concept; or newfound essentials like relational-division, and so on...).

There are also genuine ergonomic issues to contend with: pretty-much every RDBMS design and/or administration tool I've used in the past 20 years (SSMS, SSDT, DBeaver, MSAccess (lol), phpMyAdmin, etc) will present the database as a long, flat list of tables - often only in alphabetical order (if you're lucky, the tooling might let you group the tables into logical subfolders based on some kind of 2-part name scheme baked into the RDBMS (e.g. "schemas" in MSSQL).

...which starts being counterproductive when 6NF means you have a large number of tables that absolutely need to exist - but aren't really that significant alone by themselves; but they always need to remain accessible to the user of the tool (so they can't be completely hidden). So you'll turn to the Diagramming feature in your DB GUI, which gives you a broader 2D view of your DB where you can proximally group related objects together - instead of endlessly scrolling a long alphabetical list; and you can actually see FKs represented by physical connections which aids intuitive groking when you're mentally onboarding onto a huge, legacy production DB design.

...but DB diagrams are just too slow to load (as the tooling needs to read the entire DB's schema, design; all objects first before it can give you a useful view of everything - it's just so incredibly grating; whereas that alphabetical list loads instantly.

Sorry I'm just rambling now but anyway, my point is, 6NF is great, but our tooling sucks, and the RDBMS they connect to suck even more (e.g. SQL-92 defined the 4 main CONSTRAINT types seen in practically all RDBMS today (CHECK, FOREIGN KEY, UNIQUE, and DEFAULT); over 30 years later we still have the same anaemic set of primitive constraints; only Postgres went further (with its `EXCEPT` constraint). As of 2026, and almost 40 years since it was defined, no RDBMS supports ASSERTION constraints; wither DOMAIN constraints and a unified type-system that elegantly mediates between named scalars, relations (unordered sets of tuples), queries, and DOMAINs and the rest.

...this situation is maddening to me because so many data-modelling problems exist _because_ of how unevolved our RDBMS are.


I broadly agree with you, so I want to pick your brain a bit:

What would your ideal RDBMS / tooling look like, that facilitates 6nf effectively? Do you think it's more a limitation of the query/storage engine, or the query language (SQL), or the user interface? Do you think founding on Datalog (or similar), which kinda lends itself to "narrow" relations, instead of SQL which kinda lends itself to "wide" relations, would help here?

(I ask as one of my personal hobby-horses is trying to design better query languages and tooling, and 6nf/datalog maintains a kinda special place in my heart)


Is there a good way to contact you - ideally something like IM or Teams/Slack/Zoom?

sure! You can email me at notches_hoses_58@icloud.com (throwaway email), and we could go from there?

Reply sent just now.

DBeaver can show a relationship diagram between tables. It's the main reason I've used it at all.

https://dbeaver.com/docs/dbeaver/ER-Diagrams/


I could have worded my post a bit better - I didn't mean to imply DBeaver only showed a flat list of tables/objects; but DBeaver is hardly unique in having DB diagrams; my point was that every DB-diagram feature/tool/workspace in a DB admin/IDE (like DBeaver, SSMS, SSDT, etc) is necessarily performance-constrained because they need to load _so much_ metadata before they can show an accurate - and therefore useful - picture of the DB - even if it's just a subset of all tables/objects.

Frankly I don't think that overcounting is solved by normalizing, because it's easy to write an overcounting SQL query over perfectly normalized data.

I tried to explain the real cause of overcounting in my "Modern Guide to SQL JOINs":

https://kb.databasedesignbook.com/posts/sql-joins/#understan...


I'll go one further and say that if you're reaching for DISTINCT and you have joins, you may have joined the data the wrong way. It's not a RULE, but it's ALWAYS a 'smell' when I see a query that uses DISTINCT to shove away duplicate matches. I always add a comment for the exceptions.

Great read, thank you!

One day I hope to write about denormalization, explained explicitly via JOINs.

Please do, you content is great!

Do you know if Redex could be a tool that would be useful in my situation?

Claude suggests it, but I need to learn a lot of Redex to understand how to apply it.

I have a general understanding of what operational semantics is. Or maybe, are there any Redex implementations for common programming languages? \lambda_v is a bit too abstract.


Even Ruby `parser` gem uses S-expressions in their documentation: https://github.com/whitequark/parser/blob/master/doc/AST_FOR...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: