HN Reader

The principles of database design, or, the Truth is out there

121

7 months agoby b-man

> A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity — not by arbitrary or surrogate values.

I fairly strongly disagree with this. Database identifiers have to serve a lot of purposes, and natural key almost certainly isn’t ideal. Off the top my head, IDs can be used for:

- Joins, lookups, indexes. Here data type can matter regarding performance and resource use.

- Idempotency. Allowing a client to generate IDs can be a big help here (ie UUIDs)

- Sharing. You may want to share a URL to something that requires the key, but not expose domain data (a URL to a user’s profile image shouldn’t expose their national ID).

There is not one solution that handles all of these well. But using natural keys is one of the least good options.

Also, we all know that stakeholders will absolutely swear that there will never be two people with the same national ID. Oh, except unless someone died, then we may reuse their ID. Oh, and sometimes this remote territory has duplicate IDs with the mainland. Oh, and for people born during that revolution 50 years ago, we just kinda had to make stuff up for them.

So ideally I’d put a unique index on the national ID column. But realistically, it would be no unique constraint and instead form validation + a warning on anytime someone opened a screen for a user with a non-unique ID.

Then maybe a BIGINT for database ID, and a UUID4/7 for exposing to the world.

EDIT: Actually, the article is proposing a new principle. And so perhaps this could indeed be a viable one. And my comment above would describe situations where it is valid to break the principle. But I also suspect that this is so rarely a good idea that it shouldn’t be the default choice.

7 months agoby adamcharnock

This takes an overly simple view of what domains can look like. There are data models that necessarily violate these principles, and they aren’t all that rare.

Some examples:

> A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity

In some domains there is no natural key because the identity is literally an inference problem and relations are probabilistic. The objective of the data model is to aggregate enough records to discover and attribute natural keys with some level of confidence. A common class of data models with this property are entity resolution data models.

> All information in the database is represented explicitly and in exactly one way

Some data models have famously dual natures. Cartographic data models, for example, must be represented both as a graph models (for routing and reachability relationships) and as geometric models (for spatial relationships). The “one true representation” has been a perennial argument in mapping for my entire life and both sides are demonstrably correct.

> Every base relation should be in its highest normal form (3, 5 or 6th normal form).

This is one of those things that sounds attractive because it ignores that it requires no ambiguities about domain boundaries or semantics, which doesn’t exist in practice. I bought into this idea too when I was a young and naive data modeler. Trying to tamp out these ambiguities adds an unbounded number of data model epicycles that add a lot of complexity and performance loss. At some point, strict normalization is not worth the cost in several aspects.

In almost all cases, it is far more important that the data model be efficient to work with than it be the abstract platonic ideal of a domain model. All of these principles have to work on real hardware in real operational environments with all of the messy limitations that implies.

7 months agoby jandrewrogers

> Principle of Essential Denotation (PED): A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity — not by arbitrary or surrogate values.

  create table citizen (
    national_id national_id primary key,
    full_name text);

Is national_id really a natural key, or is it someone else's synthetic key? If so, should the owner of that database have opted for a natural key rather than a synthetic key?

More arguments for synthetic over natural keys: https://blog.ploeh.dk/2024/06/03/youll-regret-using-natural-...

7 months agoby mrkeen

I have a joke in context that I often like to tell:

Devil captured Physicist, Engineer and Mathematician. He gave each of them big can of spam and locked them in the empty room saying „you will be here for 2 weeks - open the can and survive or die to starvation”. After 2 weeks Devil opens Physicist cell. It’s covered floor to the ceiling in complex scribbles. One piece of wall is clean of etching but small dent is visible. Can of span is opened and eaten clean, Physicist sits in corner visibly annoyed. Next one is Engineer. Cell walls are covered in multiple dents and pieces of spam. Engineer is bruised almost as much as the can, but it is ultimately opened and engineer is alive.

Finally the Devil opens Mathematician cell and find him dead. Only „given the cylinder” is etched on the wall.

—-

Puent isn’t about engineering but it always helped me to set limits between software engineering and computer science.

7 months agoby xlii

I really don't like using natural keys as primary keys.

Natural keys sometimes need to change for unforeseen reasons, such as identity theft, and this is really tricky to manage if those keys are cascaded into many tables as foreign keys.

Natural keys are often not unique either. Using the national ID example, there are millions of duplicate SSNs issued within USA. https://www.computerworld.com/article/1687803/not-so-unique....

So, don't use natural keys as primary keys. Put them in as surrogate keys, ideally with a unique constraint.

7 months agoby pretoriusdre

> A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity — not by arbitrary or surrogate values.

That's a very nice example of Theory VS Practice !

I agree, in Theory, PK should be identified by the natural unique key identifier of the domain. Your example with national_id is proper.

In practice :

- due to their varying underlying type, manipulating natural key identifier throught an application (passing values) is not always easy. UUID or INT are reliable and always works.

- sometimes, we want a "poor" anonymization when passing and calculting data. As the PK is the identifier of the tuple, having a personal data for PK can be problematic.

- last but not least : using PK that is subject to change is a terrible idea. An address mail, for example, anybody can be subject to change. When it's the PK, you are fuck to do or propagate the change.

And in this age of gods war, Practice always win.

7 months agoby QuiCasseRien

"Databases are representations of reality"

The national ID example is funny. Let me give you a dose of reality concerning national IDs.

- I've seen cases with my country's national ID numbers containing duplicates due to human error.

- National ID's for people can change. In my country, children being adopted, get a different ID after adoption.

- There exist people without a national ID. Or people that don't want you to know their national ID.

7 months agoby photios

Seems like this article places too much emphasis on normalization, which is appropriate for many cases, but may be a huge cost and performance issue for requirements like reporting. You may probably need different kinds of schema and data storage structures for different requirements in the same application, which in turn may result in duplicated data, but results in acceptable trade offs.

7 months agoby AnonHP

The example works if and only if there's one National ID per person.

That's not true for SSNs. It's not true in that it is false. My statement that it is false is, in point of fact, true, and therefore not up for debate.

The government even acknowledges this:

https://www.ssa.gov/OP_Home/handbook/handbook.14/handbook-14...

> 1401.7 Can a person have more than one SSN?

> Most persons have only one SSN. In certain limited situations, SSA can assign you a new number. If you receive a new SSN, you should use the new number. However, your old and new number will remain linked in our records to ensure that your earnings are credited properly. This could affect your benefits.

Maybe there are countries where it is the case that nobody ever gets multiple National IDs. Maybe there are countries without fraud and where everyone can and will update their records when the government does. Maybe there is a veritable Utopia on Earth, a Cockaigne of validated data and reasonable deadlines.

7 months agoby msla

I’m going to have to disagree. What happened what national ids are expanded to include letters? Or extra digits? I’ve lived through this mess a few times in the migrations are painful and take a ton of time away from solving actual business problems. For example: us telephone numbers. The exchange (middle 3 digits), used to physically Geolocate a consumer, because everybody had land lines. I worked on a project and an investment firm that was using it as a primary key to identify a users’s location. Holy old macro was that an unbelievably expensive and painful migration. Don’t do this.

Instead, follow this principle: never ever use an externally assigned identifier as a primary key in your database.

Instead: link tables with simple integers. Do not use these integers for any other purpose and assume unordered. Never let these integers leave the app. When exchanging information between systems or third parties generate a suitable external identifier with a alphabetical prefix.

7 months agoby exabrial

The "natural ID" for people design reminds me of a story from a state department of education: They had two students, both named John Smith Jr. They were identical twins and attending the same class.

They had the same birth date, school, parents, phone number, street address, first name, last name, school, teachers, everything...

The story was that their dad was John Smith Sr in a long line of John Smiths going back a dozen generations. It was "a thing" for the family line, and there was no way he was going to break centuries of tradition just because he happened to have twins.

Note: In very junior grades the kids aren't expected to memorise and use a student ID because they haven't (officially) learned to read and write yet! (I didn't use one until University.)

7 months agoby jiggawatts

Just a side note about the historical anecdote at the bottom of the post, which is related to Notre-Dame de Paris:

> 28 statues that portrayed the biblical Kings of Judah. [...] They didn’t portray French kings

That's wrong. Several texts from the revolution and before still exist that prove that these kings were identified as both Judea kings and France kings. For instance, David was Pépin le bref. On one of the gates of the cathedral, the List of the French kings was engraved, starting with Clovis. That glorification of the monarchy, with parallels to the bible, was common at the time: other French medieval cathedrals show the same analogies.

7 months agoby idoubtit

No. Real life rarely has natural keys that are unique and do not change. For example the national id number in several countries can change in some circumstances...and that is already a synthetic key.

7 months agoby peanut-walrus

Ohhhhh absolutely not, thank you. I want my IDs to have absolutely no meaning whatsoever.

7 months agoby traches

Lots of peole got hung up on the example, which I thought would be be helpful on the discussion, but certainly should not replace the main point, which is:

Relations, attributes and tuples are logical. A PK is a combination of one or more attributes representing a name that uniquely identifies tuples and, thus, is logical too[2], while performance is determined exclusively at the physical level, by implementation.

So generating SKs for performance reasons (see, for example, Natural versus Surrogate Keys: Performance and Usability, Performance of Surrogate Key vs Composite Keys) is logical-physical confusion (LPC)[3]. Performance can be considered in PK choice only when there is no logical reason for choosing one key over another.

https://www.dbdebunk.com/2018/04/a-new-understanding-of-keys...

7 months agoby b-man

> When a collection of such propositions is stored in a computer system, we call it a database.

Is it? A database is a place data is stored and retrieved. It is literally a data base. No more, no less. Whether it logically models a domain may be completely irrelevant to store and retrieve data.

I would argue that a database should not logically model a domain. Why? Because every database must store and retrieve data. Therefore data should be modeled to store and retrieve data as efficiently as possible. And the structure that most efficiently stores and retrieves data most likely does not logically model a domain.

7 months agoby thuanao

All of these "rules" will change once they hit the reality of utilization. It reminds me of this:

"We need to normalize the database for better performance."

"We need to denormalize the database for better performance."

7 months agoby mannyv

Many of the principles and also the example provided for PED cannot be mapped easily through an ORM library and AFAIK Java JPA doesn't handle it too.

Why does it matter? I have seen that many developers rely totally only on the code to manage entities on the database, instead of relying on prepared statements and pure SQL queries. This obviously opens a door for poor optimisation, since these Entity Management libraries don't support certain SQL capabilities.

7 months agoby madduci

Bad luck if you don’t yet have (or know) your national ID.

National id is not something issued at birth in the country I live in. It’s something applied for at a certain age.

7 months agoby stevoski

Article sounds like it was written from a purely theoretical perspective and not from what happens and the requirements in real life.

7 months agoby RedShift1

" Every base relation should be in its highest normal form (3, 5 or 6th normal form). "

If I remember my database lessons correctly there is no strictly highest normal form. It progresses from 1NF to BCNF, but above that it is more choosing different trade-offs.

Even below it is always a trade-off with performance and that is why we most of the time aim for 3NF, and sometimes BCNF.

7 months agoby weinzierl

Principle of Full Normalization is a useful guide that should not be religiously followed. Mostly full normalization is good, but some objects like names and address resist normalization and thus can introduce significantly more complexity than they save by normalizing.

7 months agoby datadrivenangel

"Databases are representations of reality"

"tell the truth that is out there"

Both truth and representation are very slippery, many-faceted concepts, encumbered with millennia of use and philosophy. Using them in this way is deceptive to the junior and useless to the senior.

7 months agoby cess11

Seems like a terrible idea, TBH - the reality is that values like "national ID" can be sensitive themselves. In the US, such a scheme would make even HTTP ACCESS LOGS into personally-identifiable-information requiring special handling.

7 months agoby al2o3cr

“ Principle of Full Normalization (POFN) : Every base relation should be in its highest normal form (3, 5 or 6th normal form)”

No it shouldnt.

7 months agoby moi2388