SQL language proposal: JOIN FOREIGN

slotrans · on Dec 30, 2021

This is not actually compatible with SQL semantics.

An important constraint on SQL is that a query must run, and produce correct results, relying only on the structure and content of tables. Indexes (can) make queries faster but must not inhibit, or be required for, correctness. The same is true of primary key constraints, foreign key constraints, check constraints, defaults, triggers, partitioning, whether a table is heap/clustered, and literally every other implementation detail of the RDBMS.

This proposal would break that constraint, by elevating the _name_ of a foreign key constraint to an identifier usable in a DQL statement (as distinct from DDL statements where such names can already appear). That's not allowed.

More importantly, it's _not a good idea_, because that semantic separation between data on one side, and the machinery of acceleration and validation on the other side, is critical to the value prop of the relational model and a big reason why it's been so hyper-successful.

So no, let's not do this.

JoelJacobson · on Dec 31, 2021

> This proposal would break that constraint, by elevating the _name_ of a foreign key constraint to an identifier usable in a DQL statement (as distinct from DDL statements where such names can already appear). That's not allowed.

Actually, constraint names already do appear in some DQL statements, such as the quite recently added INSERT INTO ... ON CONFLICT in PostgreSQL [1]

    INSERT INTO ... ON CONFLICT [ conflict_target ] conflict_action

where conflict_target can be one of:

    ( { index_column_name | ( index_expression ) } [ COLLATE collation ] [ opclass ] [, ...] ) [ WHERE index_predicate ]
    ON CONSTRAINT constraint_name

[1] https://www.postgresql.org/docs/current/sql-insert.html

bradleyjg · on Dec 31, 2021

PostgresSQL isn’t SQL though. Vendor extensions have done all kinds of inelegant or even outright ill advised things from the beginning.

JoelJacobson · on Dec 31, 2021

The INSERT ... ON CONFLICT UPDATE patch has been documented here: https://wiki.postgresql.org/wiki/UPSERT#UPSERT_as_implemente...

There are some clear benefits possible only thanks to specifying constraint names. Quoted from the link:

* The SQL-standard MERGE doesn't provide for choosing an index, so use of a unique index would need to be conditioned on equality quals against indexed columns in order to provide the behavior being discussed for UPSERT.

* The ON expression will need to be evaluated to see whether it properly compares to a unique index on the target table. Initially this will need to be done to determine whether the MERGE is allowed at all; later it will determine which sort of plan is allowed. It would be easier to match an index name, provided the index name is known and stable.

yunohn · on Dec 31, 2021

> PostgresSQL isn’t SQL though

What is SQL acc. to you? MS-SQL, MySQL, SQLITE, Bigquery “Standard SQL”?

bradleyjg · on Dec 31, 2021

Standards published by ISO/IEC.

yunohn · on Dec 31, 2021

Sure, but where is that pure standard implemented?

bradleyjg · on Dec 31, 2021

Why does it need to be implemented somewhere?

yunohn · on Dec 31, 2021

What do you mean?

If the ISO SQL is not implemented by any SQL DB, and you denounce PostgreSQL/etc version of SQL - what is left over?

bradleyjg · on Dec 31, 2021

I’m not denouncing PostgreSQL. There’s things that are appropriate for a standard and other, sometimes far less elegant, things that are appropriate for a concrete implementation. The linked article propose a change to the standard. If it had only proposed a non-standard vendor extension for some or other database we wouldn’t be having this discussion.

cryptonector · on Dec 31, 2021

That's DML, not DQL. Still, I disagree with GP's arguments (see separate comment).

JoelJacobson · on Dec 31, 2021

> That's DML, not DQL.

Thanks. I incorrectly assumed there were only two sublanguages. But now when you say it, I've heard both DML and DDL before, but never DQL. Curious if there were even more categories, I found this Wikipedia article:

https://en.wikipedia.org/wiki/Data_query_language

Interesting to read, what a SELECT statement is, depends on if having FROM or WHERE "data manipulators". Quote from Wikipedia:

"Although often considered part of DML, the SQL SELECT statement is strictly speaking an example of DQL. When adding FROM or WHERE data manipulators to the SELECT statement the statement is then considered part of the DML."

cryptonector · on Dec 31, 2021

> and produce correct results

This doesn't interfere with that.

> relying only on the structure and content of tables

Constraints are part of the table schema, not index schema.

> his proposal would break that constraint, by elevating the _name_ of a foreign key constraint to an identifier usable in a DQL statement (as distinct from DDL statements where such names can already appear). That's not allowed.

Not allowed says who? The standard? I doubt it, and anyways, it can be changed.

"That's not allowed" is not a good argument. A better argument is that this is the first time a constraint can change the meaning of a query -- that is a good argument, but it would be better if the constraint could change existing queries, which it does not do. Because this would only affect queries that refer to the constraint, this seems quite allowable to me.

> More importantly, it's _not a good idea_, because that semantic separation between data on one side, and the machinery of acceleration and validation on the other side, ...

This is not about optimization, ergo this argument is out.

I'm not sure I want this particular extension, but I don't buy your arguments against it.

goto11 · on Dec 31, 2021

> More importantly, it's _not a good idea_, because that semantic separation between data on one side, and the machinery of acceleration and validation on the other side, is critical to the value prop of the relational model and a big reason why it's been so hyper-successful.

Constraints are part of the logical model just like the types of columns are. Indexes on the other hand is part of the physical model and should be transparent to the logical model. Database engines tend to couple foreign-key constraints with indexes, since you usually want an index on a foreign-key. But in principle they are separate.

So I don't see any violation of the relational model in this proposal. I do like the general idea of extending SQL with metadata-aware abstractions, although I'm not a fan of this particular syntax.

dragonwriter · on Dec 31, 2021

> This proposal would break that constraint, by elevating the _name_ of a foreign key constraint to an identifier usable in a DQL statement

This doesn't break that constraint in an SQL RDBMS that also implements the relational model, since it is a fundamental element of the relational model that schema metadata is stored as data, and therefore constraint specifications, including names, are included within “content of tables”.

lolive · on Dec 31, 2021

To me, making the data relationships implicit (and making them explicit again in the DAO layer of your app) is a nightmare from a data design point of view. In real life, a DB is often tied to the app that was built on top of it. Making data reuse, data discovery and data understanding VERY hard.

jpz · on Dec 31, 2021

I agree with all the points you make here, we’ll said.

Smotko · on Dec 30, 2021

It might be just me, but I feel like remembering the foreign key name is more difficult than remembering the columns that you need in the ON clause. Especially since you can usually find the column names by just seeing the data in the table (select * from x) wheres seeing the foreign key names is much harder (show create table x?).

Also, if you use an ORM it will usually generate foreign key names that are almost impossible to remember.

rtpg · on Dec 30, 2021

In a universe where foreign key index names are important we would specify better names.

I think stuff like “documents_by_user” as foreign key names and explicit index usage would improve peoples awareness of how indices get used and would generally be a positive

exabrial · on Dec 30, 2021

I think this is an operator problem. You're using the wrong tool for the job.

TablePlus, SequelAce, the official MySQL client all support cntrl-space autocompletion. I wish we used Postgres, but I imagine the landscape is the same. The big box databases like Oracle, DB2 undoubtedly having this tooling as well.

That being said, here is our fk naming convention: `fk-asset_types->asset_categories` which pretty states what's going on and is easy to remember.

dolmen · on Dec 30, 2021

SQL is not only written in an SQL client. SQL is also written (and read from) embedded/mixed in an other programming language were tooling is not always available.

Having to know the names of foreign keys (in addition to the column names of the 2 tables) is adding more cognitive load. I don't think that is an improvement.

Latty · on Dec 30, 2021

At least when it comes to both JetBrains and VSCode, they can handle one language embedded in another. I'm kind of surprised there are environments that don't handle that these days.

JoelJacobson · on Dec 30, 2021

It would indeed be difficult to remember, but the proposal also suggest changing the default naming convention for foreign keys, to give them the same name as the referenced table.

If using an ORM, I would guess this proposal isn't useful, since then you wouldn't hand-write queries anyway, right? Except when you want to override the queries generated by the ORM? (I'm not an ORM user myself.)

to11mtm · on Dec 30, 2021

Speaking as someone who has used ORMs in the past and contributes to a LINQ Micro ORM...

It might make tooling 'easier', but since backwards compatibility has to be considered the actual value add is questionable IMO.

Most ORMs/MicroORMs will have tooling that sniffs out the DB Schema including foreign keys, and if you are using those bits (i.e. 'not hand written') most will do the right thing today. I suppose you could include some extra syntax for whatever DSL you're providing users....

IDK. Speaking as someone who is very comfortable in SQL, This feels more like syntactic sugar than anything else.

lolive · on Dec 31, 2021

Could you please elaborate on this FK sniffing, and which tools are actually implementing that?

to11mtm · on Jan 7, 2022

Ahhh, I know that Entity Framework would do this at one point, IDK if EF Core does or not.

Linq2Db does it via T4 Template generation, so you can play with it more if you want [0]

[0] - https://github.com/linq2db/linq2db/blob/64a0db9a9ed7787ff755...

JoelJacobson · on Dec 30, 2021

This is the same idea as already posted in this thread https://news.ycombinator.com/item?id=29687134, but with a more in-depth explanation that couldn't fit in the comment field, and with some syntax improvements, such as using "FOREIGN", which is a reserved keyword, instead of the previously suggested "->" notation.

Thanks for all the valuable comments on last proposal. Excited to hear what you think about this update.

thom · on Dec 30, 2021

Of all the myriad indignities of SQL, this isn't near the top of my list. I also don't like making the names of objects like foreign keys and indexes first class concerns in your queries, that's a whole new layer of cognitive overhead.

degenerate · on Dec 30, 2021

Agree. I'd much prefer they worked on the way MySQL determines what indexes to use on a multi-table join, so there could be less emphasis on the need to ANALYZE TABLE and FORCE INDEX when the DB is seemingly being dumb, but there's not enough information in EXPLAIN to tell you where it's actually getting tripped up. Troubleshooting full table scans on large sets of data is a nightmare.

dspillett · on Dec 31, 2021

> but there's not enough information in EXPLAIN to tell you where it's actually getting tripped up

Unfortunately I don't think providing that information is generally possible. In some specific cases there are useful details it could provide, but there would usually be a myriad of other similar details that are irrelevant and if it included all those you'd not see the wood for the trees.

EXPLAIN and query plan outputs in other DBs are “this is what I did” not really “why I did what I did”. To make the query planner bright enough to know what details would be useful to you, would probably pretty much require making it bright enough to do the optimisation job without you¹.

[1] picking better indexes without hints, even creating those that are often needed, etc.

MS's SQL Server tries to do this a bit with index suggestions. These are sometimes handy, but often at best for guidance². I've seen people blindly follow these suggestions to get a %-or-few gain from a small set of queries that could see orders of magnitude improvement with just a little tweaking elsewhere³, slowly amassing collections of indexes for very specific cases, sometimes multiple on the same key columns but each INCLUDEing a different mix of other data, that balloon their storage requirements⁴.

[2] a nudge in the direction of “Mr Dev/DBA, you might want to think about how I'd avoid scanning this large object or performing many many thousands of seeks on this other one”

[3] refactoring non-sargable predicates, index changes on other tables being referred to, getting rid of “SELECT *” particularly when referring to hideous views, ...

[4] and having the knock-on effects of slowing insert/update activity & important admin functions (particularly backups)

Tostino · on Dec 30, 2021

Slightly different group of people working on MySQL vs this proposal.

hnthrowaway0315 · on Dec 30, 2021

What I really need is something like:

SELECT -col1, -col14 FROM table LIMIT 50;

Where the minus sign means I don't want these two columns. I still don't see a way to do it easily (for Vertica and in Datagrip).

nvartolomei · on Dec 30, 2021

Some have support for this.

- BQ https://cloud.google.com/bigquery/docs/reference/standard-sq...

- CH https://clickhouse.com/docs/en/sql-reference/statements/sele...

hnthrowaway0315 · on Dec 30, 2021

Thanks, these really look neat. Unfortunately we only use Vertica :/

piaste · on Dec 30, 2021

Similarly, I'd love some form of

    GROUP BY every column except for <these>

It feels silly when you are SELECTing a ton of columns, then you add a JOIN to a many-to-one relationship which you want to aggregate. Now you need to either make it a subquery (and hope the optimizer doesn't screw up) or duplicate all your SELECT expression (not even the identifiers) into the GROUP BY.

baskethead · on Dec 30, 2021

Queries like this won’t break but will silently fail upstream by missing data if column names change or are deleted. The explicit nature of SQL ensures that it will break positively which is a better type of failure in my opinion.

Groxx · on Dec 30, 2021

whether that's desirable or not depends on the query and what kinds of changes you make. though I could see it being error-prone in the most common cases.

icambron · on Dec 30, 2021

I wish it just had a "group by all the stuff I selected without aggregation" shortcut

coderzach · on Dec 30, 2021

yeah, this is the right answer. Are there any examples where this isn't what you want? I guess GROUPING SETS and ROLLUP, but those could be special.

xyzzy_plugh · on Dec 30, 2021

Isn't this just SELECT DISTINCT ...?

orthoxerox · on Dec 30, 2021

No. What icambron and I would like is something that can get rid of this group by:

    select division_name, branch_name, dealer_id, dealer_name, quarter, month, sum(total_paid)
    from divsions, branches, dealers, transactions
    where ..... --buncha joins
    group by division_name, branch_name, dealer_id, dealer_name, quarter, month

If I didn't want to group the result by division_name, branch_name, dealer_id, dealer_name, quarter and month, why would I put them in the select clause?

Sesse__ · on Dec 30, 2021

What if you do SELECT a + b, SUM(x) FROM t1? Should the implicit GROUP BY be on (a,b) or (a+b)?

icambron · on Dec 31, 2021

On a+b. I'm suggesting it simply copies all the non-aggregate select pieces to the group by clause. If I want to group on a and b, I could either select them too or add them to the group by (I'm imagining something like "GROUP BY SELECT, a, b"). Note that selecting them is the natural thing you'd do if you wanted that grouping, because otherwise you wouldn't know what the grouping is when you got the results.

tpetry · on Dec 30, 2021

In PostgreSQL you can GROUP BY on the primary key so you can omit specifying all other columns

icambron · on Dec 31, 2021

I remember when I discovered this-- very helpful feature and I am happy it's there (and sad most DBs don't have it). But if I'm grouping I'm almost necessarily not grouping on at least one of the tables' PKs, so it only goes so far

benjiweber · on Dec 30, 2021

"group by 1,2,3,4,5,6" is at least a little more consise.

Svip · on Dec 30, 2021

You know, you can just write GROUP BY x, y, ...

    SELECT t.i+1, count(*) FROM table t GROUP BY 1

1 in this context means the first selected item (i.e. t.i+1). I know this works in PostgreSQL.

edoceo · on Dec 30, 2021

I use this. It gets messy after like 10 columns. I have a trick in my ORM that copies the non-aggrigate columns into the statement. Maybe your ORM has it?

dragonwriter · on Dec 30, 2021

What I really want is a special phrase MINIMAL GROUPING such that GROUP BY MINIMAL GROUPING includes exactly the items that it would be an error not to include in the GROUP BY.

oweiler · on Dec 30, 2021

Can't you use an alias in a group by?

icambron · on Dec 31, 2021

Some DBs let you do this and it's helpful when one of the things you're selecting is some complex scalar expression, but you still have to list each item in the group by

Sesse__ · on Dec 30, 2021

Not by the SQL standard, since projection (which creates aliases) happens after aggregation. There are many databases that allow it as an extension, though.

blondin · on Dec 30, 2021

i would rather it written as SELECT * BUT or SELECT * EXCEPT or even SELECT ALL BUT.

SQL has always been that language that is easy to read. even when you don't understand what the queries are doing. adding a cryptic syntax like "-column" would make it less readable.

polygotdomain · on Dec 30, 2021

What about SELECT * WITHOUT {columns_you_do_not_want} FROM...

EXCEPT is already a keyword and has is used for set-based operations, so I don't think it's good to overload it.

edgyquant · on Dec 30, 2021

I like this fwiw

recursive · on Dec 30, 2021

The given syntax is obviously a no-go. It's already unary minus.

colinmhayes · on Dec 30, 2021

EXCEPT is already a sql keyword

bobbyi · on Dec 30, 2021

If you have specific columns that you frequently want to ignore (which, I think, is the common case of this), you could define a view that selects all the columns except those and do your queries against that view.

Eremotherium · on Dec 30, 2021

Or create a function that returns a view of a table minus named columns in situ if that's really a common case for you. Just as an idea.

fendy3002 · on Dec 30, 2021

Iirc in some older db it won't be able to use indexes this way. Don't know if still applied nowadays.

dolmen · on Dec 30, 2021

> SELECT -col1, -col14 FROM table LIMIT 50;

This is already valid SQL. Example: SELECT -col1, -col4 FROM (SELECT 1 AS col1, 2 AS col4) AS tbl;

Do you think seriously that a new meaning could ever be attached to that syntax?

hnthrowaway0315 · on Dec 31, 2021

I just say something similar, doesn't have to be the exact form.

dspillett · on Dec 31, 2021

There are often occupancies where I'd like this in ad-hoc queries, but it would be something to avoid and discourage in production code for the same reasons “SELECT *” is discouraged (because it is “SELECT *” with a bit of sugar added)

stingraycharles · on Dec 30, 2021

Simple, effective, and really useful in a lot of situations. I like it!

jdunstan · on Dec 30, 2021

That's very close to SELECT *, which has it's own dangers. I agree that it would be nice for exploration and testing, but probably should not be used in production.

hnthrowaway0315 · on Dec 30, 2021

The problem I found out is that SELECT * LIMIT 10 is guaranteed to be needed so it's actually much more important than whatever is in production (you have to explore a lot before writing the production scripts). My idea is that exploration should be as easy as possible.

Things such as SELECT * EXCEPT col1, col2 are really a PIA to write and can build up frustration level really quickly. Certain IDEs such as Datagrip ease the process by providing "macros" but they are not enough.

Another thing is to generate useful boilerplates such as SELECT col1 FROM table GROUP BY col1 ORDER BY col1 to explore all unique values of col1.

ipaddr · on Dec 30, 2021

What dangers lay in select *...? Too much data?

wiredfool · on Dec 30, 2021

Minimal if you refer to columns by name. Risky if you rely on column ordering. Can be too much data and a performance issue if you have large columns that you’re not using. (Though, it’s not much different if you’re using an orm that loads the whole object anyway), like Django or sqlalchemy.

I’ve done it for many years (using named columns/ dictionaries as the result set) and its never been an issue.

c06n · on Dec 30, 2021

R's data.table has that.

dt[1:50, -c('col1', 'col14')]

Upitor · on Dec 30, 2021

In my opinion this proposal seems only to consider simple cases, but there are many not-so-simple relationsship types:

Consider a ‘sales’ table which includes columns [time] and [sold_by_employee_id], and a periodized ‘employee’ table which includes columns [employee_id], [valid_from] and [valid_to] columns. There is a perfectly valid relationsship between the two tables, but you cant join them using only equal-statements (you need a between-statement as well)

JoelJacobson · on Dec 30, 2021

Nice example! The join you describe would remain as a JOIN ON.

This is per design. Quote from the proposal:

"The idea is to improve the SQL language, specifically the join syntax, for the special but common case when joining on foreign key columns." ... "If the common simple joins (when joining on foreign key columns) would be written in a different syntax, the remaining joins would visually stand out and we could focus on making sure we understand them when reading a large SQL query."

So, the special non-equal based join condition you describe, would become more visible, and stand out, allowing readers to pay more attention to it.

The hypothesis is most joins are made on foreign key columns, so if we can improve such cases, a lot can be won.

DevKoala · on Dec 30, 2021

But how could you accurately tell if some queries join on the foreign key, but were written by someone without knowledge of the new specification?

JoelJacobson · on Dec 31, 2021

Not sure I understand, can please elaborate? Perhaps with an example?

Upitor · on Dec 30, 2021

I see. I should have read the article more carefully :-)

Upitor · on Dec 30, 2021

Also, consider a ‘sales’ table with multiple references to a ‘calendar’ table: [shipped_date], [order_date], [received_date]

JoelJacobson · on Dec 30, 2021

Good example too, but this one can with benefit be written using the JOIN FOREIGN syntax, you just need to give the foreign keys suitable names such as e.g. "shipped_date", "order_date", "received_date". Or, to remind you of which is the referenced table, perhaps you want to include it in the names, and the names would be "shipped_date_calendar", "order_date_calendar", "received_date_calendar", but personally I would prefer just "shipped_date" or perhaps even just "shipped".

CRConrad · on Jan 1, 2022

Are those timestamps, or actual dates, i.e. meaning a whole day encompassing (usually) 24 hours?

In many (most?) data warehousing projects I've seen, you make a "fake date" integer the primary key column of your Dates dimension. This integer consists of 10000 × YEAR_PART + 100 × MONTH_PART + 1 × DAY_PART of the date in question, so yesterday's New Year's Eve woul get 10000 × 2021 + 100 × 12 + 1 × 31 = 20211231. The date dimension itself has many more columns (often booleans, IS_WEEKEND, IS_HOLIDAY, etc; also the date parts themselves in both numeric and character form (12, 'December'), day of week (5 [or 6, depending on convention], 'Friday'), etc) that are used for BI and reporting.

But since this generated ISO-8601-date-as-integer column is the primary key of the Dates table, it is also the value of the Date foreign key in all tables that reference Dates. That makes it incredibly handy in queries -- both during development and for ad-hoc reports -- of those tables without joining to the Dates dimension at all: grouping, sorting, limiting to a more manageable date interval in the WHERE clause... And it tells the reader exactly what the actual date in question is. (Well, at least readers who are used to ISO-8601-format dates.) As jerryp would have said, recommended.

tester756 · on Dec 30, 2021

While we are at it

let's write SQL queries starting from FROM.

`FROM users SELECT *`

It'd allow tooling to provide IntelliSense better.

miohtama · on Dec 30, 2021

While this might be a joke, the world has been ripe for a replacement for SQL… for the last 20 years.

hdjrudni · on Dec 30, 2021

I started writing a new SQL parser that just transforms the query...this was one of the features. The other being a syntax for group-wise max queries which are just crazy dumb to write efficiently in MySQL.

robocat · on Dec 30, 2021

I could see some standardised SQL language extension, similar to how TypeScript extended JavaScript, that has a transformation phase into the native SQL dialect within the database access library, or via some middleware.

Too · on Dec 31, 2021

LINQ?

vosper · on Dec 30, 2021

I agree, though I'll give SQL a pass for this because it's old.

But how the Javascript world ever thought that `import { function } from 'library'` was better than `from 'library' import { function }` I'll never know. Python got this right long before anyone was even thinking about adding imports to JS!

goscript · on Dec 30, 2021

I agree but not with current JavaScript modules. i would rather work with this:

    import { functionA } from 'library';
    import { functionB } from '../utils/core/abc';
    import { functionC } from './a';

over:

    from 'library' import { functionA };
    from '../utils/core/abc' import { functionB };
    from './a' import { functionC };

Too · on Dec 31, 2021

Care to rationalize? Your first example only looks pretty because the function names happens to be aligned while the library names are not. Import a few more functions with different length name and the story will be different. In reality library names tend to align better since they usually start with same prefix spanning across several sub-components.

hdjrudni · on Dec 30, 2021

Ya..that one is particularly sad because it was recent. Don't know why they felt the need to botch that.

zoomablemind · on Dec 30, 2021

I wonder if primary objective of this proposal is to increase convenience and reduce the amount of typing that users need to do, when using SQL inline?

If indeed that's the goal, then it targets a rather specific subset of users dealing with explorative/ad-hoc analysis on a database. Once such analysis is done, the queries would usually need to be formalized for robustness and to avoid ambiguities.

Obviously, the whole train of queries would derail, should the FK (which is just an index) be dropped for one reason or the other.

The existing JOIN features are explicit at least on the level of specified table structure. I believe, any constraint details in such context will be, well, ...foreign.

Perhaps, a simple solution to verbosity problem may be to use an "intelligent" SQL client, which supports some form of autocomplete and which may as well internally use as many schema/data details as available.

In anycase, thanks for making the proposal. I was not aware of JOIN ... USING syntax. I often wanted some convenient way of specifying homonymous join columns, as some schemes are consistent in such namings. So typing JOIN on col1, col2... would translate into equality joins between the listed tables. However, again, there is ambiguity here...

branko_d · on Jan 1, 2022

> FK (which is just an index)

Correction: foreign key is not an index. Many (most?) DBMSes allow a FK to exist without an index.

lolive · on Dec 30, 2021

Coming from the graph database / semantic web area, I would propose foreign key/ primary key relationships in the DB to be detailed and named in a schema description, and then queries reference those relationships by name to define the needed joins.

criticaltinker · on Dec 30, 2021

Yup totally agree - OP is an interesting proposal and has provoked a lot of quality commentary regarding the tradeoffs of clarity vs conciseness, implicit vs explicit, FK performance, etc.

But IMO you’ve raised the important long term consideration - do graph based schemas and query languages obviate the need to model foreign keys explicitly? If this JOIN FOREIGN proposal is an incremental step forward, what’s the next big leap?

lolive · on Dec 31, 2021

The JPA annotations of the DAO entities should not be the place to look for the data model of a DB. #imho

jeff-davis · on Dec 30, 2021

I'm not sure this is a good idea. It means that a constraint on the data (a FK is a constraint) affects the semantics of the query. In SQL, those two concepts are seperate.

tapas73 · on Dec 30, 2021

I will write a letter to santa asking to get this into sql. During 10 years in data warehouse I constantly wondered why SQL does not have something like that.

Reducing verboseness is nice, but the main perk is the correctness.

Oh.. if I got a cent every time I found a bug in colleagues sql, because of join accidentally multiplying/doubling rows... :-)

slt2021 · on Dec 30, 2021

not a good idea, constaints are one thing and joins are another thing. plus I can join on conditions other than =.

the problem that author is trying to solve can be easily solved by a view:

1. Declare a view with all necessary JOINs once

2. select from view only what you need, aggregate what you want

3. Optimizer will throw out unnecessary stuff and optimize query while all JOIN logic will be declared only once and will be hidden inside the view

plus each DBMS has its own flavor of SQL and will have its own query optimizer nuances when dealing with joins, especially nested via CTEs/views/lateral queries,etc.

Sesse__ · on Dec 30, 2021

Your #3 is very optimistic. You can't just remove a join (be it inner, left, or really any other type) just because you didn't refer to any fields from one of the sides, and it's even harder when aggregation is in the mix. There are cases where you can without influencing the result, but they are special-case optimizations and not universally supported across databases.

antender · on Dec 30, 2021

I, personally, feel like this is a rather pointless addition to an already somewhat bloated language. It it isn't saving that much typing (we have NATURAL JOIN's already, nobody is using them), is kind of inconsistent with other language principles (we should be able to join anything to anything if we need to, see SQL-89) and limits JOIN flexibility (you don't have to join with "=" operator, JOIN's supports different types of conditions and are equivalent to WHERE clause)

nrabulinski · on Dec 31, 2021

Nobody is using NATURAL because usually it’s broken and doesn’t do what you want it to and for any more complex joins or joins on other than = you would still use JOIN ON, what’s the issue? JOIN was introduced when we had multiple tables in the FROM clause and conditions in WHERE for clarity and convenience, why are people so opposed to this change then?

antender · on Dec 31, 2021

The issue, as i see it, is that adding this feature complicates the language parser even more for not that much actual gains.

1) Implementation has to do some nontrivual rewriting into actual JOIN's which uses indexes, previous special cases like NATURAL were purely syntax sugar 2) All of the tooling that depends on parsing SQL need to add sensible support or else noone will even recommend to use this 3) It is useless for anyone using ORMs in the first place, they will continue to generate normal JOIN's (less actual impact) 4) For anyone using prepared statements it automatically goes to "is not recommended to use" list because it makes your JOIN's depend on existence of foreign keys and corresponding indexes, which are an optional feature DB still should work without. I've had tasks in my career when my team added or removed foreign keys, so this type of JOIN's would make migrations even harder to do. 5) So considering all of the above this is feature designed purely for REPL and for this purpose it is also kind of useless. I can imagine remembering and typing column names in normal JOIN's, but foreign keys usually have some long unintelligeble autogenerated name.

antender · on Dec 31, 2021

Basically I think this syntax addition is nonorthogonal, isn't making SQL any more powerful, but isn't solving the main usability problem, which is composability of queries in any way. Like writing JOIN's is not complicated if somewhat wordy already.

pella · on Dec 30, 2021

related: pgsql-hackers thread and context

https://www.postgresql.org/message-id/flat/1aec0dd0-dc27-40e...

btilly · on Dec 30, 2021

Please, no. A common problem in data warehousing is that a design with lots of foreign keys becomes slow to load. A standard solution is to move the checks for referential integrity elsewhere, then drop the foreign key constraint. This massively improves load performance.

This syntax change means that this solution can't be used because you have no idea what random queries out there might rely on the specific existence of a foreign key constraint for the definition of the query. Thereby meaning that if a foreign key constraint becomes a performance problem, we're stuck with it rather than having a solution.

Features have consequences. And I don't like the consequences of making business rules that are now explicit in the query, be instead implicit in the table design.

JoelJacobson · on Dec 30, 2021

Good point, but addressable:

Simply decouple the relationship definition and referential integrity check, allowing a user to drop the referential integrity check if desired, but keeping the relationship definition.

I cannot see why you would not want to at least always store the information a certain table/column(s) references some other table/column(s) in the data model. Enforcing referential integrity is probably good in general too, but I agree you might need to disable it for some FKs, in some databases, like PostgreSQL before they got FOR KEY SHARE locks.

hn_throwaway_99 · on Dec 30, 2021

At this point you should realize your proposal is a non-starter, and I didn't even realize btilly's objection originally.

I can't think of any other feature in SQL where the rules of the query are actually dependent on something not explicit to the query itself. Even USING and NATURAL are just syntactic sugar that depends on the structure of the table, not on any underlying constraints.

So, what you have proposed, "allowing a user to drop the referential integrity check if desired, but keeping the relationship definition" would be a massive change to tons of SQL tools out there as it's a huge new feature, for some minor syntactic sugar. Ain't gonna happen.

JoelJacobson · on Dec 30, 2021

> Even USING and NATURAL are just syntactic sugar that depends on the structure of the table, not on any underlying constraints.

Similar to how JOIN FOREIGN would depend on the structure of the data model, defined by tables, foreign keys, etc.

> So, what you have proposed, "allowing a user to drop the referential integrity check if desired, but keeping the relationship definition" would be a massive change to tons of SQL tools out there as it's a huge new feature, for some minor syntactic sugar. Ain't gonna happen.

Why would it be a problem from the tools perspective if the foreign key wasn't actually enforced if the DBA insists on temporarily disabling the enforcement of the FK? If the tool would e.g. be used to insert a row, and the DB would accept it, even though it would violate the FK, what do you suggest would be the problem from the tools perspective?

This is also not a new idea. It's already implemented in MSSQL, see WITH NOCHECK.

hn_throwaway_99 · on Dec 30, 2021

> Similar to how JOIN FOREIGN would depend on the structure of the data model, defined by tables, foreign keys, etc.

The point that everyone is making is there are not currently any SQL statements that depend on structural information as defined in the foreign key relationships when calculating the structure of the data. Furthermore, there are already tons of tooling and processes that depend on this fact, that your proposal would break, for a teeny bit of less typing.

Beating a dead horse at this point.

tqi · on Dec 30, 2021

The person you are responding to was more direct/harsh than was necessary, but I think it would be good to step back for a moment and reflect on the feedback from this community. You said "If someone can convince me this is a bad idea, that would help me forget about all of this, so I would greatly appreciate your thoughts, no matter how negative or positive." I think there are enough valid objections here to at least consider the idea that this is not a clear improvement?

JoelJacobson · on Dec 30, 2021

When I wrote that comment, the idea had even more flaws than currently, at that time I suggested using "WITH" and a new "->" operator. Thanks to new ideas coming from other users in the PostgreSQL and Hacker News community, those problems have been solved, and we now have less remaining problems with the proposal. I'm really grateful for all the help.

Like I said in another reply, I will put together a "Drawbacks / Remaining issues" section and update the Gist, based on all replies. Perhaps the end result will be Status Quo, but at least then we have documented the reasons why this idea is a dead end. However, thanks to all the improvements just during the last couple of days, I feel really optimistic and motivated, so I think there is a great chance we can solve the remaining issues together if we try.

To comment on the response from the direct/hash person:

The point made by the user, "not currently any SQL statements that depend on structural information as defined in the foreign key relationships", is true, but I don't see why that's an argument by itself against the idea?

I find the other argument, claiming there would be a problem with tooling and processes, much more interesting and I'm eager to fully understand it. I asked a question in hope to do so, "what do you suggest would be the problem from the tools perspective", but has so far not received any reply.

mtone · on Dec 30, 2021

I also agree that relying on database schema is a show stopper.

First counterpoint: As long as data type information isn't completely messed up, I can dump excel spreadsheets in a database, or dump database CSVs in a data lake, and start querying them right away using SQL with complex joins using auto-completion from the dataset alone.

Second counterpoint (harder to communicate): I work with a SaaS database (MS Dynamics/Dataverse) which doesn't provide direct SQL access and is not supported by common ORMs. Some of the data APIs require relationship schema information which invariably put a needle in attempts to generalize functionality, or just consume.

In this context, creating a simple in-memory test database or serializing records between modules, cannot possibly function without also knowing schema information if queries are going to make use of it. So now you need to carry schema information for everything, and load it at run-time from generated code or a live database, in back-end and front-end, just to interact with data where simple SQL would have worked just fine.

Conclusion: I dislike the proposal -- even if not breaking backward compatibility, it brings database-configuration details to SQL. SQL is imperfect and is already hurt by database-implementation details (like date functions) but it remains a beautiful expression of relational algebra and set theory - a query given the same data should return the same output, regardless of context. SQL is lingua franca for a reason.

The proposal feels like a fairly specific developer-centric extension and isn't where SQL should be headed, in my opinion.

JoelJacobson · on Dec 31, 2021

> First counterpoint: As long as data type information isn't completely messed up, I can dump excel spreadsheets in a database, or dump database CSVs in a data lake, and start querying them right away using SQL with complex joins using auto-completion from the dataset alone.

In your example, all you have is data and no foreign keys, that's the show stopper? That means you have all the relationships in your head and that's how you can write complex joins right away? Sure, if that's the case, then you can't use foreign keys since you don't have any. Don't see how this would be a counterpoint though. There is nothing forcing you to use JOIN FOREIGN, you could just do what you describe. But I'm sure you are aware many databases have foreign keys for all relationships to enforce referential integrity. I should have mentioned in the proposal, the scope is limited to such databases.

I enjoyed your example though. I want to share a similar example. It happened to me at least a few times, I've had to deal with data, shipped as multiple CSV files, but without any schema at all. What I tend to do then is to quickly write a very loose data model with mostly text columns, to accept any values. Once the CSVs are in SQL, I can then clean up the data step by step, by inspecting the tables and converting the text columns to proper data types. Next, when suspecting some column(s) in some table seem to be referencing some other column(s) in some other table, based on the content of the columns in both tables, I then try to add a FOREIGN KEY with a suitable name between such column(s). If successful, we know there is referential integrity between the columns, and we know also have a name to describe such relationship. Win-win! Otherwise if the foreign key could not be created, I investigate what rows that only appear in the referencing table that are not present in the referenced table, using a NOT EXISTS (...) query. If the extra rows can safely be deleted, such as if e.g. forgetting to handle empty string values as NULL values, I can then try to create the foreign key again.

wruza · on Dec 31, 2021

Ain’t gonna happen for whom? Features were added for decades before and will be added after us. I don’t get it, someone always says it’s not gonna happen because of size of something but then you open the tech news site after a week and the amount of happenings is overwhelming.

Too · on Dec 31, 2021

> I can't think of any other feature in SQL where the rules of the query are actually dependent on something not explicit to the query itself.

INSERT depends on constraints and AUTO_INCREMENT.

cogman10 · on Dec 30, 2021

Absolutely agree.

Further, whenever someone is adjusting table performance index tweaks is almost always the first thing to tackle.

Adding foreign keys into the query is just as bad as adding indexes into the query (which, you can do in T-SQL, but generally shouldn't). Indexes can be dropped, changed, or added and you SHOULD be relying on the SQL optimizer to use the most appropriate index.

This feature appears to only save a bit of typing in the best of scenarios. In the worst, an update/drop of a foreign key will end up breaking a bunch of queries, which is insane.

Tostino · on Dec 30, 2021

While I wish there were a way to easily refactor queries when making these changes, this is not different than it works for renaming/dropping a column..."In the worst, an re-name/drop of a column will end up breaking a bunch of queries, which is insane."

cogman10 · on Dec 30, 2021

I'd argue it's about expectations.

If I change the shape of my data, of course I'd expect a bunch of stuff that potentially needs to be updated. Adding and removing fields is dangerous.

On the flip side, changing an index or a foreign key will almost never change any query, because the data shape is exactly the same. The most it will effect is queries that change data. Even then, when you make that change you have to address, up front, what to do with existing data that might violate a new constraint.

tapas73 · on Dec 30, 2021

Lets have disabled FK then.

postgre, seems to not have it, but the proposal could include also "disabled FK" part.

teradata, oracle, sql server already have option for FK "disable/no check".

https://docs.teradata.com/r/eWpPpcMoLGQcZEoyt5AjEg/df1PvVh6e... https://docs.microsoft.com/en-us/sql/relational-databases/ta... https://docs.oracle.com/cd/B28359_01/server.111/b28310/gener...

zmmmmm · on Dec 30, 2021

Data warehouses have all sorts of different design approaches based on their different requirements, wouldn't this just be another one of them? I actually think that having a way to alter the constraint to non-enforcing sounds like it'd be better for your situation anyway because you can then have your load performance but utilise the constraints with background / delayed process that checks the integrity because the information is still there in a standard form.

Which sort of leads to .... I don't agree with your characterisation of foreign key constraints as business rules. They are genuine information about the structure of the data.

seadan83 · on Dec 30, 2021

This, the result of a query should not change depending on a constraint.

zmmmmm · on Dec 30, 2021

By that do you mean, should not vary in the data returned or should not break?

Personally I agree that changing a constraint shouldn't alter the data returned. But I'm happy enough if it breaks in a clear and verifiable manner. There are plenty of other situations where adding a constraint will cause existing SQL (if not queries) to break so its not really that much of a change.

zzzeek · on Dec 30, 2021

this was my thought too. but now you need an ALTER CONSTRAINT NON ENFORCING or something like that so that the "constraints" can be present declaratively but don't actually get used.

which then looks a whole lot like you're just introducing macros into SQL where you have some symbolic keywords that expand out into pre-fabricated ON clauses.

cryptonector · on Dec 31, 2021

Enforcement of relational integrity on bulk load should be optional -- you trust the data, right?

But also, if you want relational integrity, then you have to check it somewhere.

Lastly, this isn't about DML or bulk load but about query expressivity. You could easily have FOREIGN KEY constraints where the RDBMS is also instructed not to enforce relational integrity, either on bulk load or at all, and documenting the FK in the schema is still massively useful.

JoelJacobson · on Jan 3, 2022

Dear @Ayesh @CRConrad @DevKoala @Sesse__ @Too @Tostino @Upitor @antender @benjiweber @bradleyjg @branko_d @btilly @clintonb @cogman10 @cpill @criticaltinker @cryptonector @degenerate @dragonwriter @dspillett @dstola @gavinray @gigatexal @goto11 @jeff-davis @jimbobmcgee @jmull @jpz @lolive @magicalhippo @matteote @mjevans @mpolichette @mritchie712 @mrweasel @mtone @nicoburns @nrabulinski @nvartolomei @pella @rcoveson @ridaj @salawat @seadan83 @silon42 @slotrans @slt2021 @srcreigh @tapas73 @thom @torgard @tqi @unbanned @wruza @yunohn @zeroimpl @zmmmmm @zoomablemind,

I've updated the Gist with some alternative syntax proposals [1] that affects/addresses your comments, and therefore wanted to notify you about such changes. I would greatly appreciate if you could provide additional feedback, either by replying here, or leave comments on the Gist via Github. Thank you all.

[1]: https://gist.github.com/joelonsql/15b50b65ec343dce94db6249cf...

hn_throwaway_99 · on Dec 30, 2021

This is my "Thanks, I hate it" response.

Reason being if you use the example they gave:

SELECT f.title, f.did, d.name, f.date_prod, f.kind FROM films f JOIN FOREIGN f.films_did_fkey d

You need to implicitly know the table that films_did_fkey points to, because 'd' is just a table alias. I can't think of anywhere else in the SQL standard where you can introduce a table alias without explicitly referencing the table. In my opinion making code essentially unreadable unless you have other background information is an antipattern.

jmull · on Dec 30, 2021

Agreed. Maybe something like this instead:

    SELECT f.title, f.did, d.name, f.date_prod, f.kind
      FROM films f
      JOIN distributors d ON FOREIGN f.films_did_fkey

and

    SELECT f.title, f.did, d.name, f.date_prod, f.kind
      FROM distributors d
      JOIN films f ON FOREIGN films_did_fkey

magicalhippo · on Dec 30, 2021

I had a similar suggestion the other day[1], except I chose what to me sounded more English-sounding:

    JOIN films f USING FOREIGN KEY

Here the explicit foreign key, films_did_fkey in this case, could be specified after FOREIGN KEY. This would be similar in syntax to when you force an index for a select statement, at least in the DB we use.

[1]: https://news.ycombinator.com/item?id=29687565

zeroimpl · on Dec 30, 2021

Yeah I think something like this is the way to go. The table name needs to be listed in a consistent manner like all other joins, and I don't think the syntax should be different depending on which order the tables are listed in.

With the above proposal, it seems the foreign key name could be left out in the common case of there being only one fkey between the two tables too.

magicalhippo · on Dec 30, 2021

> With the above proposal, it seems the foreign key name could be left out in the common case of there being only one fkey between the two tables too.

Yes, this was indeed my intention. The common case with only a single matching foreign key constraint would not require being explicit, but one could be if needed in a natural way.

    JOIN films f USING FOREIGN KEY films_did_fkey

silon42 · on Dec 30, 2021

+1 ... needed at least as an option where FK is not declared

ridaj · on Dec 30, 2021

Another thing, this makes things harder to reverse.

Some engines have join-order-dependent performance, so there are instances where you would want to write

``` FROM orders LEFT JOIN customers USING (customer_id) ```

and others where you'd want to write

``` FROM customers RIGHT JOIN orders USING (customer_id) ```

Swapping join order with current syntax is relatively easy since references in the same FROM clause are interchangeable. But in this proposal, the reference to the joined table isn't written out, so it would be pretty complicated to reverse join order.

mritchie712 · on Dec 30, 2021

yeah, this saves like 10 characters of text in exchange for massive confusion.

JoelJacobson · on Dec 30, 2021

In this simple example yes, but there can be multiple columns in a join.

The syntax proposal i O(1) syntax wise compared to O(2n) for a JOIN ON where n is the number of columns.

hn_throwaway_99 · on Dec 30, 2021

But that "simple example" is like 95% plus of all my joins, where I'm joining two tables on a foreign key.

JoelJacobson · on Dec 30, 2021

Your data model must be a lot simpler than mine, congrats ;)

JoelJacobson · on Dec 31, 2021

Insightful comment, thanks.

It has been added to the "Drawbacks / Tradeoffs / Remaining issues" section of the proposal.

https://gist.github.com/joelonsql/15b50b65ec343dce94db6249cf...

JoelJacobson · on Dec 30, 2021

I think your concern is addressed by the idea further down in proposal; giving the foreign keys the same names as the referenced tables. This example was provided:

SELECT f.title, f.did, d.name, f.date_prod, f.kind FROM films f JOIN FOREIGN f.distributors d

hn_throwaway_99 · on Dec 30, 2021

No, my concern is not that you can also specify the table name that explicitly states the name of both tables, it's that you have proposed a syntax where it is possible to leave off the explicit table name.

If you got rid of the syntax that uses the underlying "magic" of needing to know which table the foreign key points to, I'd be more amenable.

mrweasel · on Dec 30, 2021

Yeah, my concern is people in the real world. If you can "magically" join on a foreign key and leave the table name out, or still call the thing anything you want, some smart ass will abuse the live crap out of it and I'll somehow be left to deal with it.

I like things to be explicit. Tell me what you're joining and how you want to join it. What is the use case for this? Other than saving some typing, and let's face it, sometimes a little extra typing now, will save you a lot of trouble later. The proposal claims: "The idea is to improve the SQL language", but is it really better?

tapas73 · on Dec 30, 2021

First, very valid concern regarding table name.

> Other than saving some typing,

I would like to point out the corectness aspect of the proposal.

Today foreign keys are enforced during inserts, updates, deletes. (you just can't violate fk. this is so good) But you can violate it in selects. (e.g. mistype column name, forget to include a column)

This proposal (or its adjustment) would allow to use fk also during select. It's like static typing for join conditions.

hn_throwaway_99 · on Dec 31, 2021

In my entire software development career, I've never had that problem, nor seen it.

Sure, I've mistyped a column name or left one out. But the failure is then nearly always so catastrophic (the query failed to compile), or the data response then so different from what I'd expect (pages of results instead on 1) that it's not something I've seen that needs fixing.

JoelJacobson · on Dec 30, 2021

There are two asymmetrical cases; one of them needs two pieces of information, the other three.

Having to always specify both tables would syntax wise be redundant, but I agree it’s worth considering it might be a good thing to improve readability, which would help especially in the case where the foreign key isn’t/cannot be given the same name as the referenced table.

tqi · on Dec 30, 2021

1. Just because you can doesn't mean people will. 2. That doesn't work if you need to have multiple keys pointing to the same table (ie owner_id and secondary_owner_id pointing to a users table)

mjevans · on Dec 30, 2021

In this example D is the target table, FK's natural relationship could be aliased (specified multiple times under different names).

srcreigh · on Dec 31, 2021

We should reject this simply because it does not provide any clear benefits, but meanwhile asks for a lot. People to learn the FK names in their DB, people to learn new SQL syntax (and tooling to support it).

It's actually difficult to list all the problems with this proposal. It's just unnecessary and the effort required to predict all the problems it will cause isn't worth doing.

unbanned · on Dec 30, 2021

Foreign keys exist for data integrity... something which is rather opaque to query onterface... So why do you believe that has place in a query?

JoelJacobson · on Dec 30, 2021

Foreign keys both define relationships between tables, and also enforce referential integrity. The discussion is about how we could potentially mine various additional value from foreign keys as an information resource, rather than just mostly being about referential integrity.

cryptonector · on Dec 31, 2021

Why do you think it's opaque? They declare relations between tables. That's structure, intent. Making intent clearer is generally a good thing.

tqi · on Dec 30, 2021

Personally, I think SQLs verbose syntax is a good thing, and increases clarity/interpretability. It reduces cognitive load to have things explicitly listed out. Not having the table and column names in the query itself makes it much harder to read and understand queries without prior knowledge of the data model.

mpolichette · on Dec 30, 2021

I'm with you on this one. I prefer it to be in your face simple to understand over the terse-ness.

In the gist example, I actually prefer the SQL-92 approach where we are joining given an explicit comparison condition. Every other implementation seems to be trying to hide details, for what gain? Less typing?

In order to use FOREIGN, you will need to know not just what columns a table has, but also their configuration. Which would also require that you have properly configured your tables. While this shouldn't be a hard ask, it does add additional dependency and makes use of this "tool" slightly less "portable" between systems.

I have unfortunately seen cases where people will only have foreign keys un-enforced by their table config. As a dev, if you're introduced to a new DB, you wont know immediately if you can use this, and if things are configured wrong, you need to make a pretty significant change to be able to use it.

I don't see a lot of harm from adding this syntax however as people are free to not use it and it relies on an existing strict convention.

JoelJacobson · on Dec 30, 2021

> I don't see a lot of harm from adding this syntax however as people are free to not use it and it relies on an existing strict convention.

This is a good argument I will add to the list.

Also interesting to read about un-enforced foreign keys. I haven't used MSSQL myself, the DB in which I heard it's possible, I've only been using PostgreSQL for the last 20 years, and before that MySQL.

I think the problems you describe is an argument against a WITH NOCHECK feature, since it could be misused. Maybe it's necessary in some databases still, but at least in PostgreSQL, the FOR KEY SHARE lock solved all the issues with concurrent updates we had at Trustly. The FOR KEY SHARE was a huge patch [1] written mainly by Alvaro Herrera. Thanks to it, Trustly has never since had any performance problems with foreign keys, and they have AFAIK not needed to drop any foreign keys up until today due to locking/performance problems.

[1] https://www.commandprompt.com/blog/fixing_foreign_key_deadlo...

rcoveson · on Dec 30, 2021

FWIW we do have `NATURAL JOIN` already, which is a lot worse than this proposal.

benjiweber · on Dec 30, 2021

NATURAL JOIN is great because it's like a relational AND. How else will you join with Table Dee and Table Dum? https://twitter.com/benjiweber/status/1476629550608101384

It's just risky if you don't design your schema with relational algebra in mind.

rcoveson · on Dec 30, 2021

That is a really interesting case. Correct me if I'm wrong, but I don't think this is a special case where you have to use `NATURAL JOIN` over `JOIN ON`. The problem is just that the Postgres SQL grammar (maybe most/all SQL grammars) require ON or USING in a non-NATURAL join, so you have to specify the (empty) equijoin predicate list yourself, i.e. the identity value: `JOIN ON TRUE`.

EDIT: I had another thought about this.

I think people not designing with the relational algebra in mind is the heart of the issue, specifically w.r.t. column names. We know that namespaces are a hard problem, and a consequence of that problem is that `NATURAL JOIN` as specified in the relational algebra seems risky, or overly magick-y. It makes what might be an unfortunate coincidence (name collision) into something algebraically impactful.

A foreign key join gets around the problem by keeping names and namespaces out of it. It's really doing exactly what `NATURAL JOIN` is supposed to do, but only in the subset of cases where name collisions are meaningful, not coincidental.

cpill · on Dec 30, 2021

I'm 20 years I have never seen anyone use relational algebra. schemas have to be self documenting albeit only in the real world.

salawat · on Dec 30, 2021

...I do... All the time. I'm cleaning up a mess of a namespace collision as we speak. I was the only one who saw it coming. I've spent the better part of a month getting people to believe it's an issue.

gavinray · on Dec 30, 2021

SQL is relational algebra, the core of relational algebra is Selection, Projection, Cross-products, Unions, and Set Differences.

SQL is a language that implements Relational Algebra/Relational Calculus

bendbro · on Dec 30, 2021

I agree with your point about explicit columns being easier to read, but I still sometimes prefer implicit columns. An example: whenever you have numerous subqueries, all using the same columns, implicit columns are easier to read since there is less text on the screen. It's also less error prone to change just one line rather than numerous lines.

This feeds into my view of metaprogramming-like situations. Whenever the code-time-view of a program differs significantly from some runtime-state-view of the program, I think there should be a code-time way to view and perhaps edit both the code-time-view and some kind of runtime-state-view. A programmer shouldn't have to waste time digging through numerous files to evaluate what implementation slots into some dependency injected class, or find out what structure ends up in a python method parameter, or what a preprocessor directive ultimately produces. I know IDEs can handle some of these things, but I think better tools can be produced.

More concisely, instead of approaching code as the single and unchanging view of the program, perhaps it would help to approach code as something more dynamic. I have no concrete ideas as to how this would work.

tapas73 · on Dec 30, 2021

I would argue that conceptually it's not too different from needing to know what tables/columns exist in datamodel.

I get that it is not common now to care about FKs when writing selects. But it could be. Tooling can be improved to help here. (show fks, autocomplete)

Btw. Everybody seems to concentrate on conciseness, but keep in mind that this helps also with query correctness.

jimbobmcgee · on Dec 30, 2021

Is anyone actually likely to implement these based on some random gist on the internet, or are you just screaming into the void?

Because, if you have that kind of clout, I've got an INSERT/SET syntax I'd like to put your way...

JoelJacobson · on Dec 30, 2021

I'm a small PostgreSQL contributor since 2010 myself. I probably can't write the whole patch myself, but if there is enough interest, and if we can work out the details and address the problems raised in this thread and elsewhere, I'm pretty confident we can do it. If we ever get there, the next step would be a reference implementation, probably in PostgreSQL, or to discuss a proposal in the SQL committee.

jimbobmcgee · on Dec 30, 2021

I'm often interested into what goes into changes to committee-driven standards.

To an outsider, proposing a change seems to require one to be part of a shady cabal of Big-5 employees, skilled in the art of hiding subtle, privacy-invading features into inscrutable, plain-text RFCs.

That or subjecting yourself to 30K+ what-abouters who deform your suggestion into something unrecognisable.

It's refreshing to see a straightforward, well-formatted proposal (even if I do slightly prefer the `FROM table1 x JOIN table2 y ON x.fk` syntax suggested in other comments).

JoelJacobson · on Dec 30, 2021

> To an outsider, proposing a change seems to require one to be part of a shady cabal of Big-5 employees, skilled in the art of hiding subtle, privacy-invading features into inscrutable, plain-text RFCs.

I thought so too. Initially I just tried to get in contact with someone at the Swedish Institute for Standards (SIS), to see if it would be possible to send a proposal to someone in the SQL committee, which I thought was nearly impossible to become a member of. But as it turns out, SIS explained I could actually join the Swedish working group, and participate directly there, I just had to send in an application and get the approval from my employer, since there is a cost involved and you have to be a member via a company. Turns out ISO is a very open and democratic organization, just like Hacker News! :)

I think this proposal could take years until it land, if it ever does, in some form, if concerns can be addressed, but SQL is here to stay for a while, so that doesn't scare me.

Tostino · on Dec 30, 2021

I was following this proposal on the -hackers thread, glad to see it getting traction here. I find it really interesting to read that, it'll be even more interesting to see how it plays out longer term. I'd love to see some more progress made with the SQL language / syntax, and having someone in the community actively engaging with the committees seems like a great way forward.

JoelJacobson · on Dec 30, 2021

> It's refreshing to see a straightforward, well-formatted proposal (even if I do slightly prefer the `FROM table1 x JOIN table2 y ON x.fk` syntax suggested in other comments).

Thank you! /me feeling happy

matteote · on Dec 30, 2021

This reminds me of KEY JOIN in Sybase which, unlike this syntax, does not require to specify the foreign key. I find Sybase's syntax to be too implicit, and challenging to port to other dialects; JOIN FOREIGN with explicit PK reference looks much better.

I wonder how it is expected to work with non-table references (views, CTEs, subqueries), especially when the columns involved in the foreign key (on either side) are not returned explicitly by the referenced object.

JoelJacobson · on Dec 30, 2021

> I wonder how it is expected to work with non-table references (views, CTEs, subqueries)

It’s not, it’s only for the special but common case of joining two tables based on a foreign key.

torgard · on Dec 30, 2021

I dig it!

I prefer defining tables like this:

    CREATE TABLE category (
        id int GENERATED ALWAYS AS IDENTITY,
        name text
    );
    
    CREATE TABLE post (
        id int GENERATED ALWAYS AS IDENTITY,
        category_id int
            REFERENCES category (id)
                ON DELETE CASCADE
    );

That is, category.id rather than category.category_id. But the USING clause doesn't work with that style, as far as I understand.

This would make my queries nicer.

thom · on Dec 30, 2021

I've been quite happy having fully prefixed column names for a long time now. Makes joins easier, big views clearer, and random exports more readable out of the box. Also in my case it's also easier to line up unique column names with things like Clojure specs but I accept that's a niche concern.

JoelJacobson · on Dec 30, 2021

Much cleaner, I agree, that's a big win. Will add that to the list of benefits.

Will also add a "Drawbacks / Remaining issues" section to the Gist, from all the valuable comments so far in this thread, thank you all, positive as well as negative comments, all very helpful.

Ayesh · on Dec 30, 2021

It looked very useful and intuitive at first, but on further thought, I think the only time that you'd truly benefit from it if you SELECT *. For other SELECT queries with explicit field name list, you'd need to know the table the key constraint links to anyway.

SQL Views serve perfectly well for queries encouraged by the schema itself, and i think they are more sophisticated and practical way.

JoelJacobson · on Dec 30, 2021

There is similar comment in the thread, suggesting both tables should always be specified.

Not sure what I prefer yet. The idea is the foreign key name is usually the same as the referenced table, so should be an infrequent problem.

zmmmmm · on Dec 30, 2021

> The idea is the foreign key name is usually the same as the referenced table

I think this is where you are hitting some of your turbulence here because lots of ORMs / schema management tools actually generate completely cryptic fk names (sometimes based on the hash of the columns or similar). Personally I think weighing in legacy baggage like that too highly is a bad thing as it creates enormous inertia.

JoelJacobson · on Dec 30, 2021

Thanks for explaining, I have no experience of ORMs, always written my queries manually. This was insightful. Thanks also for the word “inertia”, good one, will add to my vocabulary.

nicoburns · on Dec 30, 2021

I wish the SQL standard would add some basic syntax sugar, such as support for trailing commas and allowing use of aliases in more places.

halayli · on Dec 30, 2021

It's a great proposition. However, I would suggest having it as a prefix rather than postfix of the JOIN expression to reduce implementation side effects and avoid ambiguity between keyword 'FOREIGN' and a table called 'FOREIGN'.

example: FOREIGN LEFT JOIN

abhchand · on Dec 30, 2021

This feels like adding syntactic sugar in some sense. That is, a layer of convenience to accomplish something that can already be accomplished today.

It also breaks the following:

* It's better to be explicit than infer * Keep specifications simple

Guest42 · on Dec 30, 2021

Is this not something that should be written at the app level rather than dml?

Tostino · on Dec 30, 2021

Foreign keys in general? I am not sure I understand what you mean...

Guest42 · on Dec 30, 2021

Not something I’ve ever thought about but I think that it could be done via macro,

I have written tsql functions in c# and imagine that other dialects have analogous functionality.

It seems as though the point is to cut down on sql code, it’d also be possible to query the foreign keys and create a data structure that could feed a function for joins.

I haven’t thought out the specifics but think this type of approach would be more practical than changing the sql standard.

Tostino · on Dec 30, 2021

At that point you are just building a feature into an "ORM" of some sort. The entire point is to allow queries to be written in a more concise way for this simple equality join use case which is (total ballpark figure) ~half of the joins in my system.

thanatos519 · on Dec 30, 2021

I dunno. I preferred the SQL-89 syntax.

samtheprogram · on Dec 30, 2021

Same here. I didn’t even realize that was an option. As a relative beginner to SQL, is there a real reason the JOIN syntax was needed at all beyond special inner/outer joins?

Tostino · on Dec 30, 2021

Sanity for your coworkers who may have to touch your SQL.

mi_lk · on Dec 30, 2021

tangent question - what're some effective ways to learn how to write SQL?

cryptonector · on Dec 31, 2021

O'Reilly has an excellent pocket book on generic SQL titled "SQL Pocket Guide: A Guide to SQL Usage" (it has several editions). That book worked really well for me. It is surprisingly thorough for a pocket book. I highly recommend it.

Besides that, I highly recommend https://sqlite.org/ as a reference. It's syntax pages for SQL are fantastic. But to understand the language you need a better resource, and for my money that's the book mentioned above.

jdunstan · on Dec 30, 2021

w3schools and the online documentation for any database are good places to start (Postgres and SQL Server have pretty good documentation).

There are also easy-to-google SQL Puzzles, if you're looking for something more advanced.

I would also recommend learning about query plans and how to read them, they're invaluable for query optimization.

orthoxerox · on Dec 30, 2021

Basic SQL has lots of tutorials. Graduate from them to Markus Winand's use-the-index-luke.com and modern-sql.com

unbanned · on Dec 30, 2021

>This improved in SQL-92:

Did it?

JoelJacobson · on Dec 30, 2021

Yes, I think so, at least according to: https://learnsql.com/blog/history-of-sql-standards/

What makes you think otherwise?

gigatexal · on Dec 30, 2021

I like this a lot. It makes thing cleaner and more like explicit.

clintonb · on Dec 30, 2021

How is this more explicit than specifying join columns today?

tapas73 · on Dec 30, 2021

It enforces foreign keys during select.

One can say that it makes you explicitly state, which foreign key do you use.

"on a.foo = b.bar" can be a incomplete and/or incorrect join condition.

cryptonector · on Dec 31, 2021

It doesn't enforce foreign keys. It uses foreign key definitions to simplify the query. It clarifies intent, and because FOREIGN KEYs generally are required to have appropriate indices, it makes it clear that there must be at least one suitable index for the JOIN.

gigatexal · on Dec 31, 2021

Exactly this.

cryptonector · on Dec 31, 2021

The intent is more explicit. Indirection != implicit.

bawolff · on Dec 30, 2021

Seems like minor syntactic sugar. I'm not opposed but really only seems mildly useful at best.

dstola · on Dec 31, 2021

Not a fan since proposal would do an implicit join on foreign keys rather than explicit join on the columns. If you know the schemas its great, but would add an extra step to check FKs on the tables otherwise

In general explicit > implicit, IMO

cryptonector · on Dec 31, 2021

This is why we don't use `*` in production SQL. It's also why we try to avoid NATURAL JOIN. USING is OK if you only have two table sources, but you don't always get to have same-name columns on both tables.

This syntax seems pretty clear and explicit, but with an indirection. Indirection != implicit. The intent is quite explicit.

Using ON also is problematic in that you might think while reading a query that the JOIN is on FOREIGN KEY columns, but... maybe not -- without looking at the schema, you can't tell. JOIN FOREIGN has a similar problem: the intent is crystal clear, but now you have to go look at the schema if you want to know which columns that refers to. Normally one would write a comment on the ON, but comments can rot.

Now, making SQL more expressive isn't necessarily a good thing. SQL is already very expressive. But making it more expressive in ways that yield clearer queries is definitely worth considering.

One very nice aspect of JOIN FOREIGN is that because RDBMSes generally require corresponding indices on those columns to optimize ON UPDATE / ON DELETE constraint processing, seeing "JOIN FOREIGN" in a query instantly lets you know that there must be an appropriate index, while ON might be causing a full table scan or query materialization and you'd have to examine the query plan carefully.