This is not actually compatible with SQL semantics.
An important constraint on SQL is that a query must run, and produce correct results, relying only on the structure and content of tables. Indexes (can) make queries faster but must not inhibit, or be required for, correctness. The same is true of primary key constraints, foreign key constraints, check constraints, defaults, triggers, partitioning, whether a table is heap/clustered, and literally every other implementation detail of the RDBMS.
This proposal would break that constraint, by elevating the _name_ of a foreign key constraint to an identifier usable in a DQL statement (as distinct from DDL statements where such names can already appear). That's not allowed.
More importantly, it's _not a good idea_, because that semantic separation between data on one side, and the machinery of acceleration and validation on the other side, is critical to the value prop of the relational model and a big reason why it's been so hyper-successful.
> This proposal would break that constraint, by elevating the _name_ of a foreign key constraint to an identifier usable in a DQL statement (as distinct from DDL statements where such names can already appear). That's not allowed.
Actually, constraint names already do appear in some DQL statements, such as the quite recently added INSERT INTO ... ON CONFLICT in PostgreSQL [1]
INSERT INTO ... ON CONFLICT [ conflict_target ] conflict_action
There are some clear benefits possible only thanks to specifying constraint names. Quoted from the link:
* The SQL-standard MERGE doesn't provide for choosing an index, so use of a unique index would need to be conditioned on equality quals against indexed columns in order to provide the behavior being discussed for UPSERT.
* The ON expression will need to be evaluated to see whether it properly compares to a unique index on the target table. Initially this will need to be done to determine whether the MERGE is allowed at all; later it will determine which sort of plan is allowed. It would be easier to match an index name, provided the index name is known and stable.
I’m not denouncing PostgreSQL. There’s things that are appropriate for a standard and other, sometimes far less elegant, things that are appropriate for a concrete implementation. The linked article propose a change to the standard. If it had only proposed a non-standard vendor extension for some or other database we wouldn’t be having this discussion.
Thanks. I incorrectly assumed there were only two sublanguages. But now when you say it, I've heard both DML and DDL before, but never DQL. Curious if there were even more categories, I found this Wikipedia article:
Interesting to read, what a SELECT statement is, depends on if having FROM or WHERE "data manipulators". Quote from Wikipedia:
"Although often considered part of DML, the SQL SELECT statement is strictly speaking an example of DQL. When adding FROM or WHERE data manipulators to the SELECT statement the statement is then considered part of the DML."
> relying only on the structure and content of tables
Constraints are part of the table schema, not index schema.
> his proposal would break that constraint, by elevating the _name_ of a foreign key constraint to an identifier usable in a DQL statement (as distinct from DDL statements where such names can already appear). That's not allowed.
Not allowed says who? The standard? I doubt it, and anyways, it can be changed.
"That's not allowed" is not a good argument. A better argument is that this is the first time a constraint can change the meaning of a query -- that is a good argument, but it would be better if the constraint could change existing queries, which it does not do. Because this would only affect queries that refer to the constraint, this seems quite allowable to me.
> More importantly, it's _not a good idea_, because that semantic separation between data on one side, and the machinery of acceleration and validation on the other side, ...
This is not about optimization, ergo this argument is out.
I'm not sure I want this particular extension, but I don't buy your arguments against it.
> More importantly, it's _not a good idea_, because that semantic separation between data on one side, and the machinery of acceleration and validation on the other side, is critical to the value prop of the relational model and a big reason why it's been so hyper-successful.
Constraints are part of the logical model just like the types of columns are. Indexes on the other hand is part of the physical model and should be transparent to the logical model. Database engines tend to couple foreign-key constraints with indexes, since you usually want an index on a foreign-key. But in principle they are separate.
So I don't see any violation of the relational model in this proposal. I do like the general idea of extending SQL with metadata-aware abstractions, although I'm not a fan of this particular syntax.
> This proposal would break that constraint, by elevating the _name_ of a foreign key constraint to an identifier usable in a DQL statement
This doesn't break that constraint in an SQL RDBMS that also implements the relational model, since it is a fundamental element of the relational model that schema metadata is stored as data, and therefore constraint specifications, including names, are included within “content of tables”.
To me, making the data relationships implicit (and making them explicit again in the DAO layer of your app) is a nightmare from a data design point of view.
In real life, a DB is often tied to the app that was built on top of it. Making data reuse, data discovery and data understanding VERY hard.
It might be just me, but I feel like remembering the foreign key name is more difficult than remembering the columns that you need in the ON clause. Especially since you can usually find the column names by just seeing the data in the table (select * from x) wheres seeing the foreign key names is much harder (show create table x?).
Also, if you use an ORM it will usually generate foreign key names that are almost impossible to remember.
In a universe where foreign key index names are important we would specify better names.
I think stuff like “documents_by_user” as foreign key names and explicit index usage would improve peoples awareness of how indices get used and would generally be a positive
I think this is an operator problem. You're using the wrong tool for the job.
TablePlus, SequelAce, the official MySQL client all support cntrl-space autocompletion. I wish we used Postgres, but I imagine the landscape is the same. The big box databases like Oracle, DB2 undoubtedly having this tooling as well.
That being said, here is our fk naming convention: `fk-asset_types->asset_categories` which pretty states what's going on and is easy to remember.
SQL is not only written in an SQL client. SQL is also written (and read from) embedded/mixed in an other programming language were tooling is not always available.
Having to know the names of foreign keys (in addition to the column names of the 2 tables) is adding more cognitive load. I don't think that is an improvement.
At least when it comes to both JetBrains and VSCode, they can handle one language embedded in another. I'm kind of surprised there are environments that don't handle that these days.
It would indeed be difficult to remember, but the proposal also suggest changing the default naming convention for foreign keys, to give them the same name as the referenced table.
If using an ORM, I would guess this proposal isn't useful, since then you wouldn't hand-write queries anyway, right? Except when you want to override the queries generated by the ORM? (I'm not an ORM user myself.)
Speaking as someone who has used ORMs in the past and contributes to a LINQ Micro ORM...
It might make tooling 'easier', but since backwards compatibility has to be considered the actual value add is questionable IMO.
Most ORMs/MicroORMs will have tooling that sniffs out the DB Schema including foreign keys, and if you are using those bits (i.e. 'not hand written') most will do the right thing today. I suppose you could include some extra syntax for whatever DSL you're providing users....
IDK. Speaking as someone who is very comfortable in SQL, This feels more like syntactic sugar than anything else.
This is the same idea as already posted in this thread https://news.ycombinator.com/item?id=29687134, but with a more in-depth explanation that couldn't fit in the comment field, and with some syntax improvements, such as using "FOREIGN", which is a reserved keyword, instead of the previously suggested "->" notation.
Thanks for all the valuable comments on last proposal. Excited to hear what you think about this update.
Of all the myriad indignities of SQL, this isn't near the top of my list. I also don't like making the names of objects like foreign keys and indexes first class concerns in your queries, that's a whole new layer of cognitive overhead.
Agree. I'd much prefer they worked on the way MySQL determines what indexes to use on a multi-table join, so there could be less emphasis on the need to ANALYZE TABLE and FORCE INDEX when the DB is seemingly being dumb, but there's not enough information in EXPLAIN to tell you where it's actually getting tripped up. Troubleshooting full table scans on large sets of data is a nightmare.
> but there's not enough information in EXPLAIN to tell you where it's actually getting tripped up
Unfortunately I don't think providing that information is generally possible. In some specific cases there are useful details it could provide, but there would usually be a myriad of other similar details that are irrelevant and if it included all those you'd not see the wood for the trees.
EXPLAIN and query plan outputs in other DBs are “this is what I did” not really “why I did what I did”. To make the query planner bright enough to know what details would be useful to you, would probably pretty much require making it bright enough to do the optimisation job without you¹.
[1] picking better indexes without hints, even creating those that are often needed, etc.
MS's SQL Server tries to do this a bit with index suggestions. These are sometimes handy, but often at best for guidance². I've seen people blindly follow these suggestions to get a %-or-few gain from a small set of queries that could see orders of magnitude improvement with just a little tweaking elsewhere³, slowly amassing collections of indexes for very specific cases, sometimes multiple on the same key columns but each INCLUDEing a different mix of other data, that balloon their storage requirements⁴.
[2] a nudge in the direction of “Mr Dev/DBA, you might want to think about how I'd avoid scanning this large object or performing many many thousands of seeks on this other one”
[3] refactoring non-sargable predicates, index changes on other tables being referred to, getting rid of “SELECT *” particularly when referring to hideous views, ...
[4] and having the knock-on effects of slowing insert/update activity & important admin functions (particularly backups)
It feels silly when you are SELECTing a ton of columns, then you add a JOIN to a many-to-one relationship which you want to aggregate. Now you need to either make it a subquery (and hope the optimizer doesn't screw up) or duplicate all your SELECT expression (not even the identifiers) into the GROUP BY.
Queries like this won’t break but will silently fail upstream by missing data if column names change or are deleted. The explicit nature of SQL ensures that it will break positively which is a better type of failure in my opinion.
whether that's desirable or not depends on the query and what kinds of changes you make. though I could see it being error-prone in the most common cases.
No. What icambron and I would like is something that can get rid of this group by:
select division_name, branch_name, dealer_id, dealer_name, quarter, month, sum(total_paid)
from divsions, branches, dealers, transactions
where ..... --buncha joins
group by division_name, branch_name, dealer_id, dealer_name, quarter, month
If I didn't want to group the result by division_name, branch_name, dealer_id, dealer_name, quarter and month, why would I put them in the select clause?
On a+b. I'm suggesting it simply copies all the non-aggregate select pieces to the group by clause. If I want to group on a and b, I could either select them too or add them to the group by (I'm imagining something like "GROUP BY SELECT, a, b"). Note that selecting them is the natural thing you'd do if you wanted that grouping, because otherwise you wouldn't know what the grouping is when you got the results.
I remember when I discovered this-- very helpful feature and I am happy it's there (and sad most DBs don't have it). But if I'm grouping I'm almost necessarily not grouping on at least one of the tables' PKs, so it only goes so far
I use this. It gets messy after like 10 columns. I have a trick in my ORM that copies the non-aggrigate columns into the statement. Maybe your ORM has it?
What I really want is a special phrase MINIMAL GROUPING such that GROUP BY MINIMAL GROUPING includes exactly the items that it would be an error not to include in the GROUP BY.
Some DBs let you do this and it's helpful when one of the things you're selecting is some complex scalar expression, but you still have to list each item in the group by
Not by the SQL standard, since projection (which creates aliases) happens after aggregation. There are many databases that allow it as an extension, though.
i would rather it written as SELECT * BUT or SELECT * EXCEPT or even SELECT ALL BUT.
SQL has always been that language that is easy to read. even when you don't understand what the queries are doing. adding a cryptic syntax like "-column" would make it less readable.
If you have specific columns that you frequently want to ignore (which, I think, is the common case of this), you could define a view that selects all the columns except those and do your queries against that view.
There are often occupancies where I'd like this in ad-hoc queries, but it would be something to avoid and discourage in production code for the same reasons “SELECT *” is discouraged (because it is “SELECT *” with a bit of sugar added)
That's very close to SELECT *, which has it's own dangers.
I agree that it would be nice for exploration and testing, but probably should not be used in production.
The problem I found out is that SELECT * LIMIT 10 is guaranteed to be needed so it's actually much more important than whatever is in production (you have to explore a lot before writing the production scripts). My idea is that exploration should be as easy as possible.
Things such as SELECT * EXCEPT col1, col2 are really a PIA to write and can build up frustration level really quickly. Certain IDEs such as Datagrip ease the process by providing "macros" but they are not enough.
Another thing is to generate useful boilerplates such as SELECT col1 FROM table GROUP BY col1 ORDER BY col1 to explore all unique values of col1.
Minimal if you refer to columns by name. Risky if you rely on column ordering. Can be too much data and a performance issue if you have large columns that you’re not using. (Though, it’s not much different if you’re using an orm that loads the whole object anyway), like Django or sqlalchemy.
I’ve done it for many years (using named columns/ dictionaries as the result set) and its never been an issue.
In my opinion this proposal seems only to consider simple cases, but there are many not-so-simple relationsship types:
Consider a ‘sales’ table which includes columns [time] and [sold_by_employee_id], and a periodized ‘employee’ table which includes columns [employee_id], [valid_from] and [valid_to] columns. There is a perfectly valid relationsship between the two tables, but you cant join them using only equal-statements (you need a between-statement as well)
Nice example! The join you describe would remain as a JOIN ON.
This is per design. Quote from the proposal:
"The idea is to improve the SQL language, specifically the join syntax, for the special but common case when joining on foreign key columns."
...
"If the common simple joins (when joining on foreign key columns) would be written in a different syntax, the remaining joins would visually stand out and we could focus on making sure we understand them when reading a large SQL query."
So, the special non-equal based join condition you describe, would become more visible, and stand out, allowing readers to pay more attention to it.
The hypothesis is most joins are made on foreign key columns, so if we can improve such cases, a lot can be won.
Good example too, but this one can with benefit be written using the JOIN FOREIGN syntax, you just need to give the foreign keys suitable names such as e.g. "shipped_date", "order_date", "received_date". Or, to remind you of which is the referenced table, perhaps you want to include it in the names, and the names would be "shipped_date_calendar", "order_date_calendar", "received_date_calendar", but personally I would prefer just "shipped_date" or perhaps even just "shipped".
Are those timestamps, or actual dates, i.e. meaning a whole day encompassing (usually) 24 hours?
In many (most?) data warehousing projects I've seen, you make a "fake date" integer the primary key column of your Dates dimension. This integer consists of 10000 × YEAR_PART + 100 × MONTH_PART + 1 × DAY_PART of the date in question, so yesterday's New Year's Eve woul get 10000 × 2021 + 100 × 12 + 1 × 31 = 20211231. The date dimension itself has many more columns (often booleans, IS_WEEKEND, IS_HOLIDAY, etc; also the date parts themselves in both numeric and character form (12, 'December'), day of week (5 [or 6, depending on convention], 'Friday'), etc) that are used for BI and reporting.
But since this generated ISO-8601-date-as-integer column is the primary key of the Dates table, it is also the value of the Date foreign key in all tables that reference Dates. That makes it incredibly handy in queries -- both during development and for ad-hoc reports -- of those tables without joining to the Dates dimension at all: grouping, sorting, limiting to a more manageable date interval in the WHERE clause... And it tells the reader exactly what the actual date in question is. (Well, at least readers who are used to ISO-8601-format dates.) As jerryp would have said, recommended.
I started writing a new SQL parser that just transforms the query...this was one of the features. The other being a syntax for group-wise max queries which are just crazy dumb to write efficiently in MySQL.
I could see some standardised SQL language extension, similar to how TypeScript extended JavaScript, that has a transformation phase into the native SQL dialect within the database access library, or via some middleware.
I agree, though I'll give SQL a pass for this because it's old.
But how the Javascript world ever thought that `import { function } from 'library'` was better than `from 'library' import { function }` I'll never know. Python got this right long before anyone was even thinking about adding imports to JS!
Care to rationalize? Your first example only looks pretty because the function names happens to be aligned while the library names are not. Import a few more functions with different length name and the story will be different. In reality library names tend to align better since they usually start with same prefix spanning across several sub-components.
I wonder if primary objective of this proposal is to increase convenience and reduce the amount of typing that users need to do, when using SQL inline?
If indeed that's the goal, then it targets a rather specific subset of users dealing with explorative/ad-hoc analysis on a database. Once such analysis is done, the queries would usually need to be formalized for robustness and to avoid ambiguities.
Obviously, the whole train of queries would derail, should the FK (which is just an index) be dropped for one reason or the other.
The existing JOIN features are explicit at least on the level of specified table structure. I believe, any constraint details in such context will be, well, ...foreign.
Perhaps, a simple solution to verbosity problem may be to use an "intelligent" SQL client, which supports some form of autocomplete and which may as well internally use as many schema/data details as available.
In anycase, thanks for making the proposal. I was not aware of JOIN ... USING syntax. I often wanted some convenient way of specifying homonymous join columns, as some schemes are consistent in such namings. So typing JOIN on col1, col2... would translate into equality joins between the listed tables. However, again, there is ambiguity here...
Coming from the graph database / semantic web area, I would propose foreign key/ primary key relationships in the DB to be detailed and named in a schema description, and then queries reference those relationships by name to define the needed joins.
Yup totally agree - OP is an interesting proposal and has provoked a lot of quality commentary regarding the tradeoffs of clarity vs conciseness, implicit vs explicit, FK performance, etc.
But IMO you’ve raised the important long term consideration - do graph based schemas and query languages obviate the need to model foreign keys explicitly? If this JOIN FOREIGN proposal is an incremental step forward, what’s the next big leap?
I'm not sure this is a good idea. It means that a constraint on the data (a FK is a constraint) affects the semantics of the query. In SQL, those two concepts are seperate.
I will write a letter to santa asking to get this into sql. During 10 years in data warehouse I constantly wondered why SQL does not have something like that.
Reducing verboseness is nice, but the main perk is the correctness.
Oh.. if I got a cent every time I found a bug in colleagues sql, because of join accidentally multiplying/doubling rows... :-)
not a good idea, constaints are one thing and joins are another thing.
plus I can join on conditions other than =.
the problem that author is trying to solve can be easily solved by a view:
1. Declare a view with all necessary JOINs once
2. select from view only what you need, aggregate what you want
3. Optimizer will throw out unnecessary stuff and optimize query while all JOIN logic will be declared only once and will be hidden inside the view
plus each DBMS has its own flavor of SQL and will have its own query optimizer nuances when dealing with joins, especially nested via CTEs/views/lateral queries,etc.
Your #3 is very optimistic. You can't just remove a join (be it inner, left, or really any other type) just because you didn't refer to any fields from one of the sides, and it's even harder when aggregation is in the mix. There are cases where you can without influencing the result, but they are special-case optimizations and not universally supported across databases.
I, personally, feel like this is a rather pointless addition to an already somewhat bloated language. It it isn't saving that much typing (we have NATURAL JOIN's already, nobody is using them), is kind of inconsistent with other language principles (we should be able to join anything to anything if we need to, see SQL-89) and limits JOIN flexibility (you don't have to join with "=" operator, JOIN's supports different types of conditions and are equivalent to WHERE clause)
Nobody is using NATURAL because usually it’s broken and doesn’t do what you want it to and for any more complex joins or joins on other than = you would still use JOIN ON, what’s the issue? JOIN was introduced when we had multiple tables in the FROM clause and conditions in WHERE for clarity and convenience, why are people so opposed to this change then?
The issue, as i see it, is that adding this feature complicates the language parser even more for not that much actual gains.
1) Implementation has to do some nontrivual rewriting into actual JOIN's which uses indexes, previous special cases like NATURAL were purely syntax sugar
2) All of the tooling that depends on parsing SQL need to add sensible support or else noone will even recommend to use this
3) It is useless for anyone using ORMs in the first place, they will continue to generate normal JOIN's (less actual impact)
4) For anyone using prepared statements it automatically goes to "is not recommended to use" list because it makes your JOIN's depend on existence of foreign keys and corresponding indexes, which are an optional feature DB still should work without. I've had tasks in my career when my team added or removed foreign keys, so this type of JOIN's would make migrations even harder to do.
5) So considering all of the above this is feature designed purely for REPL and for this purpose it is also kind of useless. I can imagine remembering and typing column names in normal JOIN's, but foreign keys usually have some long unintelligeble autogenerated name.
Basically I think this syntax addition is nonorthogonal, isn't making SQL any more powerful, but isn't solving the main usability problem, which is composability of queries in any way. Like writing JOIN's is not complicated if somewhat wordy already.
Please, no. A common problem in data warehousing is that a design with lots of foreign keys becomes slow to load. A standard solution is to move the checks for referential integrity elsewhere, then drop the foreign key constraint. This massively improves load performance.
This syntax change means that this solution can't be used because you have no idea what random queries out there might rely on the specific existence of a foreign key constraint for the definition of the query. Thereby meaning that if a foreign key constraint becomes a performance problem, we're stuck with it rather than having a solution.
Features have consequences. And I don't like the consequences of making business rules that are now explicit in the query, be instead implicit in the table design.
Simply decouple the relationship definition and referential integrity check, allowing a user to drop the referential integrity check if desired, but keeping the relationship definition.
I cannot see why you would not want to at least always store the information a certain table/column(s) references some other table/column(s) in the data model. Enforcing referential integrity is probably good in general too, but I agree you might need to disable it for some FKs, in some databases, like PostgreSQL before they got FOR KEY SHARE locks.
At this point you should realize your proposal is a non-starter, and I didn't even realize btilly's objection originally.
I can't think of any other feature in SQL where the rules of the query are actually dependent on something not explicit to the query itself. Even USING and NATURAL are just syntactic sugar that depends on the structure of the table, not on any underlying constraints.
So, what you have proposed, "allowing a user to drop the referential integrity check if desired, but keeping the relationship definition" would be a massive change to tons of SQL tools out there as it's a huge new feature, for some minor syntactic sugar. Ain't gonna happen.
> Even USING and NATURAL are just syntactic sugar that depends on the structure of the table, not on any underlying constraints.
Similar to how JOIN FOREIGN would depend on the structure of the data model, defined by tables, foreign keys, etc.
> So, what you have proposed, "allowing a user to drop the referential integrity check if desired, but keeping the relationship definition" would be a massive change to tons of SQL tools out there as it's a huge new feature, for some minor syntactic sugar. Ain't gonna happen.
Why would it be a problem from the tools perspective if the foreign key wasn't actually enforced if the DBA insists on temporarily disabling the enforcement of the FK? If the tool would e.g. be used to insert a row, and the DB would accept it, even though it would violate the FK, what do you suggest would be the problem from the tools perspective?
This is also not a new idea. It's already implemented in MSSQL, see WITH NOCHECK.
> Similar to how JOIN FOREIGN would depend on the structure of the data model, defined by tables, foreign keys, etc.
The point that everyone is making is there are not currently any SQL statements that depend on structural information as defined in the foreign key relationships when calculating the structure of the data. Furthermore, there are already tons of tooling and processes that depend on this fact, that your proposal would break, for a teeny bit of less typing.
The person you are responding to was more direct/harsh than was necessary, but I think it would be good to step back for a moment and reflect on the feedback from this community. You said "If someone can convince me this is a bad idea, that would help me forget about all of this, so I would greatly appreciate your thoughts, no matter how negative or positive." I think there are enough valid objections here to at least consider the idea that this is not a clear improvement?
When I wrote that comment, the idea had even more flaws than currently, at that time I suggested using "WITH" and a new "->" operator. Thanks to new ideas coming from other users in the PostgreSQL and Hacker News community, those problems have been solved, and we now have less remaining problems with the proposal. I'm really grateful for all the help.
Like I said in another reply, I will put together a "Drawbacks / Remaining issues" section and update the Gist, based on all replies. Perhaps the end result will be Status Quo, but at least then we have documented the reasons why this idea is a dead end. However, thanks to all the improvements just during the last couple of days, I feel really optimistic and motivated, so I think there is a great chance we can solve the remaining issues together if we try.
To comment on the response from the direct/hash person:
The point made by the user, "not currently any SQL statements that depend on structural information as defined in the foreign key relationships", is true, but I don't see why that's an argument by itself against the idea?
I find the other argument, claiming there would be a problem with tooling and processes, much more interesting and I'm eager to fully understand it. I asked a question in hope to do so, "what do you suggest would be the problem from the tools perspective", but has so far not received any reply.
I also agree that relying on database schema is a show stopper.
First counterpoint: As long as data type information isn't completely messed up, I can dump excel spreadsheets in a database, or dump database CSVs in a data lake, and start querying them right away using SQL with complex joins using auto-completion from the dataset alone.
Second counterpoint (harder to communicate):
I work with a SaaS database (MS Dynamics/Dataverse) which doesn't provide direct SQL access and is not supported by common ORMs. Some of the data APIs require relationship schema information which invariably put a needle in attempts to generalize functionality, or just consume.
In this context, creating a simple in-memory test database or serializing records between modules, cannot possibly function without also knowing schema information if queries are going to make use of it. So now you need to carry schema information for everything, and load it at run-time from generated code or a live database, in back-end and front-end, just to interact with data where simple SQL would have worked just fine.
Conclusion: I dislike the proposal -- even if not breaking backward compatibility, it brings database-configuration details to SQL. SQL is imperfect and is already hurt by database-implementation details (like date functions) but it remains a beautiful expression of relational algebra and set theory - a query given the same data should return the same output, regardless of context. SQL is lingua franca for a reason.
The proposal feels like a fairly specific developer-centric extension and isn't where SQL should be headed, in my opinion.
> First counterpoint: As long as data type information isn't completely messed up, I can dump excel spreadsheets in a database, or dump database CSVs in a data lake, and start querying them right away using SQL with complex joins using auto-completion from the dataset alone.
In your example, all you have is data and no foreign keys, that's the show stopper? That means you have all the relationships in your head and that's how you can write complex joins right away? Sure, if that's the case, then you can't use foreign keys since you don't have any. Don't see how this would be a counterpoint though. There is nothing forcing you to use JOIN FOREIGN, you could just do what you describe.
But I'm sure you are aware many databases have foreign keys for all relationships to enforce referential integrity. I should have mentioned in the proposal, the scope is limited to such databases.
I enjoyed your example though. I want to share a similar example. It happened to me at least a few times, I've had to deal with data, shipped as multiple CSV files, but without any schema at all. What I tend to do then is to quickly write a very loose data model with mostly text columns, to accept any values. Once the CSVs are in SQL, I can then clean up the data step by step, by inspecting the tables and converting the text columns to proper data types.
Next, when suspecting some column(s) in some table seem to be referencing some other column(s) in some other table, based on the content of the columns in both tables, I then try to add a FOREIGN KEY with a suitable name between such column(s). If successful, we know there is referential integrity between the columns, and we know also have a name to describe such relationship. Win-win!
Otherwise if the foreign key could not be created, I investigate what rows that only appear in the referencing table that are not present in the referenced table, using a NOT EXISTS (...) query. If the extra rows can safely be deleted, such as if e.g. forgetting to handle empty string values as NULL values, I can then try to create the foreign key again.
Ain’t gonna happen for whom? Features were added for decades before and will be added after us. I don’t get it, someone always says it’s not gonna happen because of size of something but then you open the tech news site after a week and the amount of happenings is overwhelming.
Further, whenever someone is adjusting table performance index tweaks is almost always the first thing to tackle.
Adding foreign keys into the query is just as bad as adding indexes into the query (which, you can do in T-SQL, but generally shouldn't). Indexes can be dropped, changed, or added and you SHOULD be relying on the SQL optimizer to use the most appropriate index.
This feature appears to only save a bit of typing in the best of scenarios. In the worst, an update/drop of a foreign key will end up breaking a bunch of queries, which is insane.
While I wish there were a way to easily refactor queries when making these changes, this is not different than it works for renaming/dropping a column..."In the worst, an re-name/drop of a column will end up breaking a bunch of queries, which is insane."
If I change the shape of my data, of course I'd expect a bunch of stuff that potentially needs to be updated. Adding and removing fields is dangerous.
On the flip side, changing an index or a foreign key will almost never change any query, because the data shape is exactly the same. The most it will effect is queries that change data. Even then, when you make that change you have to address, up front, what to do with existing data that might violate a new constraint.
Data warehouses have all sorts of different design approaches based on their different requirements, wouldn't this just be another one of them? I actually think that having a way to alter the constraint to non-enforcing sounds like it'd be better for your situation anyway because you can then have your load performance but utilise the constraints with background / delayed process that checks the integrity because the information is still there in a standard form.
Which sort of leads to .... I don't agree with your characterisation of foreign key constraints as business rules. They are genuine information about the structure of the data.
By that do you mean, should not vary in the data returned or should not break?
Personally I agree that changing a constraint shouldn't alter the data returned. But I'm happy enough if it breaks in a clear and verifiable manner. There are plenty of other situations where adding a constraint will cause existing SQL (if not queries) to break so its not really that much of a change.
this was my thought too. but now you need an ALTER CONSTRAINT NON ENFORCING or something like that so that the "constraints" can be present declaratively but don't actually get used.
which then looks a whole lot like you're just introducing macros into SQL where you have some symbolic keywords that expand out into pre-fabricated ON clauses.
Enforcement of relational integrity on bulk load should be optional -- you trust the data, right?
But also, if you want relational integrity, then you have to check it somewhere.
Lastly, this isn't about DML or bulk load but about query expressivity. You could easily have FOREIGN KEY constraints where the RDBMS is also instructed not to enforce relational integrity, either on bulk load or at all, and documenting the FK in the schema is still massively useful.
I've updated the Gist with some alternative syntax proposals [1] that affects/addresses your comments, and therefore wanted to notify you about such changes. I would greatly appreciate if you could provide additional feedback, either by replying here, or leave comments on the Gist via Github. Thank you all.
SELECT f.title, f.did, d.name, f.date_prod, f.kind
FROM films f
JOIN FOREIGN f.films_did_fkey d
You need to implicitly know the table that films_did_fkey points to, because 'd' is just a table alias. I can't think of anywhere else in the SQL standard where you can introduce a table alias without explicitly referencing the table. In my opinion making code essentially unreadable unless you have other background information is an antipattern.
I had a similar suggestion the other day[1], except I chose what to me sounded more English-sounding:
JOIN films f USING FOREIGN KEY
Here the explicit foreign key, films_did_fkey in this case, could be specified after FOREIGN KEY. This would be similar in syntax to when you force an index for a select statement, at least in the DB we use.
Yeah I think something like this is the way to go. The table name needs to be listed in a consistent manner like all other joins, and I don't think the syntax should be different depending on which order the tables are listed in.
With the above proposal, it seems the foreign key name could be left out in the common case of there being only one fkey between the two tables too.
> With the above proposal, it seems the foreign key name could be left out in the common case of there being only one fkey between the two tables too.
Yes, this was indeed my intention. The common case with only a single matching foreign key constraint would not require being explicit, but one could be if needed in a natural way.
Another thing, this makes things harder to reverse.
Some engines have join-order-dependent performance, so there are instances where you would want to write
```
FROM orders
LEFT JOIN customers USING (customer_id)
```
and others where you'd want to write
```
FROM customers
RIGHT JOIN orders USING (customer_id)
```
Swapping join order with current syntax is relatively easy since references in the same FROM clause are interchangeable. But in this proposal, the reference to the joined table isn't written out, so it would be pretty complicated to reverse join order.
I think your concern is addressed by the idea further down in proposal; giving the foreign keys the same names as the referenced tables. This example was provided:
SELECT f.title, f.did, d.name, f.date_prod, f.kind
FROM films f
JOIN FOREIGN f.distributors d
No, my concern is not that you can also specify the table name that explicitly states the name of both tables, it's that you have proposed a syntax where it is possible to leave off the explicit table name.
If you got rid of the syntax that uses the underlying "magic" of needing to know which table the foreign key points to, I'd be more amenable.
Yeah, my concern is people in the real world. If you can "magically" join on a foreign key and leave the table name out, or still call the thing anything you want, some smart ass will abuse the live crap out of it and I'll somehow be left to deal with it.
I like things to be explicit. Tell me what you're joining and how you want to join it. What is the use case for this? Other than saving some typing, and let's face it, sometimes a little extra typing now, will save you a lot of trouble later. The proposal claims: "The idea is to improve the SQL language", but is it really better?
I would like to point out the corectness aspect of the proposal.
Today foreign keys are enforced during inserts, updates, deletes. (you just can't violate fk. this is so good)
But you can violate it in selects. (e.g. mistype column name, forget to include a column)
This proposal (or its adjustment) would allow to use fk also during select.
It's like static typing for join conditions.
In my entire software development career, I've never had that problem, nor seen it.
Sure, I've mistyped a column name or left one out. But the failure is then nearly always so catastrophic (the query failed to compile), or the data response then so different from what I'd expect (pages of results instead on 1) that it's not something I've seen that needs fixing.
There are two asymmetrical cases; one of them needs two pieces of information, the other three.
Having to always specify both tables would syntax wise be redundant, but I agree it’s worth considering it might be a good thing to improve readability, which would help especially in the case where the foreign key isn’t/cannot be given the same name as the referenced table.
1. Just because you can doesn't mean people will.
2. That doesn't work if you need to have multiple keys pointing to the same table (ie owner_id and secondary_owner_id pointing to a users table)
We should reject this simply because it does not provide any clear benefits, but meanwhile asks for a lot. People to learn the FK names in their DB, people to learn new SQL syntax (and tooling to support it).
It's actually difficult to list all the problems with this proposal. It's just unnecessary and the effort required to predict all the problems it will cause isn't worth doing.
Foreign keys both define relationships between tables, and also enforce referential integrity. The discussion is about how we could potentially mine various additional value from foreign keys as an information resource, rather than just mostly being about referential integrity.
Personally, I think SQLs verbose syntax is a good thing, and increases clarity/interpretability. It reduces cognitive load to have things explicitly listed out. Not having the table and column names in the query itself makes it much harder to read and understand queries without prior knowledge of the data model.
I'm with you on this one. I prefer it to be in your face simple to understand over the terse-ness.
In the gist example, I actually prefer the SQL-92 approach where we are joining given an explicit comparison condition. Every other implementation seems to be trying to hide details, for what gain? Less typing?
In order to use FOREIGN, you will need to know not just what columns a table has, but also their configuration. Which would also require that you have properly configured your tables. While this shouldn't be a hard ask, it does add additional dependency and makes use of this "tool" slightly less "portable" between systems.
I have unfortunately seen cases where people will only have foreign keys un-enforced by their table config. As a dev, if you're introduced to a new DB, you wont know immediately if you can use this, and if things are configured wrong, you need to make a pretty significant change to be able to use it.
I don't see a lot of harm from adding this syntax however as people are free to not use it and it relies on an existing strict convention.
> I don't see a lot of harm from adding this syntax however as people are free to not use it and it relies on an existing strict convention.
This is a good argument I will add to the list.
Also interesting to read about un-enforced foreign keys. I haven't used MSSQL myself, the DB in which I heard it's possible, I've only been using PostgreSQL for the last 20 years, and before that MySQL.
I think the problems you describe is an argument against a WITH NOCHECK feature, since it could be misused. Maybe it's necessary in some databases still, but at least in PostgreSQL, the FOR KEY SHARE lock solved all the issues with concurrent updates we had at Trustly. The FOR KEY SHARE was a huge patch [1] written mainly by Alvaro Herrera. Thanks to it, Trustly has never since had any performance problems with foreign keys, and they have AFAIK not needed to drop any foreign keys up until today due to locking/performance problems.
That is a really interesting case. Correct me if I'm wrong, but I don't think this is a special case where you have to use `NATURAL JOIN` over `JOIN ON`. The problem is just that the Postgres SQL grammar (maybe most/all SQL grammars) require ON or USING in a non-NATURAL join, so you have to specify the (empty) equijoin predicate list yourself, i.e. the identity value: `JOIN ON TRUE`.
EDIT: I had another thought about this.
I think people not designing with the relational algebra in mind is the heart of the issue, specifically w.r.t. column names. We know that namespaces are a hard problem, and a consequence of that problem is that `NATURAL JOIN` as specified in the relational algebra seems risky, or overly magick-y. It makes what might be an unfortunate coincidence (name collision) into something algebraically impactful.
A foreign key join gets around the problem by keeping names and namespaces out of it. It's really doing exactly what `NATURAL JOIN` is supposed to do, but only in the subset of cases where name collisions are meaningful, not coincidental.
...I do... All the time. I'm cleaning up a mess of a namespace collision as we speak. I was the only one who saw it coming. I've spent the better part of a month getting people to believe it's an issue.
I agree with your point about explicit columns being easier to read, but I still sometimes prefer implicit columns. An example: whenever you have numerous subqueries, all using the same columns, implicit columns are easier to read since there is less text on the screen. It's also less error prone to change just one line rather than numerous lines.
This feeds into my view of metaprogramming-like situations. Whenever the code-time-view of a program differs significantly from some runtime-state-view of the program, I think there should be a code-time way to view and perhaps edit both the code-time-view and some kind of runtime-state-view. A programmer shouldn't have to waste time digging through numerous files to evaluate what implementation slots into some dependency injected class, or find out what structure ends up in a python method parameter, or what a preprocessor directive ultimately produces. I know IDEs can handle some of these things, but I think better tools can be produced.
More concisely, instead of approaching code as the single and unchanging view of the program, perhaps it would help to approach code as something more dynamic. I have no concrete ideas as to how this would work.
I would argue that conceptually it's not too different from needing to know what tables/columns exist in datamodel.
I get that it is not common now to care about FKs when writing selects. But it could be. Tooling can be improved to help here. (show fks, autocomplete)
Btw. Everybody seems to concentrate on conciseness, but keep in mind that this helps also with query correctness.
I'm a small PostgreSQL contributor since 2010 myself. I probably can't write the whole patch myself, but if there is enough interest, and if we can work out the details and address the problems raised in this thread and elsewhere, I'm pretty confident we can do it. If we ever get there, the next step would be a reference implementation, probably in PostgreSQL, or to discuss a proposal in the SQL committee.
I'm often interested into what goes into changes to committee-driven standards.
To an outsider, proposing a change seems to require one to be part of a shady cabal of Big-5 employees, skilled in the art of hiding subtle, privacy-invading features into inscrutable, plain-text RFCs.
That or subjecting yourself to 30K+ what-abouters who deform your suggestion into something unrecognisable.
It's refreshing to see a straightforward, well-formatted proposal (even if I do slightly prefer the `FROM table1 x JOIN table2 y ON x.fk` syntax suggested in other comments).
> To an outsider, proposing a change seems to require one to be part of a shady cabal of Big-5 employees, skilled in the art of hiding subtle, privacy-invading features into inscrutable, plain-text RFCs.
I thought so too. Initially I just tried to get in contact with someone at the Swedish Institute for Standards (SIS), to see if it would be possible to send a proposal to someone in the SQL committee, which I thought was nearly impossible to become a member of. But as it turns out, SIS explained I could actually join the Swedish working group, and participate directly there, I just had to send in an application and get the approval from my employer, since there is a cost involved and you have to be a member via a company. Turns out ISO is a very open and democratic organization, just like Hacker News! :)
I think this proposal could take years until it land, if it ever does, in some form, if concerns can be addressed, but SQL is here to stay for a while, so that doesn't scare me.
I was following this proposal on the -hackers thread, glad to see it getting traction here. I find it really interesting to read that, it'll be even more interesting to see how it plays out longer term. I'd love to see some more progress made with the SQL language / syntax, and having someone in the community actively engaging with the committees seems like a great way forward.
> It's refreshing to see a straightforward, well-formatted proposal (even if I do slightly prefer the `FROM table1 x JOIN table2 y ON x.fk` syntax suggested in other comments).
This reminds me of KEY JOIN in Sybase which, unlike this syntax, does not require to specify the foreign key. I find Sybase's syntax to be too implicit, and challenging to port to other dialects; JOIN FOREIGN with explicit PK reference looks much better.
I wonder how it is expected to work with non-table references (views, CTEs, subqueries), especially when the columns involved in the foreign key (on either side) are not returned explicitly by the referenced object.
CREATE TABLE category (
id int GENERATED ALWAYS AS IDENTITY,
name text
);
CREATE TABLE post (
id int GENERATED ALWAYS AS IDENTITY,
category_id int
REFERENCES category (id)
ON DELETE CASCADE
);
That is, category.id rather than category.category_id.
But the USING clause doesn't work with that style, as far as I understand.
I've been quite happy having fully prefixed column names for a long time now. Makes joins easier, big views clearer, and random exports more readable out of the box. Also in my case it's also easier to line up unique column names with things like Clojure specs but I accept that's a niche concern.
Much cleaner, I agree, that's a big win. Will add that to the list of benefits.
Will also add a "Drawbacks / Remaining issues" section to the Gist, from all the valuable comments so far in this thread, thank you all, positive as well as negative comments, all very helpful.
It looked very useful and intuitive at first, but on further thought, I think the only time that you'd truly benefit from it if you SELECT *. For other SELECT queries with explicit field name list, you'd need to know the table the key constraint links to anyway.
SQL Views serve perfectly well for queries encouraged by the schema itself, and i think they are more sophisticated and practical way.
> The idea is the foreign key name is usually the same as the referenced table
I think this is where you are hitting some of your turbulence here because lots of ORMs / schema management tools actually generate completely cryptic fk names (sometimes based on the hash of the columns or similar). Personally I think weighing in legacy baggage like that too highly is a bad thing as it creates enormous inertia.
Thanks for explaining, I have no experience of ORMs, always written my queries manually. This was insightful. Thanks also for the word “inertia”, good one, will add to my vocabulary.
It's a great proposition. However, I would suggest having it as a prefix rather than postfix of the JOIN expression to reduce implementation side effects and avoid ambiguity between keyword 'FOREIGN' and a table called 'FOREIGN'.
Not something I’ve ever thought about but I think that it could be done via macro,
I have written tsql functions in c# and imagine that other dialects have analogous functionality.
It seems as though the point is to cut down on sql code, it’d also be possible to query the foreign keys and create a data structure that could feed a function for joins.
I haven’t thought out the specifics but think this type of approach would be more practical than changing the sql standard.
At that point you are just building a feature into an "ORM" of some sort. The entire point is to allow queries to be written in a more concise way for this simple equality join use case which is (total ballpark figure) ~half of the joins in my system.
Same here. I didn’t even realize that was an option. As a relative beginner to SQL, is there a real reason the JOIN syntax was needed at all beyond special inner/outer joins?
O'Reilly has an excellent pocket book on generic SQL titled "SQL Pocket Guide: A Guide to SQL Usage" (it has several editions). That book worked really well for me. It is surprisingly thorough for a pocket book. I highly recommend it.
Besides that, I highly recommend https://sqlite.org/ as a reference. It's syntax pages for SQL are fantastic. But to understand the language you need a better resource, and for my money that's the book mentioned above.
It doesn't enforce foreign keys. It uses foreign key definitions to simplify the query. It clarifies intent, and because FOREIGN KEYs generally are required to have appropriate indices, it makes it clear that there must be at least one suitable index for the JOIN.
Not a fan since proposal would do an implicit join on foreign keys rather than explicit join on the columns. If you know the schemas its great, but would add an extra step to check FKs on the tables otherwise
This is why we don't use `*` in production SQL. It's also why we try to avoid NATURAL JOIN. USING is OK if you only have two table sources, but you don't always get to have same-name columns on both tables.
This syntax seems pretty clear and explicit, but with an indirection. Indirection != implicit. The intent is quite explicit.
Using ON also is problematic in that you might think while reading a query that the JOIN is on FOREIGN KEY columns, but... maybe not -- without looking at the schema, you can't tell. JOIN FOREIGN has a similar problem: the intent is crystal clear, but now you have to go look at the schema if you want to know which columns that refers to. Normally one would write a comment on the ON, but comments can rot.
Now, making SQL more expressive isn't necessarily a good thing. SQL is already very expressive. But making it more expressive in ways that yield clearer queries is definitely worth considering.
One very nice aspect of JOIN FOREIGN is that because RDBMSes generally require corresponding indices on those columns to optimize ON UPDATE / ON DELETE constraint processing, seeing "JOIN FOREIGN" in a query instantly lets you know that there must be an appropriate index, while ON might be causing a full table scan or query materialization and you'd have to examine the query plan carefully.
An important constraint on SQL is that a query must run, and produce correct results, relying only on the structure and content of tables. Indexes (can) make queries faster but must not inhibit, or be required for, correctness. The same is true of primary key constraints, foreign key constraints, check constraints, defaults, triggers, partitioning, whether a table is heap/clustered, and literally every other implementation detail of the RDBMS.
This proposal would break that constraint, by elevating the _name_ of a foreign key constraint to an identifier usable in a DQL statement (as distinct from DDL statements where such names can already appear). That's not allowed.
More importantly, it's _not a good idea_, because that semantic separation between data on one side, and the machinery of acceleration and validation on the other side, is critical to the value prop of the relational model and a big reason why it's been so hyper-successful.
So no, let's not do this.