How (not) to write a patch for PostgreSQL

What

In this quick and dirty article, we will be describing a syntax and implementing the new syntax in postgresql backend. The syntax to be implemented is CORRESPONDING clause, which is referenced in the SQL20nn standard draft as “Feature F301, CORRESPONDING in query expressions”.

CORRESPONDING clause can be seen as a modifier for set operations in SELECT statements. It filters the projected columns to either match the same column names, or match the column names with a list (CORRESPONDING BY).

Note: [] denotes optionality

query1 and query2 are select statements.

 

query1 UNION [ ALL ] [ CORRESPONDING [ BY ( column_list ) ] ] query2

query1 EXCEPT [ ALL ] [ CORRESPONDING [ BY ( column_list ) ] ] query2

query1 INTERSECT [ ALL ] [ CORRESPONDING [ BY ( column_list ) ] ] query2

  • CORRESPONDING returns all columns that are in both query1 and query2 with the same name.
  • CORRESPONDING BY returns all columns in the column_list that are also in both query1 and query2 with the same name.

We will use UNION operator, INTERSECT and EXCEPT works the same with CORRESPONDING clause. Here are some examples to clarify what CORRESPONDING means.

SELECT 1 a, 2 b, 3 c UNION CORRESPONDING SELECT 4 a, 5 b, 6 c;
 a | b | c 
---+---+---
 1 | 2 | 3
 4 | 5 | 6
(2 rows)

Now the fun part, we have removed column b from query2:

SELECT 1 a, 2 b, 3 c UNION CORRESPONDING SELECT 4 a, 6 c;
 a | c  
---+--- 
 1 | 3 
 4 | 6 
(2 rows)

SELECT 1 a, 2 b, 3 c UNION CORRESPONDING SELECT 4 d; ERROR: UNION queries with a CORRESPONDING clause must have

at least one column with the same name

The idea with CORRESPONDING is, unmatched columns are removed. If no columns left to project, then it is a syntax error.

For the CORRESPONDING BY(column_list) part, one more step of filtering is done to ensure that columns in the column_list are present in query1 and query2.

SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(a, b, c) SELECT 4 a, 5 b, 6 c;
 a | b | c 
---+---+---
 1 | 2 | 3
 4 | 5 | 6
(2 rows)

We remove a column name from column_list:

SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(a, c) SELECT 4 a, 5 b, 6 c;
 a | c 
---+---
 1 | 3
 4 | 6
(2 rows)

Although we can suggest that CORRESPONDING BY is more useful with * queries as in:

SELECT * FROM t1 UNION CORRESPONDING BY (a, b) SELECT * FROM t2;

Why

There is a single main reason why I pursued to write a patch for CORRESPONDING in postgresql, I decided to delve into postgresql code base with a simple project and it was in the TODO list. As it turns out it is not simple, and I cannot go without the saying “If it is in the TODO list, it ain’t simple”.

Development Environment

After setting up a vmware virtual machine with Pardus Linux installed, complete all the updates and make sure you have at least the following software configuration.

  • Pardus Linux 2011.2 (kernel 2.6.37.6 i686 GNU/Linux)
  • gcc 4.5.3 (compiler)
  • flex 2.5.35 (lexer)
  • bison 2.4.3 (parser generator)
  • autotools
  • gdb
  • ccache (compiler cache)
  • git (version control)
  • eclipse CDT (IDE)

For eclipse configuration please see: http://wiki.postgresql.org/wiki/Working_with_Eclipse

For other details about setting up an environment for postgresql development, please see http://wiki.postgresql.org/wiki/Developer_FAQ

How

In this section, we will touch to the very basics of postgres query execution relating to the execution of a set operation(UNION, INTERSECT, EXCEPT)

  1. Parser
    1. Raw Parser – Generate a raw parse tree from the sql statement.
    2. Analyzer – Generate a query tree from a raw parse tree. (This is what we will modify mostly)
  2. Optimizer – Choose an efficient plan.
  3. Executor – Execute the chosen plan.

How Exactly

There are two simple paths of implementing the CORRESPONDING clause, one is modifying the query tree such that query1 and query2 are wrapped in subqueries, and the other is modifying the way that UNION works which is a big deal involving many parts of the code from analyzer to optimizer. We chose to implement it by modifying the query tree which I will describe in a moment.

  1. Modify parser to accept new syntax.
    1. Introduce keywords.
    2. Generate a relevant parse tree for the set operation.
    3. Add metadata about CORRESPONDING operation to node structures in the raw parse tree. (column_list, is it CORRESPONDING or CORRESPONDING BY, etc.)
  2. Update analyzer to process the new syntax.
    1. While processing the set operation, verify validity of the CORRESPONDING semantics.
      1. Does the columns exist in both tables
      2. Is there any non-existent column names in column_list.
      3. Is output projection non-empty, etc.
    2. If CORRESPONDING clause is valid, modify query1 and query2.
  3. Write documentation.
  4. Write regressions.
New Syntax

Modifying parser is the easiest part, you would think. What is the big deal? We are only going to add a keyword, what could go wrong? Well, keep on.

backend/parser/gram.y

gram.y is a bison grammar file. Bison produces a c file from this input. It’s syntax is intuitive and you will feel familiar if you check a few SQL statements you know by heart and inspect their gram.y counterparts. For the hasty it gram.y looks like this:

GrantStmt:	GRANT privileges ON privilege_target TO grantee_list
			opt_grant_grant_option
				{
					GrantStmt *n = makeNode(GrantStmt);
					n->is_grant = true;
					n->privileges = $2;
					n->targtype = ($4)->targtype;
					n->objtype = ($4)->objtype;
					n->objects = ($4)->objs;
					n->grantees = $6;
					n->grant_option = $7;
					$$ = (Node*)n;
				}
		;

Here, uppercase words are keywords, lower case words are further definitions. statement between {} is the raw parse tree description of GrantStmt. privileges, privilege_target, grantee_list, opt_grant_grant_option are defined below this definition of GrantStmt.

Between the curly braces, we refer to the positional arguments of the statement.

n->privileges = $2;

$2 refers to privileges in the first line for example.

Adding CORRESPONDING to gram.y

To use CORRESPONDING BY as a valid clause, we need to add CORRESPONDING and BY keywords to the parser. BY keyword is already used in GROUP BY, ORDER BY etc. so we only need to add CORRESPONDING token.

*** 498,504 ****
  	CHARACTER CHARACTERISTICS CHECK CHECKPOINT CLASS CLOSE
  	CLUSTER COALESCE COLLATE COLLATION COLUMN COMMENT COMMENTS COMMIT
  	COMMITTED CONCURRENTLY CONFIGURATION CONNECTION CONSTRAINT CONSTRAINTS
! 	CONTENT_P CONTINUE_P CONVERSION_P COPY COST CREATE
  	CROSS CSV CURRENT_P
  	CURRENT_CATALOG CURRENT_DATE CURRENT_ROLE CURRENT_SCHEMA
  	CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE
--- 499,505 ----
  	CHARACTER CHARACTERISTICS CHECK CHECKPOINT CLASS CLOSE
  	CLUSTER COALESCE COLLATE COLLATION COLUMN COMMENT COMMENTS COMMIT
  	COMMITTED CONCURRENTLY CONFIGURATION CONNECTION CONSTRAINT CONSTRAINTS
! 	CONTENT_P CONTINUE_P CONVERSION_P COPY CORRESPONDING COST CREATE
  	CROSS CSV CURRENT_P
  	CURRENT_CATALOG CURRENT_DATE CURRENT_ROLE CURRENT_SCHEMA
  	CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE

Add CORRESPONDING to select statement as a valid clause.

*** 8489,8507 ****
  					n->fromClause = list_make1($2);
  					$$ = (Node *)n;
  				}
! 			| select_clause UNION opt_all select_clause
  				{
! 					$$ = makeSetOp(SETOP_UNION, $3, $1, $4);
  				}
! 			| select_clause INTERSECT opt_all select_clause
  				{
! 					$$ = makeSetOp(SETOP_INTERSECT, $3, $1, $4);
  				}
! 			| select_clause EXCEPT opt_all select_clause
  				{
! 					$$ = makeSetOp(SETOP_EXCEPT, $3, $1, $4);
  				}
  		;

  /*
   * SQL standard WITH clause looks like:
--- 8490,8514 ----
  					n->fromClause = list_make1($2);
  					$$ = (Node *)n;
  				}
! 			| select_clause UNION opt_all opt_corresponding_clause select_clause
  				{
! 					$$ = makeSetOp(SETOP_UNION, $3, $4, $1, $5);
  				}
! 			| select_clause INTERSECT opt_all opt_corresponding_clause select_clause
  				{
! 					$$ = makeSetOp(SETOP_INTERSECT, $3, $4, $1, $5);
  				}
! 			| select_clause EXCEPT opt_all

opt_corresponding_clause

 select_clause
  				{
! 					$$ = makeSetOp(SETOP_EXCEPT, $3, $4, $1, $5);
  				}
  		;

Now we need to define opt_corresponding_clause.

 opt_corresponding_clause:
 			CORRESPONDING BY '(' expr_list ')'		{ $$ = $4; }
 			| CORRESPONDING							{ $$ = list_make1(NIL); }
 			| /*EMPTY*/								{ $$ = NIL; }
 			;

We have modified the parameter list of makeSetOp. Let’s go over there.

*** 12642,12648 ****
  }

  static Node *

! makeSetOp(SetOperation op, bool all, Node *larg, Node *rarg)

  {
  	SelectStmt *n = makeNode(SelectStmt);

--- 12649,12655 ----
  }

  static Node *

! makeSetOp(SetOperation op, bool all, List *correspondingClause, Node *larg, Node *rarg)

  {
  	SelectStmt *n = makeNode(SelectStmt);

***************
*** 12650,12655 ****
--- 12657,12663 ----
  	n->all = all;
  	n->larg = (SelectStmt *) larg;
  	n->rarg = (SelectStmt *) rarg;

+ n->correspondingClause = correspondingClause;

  	return (Node *) n;
  }

Now we have used SelectStmt->correspondingClause, let’s add that to parsenodes.h

include/nodes/parsenodes.h
***************
*** 1006,1011 ****
--- 1006,1013 ----
  	/*
  	 * These fields are used only in "leaf" SelectStmts.
  	 */

+ List *correspondingClause; /* NULL, list of CORRESPONDING BY exprs, or */ + /* lcons(NIL, NIL) for CORRESPONDING */

Code will compile, but there is one tidbit missing. Postgres will shout an error to you about unrecognized keywords if you use a CORRESPONDING query.

You have to add all new keywords to kwlist.h

*** 94,99 ****
--- 94,100 ----
  PG_KEYWORD("continue", CONTINUE_P, UNRESERVED_KEYWORD)
  PG_KEYWORD("conversion", CONVERSION_P, UNRESERVED_KEYWORD)
  PG_KEYWORD("copy", COPY, UNRESERVED_KEYWORD)

+ PG_KEYWORD(“corresponding”, CORRESPONDING, UNRESERVED_KEYWORD)

  PG_KEYWORD("cost", COST, UNRESERVED_KEYWORD)
  PG_KEYWORD("create", CREATE, RESERVED_KEYWORD)
  PG_KEYWORD("cross", CROSS, TYPE_FUNC_NAME_KEYWORD)

[/code]


At this point we can compile postgres and see that it dismisses the CORRESPONDING or CORRESPONDING BY from a query without an error.

Adding meaning to CORRESPONDING, in analyze.c

In transformSetOperationTree function, we determine the output column names, and verify their existence in tables of both sides. Here CORRESPONDING part is given. CORRESPONDING BY is similar.

 		else if(linitial(stmt->correspondingClause) == NULL)
 		{
 			// CORRESPONDING clause, find matching column names from both tables. If there are none then it is a syntax error.

 			Query	*largQuery;
 			Query	*rargQuery;
 			List	*matchingColumns;

 			/* Analyze left query to resolve column names. */
 			largQuery = parse_sub_analyze((Node *) stmt->larg, pstate, NULL, false);

 			/* Analyze right query to resolve column names. */
 			rargQuery = parse_sub_analyze((Node *) stmt->rarg, pstate, NULL, false);

 			/* Find matching columns from both queries. */
 			matchingColumns = determineMatchingColumns(largQuery->targetList,
 													   rargQuery->targetList);

 			op->correspondingColumns = matchingColumns;
 			op->hasCorrespondingBy = false;

 			/* If matchingColumns is empty, there is an error. At least one column in the select lists must have the same name. */
 			if(list_length(matchingColumns) == 0)
 			{
 				ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR),
 								errmsg("%s queries with a CORRESPONDING clause must have at least one column with the same name",
 								context)));
 			}

 			// Create subquery for larg, selecting column names from matchingColumns.
 			stmt->larg = createSubqueryForCorresponding(matchingColumns, stmt->larg);

 			// Assign newly generated query to original left query.
 			op->larg = transformSetOperationTree(pstate, stmt->larg,
 												 false,
 												 <argetlist);

 			// Create subquery for rarg, selecting column names from matchingColumns.
 			stmt->rarg = createSubqueryForCorresponding(matchingColumns, stmt->rarg);

 			// Assign newly generated query to original right query.
 			op->rarg = transformSetOperationTree(pstate, stmt->rarg,
 												 false,
 												 &rtargetlist);
 		}

To create a subquery with given column names and a query to me used as a subquery, we use createSubqueryForCorresponding function. This function returns a SelectStmt for the following statement:

SELECT outputColumns FROM (main_arg)
  /*
  * Returns a subquery selecting outputColumns from main_arg.
  * main_arg is modified and returned.
  */
 static SelectStmt *
 createSubqueryForCorresponding(List* outputColumns, SelectStmt* main_arg)
 {
 	ColumnRef *cr;
 	ResTarget *rt;
 	SelectStmt *n;

 	RangeSubselect * rss;
 	ListCell* mctl;

 	n = makeNode(SelectStmt);
 	n->targetList = NIL;
 	foreach(mctl, outputColumns)
 	{
 		TargetEntry *mctle = (TargetEntry *) lfirst(mctl);

 		cr = makeNode(ColumnRef);
 		cr->fields = list_make1(makeString(mctle->resname));
 		cr->location = -1;

 		rt = makeNode(ResTarget);
 		rt->name = NULL;
 		rt->indirection = NIL;
 		rt->val = (Node *)cr;
 		rt->location = -1;

 		n->targetList = lappend(n->targetList, rt);
 	}

 	rss = makeNode(RangeSubselect);

 	// XXX makeAlias alias name should be empty??
 	rss->alias = makeAlias("", NULL);
	rss->subquery = (Node *)main_arg;

 	n->fromClause = list_make1(rss);

 	main_arg = n;

 	return main_arg;
 }
Documentation and Regression

For documentation, please see changes in

  • doc/src/sgml/queries.sgml
  • doc/src/sgml/sql.sgml

For regression, please see the excellent article from postgresql, and see the changes in

  • src/test/regress/sql/corresponding_union.sql
  • src/test/regress/serial_schedule
  • src/test/regress/parallel_schedule
  • src/test/regress/expected/corresponding_union.out

The (not) part

This is the main reason the patch was not accepted by postgresql mailing list community. When we modify the parse tree in the analyzer.c, we modify it for good. Optimizer, planner, executor never sees a CORRESPONDING clause, they only see the subquery we generated.

=# CREATE VIEW v1 AS

SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(a, c) SELECT 4 a, 5 b, 6 c

;
CREATE VIEW
=# select * from v1;
 a | c 
---+---
 1 | 3
 4 | 6
(2 rows)

=# select * from pg_views WHERE viewname = 'v1';
 schemaname | viewname | viewowner |                                                                  definition                                                                  
------------+----------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------
 public     | v1       | kerem     |

SELECT alias.a, alias.c FROM (SELECT 1 AS a, 2 AS b, 3 AS c) alias UNION SELECT alias.a, alias.c FROM (SELECT 4 AS a, 5 AS b, 6 AS c) alias

;
(1 row)

Patch File