Apache Calcite is a query optimizer that does not have a storage system or any other aspect related to RDBMS. That way, it can be attached to engines like Cassandra and MySQL.
Today, we are going to clone it, introduce it with a CSV example, then we will try to build a basic tree.
First we used calcite’s examples, after that. we tried to create our own
The Setup
First, we clone it like(Git installation https://github.com/git-guides/install-git):
cd d:\\your_directory
#now we are in so we clone
git clone <https://github.com/apache/calcite.git>
# after that, in that directory we can build using the gradle wrapper, this is going
# to take a while
./gradlew build
After that, we will have our Calcite installed. Now, we need Java 8 - 21 to run Calcite and IntelliJ IDEA for the demo.
Now, reset git and enter this directory:
cd d:\\yourdirectory\\calcite\\example\\csv
# some of the versions might not have it so make sure you have that directory while trying
# to run this demo
# after that we need to run sqlline, which is a shell a command line that is used
# to run sql in JDBC connections here it is installed already, but if you wan the githyb
# <https://github.com/julianhyde/sqlline>
./sqlline
Now, here we can run SQL queries on CSV files using calcite as a query optimizer,
Calicte has these things called adapters, which are basically a connection to the source(database, here it is a CSV file, but I could be anything, like MySQL), so we could run queries on it.
So, these are the connections to the source, meaning we have to connect, and to do that, we need:
!connect jdbc:calcite:model=src/test/resources/model.json admin admin
The connect is a connect command line
JDBC is used to specify the kind of adapter to use, here it is calcite, and then we specify the path to the file, here it is a JSON file created by Apache, which is the model.json
Now we can run queries, like:
0: jdbc:calcite:model=src/test/resources/mode> select* from emps;
+-------+-------+--------+--------+---------------+-------+------+---------+---+
| EMPNO | NAME | DEPTNO | GENDER | CITY | EMPID | AGE | SLACKER | M |
+-------+-------+--------+--------+---------------+-------+------+---------+---+
| 100 | Fred | 10 | | | 30 | 25 | true | f |
| 110 | Eric | 20 | M | San Francisco | 3 | 80 | | f |
| 110 | John | 40 | M | Vancouver | 2 | null | false | t |
| 120 | Wilma | 20 | F | | 1 | 5 | | t |
| 130 | Alice | 40 | F | Vancouver | 2 | null | false | t |
+-------+-------+--------+--------+---------------+-------+------+---------+---+
5 rows selected (0.978 seconds)
So, we ran a regular select* on the EMPS table, and we got these, so we have tables, rows, etc.
There are a lot of SQL commands we can run, like:
0: jdbc:calcite:model=src/test/resources/mode> explain plan for select* from emps;
+---------------------------------------------+
| PLAN |
+---------------------------------------------+
| EnumerableTableScan(table=[[SALES, EMPS]])
|
+---------------------------------------------+
1 row selected (0.024 seconds)
So, here we asked for the estimated execution plan, and we got
For more tutorials like this, check out the docs.
And, for more on the basic architecture of Calcite.
And, for more on query optimizers, check out our blog here
RelBuilder: building a logical tree
Now, in the previous example, we had an engine built for us, what if we wanted to build a logical tree
One of the key blocks of that is logical operators, now RelBuilder is a simplified way of showing how it does that,
Now in calicte\core\src\test\java\org.apache.calcite\examples\RelBuilderExample
We can find the docs demo(git)
There, they did like the following:
package org.apache.calcite.examples;
//First it starts with the imports
import org.apache.calcite.plan.RelOptUtil;
import org.apache.calcite.rel.RelNode;
import org.apache.calcite.rel.core.JoinRelType;
import org.apache.calcite.sql.fun.SqlStdOperatorTable;
import org.apache.calcite.test.RelBuilderTest;
import org.apache.calcite.tools.FrameworkConfig;
import org.apache.calcite.tools.RelBuilder;
We have:
- RelOptUtil: gives us the tools to troubleshoot and see the plans
- RelNode: the class that contains the nodes of the logical operators
- JoinRelType: we need it to specify the join type like inner or outer etc.
- SqlStdOperatorTable: contains the standard operators, like equality operators, for example, “=”
- RelBuilderTest: contains the in-memory table that we need, here it is the SCOTT database
- FrameworkConfig: it is what mixes these together, the glue
- RelBuilder: a way to connect the nodes created by RelNode
After that, it declares the class for the example:
public class RelBuilderExample {
private final boolean verbose;
So, if it is true, it would output it, which is what we want here, since we want to see the relational expressions
public RelBuilderExample(boolean verbose) {
this.verbose = verbose;
}
In this constructor, it defines that the variable verbose should equal this verbose; nothing new here
public static void main(String[] args) {
new RelBuilderExample(true).runAllExamples();
}
Now, here they create a method that could use command-line arguments,
This method calls the RelBuilderExample we created in the previous line, and sets verbose = true, which means it will run all the examples.
After that, it calls another method, runAllExamples. which would pass the examples
public void runAllExamples() {
final FrameworkConfig config = RelBuilderTest.config().build();
final RelBuilder builder = RelBuilder.create(config);
for (int i = 0; i <= 4; i++) {
doExample(builder, i);
final RelNode node = builder.build();
if (verbose) {
System.out.println(RelOptUtil.toString(node));
}
}
}
Now, here it declares the method we saw before, runAllExamples()
Now here it is, calling FrameworkConfig where it configures the RelBuilder stuff using the RelBuilderTest and builds it.
Then it calls the RelBuilder, where builder = calling it using create, where the schema is coming from the config
Now, here it loops for each example0, exmaple1 which we can see as a parameters(builder,i),
After that, we see that it calls RelNode to make sure that it would create a node for each logical operator
if we have the verbose true, now here it prints out the utility we imported, specifically the method to string, so we could see the logical operators that we have built:
private RelBuilder doExample(RelBuilder builder, int i) {
switch (i) {
case 0:
return example0(builder);
case 1:
return example1(builder);
case 2:
return example2(builder);
case 3:
return example3(builder);
case 4:
return example4(builder);
default:
throw new AssertionError("unknown example" + i);
}
}
Now here it calls doExample with the parameters builder and i, based on each i value
If there is no example, the verbose gets false, and no printing
If there is an example, but the example is not defined here, then it outputs the unknown example
Caclite’s first example:
/* Creates a relational expression
* equivalent to
* <blockquote><pre>VALUES(1,TRUE),
* (NULL,FALSE)</pre></blockquote>
* this is our fix, we submitted that as a pull request
*/
private RelBuilder example0(RelBuilder builder) {
return builder
.values(new String[] {"a", "b"}, 1, true, null, false);
}
Here it calls RelBuilders, it defines the Example, inputs RelBuilder builder we defined as create previously as a parameter, it returns the builder, so whatever we create we would see as output, which would be printed if it fits the conditions we provided previously
Now here we are defining example0 as a call to the method .values then we are defining the columns to the call as a and b, which has 1 and null for a, and true and false for b
this equals
values(1,true),(null,false)
If you run the program, you can see the output as:
LogicalValues(tuples=[[{ 1, true }, { null, false }]])
So, you can see the logical operator for this,
Calcite’s example1: creating a table scan
private RelBuilder example1(RelBuilder builder) {
return builder
.scan("EMP");
}
So, it calls the RelBuilder, creates example1, inputs RelBuilder builder, it outputs the builder with a call to the scan with the table EMP as a parameter,
We could see in the output the following:
LogicalTableScan(table=[[scott, EMP]])
This equals:
select * from EMP
Calcite’s example2: the projection
The projection filters the table scan, the example:
private RelBuilder example2(RelBuilder builder) {
return builder
.scan("EMP")
.project(builder.field("DEPTNO"), builder.field("ENAME"));
}
It does the same as before, but here it selects DEPTNO and ENAME, meaning it filters the results by these, meaning:
select DEPTNO,ENAME FROM EMP
Calcite’s example3: group by, simple aggregates, and having
private RelBuilder example3(RelBuilder builder) {
return builder
.scan("EMP")
.aggregate(builder.groupKey("DEPTNO"),
builder.count(false, "C"),
builder.sum(false, "S", builder.field("SAL")))
.filter(
builder.call(SqlStdOperatorTable.GREATER_THAN, builder.field("C"),
builder.literal(10)));
}
SO:
- It builds a scan
- Two aggregates, one count and one sum, each of them had an alias, for the first is C and the Second is S
- The groupKey = group by DEPTNO
- The builder.field SAL, salaries meaning the count and the sum, where salary is grouped by deptno
- Now the Filter is the having clause, it gets from SqlStdOperatorTable > and builds based on the count or C, so here it filters by C
- where the count> 10 or build.literal(10)
So it equals:
select
deptno
count(*) c,
sum(sal) s,
from
EMP
group by deptno
having c>10
Calcite’s example4: Joins
private RelBuilder example4(RelBuilder builder) {
final RelNode left = builder
.scan("EMP")
.scan("DEPT")
.join(JoinRelType.INNER, "DEPTNO")
.build();
final RelNode right = builder
.scan("EMP")
.scan("BONUS")
.join(JoinRelType.INNER, "ENAME")
.build();
return builder
.push(left)
.push(right)
.join(JoinRelType.INNER, "ENAME");
}
}
So:
- On left, it called a node for a join
- Scans both tables
- Then an inner join between them based on the column DEPTNO
- The same goes for the right
- Then it joins both left and right in the new builder variable inside the builder
- joins both on ENAME
equals this:
select *
from (emp join dept on deptno) as left
join (emp join bonus on ename) as right
on left.ename = right.ename;
And the output was like:
LogicalJoin(condition=[=($1, $12)], joinType=[inner])
LogicalJoin(condition=[=($7, $8)], joinType=[inner])
LogicalTableScan(table=[[scott, EMP]])
LogicalTableScan(table=[[scott, DEPT]])
LogicalJoin(condition=[=($1, $8)], joinType=[inner])
LogicalTableScan(table=[[scott, EMP]])
LogicalTableScan(table=[[scott, BONUS]])
or something like that, and right and left here are internal, they are replaced by $etc. here, that we can see, but if we want to see the convention that we want, we need a new import, which complicates a complicated example
So here, for each, it substitutes with an ordinal reference indexed based on the node number for each, which is $1, and the other stuff
What if we want to create our own examples:
Re-Creating RelBuilderExample: RelBuilderExample1
Copy and paste this file(https://github.com/sulees/calcite-relbuilder-examples/blob/main/RelBuilderExample1.java) in a new class you create in the same directory as RelBuilderExample, name the class RelBuilderExample1.
Now, here we did the same:
public class RelBuilderExample1 {
private final boolean verbose;
So, we created a class that has a verbose instance variable, meaning it would output if it is true, we close it at the end
public RelBuilderExample1(boolean verbose) {
this.verbose = verbose;
}
Now, here we create a method that accepts the parameter where it is this, one not the instance variable
public static void main(String[] args) {
new RelBuilderExample1(true).runAllExamples();
}
Now, here we say that it if it is true, then call the runAllExamples method
public void runAllExamples() {
final FrameworkConfig config = RelBuilderTest.config().build();
final RelBuilder builder = RelBuilder.create(config);
for (int i = 0; i <= 4; i++) {
RelNode node = doExample(builder, i).build();
if (verbose) {
System.out.println(RelOptUtil.toString(node));
}
}
}
Now, here we are gluing the content together, We are calling the config from the FrameworkConfig to build RelBuilderTest and configure, so we could have our in-memory tables and schema to build the logical operators on
Then, we are configuring the builder that will be our parameter for the examples. This would be our logical operator. We create based on config
After that, we are defining a loop that adds one for each value, and calls the doExample method, by passing the builder and i as parameters, where the builder references the example
And, if it is verbose = true, then it outputs the logical operators for the node
private RelBuilder doExample(RelBuilder builder, int i) {
switch (i) {
case 0: return example0(builder);
case 1: return example1(builder);
case 2: return example2(builder);
case 3: return example3(builder);
case 4: return example4(builder);
default: throw new AssertionError("Unknown example: " + i);
}
}
Now, here we pass the switch, which would define the example0 or whatever that we want to build using the above-created setup
So the doExample would be called for each of these examples, otherwise it would either stop due to verbose = false, or it would throw Unknown example.
Select* from bonus
// now here we have our first example simple select * from BONUS
private RelBuilder example0(RelBuilder builder) {
return builder.scan("BONUS");
}
So, a call, for example0, that passes the builder we configured above as a parameter, which would return the logical operator table scan of bonus, if you run it you would get:
LogicalTableScan(table=[[scott, BONUS]])
So far, so good
Where clause
//here we are scanning and filtering here, as you can see SELECT * FROM EMP WHERE SAL > 2000;
//as you can see here, the filter is not based on the aggregate's column but rather the SAL column itself
private RelBuilder example1(RelBuilder builder) {
return builder
.scan("EMP")
.filter(
builder.call(
SqlStdOperatorTable.GREATER_THAN,
builder.field("SAL"),
builder.literal(2000)
)
);
}
//here we just changed the operators
private RelBuilder example2(RelBuilder builder) {
return builder
.scan("EMP")
.filter(
builder.call(
SqlStdOperatorTable.LESS_THAN,
builder.field("SAL"),
builder.literal(2000)
)
);
}
As you can see, call the builder for a scan and a filter, where in there we call the call method to specify the inequality operators, the field that we want to reference, and the literal we want to compare to
We did it with < and> so we got these
LogicalFilter(condition=[>($5, 2000)])
LogicalTableScan(table=[[scott, EMP]])
LogicalFilter(condition=[<($5, 2000)])
LogicalTableScan(table=[[scott, EMP]])
$5 is a reference to the column SAL as indexed from zero by the engine, we said before we can change the naming convention, but it would complicate the example
A join
//now here we are joining two scans, and we projecting two fields which would be like:
// SELECT ENAME, DNAME FROM EMP INNER JOIN DEPT ON DEPT.DEPTNO = EMP.DEPTNO
private RelBuilder example3(RelBuilder builder) {
return builder
.scan("EMP")
.scan("DEPT")
.join(JoinRelType.INNER, "DEPTNO")
.project(
builder.field("ENAME"),
builder.field("DNAME")
);
}
So we call the RelBuilder builder, it calls the scan twice and joins them on DEPTNO while only projecting two columns
We get the following:
LogicalProject(ENAME=[$1], DNAME=[$9])
LogicalJoin(condition=[=($7, $8)], joinType=[inner])
LogicalTableScan(table=[[scott, EMP]])
LogicalTableScan(table=[[scott, DEPT]])
group by and a having based on an aggregate
SELECT DEPTNO, COUNT(*) AS EMP_COUNT
FROM EMP
GROUP BY DEPTNO
HAVING COUNT(*) > 3;
private RelBuilder example4(RelBuilder builder) {
return builder
.scan("EMP")
.aggregate(
builder.groupKey("DEPTNO"),
builder.count(false, "EMP_COUNT")
)
.filter(
builder.call(
SqlStdOperatorTable.GREATER_THAN,
builder.field("EMP_COUNT"),
builder.literal(3)
)
);
}
So, we call the builder, that scans EMP
Then, it counts with an alias called EMP_COUNT, while grouping by the group key which is DEPTNO
Which is not selected here, but it would not change much
After that, it filters based on the alias, meaning it is a having clause, so, the where and the having have the same method as logical operators Filter, then we start specifying the details under the call method
The literal would be 3, the field is the alias, and the inequality operator is >.
And with that we finish this blog.