Implement Foreign Scan With FDW Interface API
Recently I have been tasked to familiarize myself with the Foreign Data Wrapper (FDW) interface API to build a new FDW capable of doing vertical / columnar sharding, meaning that the FDW is capable of collecting column information from multiple sources and combine them together as a result query. I will document and blog about the vertical sharding in later posts. Now, in this blog, I would like to share some of the key findings and understanding of FDW interface related to foreign scan.
2. Scan-related API Routines
FDW foreign scan API routines have some similarities to the table access method API’s sequential scan routines. You can learn more about it in the blog post here. Most noticeable difference is that the FDW routines require us to take part during the planning stage, so we have to come out with a foreign plan for the exectuor so it knows what and how to perform the foreign scan.
In order to complete a foreign scan, the following FDW routines are invoked in order:
In this routine, you have to tell PG whether or not the foreign scan can support parallel Scan by returning true to indicate it is supported, or false otherwise. Parallel scan is invokved by the planner when a scan query involves retrieving a large amount of data. Depending on the foreign server and your implementation, you have to decide whether or not parallel scan is supported.
Similar to the table access method’s
relation_estimate_size() API, this routine requires you to provide an estimate of the amount of tuples/rows would be involved during this round of scanning. This information is very important because it directly affects the planner’s decision to execute the query. For example, if you say foreign relation has 1 million tuples to be fetched, then the planner is most likely to use parallel scan to complete the scan for performance reasons if you answer “true” in the
IsForeignScanParallelSafe() function call.
In this routine, we are required to provide a list of possible paths to complete the required foreign scan. Paths simply means ways to execute. For example, you can complete a foreign scan by doing a sequential scan by fetching 1 tuple at a time, or you can complete the same scan using index to look-up a target tuple if the table has an index defined and user gives specific
WHERE clause to pinpoint a tuple. If user provides a
WHERE clause to filter the result, you also have an option to estimate whether the
WHERE can be executed by the foreign server or by the local server as it fetches a tuple.
All these are possible
pathnodes are required by the planner to decide on the optimal path to execute the scan. Each
pathnode requires you to provide the
number of rows and
path keys if any. The official documentation suggests that you should use
create_foreignscan_path routines to prepare a list of path nodes.
Having provided all the possible
pathnodes, planner will pick one that is the most optimal and pass that
pathnode to this routine and ask you to make a foreign scan plan for the executor. This is a very important function because whatever plan that we provide here directly affects how the executor is going to function, so be cautious here.
In this function, the planner will give you everything about the query such as the target attribute list and the restricting clauses (
WHERE clauses) and also the information about the foreign relation, such as its OID, and each column’s data types. Official PostgreSQL documentation suggests using
make_foreignscan() routine with the desired arguments to create the foreign plan.
In here, you also will have an option to
push down the restricting clauses or not. In other words, you can choose to send all the
WHERE clauses to the foreign server to process if it can support it. This results in much less data communication. You can also choose having all the
WHERE clauses to be processed locally, but this requires the FDW to fetch every single row from the foreign server, resulting in more data communication. This decision is controlled by the
List *qpqual argument to the
make_foreignscan() function. If this list contains the list of your restricting clauses, the local server will perform the
WHERE evaluation locally; if this list is empty, then the local server assumes the FDW will do the
WHERE evaluation on the foreign server
This routine will be called before starting the foreign scan, giving you a chance to allocate all the necessary control information required for your foreign scan. You can define your own control information, for example, cursor location, remote connection structure, current offset…etc. and store in
node->fdw_state and it will be passed down to the
IterateForeignScan as the scan is begin completed.
This is the main routine to perform the actual scan. You need to put the proper logic here depending on your use case to fetch one or more rows from the foreign server. When you have the row data, you will need to convert that to a
TupleTableSlot strucutre in which the PG internals can understand. We normally use
ExecStoreHeapTuple() routine to convert a HeapTuple into a
TupleTableSlot. This routine will be continuously called by the PG executor as long as you still have data to fetch and it will only stop once you return an empty
This routine will be called at the end of foreign scan, giving you the opportunity to clean up the control information allocated in
BeginForeignScan(). This marks the end of foreign scan via FDW.
There is a lot of information and articles out there related to PostgreSQL’s FDW API interface. This blog is merely a quick summary of what I have learned during the FDW evaluation and there is much more to what’s discussed here. For more detail, you can refer to the official documentation about the FDW callback functions here