And learn that California has 69 cities of 100k or more
comprising almost 1/2 of the state’s population:
+---------------------+----------------------+
| City Count | Pct State Population |
+---------------------+----------------------+
| 69 | 48.574217177106576 |
+---------------------+----------------------+
For simple file formats such as CSV, the file is self-describing and
you don’t even need a model.
See CSV files and model-free browsing.
A simple example
Let’s start with a simple example. First, we need a
model definition,
as follows.
Schemas are defined as a list of tables, each containing minimally a
table name and a url. If a page has more than one table, you can
include in a table definition selector and index fields to specify the
desired table. If there is no table specification, the file adapter
chooses the largest table on the page.
EMPS.html contains a single HTML table:
The model file is stored as file/src/test/resources/sales.json,
so you can connect via sqlline
as follows:
Mapping tables
Now for a more complex example. This time we connect to Wikipedia via
HTTP, read pages for US states and cities, and extract data from HTML
tables on those pages. The tables have more complex formats, and the
file adapter helps us locate and parse data in those tables.
Tables can be simply defined for immediate gratification:
And subsequently refined for better usability/querying:
Connect and execute queries, as follows.
Note that Cities is easier to consume than RawCities,
because its table definition has a field list.
Field definitions may be used to rename or skip source fields, to
select and condition the cell contents and to set a data type.
Parsing cell contents
The file adapter can select DOM nodes within a cell, replace text
within the selected element, match within the selected text, and
choose a data type for the resulting database column. Processing
steps are applied in the order described and replace and match
patterns are based on
Java regular expressions.
Further examples
There are more examples in the form of a script:
(When running webjoin.sql you will see a number of warning messages for
each query containing a join. These are expected and do not affect
query results. These messages will be suppressed in the next release.)
CSV files and model-free browsing
Some files describe their own schema, and for these files, we do not need a model. For example, DEPTS.csv has an
integer DEPTNO column and a string NAME column:
You can launch sqlline, and pointing the file adapter that directory,
and every CSV file becomes a table:
JSON files and model-free browsing
Some files describe their own schema, and for these files, we do not need a model. For example, DEPTS.json has an integer DEPTNO column and a string NAME column:
You can launch sqlline, and pointing the file adapter that directory,
and every JSON file becomes a table:
Future improvements
We are continuing to enhance the adapter, and would welcome
contributions of new parsing capabilities (for example parsing JSON
files) and being able to form URLs dynamically to push down filters.