Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Confirmed.

What can be done with the odb tools and the python interface?

We can read the content of an existing odb file and create new files too, unfortunately with some limitations...

Creating odb files in various ways... an interesting exercise


Code Block
languagepy
odb sql "select 1 as x , 2 as y, 3/0 as z" -f odb -o new.odb
odb sql "select 1 as x, 2 as y, 3/0 as z into 'new.odb'" 

this two command will create a new odb file filled with one row:

Code Block
languagepy
             x               y               z

1.00000000000000000     2.00000000000000000                inf

and the header will look like:

Code Block
languagepy
odb header new.odb

Header 1. Begin offset: 0, end offset: 322, number of rows in block: 1, byteOrder: same
0. name: x, type: REAL, codec: constant, value=1.000000
1. name: y, type: REAL, codec: constant, value=2.000000
2. name: z, type: DOUBLE, codec: constant, value=inf

the column named "z" will have infinite value... I can catch it with queries of this kind:

select * where z=1/0

Unfortunately the created columns are by default real numbers for x and y, except for z which is double (like any output of functions).

I could cast the values to be integers using something like

Code Block
languagepy
odb sql "select int(1) as x, int(2) as y, int(3/0) as z into 'new.odb'"

and what I get is:

odb sql "select * from ''new.odb'"
000 2017-10-27 15:59:43 (I)  : dual
        int(1)          int(2)     int(/(3,0))
             1               2  -9223372036854775808

In this case I have x and y correctly converted to integers, but the Inf value makes the type casting not failing, instead it produces -9223372036854775808... I would like to see an error here! (later -9....8 will became -nan)

From one row/block to two

Considering the last example, we can create a new file with an extra column

Code Block
languagepy
odb sql "select int(1) as x, int(2) as y, int(3/0) as z, 6. as k into 'new01.odb'"

and what I get is:

odb sql "select *" -i new.odb
             x               y               z               k
             1               2  -9223372036854775808    6.00000000000000000

Then we can combine the filese into one using cat:

cat new01.odb new.odb > new_mix.odb

and the result of a select will be:

odb sql "select * from 'new_mix.odb'"
000 2017-10-30 09:31:17 (I)  : dual
             x               y               z               k
             1               2  -9223372036854775808    6.00000000000000000
000 2017-10-30 09:31:17 (W) Column 'k': index has changed in new dataset.
000 2017-10-30 09:31:17 (W) Was: 3.
000 2017-10-30 09:31:17 (E) Exception: UserError: One of selected columns, 'k', does not exist in new data set. 
000 2017-10-30 09:31:17 (I) Execute: 0.001022 second elapsed, 0.001021 second cpu
000 2017-10-30 09:31:17 (E) ** UserError: One of selected columns, 'k', does not exist in new data set. Caught in  (/tmp/metabuilds/ecflow-metab_5062/leap42/GNU.53/eckit/eckit/src/eckit/runtime/Tool.cc +38 start)
000 2017-10-30 09:31:17 (E) ** Exception terminates odb

When exchanging the orders of the files instead:

cat new.odb new01.odb > new_mix_bis.odb
odb sql "select * from 'new_mix_bis.odb'"
000 2017-10-30 09:33:36 (I)  : dual
             x               y               z
             1               2             inf
             1               2            -nan
000 2017-10-30 09:33:36 (I) Matching row(s): 2 out of 2
000 2017-10-30 09:33:36 (I) Skips: 0
000 2017-10-30 09:33:36 (I) Execute: 0.000191 second elapsed, 0.00019 second cpu

Concatenating ODB files using cat is a bad practice!

I would like this method to be able to work in a more clean way, or to have an "odb cat " command that rise an error if the produced file will be corrupted... Or a more deep approach to heterogeneous files.

Handling heterogeneous files

We have some possible answers to the basic question: what is an ODB file?

  • It is a table containing data according to an header... (Mostly false)
  • It is a collections of table that can be of same type or different type and the differences can be in order and number of columns.

There are two entities that sometime coincide: the group of rows under one header, a collections of groups.

We need more tools to work with these entities.

The command "odb header" is not good enough. We need something capable of breaking a file in groups of headers, capable of producing a compact output (not N lines for group) as well as a detailed one, on request for each group.

The odb1 philosophy of groups aimed to parallel computing should be still in place, instead groups are mainly for compression purposes.

The python api

Code Block
import odb

odbf = 'foo.odb' # An existing odb file

# legacy version
for r in odb.sql("select distinct andate, antime, groupid, reportype, varno, datum_status.active, sum(counts) from '%s' where area='GLOBE' order by groupid, reportype, varno, datum_status.active;" % odbf):
	## Do something with "r"


# last version:
conn = odb.connect('')
cur = odb.cursor()

for r in cur.execute("select distinct andate, antime, groupid, reportype, varno, datum_status.active, sum(counts) from '%s' where area='GLOBE' order by groupid, reportype, varno, datum_status.active;" % odbf))
	## Do something with "r"            

In the example we have two different ways of executing and capturing the output of a query. From the user point of view the legacy one is better, because there is less code involved.

However, the legacy does not allow to create an odb file (unless we use the tricks shown above).

The way to create new files within python is not correct. We can see the online example 1:

Code Block
languagepy
import odb
  
### Example 1. Create a new ODB file.
###
  
### 1.1. Define metadata (column names and their types) 
#    of a new table associated with a physical file 
#    by passing some DDL (Data Definition Language):
#    CREATE TYPE (to define bitfields) and CREATE TABLE
#    statements to function connect.
#
#   Function connect returns a Connection object.
  
conn = odb.connect("")
  
### 1.2 Create a Cursor object. """
c = conn.cursor()
  
### 1.3 Populate table with data using SQL INSERT statement 
#    and method Cursor.executemany. 
  
c.execute('''
    CREATE TYPE bf AS (f1 bit1, f2 bit2); 
  
    CREATE TABLE foo AS 
    ( x INTEGER, y DOUBLE, v STRING, status bf) 
    ON 'new_api_example.odb';
    ''')
  
c.executemany('INSERT INTO foo (x,y,v,status) VALUES (?,?,?,?);', 
                [[1,0.1, '  one   ', 1], 
                 [2,0.2, '  two   ', 2], 
                 [3,0.3, '  three ', 3], 
                 [4,0.4, '  four  ', 4]])
### 1.4 Flush buffers and write to file associated with the table. """
conn.commit()

The first thing to notice is that it does not work. We will see an empty new_api_example.odb file. To see the content one should run the example two that consists in reading the values inserted here. When doing this the file will be properly created.

We could leave with the fact that commit does not commit?... no

However, the steps to create a file are not right:

The create table statement is wrong 100%, because the file is a table. In the example we will  have some columns: x@foo, y@foo, v@foo, status@foo, but we can not create a column called just reportype... We could do it after, renaming within a query.

It is nice to be able to create groups of tables, but it is not a common sql thing, in particular '@' is not usable as character in column names.

When referring to columns of a table in sql like we use table.column, not column@table. However here we do not have tables, but groups of columns...


we should be able to do something like this:

Code Block
languagepy
import odb
conn = odb.connect()
cur = conn.cursor()
cur.execute("create table 'foo.odb' as (x INTEGER, y INTEGER, z REAL, k DOUBLE, v STRING, any@a_group INTEGER, other@a_group A-TYPE)")
# defining default values would be a good thing, as well as having all standard data types.
cur.commit() # this should create a file called foo.odb containg an header and zero rows.

In case we want to write in the created odb file we should be able to simply:

cur.executemany("insert into 'foo.odb' (x, y, z, k, v, any@a_group, other@a_group) values (?, ?, ?, ?, ?, ?, ?)", [[...],...,[...]])

and to access the data:

cur.execute("select * from 'foo.odb'")

I the online example 1-2 the connection was opened, a table created, populated and on the same connection a query or select statement was performed. only after that the file was created.... WRONG, very wrong.


Summarizing: a create statement should be used to create the header of a group. this should happen directly on a file.

ODB  has other properties that we would like to control like the maximum number of rows for a group.

We could be able to use 'ALTER' statement to rename and add columns to a file without having to write it as a new file.

We should be able to use 'UPDATE' statement to change values

We should be able to use "DELETE FROM ... WHERE ..."

The work-flow of odb files is done in a way that always: an_odb_file→ odb operation → new_file. IT is not NICE!

Working with files in memory

The example I gave of a desired way of using odb is not complete, instead we need a key feature:

Code Block
languagepy
import odb
import StringIO # BytesIO should do too

file_in_memory = StringIO.StringIO()

cur.execute("create table '?' as (x INTEGER, y INTEGER, z REAL, k DOUBLE, v STRING, any@a_group INTEGER, other@a_group A-TYPE)", file_obj = file_in_memory)

with open('a_new_odb_file.odb', 'wb') as f:
	f.write(file_in_memory.getvalue())

Also the "select" statements should go in memory.

If using StringIO/BytesIO is too much, we should be able to pass file objects in general.



Comments

  • The odb.connect() method is used in a very general way at the moment: when pointing to ('') it seems to be anything but wrong (pointing to a folder would not be nicer?).

When pointing to ('mars//RETRIEVE...') it shows a strong character. However, distributing the odb_api as it is would create just problems: nobody has access to mars from outside.

Instead we have the web api to download mars data....

The odb_api should not have a link to mars in itself, the data handling should be separate from the data retrieval and some web_api/odb_api combination could be instead done keeping the two players in two different packages.

  • Excess in verbosity... output from odb tools and odb_api is overwealming and sometimes standard output goes into standard error...

    Code Block
    languagepy
    odb sql "select 1,2,3" >/dev/null
    000 2017-10-30 13:10:04 (I) Matching row(s): 1 out of 1
    000 2017-10-30 13:10:04 (I) Skips: 0
    000 2017-10-30 13:10:04 (I) Execute: 0.000122 second elapsed, 0.00012 second cpu

    The example is not 100% right, because when performing a select some filtering are possible, but try odb set/ odb split/ odb mdset. The output is too much and not directed to standard output.