How to build a data model¶
The main steps to create a data model (or schema) for the mdf_reader are:
Create a valid directory tree to hold the model (mymodel) as shown in the figure below. The correct directory path to store your schema is
~/mdf_reader/data_models/lib/
.
![_images/schema.png](_images/schema.png)
Data model directory¶
Create a valid schema file under
../lib/mymodel/mymodel.json
:
To create the schema file, two important aspects of the schema need to be clear beforehand; i) the order and field lengths of each element in the data input string, ii) do the information in the data input needs to be organised into sections, like ICOADS .imma
data format. With this in mind, one can access all the schema file templates available from within the tool via:
template_names = mdf_reader.schemas.templates()
These templates have been created to ease the generation of new valid schema files, these templates cover from a basic schema format to a more complex one:
Fixed width or delimited: fixed_width_ or delimited_
With no sections or with sections: _basic or _sections
More complex options include blocks of sections which in the case of ICOADS data are exclusive for certain decks (e.g. deck
td11
) or blocks of sections that are optional:_complex_exc.json
or_complex_opt.json
To copy a template to edit you can run the following functions:
mdf_reader.schemas.copy_template(template_name,out_path=file_path)
Create valid code tables under
../lib/mymodel/code_tables/table_name[i].json
if the data model includes code tables.
The general structure of a schema and the description of each attribute is explain in the table below:
Schema block |
Scope |
Attribute |
Header |
common |
|
no sections |
|
|
sections |
|
|
Elements |
common |
|
numeric |
|
|
object, str |
|
|
key |
|
|
datetime |
|
|
fixed_width |
|
|
Sections (header) |
common |
|
fixed_width |
|
Schema header block¶
The header block is the first block of the schema file, and is common to all schema types, but some of its descriptors are, however, specific to certain model types. There is no need to declare a header block in data models for which sections are sequential (e.g. all elements in the data source appear in the same order as declared in the sections block).
Example of a header block for a
.imma
based schema:"header": { "parsing_order": [ {"s": ["core"]}, {"o": ["c1","c5","c6","c7","c8","c9","c95","c96","c97","c98"]}, {"s": ["c99_sentinal", "c99_data", "c99_header", "c99_qc"]}] },
Scope |
Descriptor name |
---|---|
Common |
|
Data models with sections (1 or Multiple) |
|
Data models with no sections |
|
delimiter
String type descriptor that defines the field delimiter for data models.
Setting this descriptor makes the default value of
field_layout
==delimited
Mainly this descriptor will be use if
field_layout
==delimited
When use together with
field_layout
==fixed_width
the code understands that the data layout is a mixture of delimited and fixed_width strings. In this case the delimiter is removed and the section is read as afixed_width
type of section.This case has been added to overcome how pandas managed the
c99
section in.imma1
model. e.g. Deck 704 c99 section, which is a sequence of fixed width elements separated by commas.Applies to
delimited
andfixed_width
field layoutsIt is a mandatory field only in the case that
field_layout
==delimited
encoding
String type descriptor that denotes the file encoding
Applies to all elements
It is not a mandatory field descriptor
- Options:
all python supported, see the following link for all possible encodings.
defaults to utf-8
filed_layout
String type descriptor that defines the layout of fields in the data model with no sections
Applies to all data models with no sections
Is mandatory descriptor (for data models with no sections)
- Options:
delimited
orfix_width
Defaults to
delimited
ifdelimiter
is set, but can be specified tofixed_width
type together with adelimiter
option.
parsing_order
List of dictionaries containing the order in which the tool must look for sections in a report and grouped the data by section block types. This field applies to those data types which reports are divided into multiple sections i.e. ICOADS data
Applies to all data models with multiple sections
The different section block types are:
s
: sequential. Sections in this block appear as listed in all reports.e
: exclusive. Among the sections listed in the block, only one of them appears in every report.o
: optional. Any combination of sections listed in the block can be present in the report. Any order, any missing or present (but does not handle repetitions).
Example:
``parsing_order``: [{"s":["core"]}, {"o":["c1", "c99"]}]
Schema element block¶
The elements block is a feature common to all data model types. It is the second and last block of data in a schema file with no sections, while it is part of each of the sections’ blocks in more complex schemas. This is an example of an element block:
"elements": {
"YR": {
"description": "year UTC",
"field_length": 4,
"column_type": "uint16",
"valid_max": 2024,
"valid_min": 1600,
"units": "year"
},
"MO": {
"description": "month UTC",
"field_length": 2,
"column_type": "uint8",
"valid_max": 12,
"valid_min": 1,
"units": "month"
},
"DY": {
"description": "day UTC",
"field_length": 2,
"column_type": "uint8",
"valid_max": 31,
"valid_min": 1,
"units": "day"
},
"HR": {
"description": "hour UTC",
"field_length": 4,
"column_type": "float32",
"valid_max": 23.99,
"valid_min": 0.0,
"scale": 0.01,
"decimal_places": 2,
"units": "hour"
}}
Elements in the data are parsed in the order they are declare here. The element block above would define a file / section with elements named: YR, MO, DY and HR. All elements attributes, some of which are data type specific, are listed and detailed in the following table:
Scope |
Descriptor name |
---|---|
Common |
|
Fixed width types |
|
Numeric types |
|
Object, str types |
|
Key type |
|
Datetime type |
|
description
String type descriptor that describes the data element (e.g. free text describing the data element).
Applies to all elements
field_length
Numeric integer descriptor that determines the field length of the elements (number of bytes or number of characters in a report string).
Applies to the schema format type:
fixed_width
and is a mandatory field in the element block.It can be set to null, or not present; if the element is unique in a section whose length is unknown and if this section is the last in the data model (e.g. like it is usually the case for ICOADS supplemental data section c99). If this is the case and the length is unknown the default will be set by the function
mdf_reader.properties.MAX_FULL_REPORT_WIDTH()
, which sets thefield_length
to 100000.
column_type
Numeric integer descriptor that determines the element data type.
Mandatory field.
Applies to all elements
- Options:
Numeric data types: all types interpreter by numpy.
Datetimes: string or
datetime64[ns]
object that formats dates or datetimes when read in a single field. The object must be a datetime.datetime valid format. Can be also read via code tables and the parameterkey
.
missing_value
String type descriptor that denotes if there are additional missing values to tag for an element in a schema.
Applies to all elements
Default values are the same as pandas default missing values
ignore
Boolean type descriptor that ignores an element on the output
Options:
True
orFalse
, defaults toFalse
Applies to all elements
Is not a mandatory field descriptor
units
String type descriptor that states the units of the measured data element.
Applies to column_type. [numerics] elements.
Is not a mandatory field descriptor
Defaults to
None
encoding
String type descriptor added if an element needs it
Is not a mandatory field
Not to be confuse with file
encoding
Applies to column_type. [numerics] elements and column_type. [key] elements
Defaults to
None
- Options:
base36
signed_overpunch
valid_max
Numeric type of descriptor that indicates the valid maximum value for numeric elements. This should be the valid maximum in variable declared units, after decoding and conversion (offset, scale…) and it is use for element validation.
Applies to column_type. [numerics] elements
Is not a mandatory field
Defaults to +inf
valid_min
Numeric type of descriptor that indicates the minimum value for numeric elements. This should be the valid minimum in variable declared units, after encoding and conversion (offset, scale …) and it is use for element validation.
Applies to column_type. [numerics] elements
Is not a mandatory field
Defaults to -inf
scale
Numeric type of descriptor. This scale is applied to numeric elements in order to convert the original value to the declared element units.
Applies to column_type. [numerics] elements
Is not a mandatory field
Defaults to 1
offset
Numeric type of descriptor. This offset is applied to numeric elements in order to convert the original value to the declared element units.
Applies to column_type. [numerics] elements
Is not a mandatory field
Defaults to 0
decimal_places
Numeric integer descriptor that defines the number of decimal places to which the observed value is reported.
Applies to column_type. [numeric_floats] elements
Is not a mandatory field
Defaults to
pandas.display.precision
= 6.
codetable
String type of descriptor containing the key code look up table name. It is the File basename of a code table (with no .json extension) located in the
mymodel/code_tables
directory. See Code tables for more information.Applies to column_type. [key] elements
Is mandatory if
"column_type": "key"
.
disable_white_strip
Boolean or string type descriptor that modifies the default leading/trailing blank stripping.
Applies to column_type. [key, object, str] elements
- Options:
do not perform any stripping: true
do not perform right stripping (trailing blanks): `r`
do not perform left stripping (leading blanks): `l`
Is not a mandatory field
Defaults to false
datetime_format
String type of descriptor that sets the format for the dates.
Applies to column_type. [datetime] elements
Is not a mandatory field
Defaults to
%Y%m%d
All python.datetime formats are valid.
Schema section block¶
If the data model is organized in sections then the schema has two main blocks: the header (see Schema header block) and the sections blocks. The sections block has a separate block per section, with the following general layout:
A section specific header (or sub-header) with info on how to access that specific section.
The section’s elements block (See Schema element block)
Example of a schema section block: “core” section of the .imma
schema:
"sections": {
"core": {
"header": {"sentinal": null,"length": 108},
"elements": {
"YR": {
"description": "year UTC",
"field_length": 4,
"column_type": "uint16",
"valid_max": 2024,
"valid_min": 1600,
"units": "year"
},
"MO": {
"description": "month UTC",
"field_length": 2,
"column_type": "uint8",
"valid_max": 12,
"valid_min": 1,
"units": "month"
}
}
}
}
Section header¶
delimiter
String type descriptor that defines the field delimiter for the data model section.
Setting this descriptor makes the default value of
field_layout
==delimited
Mainly this descriptor will be use if
field_layout
==delimited
When use together with
field_layout
==fixed_width
the code understands that the data layout is a mixture of delimited and fixed_width strings. In this case the delimiter is removed and the section is read as afixed_width
type of section.Applies to
delimited
andfixed_width
field layoutsIt is a mandatory field only in the case that
field_layout
==delimited
disable_read
Boolean type descriptor that if set to True will ignore the elements of that section. This section will then be produced in the output as a single string.
Options:
True
ofFalse
Defaults to False
field_layout
String type descriptor that defines the layout of fields in the section of the data model
Applies to all sections
If field
delimiter
is set, thenfield_layout
defaults todelimited
, else tofixed_width
.This descriptor does not need to be specified in the schema files in the majority of the cases. However, to account for mixed formats, like c99 section in imma1 files for deck 704, this default setting can be overridden by specifying the
field_layout
parameter.- Options:
delimited
orfix_width
Defaults to
delimited
ifdelimiter
is set, else defaults to what ever is set in thefixed_width
.
sentinal
String type of descriptor that allows the code to identify a section.
Applies to sections of format.fixed_width
It is a mandatory field if the section is unique, unique in a parsing_order block, or part of a sequential parsing_order block.
Elements bearing the sentinal need to be, additionally, declared in the elements block.
length
Numeric integer type of descriptor that defines the length of the section (how many bytes or characters in a string).
Applies to format.fixed_width
It is a mandatory field
Can be also set to
null
, or not reported, if the section is the last one to be parsed and the length is unknown (like the c99 section of the .imma model.
Section elements¶
Same as Schema element block.
Code Tables¶
To learn about how to construct a code table, please read the Code tables section.