How to build a data model¶
The main steps to create a data model (or schema) for the mdf_reader are:
Create a valid directory tree to hold the model (mymodel) as shown in the figure below. The correct directory path to store your schema is
~/mdf_reader/data_models/lib/.
Data model directory¶
Create a valid schema file under
../lib/mymodel/mymodel.json:
To create the schema file, two important aspects of the schema need to be clear beforehand; i) the order and field lengths of each element in the data input string, ii) do the information in the data input needs to be organised into sections, like ICOADS .imma data format. With this in mind, one can access all the schema file templates available from within the tool via:
template_names = mdf_reader.schemas.templates()
These templates have been created to ease the generation of new valid schema files, these templates cover from a basic schema format to a more complex one:
Fixed width or delimited: fixed_width_ or delimited_
With no sections or with sections: _basic or _sections
More complex options include blocks of sections which in the case of ICOADS data are exclusive for certain decks (e.g. deck
td11) or blocks of sections that are optional:_complex_exc.jsonor_complex_opt.json
To copy a template to edit you can run the following functions:
mdf_reader.schemas.copy_template(template_name,out_path=file_path)
Create valid code tables under
../lib/mymodel/code_tables/table_name[i].jsonif the data model includes code tables.
The general structure of a schema and the description of each attribute is explain in the table below:
Schema block |
Scope |
Attribute |
Header |
common |
|
no sections |
|
|
sections |
|
|
Elements |
common |
|
numeric |
|
|
object, str |
|
|
key |
|
|
datetime |
|
|
fixed_width |
|
|
Sections (header) |
common |
|
fixed_width |
|
Schema header block¶
The header block is the first block of the schema file, and is common to all schema types, but some of its descriptors are, however, specific to certain model types. There is no need to declare a header block in data models for which sections are sequential (e.g. all elements in the data source appear in the same order as declared in the sections block).
Example of a header block for a
.immabased schema:"header": { "parsing_order": [ {"s": ["core"]}, {"o": ["c1","c5","c6","c7","c8","c9","c95","c96","c97","c98"]}, {"s": ["c99_sentinal", "c99_data", "c99_header", "c99_qc"]}] },
Scope |
Descriptor name |
|---|---|
Common |
|
Data models with sections (1 or Multiple) |
|
Data models with no sections |
|
delimiterString type descriptor that defines the field delimiter for data models.
Setting this descriptor makes the default value of
field_layout==delimitedMainly this descriptor will be use if
field_layout==delimitedWhen use together with
field_layout==fixed_widththe code understands that the data layout is a mixture of delimited and fixed_width strings. In this case the delimiter is removed and the section is read as afixed_widthtype of section.This case has been added to overcome how pandas managed the
c99section in.imma1model. e.g. Deck 704 c99 section, which is a sequence of fixed width elements separated by commas.Applies to
delimitedandfixed_widthfield layoutsIt is a mandatory field only in the case that
field_layout==delimited
encodingString type descriptor that denotes the file encoding
Applies to all elements
It is not a mandatory field descriptor
- Options:
all python supported, see the following link for all possible encodings.
defaults to utf-8
filed_layoutString type descriptor that defines the layout of fields in the data model with no sections
Applies to all data models with no sections
Is mandatory descriptor (for data models with no sections)
- Options:
delimitedorfix_widthDefaults to
delimitedifdelimiteris set, but can be specified tofixed_widthtype together with adelimiteroption.
parsing_orderList of dictionaries containing the order in which the tool must look for sections in a report and grouped the data by section block types. This field applies to those data types which reports are divided into multiple sections i.e. ICOADS data
Applies to all data models with multiple sections
The different section block types are:
s: sequential. Sections in this block appear as listed in all reports.e: exclusive. Among the sections listed in the block, only one of them appears in every report.o: optional. Any combination of sections listed in the block can be present in the report. Any order, any missing or present (but does not handle repetitions).
Example:
``parsing_order``: [{"s":["core"]}, {"o":["c1", "c99"]}]
Schema element block¶
The elements block is a feature common to all data model types. It is the second and last block of data in a schema file with no sections, while it is part of each of the sections’ blocks in more complex schemas. This is an example of an element block:
"elements": {
"YR": {
"description": "year UTC",
"field_length": 4,
"column_type": "uint16",
"valid_max": 2024,
"valid_min": 1600,
"units": "year"
},
"MO": {
"description": "month UTC",
"field_length": 2,
"column_type": "uint8",
"valid_max": 12,
"valid_min": 1,
"units": "month"
},
"DY": {
"description": "day UTC",
"field_length": 2,
"column_type": "uint8",
"valid_max": 31,
"valid_min": 1,
"units": "day"
},
"HR": {
"description": "hour UTC",
"field_length": 4,
"column_type": "float32",
"valid_max": 23.99,
"valid_min": 0.0,
"scale": 0.01,
"decimal_places": 2,
"units": "hour"
}}
Elements in the data are parsed in the order they are declare here. The element block above would define a file / section with elements named: YR, MO, DY and HR. All elements attributes, some of which are data type specific, are listed and detailed in the following table:
Scope |
Descriptor name |
|---|---|
Common |
|
Fixed width types |
|
Numeric types |
|
Object, str types |
|
Key type |
|
Datetime type |
|
descriptionString type descriptor that describes the data element (e.g. free text describing the data element).
Applies to all elements
field_lengthNumeric integer descriptor that determines the field length of the elements (number of bytes or number of characters in a report string).
Applies to the schema format type:
fixed_widthand is a mandatory field in the element block.It can be set to null, or not present; if the element is unique in a section whose length is unknown and if this section is the last in the data model (e.g. like it is usually the case for ICOADS supplemental data section c99). If this is the case and the length is unknown the default will be set by the function
mdf_reader.properties.MAX_FULL_REPORT_WIDTH(), which sets thefield_lengthto 100000.
column_typeNumeric integer descriptor that determines the element data type.
Mandatory field.
Applies to all elements
- Options:
Numeric data types: all types interpreter by numpy.
Datetimes: string or
datetime64[ns]object that formats dates or datetimes when read in a single field. The object must be a datetime.datetime valid format. Can be also read via code tables and the parameterkey.
missing_valueString type descriptor that denotes if there are additional missing values to tag for an element in a schema.
Applies to all elements
Default values are the same as pandas default missing values
ignoreBoolean type descriptor that ignores an element on the output
Options:
TrueorFalse, defaults toFalseApplies to all elements
Is not a mandatory field descriptor
unitsString type descriptor that states the units of the measured data element.
Applies to column_type. [numerics] elements.
Is not a mandatory field descriptor
Defaults to
None
encodingString type descriptor added if an element needs it
Is not a mandatory field
Not to be confuse with file
encodingApplies to column_type. [numerics] elements and column_type. [key] elements
Defaults to
None- Options:
base36signed_overpunch
valid_maxNumeric type of descriptor that indicates the valid maximum value for numeric elements. This should be the valid maximum in variable declared units, after decoding and conversion (offset, scale…) and it is use for element validation.
Applies to column_type. [numerics] elements
Is not a mandatory field
Defaults to +inf
valid_minNumeric type of descriptor that indicates the minimum value for numeric elements. This should be the valid minimum in variable declared units, after encoding and conversion (offset, scale …) and it is use for element validation.
Applies to column_type. [numerics] elements
Is not a mandatory field
Defaults to -inf
scaleNumeric type of descriptor. This scale is applied to numeric elements in order to convert the original value to the declared element units.
Applies to column_type. [numerics] elements
Is not a mandatory field
Defaults to 1
offsetNumeric type of descriptor. This offset is applied to numeric elements in order to convert the original value to the declared element units.
Applies to column_type. [numerics] elements
Is not a mandatory field
Defaults to 0
decimal_placesNumeric integer descriptor that defines the number of decimal places to which the observed value is reported.
Applies to column_type. [numeric_floats] elements
Is not a mandatory field
Defaults to
pandas.display.precision= 6.
codetableString type of descriptor containing the key code look up table name. It is the File basename of a code table (with no .json extension) located in the
mymodel/code_tablesdirectory. See Code tables for more information.Applies to column_type. [key] elements
Is mandatory if
"column_type": "key".
disable_white_stripBoolean or string type descriptor that modifies the default leading/trailing blank stripping.
Applies to column_type. [key, object, str] elements
- Options:
do not perform any stripping: true
do not perform right stripping (trailing blanks): `r`
do not perform left stripping (leading blanks): `l`
Is not a mandatory field
Defaults to false
datetime_formatString type of descriptor that sets the format for the dates.
Applies to column_type. [datetime] elements
Is not a mandatory field
Defaults to
%Y%m%dAll python.datetime formats are valid.
Schema section block¶
If the data model is organized in sections then the schema has two main blocks: the header (see Schema header block) and the sections blocks. The sections block has a separate block per section, with the following general layout:
A section specific header (or sub-header) with info on how to access that specific section.
The section’s elements block (See Schema element block)
Example of a schema section block: “core” section of the .imma schema:
"sections": {
"core": {
"header": {"sentinal": null,"length": 108},
"elements": {
"YR": {
"description": "year UTC",
"field_length": 4,
"column_type": "uint16",
"valid_max": 2024,
"valid_min": 1600,
"units": "year"
},
"MO": {
"description": "month UTC",
"field_length": 2,
"column_type": "uint8",
"valid_max": 12,
"valid_min": 1,
"units": "month"
}
}
}
}
Section header¶
delimiterString type descriptor that defines the field delimiter for the data model section.
Setting this descriptor makes the default value of
field_layout==delimitedMainly this descriptor will be use if
field_layout==delimitedWhen use together with
field_layout==fixed_widththe code understands that the data layout is a mixture of delimited and fixed_width strings. In this case the delimiter is removed and the section is read as afixed_widthtype of section.Applies to
delimitedandfixed_widthfield layoutsIt is a mandatory field only in the case that
field_layout==delimited
disable_readBoolean type descriptor that if set to True will ignore the elements of that section. This section will then be produced in the output as a single string.
Options:
TrueofFalseDefaults to False
field_layoutString type descriptor that defines the layout of fields in the section of the data model
Applies to all sections
If field
delimiteris set, thenfield_layoutdefaults todelimited, else tofixed_width.This descriptor does not need to be specified in the schema files in the majority of the cases. However, to account for mixed formats, like c99 section in imma1 files for deck 704, this default setting can be overridden by specifying the
field_layoutparameter.- Options:
delimitedorfix_widthDefaults to
delimitedifdelimiteris set, else defaults to what ever is set in thefixed_width.
sentinalString type of descriptor that allows the code to identify a section.
Applies to sections of format.fixed_width
It is a mandatory field if the section is unique, unique in a parsing_order block, or part of a sequential parsing_order block.
Elements bearing the sentinal need to be, additionally, declared in the elements block.
lengthNumeric integer type of descriptor that defines the length of the section (how many bytes or characters in a string).
Applies to format.fixed_width
It is a mandatory field
Can be also set to
null, or not reported, if the section is the last one to be parsed and the length is unknown (like the c99 section of the .imma model.
Section elements¶
Same as Schema element block.
Code Tables¶
To learn about how to construct a code table, please read the Code tables section.