How to build a data model

The main steps to create a data model (or schema) for the mdf_reader are:

  1. Create a valid directory tree to hold the model (mymodel) as shown in the figure below. The correct directory path to store your schema is ~/mdf_reader/data_models/lib/.

_images/schema.png

Data model directory

  1. Create a valid schema file under ../lib/mymodel/mymodel.json:

To create the schema file, two important aspects of the schema need to be clear beforehand; i) the order and field lengths of each element in the data input string, ii) do the information in the data input needs to be organised into sections, like ICOADS .imma data format. With this in mind, one can access all the schema file templates available from within the tool via:

template_names = mdf_reader.schemas.templates()

These templates have been created to ease the generation of new valid schema files, these templates cover from a basic schema format to a more complex one:

  • Fixed width or delimited: fixed_width_ or delimited_

  • With no sections or with sections: _basic or _sections

  • More complex options include blocks of sections which in the case of ICOADS data are exclusive for certain decks (e.g. deck td11) or blocks of sections that are optional: _complex_exc.json or _complex_opt.json

To copy a template to edit you can run the following functions:

mdf_reader.schemas.copy_template(template_name,out_path=file_path)
  1. Create valid code tables under ../lib/mymodel/code_tables/table_name[i].json if the data model includes code tables.

The general structure of a schema and the description of each attribute is explain in the table below:

Schema block

Scope

Attribute

Header

common

encoding

no sections

field_layout, delimiter

sections

parsing_order

Elements

common

column_type, description, ignore, missing_value

numeric

decimal_places, encoding,

offset, scale, units

valid_max, valid_min

object, str

disable_white_strip

key

code_table, encoding disable_white_strip,

datetime

datetime_format

fixed_width

field_length

Sections (header)

common

delimiter, disable_read field_layout

fixed_width

length, sentinal

Schema header block

The header block is the first block of the schema file, and is common to all schema types, but some of its descriptors are, however, specific to certain model types. There is no need to declare a header block in data models for which sections are sequential (e.g. all elements in the data source appear in the same order as declared in the sections block).

  • Example of a header block for a .imma based schema:

    "header": {
         "parsing_order": [
             {"s": ["core"]},
             {"o": ["c1","c5","c6","c7","c8","c9","c95","c96","c97","c98"]},
             {"s": ["c99_sentinal", "c99_data", "c99_header", "c99_qc"]}]
     },
    

Scope

Descriptor name

Common

encoding

Data models with sections (1 or Multiple)

parsing_order

Data models with no sections

field_layout, delimiter

  • delimiter
    • String type descriptor that defines the field delimiter for data models.

    • Setting this descriptor makes the default value of field_layout == delimited

    • Mainly this descriptor will be use if field_layout == delimited

    • When use together with field_layout == fixed_width the code understands that the data layout is a mixture of delimited and fixed_width strings. In this case the delimiter is removed and the section is read as a fixed_width type of section.

    • This case has been added to overcome how pandas managed the c99 section in .imma1 model. e.g. Deck 704 c99 section, which is a sequence of fixed width elements separated by commas.

    • Applies to delimited and fixed_width field layouts

    • It is a mandatory field only in the case that field_layout == delimited

  • encoding
    • String type descriptor that denotes the file encoding

    • Applies to all elements

    • It is not a mandatory field descriptor

    • Options:
      1. all python supported, see the following link for all possible encodings.

      2. defaults to utf-8

  • filed_layout
    • String type descriptor that defines the layout of fields in the data model with no sections

    • Applies to all data models with no sections

    • Is mandatory descriptor (for data models with no sections)

    • Options:
      1. delimited or fix_width

      2. Defaults to delimited if delimiter is set, but can be specified to fixed_width type together with a delimiter option.

  • parsing_order
    • List of dictionaries containing the order in which the tool must look for sections in a report and grouped the data by section block types. This field applies to those data types which reports are divided into multiple sections i.e. ICOADS data

    • Applies to all data models with multiple sections

    • The different section block types are:

      1. s: sequential. Sections in this block appear as listed in all reports.

      2. e: exclusive. Among the sections listed in the block, only one of them appears in every report.

      3. o: optional. Any combination of sections listed in the block can be present in the report. Any order, any missing or present (but does not handle repetitions).

    • Example:

      ``parsing_order``: [{"s":["core"]}, {"o":["c1", "c99"]}]
      

Schema element block

The elements block is a feature common to all data model types. It is the second and last block of data in a schema file with no sections, while it is part of each of the sections’ blocks in more complex schemas. This is an example of an element block:

"elements": {
             "YR": {
                 "description": "year UTC",
                 "field_length": 4,
                 "column_type": "uint16",
                 "valid_max": 2024,
                 "valid_min": 1600,
                 "units": "year"
             },
             "MO": {
                 "description": "month UTC",
                 "field_length": 2,
                 "column_type": "uint8",
                 "valid_max": 12,
                 "valid_min": 1,
                 "units": "month"
             },
             "DY": {
                 "description": "day UTC",
                 "field_length": 2,
                 "column_type": "uint8",
                 "valid_max": 31,
                 "valid_min": 1,
                 "units": "day"
             },
             "HR": {
                 "description": "hour UTC",
                 "field_length": 4,
                 "column_type": "float32",
                 "valid_max": 23.99,
                 "valid_min": 0.0,
                 "scale": 0.01,
                 "decimal_places": 2,
                 "units": "hour"
             }}

Elements in the data are parsed in the order they are declare here. The element block above would define a file / section with elements named: YR, MO, DY and HR. All elements attributes, some of which are data type specific, are listed and detailed in the following table:

Scope

Descriptor name

Common

column_type, description, ignore, missing_value

Fixed width types

field_length

Numeric types

decimal_places, encoding, offset, scale, valid_max, valid_min

Object, str types

disable_white_strip

Key type

codetable, disable_white_strip, encoding

Datetime type

datetime_format

  • description
    • String type descriptor that describes the data element (e.g. free text describing the data element).

    • Applies to all elements

  • field_length
    • Numeric integer descriptor that determines the field length of the elements (number of bytes or number of characters in a report string).

    • Applies to the schema format type: fixed_width and is a mandatory field in the element block.

    • It can be set to null, or not present; if the element is unique in a section whose length is unknown and if this section is the last in the data model (e.g. like it is usually the case for ICOADS supplemental data section c99). If this is the case and the length is unknown the default will be set by the function mdf_reader.properties.MAX_FULL_REPORT_WIDTH(), which sets the field_length to 100000.

  • column_type
    • Numeric integer descriptor that determines the element data type.

    • Mandatory field.

    • Applies to all elements

    • Options:
      1. Numeric data types: all types interpreter by numpy.

      2. Datetimes: string or datetime64[ns] object that formats dates or datetimes when read in a single field. The object must be a datetime.datetime valid format. Can be also read via code tables and the parameter key.

  • missing_value
    • String type descriptor that denotes if there are additional missing values to tag for an element in a schema.

    • Applies to all elements

    • Default values are the same as pandas default missing values

  • ignore
    • Boolean type descriptor that ignores an element on the output

    • Options: True or False, defaults to False

    • Applies to all elements

    • Is not a mandatory field descriptor

  • units
    • String type descriptor that states the units of the measured data element.

    • Applies to column_type. [numerics] elements.

    • Is not a mandatory field descriptor

    • Defaults to None

  • encoding
    • String type descriptor added if an element needs it

    • Is not a mandatory field

    • Not to be confuse with file encoding

    • Applies to column_type. [numerics] elements and column_type. [key] elements

    • Defaults to None

    • Options:
      1. base36

      2. signed_overpunch

  • valid_max
    • Numeric type of descriptor that indicates the valid maximum value for numeric elements. This should be the valid maximum in variable declared units, after decoding and conversion (offset, scale…) and it is use for element validation.

    • Applies to column_type. [numerics] elements

    • Is not a mandatory field

    • Defaults to +inf

  • valid_min
    • Numeric type of descriptor that indicates the minimum value for numeric elements. This should be the valid minimum in variable declared units, after encoding and conversion (offset, scale …) and it is use for element validation.

    • Applies to column_type. [numerics] elements

    • Is not a mandatory field

    • Defaults to -inf

  • scale
    • Numeric type of descriptor. This scale is applied to numeric elements in order to convert the original value to the declared element units.

    • Applies to column_type. [numerics] elements

    • Is not a mandatory field

    • Defaults to 1

  • offset
    • Numeric type of descriptor. This offset is applied to numeric elements in order to convert the original value to the declared element units.

    • Applies to column_type. [numerics] elements

    • Is not a mandatory field

    • Defaults to 0

  • decimal_places
    • Numeric integer descriptor that defines the number of decimal places to which the observed value is reported.

    • Applies to column_type. [numeric_floats] elements

    • Is not a mandatory field

    • Defaults to pandas.display.precision = 6.

  • codetable
    • String type of descriptor containing the key code look up table name. It is the File basename of a code table (with no .json extension) located in the mymodel/code_tables directory. See Code tables for more information.

    • Applies to column_type. [key] elements

    • Is mandatory if "column_type": "key".

  • disable_white_strip
    • Boolean or string type descriptor that modifies the default leading/trailing blank stripping.

    • Applies to column_type. [key, object, str] elements

    • Options:
      1. do not perform any stripping: true

      2. do not perform right stripping (trailing blanks): `r`

      3. do not perform left stripping (leading blanks): `l`

    • Is not a mandatory field

    • Defaults to false

  • datetime_format
    • String type of descriptor that sets the format for the dates.

    • Applies to column_type. [datetime] elements

    • Is not a mandatory field

    • Defaults to %Y%m%d

    • All python.datetime formats are valid.

Schema section block

If the data model is organized in sections then the schema has two main blocks: the header (see Schema header block) and the sections blocks. The sections block has a separate block per section, with the following general layout:

  • A section specific header (or sub-header) with info on how to access that specific section.

  • The section’s elements block (See Schema element block)

Example of a schema section block: “core” section of the .imma schema:

"sections": {
     "core": {
         "header": {"sentinal": null,"length": 108},
         "elements": {
             "YR": {
                 "description": "year UTC",
                 "field_length": 4,
                 "column_type": "uint16",
                 "valid_max": 2024,
                 "valid_min": 1600,
                 "units": "year"
             },
             "MO": {
                 "description": "month UTC",
                 "field_length": 2,
                 "column_type": "uint8",
                 "valid_max": 12,
                 "valid_min": 1,
                 "units": "month"
             }
        }
    }
}

Section header

  • delimiter
    • String type descriptor that defines the field delimiter for the data model section.

    • Setting this descriptor makes the default value of field_layout == delimited

    • Mainly this descriptor will be use if field_layout == delimited

    • When use together with field_layout == fixed_width the code understands that the data layout is a mixture of delimited and fixed_width strings. In this case the delimiter is removed and the section is read as a fixed_width type of section.

    • Applies to delimited and fixed_width field layouts

    • It is a mandatory field only in the case that field_layout == delimited

  • disable_read
    • Boolean type descriptor that if set to True will ignore the elements of that section. This section will then be produced in the output as a single string.

    • Options: True of False

    • Defaults to False

  • field_layout
    • String type descriptor that defines the layout of fields in the section of the data model

    • Applies to all sections

    • If field delimiter is set, then field_layout defaults to delimited, else to fixed_width.

    • This descriptor does not need to be specified in the schema files in the majority of the cases. However, to account for mixed formats, like c99 section in imma1 files for deck 704, this default setting can be overridden by specifying the field_layout parameter.

    • Options:
      1. delimited or fix_width

      2. Defaults to delimited if delimiter is set, else defaults to what ever is set in the fixed_width.

  • sentinal
    • String type of descriptor that allows the code to identify a section.

    • Applies to sections of format.fixed_width

    • It is a mandatory field if the section is unique, unique in a parsing_order block, or part of a sequential parsing_order block.

    • Elements bearing the sentinal need to be, additionally, declared in the elements block.

  • length
    • Numeric integer type of descriptor that defines the length of the section (how many bytes or characters in a string).

    • Applies to format.fixed_width

    • It is a mandatory field

    • Can be also set to null, or not reported, if the section is the last one to be parsed and the length is unknown (like the c99 section of the .imma model.

Section elements

Same as Schema element block.

Code Tables

To learn about how to construct a code table, please read the Code tables section.