Page tree
Skip to end of metadata
Go to start of metadata

Purpose

This file format describes how to the user can define a data dictionary to accompany their data submissions. It enables basic quality control checks on the data.

File Format

Files can be comma, or tab separated. Comma separated files should have a .csv extension. Tab separated files should have a .txt extension.

File Naming Convention

Dictionary files should have the _phen_dd suffix. The file name of the data dictionary should be the same as the phenotypic file.

If you use dates within the filename, please write them in YYYY-MM-DD format (ISO_8601 Dates).

Example: If the phenotypic file is name dipad_2013-09-24_phen.csv then the data dictionary file should be named dipad_2013-09-24_phen_dd.csv

Cardinality

The submission should only contain one dictionary file per phenotypic file.

Contents

The data in the dictionary file should be in a tabular format. The data consists of the following columns which are as explained below.

Column names are case sensitive. i.e. ind_id is not the same as Ind_id

Columns in the user-defined data dictionary should match EXACTLY to the column names in the corresponding file, in content, case, as well as order.

Column NameDescriptionExamples
nameName of the column. No spaces.ind_id, site_id, etc.
unitUnits of the data collected in the column.For a column weight, the units can be pounds, kilograms, etc.
typeData type of the information collected in the column. It can be string, integer, decimal and can be combined with fixed_set, or encoded.
typeexample of corresponding values column in dictionary

string

 
string, encoded*M=Male|F=Female|U=Unknown
string, fixed_set*schizophrenia, disorganized|schizophrenia, paranoid|unknown
integer 
integer, encoded*1=Male|2=Female|0=Unknown
integer, fixed set*1|2|0
decimal 
decimal, encoded*1.0=One or more episodes|0.0=No reported episodes
decimal, fixed_set*1.0|0.0
date1900-11-11 (ISO 8601 format)

*For fixed-set and encoded values, the corresponding values in the file must match EXACTLY, in both case and content. Using the "string, fixed_set" example above, "schizophrenia,disorganized" or "schizophrenia, Disorganized" will NOT be recognized as "schizophrenia, disorganized"

 

min

Valid for numeric data. The minimum value allowed for the data.

You may use pre-defined variables in min/max columns, which will be substituted by the system. Available variables are current_year or current_date.

nametypemin
due_datedate{current_date}
maxValid for numeric data. The maximum value allowed for the data.

You may use pre-defined variables in min/max columns, which will be substituted by the system. Available variables are current_year or current_date.

nametypemax
year_of_deathinteger{current_year}
min_length

Valid for string data. The data value must have a minimum length of min_length

This column will be ignored if the entered "type" is integer or decimal

nametypemin_length
ssnstring10
max_length

Valid for string data. The data value must have a maximum length of max_length

This column will be ignored if the entered "type" is integer or decimal

nametypemin_lengthmax_length
postal_codestring510
unique

Assign a one or more labels (separated by a pipe '|') to identify a column, or a set of columns whose values should be unique.

Note: Unique columns allow null values.

For a column ind_id which should contain unique data the value can be set to u_ind_id

For a set of columns site_id, family_id, subject_id which combined should contain unique data the value can be set to u_sfs_id. The same label u_sfs_id should be assigned for all three columns

 

nameunique
ind_idu_ind_id
site_idu_sfs_id
family_idu_sfs_id
subject_idu_sfs_id

 

Example: Desired effect (user, col_a), and (user, col_b) combinations should be unique.

nameuniqueComment
useru_ua | u_ubSeparating labels by a pipe '|'
col_au_ua 
col_bu_ub 
mandatory

Columns which do not allow null values.

ValueDescription
No value or nA value for this field is not required.
yA value for this field is always required.

<expression>

A value for this field is required if the expression evaluates to True, otherwise the value is optional.


Expression Grammar

Access value of a field using format, c["<column-name>"], i.e. c["sex"], etc.

Constant string should be enclosed in single or double quotes, i.e "a" or 'a'

Constant integers should not be quoted, i.e. 159, 35.0

Constant booleans should be either True or False

Constant date's must be reprsented as date( "<date>"), where <date> is date in IS0 8601 format


Supported OperationsDescription
Compare Operation

<left> <cmp-op> <right>

<cmp-op> can be >, >=, <, <=, ==, !=, is, is not, in, not in

Example a > 1, b != 5, x is None

Arithmetic Operation

<left> <arithmetic-op> <right>

<arithmetic-op> can be +, -, /, *, % (Modulo)

Example

1 + 1 => 2

4 % 3 => 1

"A" + "B" => "AB"

Boolean Operation

<left> <bool-op> <right>

<bool-op> can be and, or.

Example a > 1 and a < 5

Unary Operation

<unary-op> <right>

<unary-op> can be +, -, not

Example +1, -257, etc.

Supported FunctionsDescription
min(v1,.., vn)

Returns smallest element from a list

Example min(1, 2, 3) => 1

max(v1,.., vn)

Returns largest element from a list

Example max(1, 2, 3) => 3

lower/upper

For strings, returns upper/lower cased version of string

Example "A".lower() => "a"

strip, lstrip, rstrip

For string, removes leading and/or trailing whitespace

Example

" A ".lstrip() => "A "

" A ".rstrip() => " A"

" A ".strip() => "A"

str(v)

Convert int/boolean to string

Example

str(1) => "1"

str(True) => "True"

int(v)

Convert string to integer

Example int("1") => "A"

bool(v)

Convert string to boolean

Example int("1") => True

date(v)

Convert string in ISO 8601 format to date

Example date("1980") => 1980-01-01

abs(v)

For numbers, returns the absolute value

Example

abs(-100) => 100

abs(-100.67) => 100.67

range(start, stop, step)

Returns a list of all numbers in range

Example

range(1, 3) => [1, 2]

range(1, 5, 2) => [1, 3]

pow(v, n)

Returns v to power of n

Example pow(2, 2) => 4

len(v)

Return length of string or list

Example

len( "ABC" ) => 3

len( [1, 3] ) => 2


Examples

ExpressionDescription
c["subject_type"] != "D" Evaluates to true, if value for subject_type field does not equal to string D
c["yod"] > 2000Evalues to true, if value for yod field is greater than integer 2000
c["sex"] == "M" or c["sex"] == "m"Evalues to true, if value of sex field is either string M or string m
c["sex"] in ["M", "m"]Evalues to true, if value of sex field is either string M or string m
c["sex"].strip().lower() == "m"Evalues to true, if value of sex field is either string M or string m
primary_keyFor column, or set of columns which combined uniquely identify records in the phenotypic file set this to Y

If site_id, family_id, subject_id uniquely identify records in a file


nameprimary_key
site_idy
family_idy
subject_idy
resolutionValid for floating point values.

Data contained in the column will be truncated to n digits after the decimal point.

If the data is 4.1212, and resolution for column is specified to 2 the data will be truncated to 4.12.

valuesValid when the type column contains fixed_set or encoded value.

Fixed Set*

Specifies the set of values which are allowed in the column. Example: For a column Gender the valid values are Male/Female/Unknown

Then values column should contain Male|Female|Unknown

Format: <value_1>|<value_2>|..|<value_n> separated by a pipe '|'.

Encoded*

Specifies the set of values which are allowed in the column and their corresponding meaning.

Example: For a column Gender the valid values are M/F/U

Then values column should contain M=Male|F=Female|U=Unknown

Format: <abbr_1>=<value_1>|<abbr_2>=<value_2>|..|<abbr_n>=<value_n> separated by a pipe '|'.

 

*For fixed-set and encoded values, the corresponding values in the file must match EXACTLY, in both case and content. Using the "Encoded" example above, "m" will NOT be recognized as "M"

description

Detailed text describing the information contained in this column.

This field cannot be left empty.

 


Advanced Examples

 

Consider a column Age which contains integer values between 0 and 120. The column has specially designated values -1 for Missing, -2 for Not collected

 

nametypeminmaxvalues
ageinteger,encoded0120-1=Missing|-2=Not Collected

 

Consider a column Twins which can contain only the following values Monozygotic, or Dizygotic.

 

nametypevalues
Twinsstring,fixed_setMonozygotic|Dizygotic

Sample File

distribution_dd.csv

  • No labels