TwoRavens

TwoRavens Metadata Service

The TwoRaven Metadata Service (TRMS) provides the summary statistics that power the TwoRaven interface via using the preprocessed tabular data (Currently, it supports CSV, TAB, XLS and XLSX files). This documentation describes the JSON specification language and how to use TRMS in your application.

Samples/Benchmarks

Dataverse Size # Rows # Columns Time First 100 Rows* Result
ajps_replication_raw_wide.tab 86.5 kB 360 66 0.23 s .tab .json
sectoral_value_added.csv 380.4 kB 900 26 0.25 s .csv .json
cces2008_primaryvote.tab 1.0 MB 3200 7 0.19 s .tab .json
data.tab 972 MB 43522 1021 96.89 s .tab .json

Guide

Overview

Our service will return a JSON string that contains four main blocks: Self Section, Dataset-level information, Variable Section and Variable Display section. Below is an example of the output.


{
     "self":{
            "description": "",
            "other attributes": ""
     },
     "dataset": {
            "description": "",
            "other attributes": ""
     },
     "variables":{
        "var_1":{
            "variableName": "",
            "other attributes": ""
        },
        "var_2":{
            "variableName": "",
            "other attributes": ""
        },
        "var_n":{
            "variableName": "",
            "other attributes": ""
        }
      },
      "variableDisplay":{
        "var_1":{
            "editable":"",
            "other attributes":""
        },
        "var_2":{
            "editable":"",
            "other attributes":""
        },
        "var_n":{
            "editable":"",
            "other attributes":""
        }
      }
    }

Self Section

note

description

A brief description that contains the link to the service which generate this output.

created

A timestamp shows when the task is created. (YYYY-MM-DD HH:MM:SS)

preprocessId

An automatic generated ID for current task – Currently is ‘None’.

version

A number describes the version of current task.

schema(Optional)

Contains the description of the schema. You need to specify the value of ‘SCHEMA_INFO_DICT’ if this block is required. We don’t provide this information by default.

Attribute Description
Name Name of the followed schema
Version Version of corresponding schema it applied
SchemaUrl Source url of the schema
SchemaDoc Link to the corresponding schema

Dataset Section

note

description

A brief description of input dataset. (Currently is null)

unit of analysis

Unknown definition. (Currently is null)

structure

The structure of the input dataset.

rowCount

Number of observations in the dataset.

variableCount

number of variables in the dataset.

dataSource(Optional)

Contains some extra information about the raw file. You need to specify the value of ‘data_source_info’ when a process runner is created, if this block is required. This information is not provided by default.

Attribute Description
Name Name of input file
Type File type (.csv, .xlxs etc)
Format Format of the input file
fileSize Size of the file in bytes

citation(Optional)

Unknown definition. default is null.

error

A message shows the error happened during dataset-level analysis. This entity may not exist if there is no error occured.

Variable Section

note

{
 "median":"NA"
}

variableName

Name of the variable (column).

description

Brief explanation of the variable.

numchar

The type of this variable (column).

nature

The type of this variable (from statistic perspective), below is the table of possible values.

Name Definition
Nominal Just names, IDs
Ordinal Have/Represent rank order
Interval Has a fixed size of interval between data points
Ratio Has a true zero point (e.g. mass, length)
Percent Namely, [0.0, 1.0] or [0, 100]%

binary

A boolean flag indicates whether this variable is a binary variable or not.

interval

Indicate whether the variable is either continuous or discrete, if it’s a numeric variable.

temporal

Whether variable appears to represent points in time

timeUnit

When temporal is True, the format string if possible

Example: the date value of “2002-03-11” has a format string of “%Y-%m-%d”

See check_time for details

geographic

Whether this variable appears to represent locations

locationUnit

When geographic is True, one of: ‘US state’, ‘country’, ‘country subdivision’

See check_location for details

invalidCount

Counts the number of invalid observations, including missing values, nulls, NA’s and any observation with a value enumerated in invalidSpecialCodes.

validCount

Counts the number of valid observations

uniqueCount

Count of unique values, including invalid observations.

median

note - This attribute may have incorrect value, fix is needed.

A central value in the distribution such that there are as many values equal or above, as there are equal or below this value. It will be ‘NA’ if the data is not numerical.

mean

Average of all numeric values, which are not contained in invalidSpecialCodes. It will be ‘NA’ if the data is not numerical.

max

Largest numeric value observed in dataset, that is not contained in invalidSpecialCodes. It will be ‘NA’ if the data is not numerical.

min

Least numeric value observed in dataset, that is not contained in invalidSpecialCodes. It will be ‘NA’ if the data is not numerical.

mode

Value that occurs most frequently. Multiple values in the case of ties.

modeFreq

Number of times value of mode is observed in variable.

fewestValues

Value that occurs least frequently. Multiple values in the case of ties.

fewestFreq

Number of times value of fewestValues is observed in variable.

midpoint

The value equidistant from the reported min and max values.

midpointFreq

Number of observations with value equal to midpoint.

stdDev

Standard deviation of the values, measuring the spread between values, specifically using population formula.

herfindahlIndex

Measure of heterogeneity of a categorical variable which gives the probability that any two randomly sampled observations have the same value.

warning

plotValues

Contains the y-value of the plot, available while the plot_type is PLOT_BAR

pdfPlotType

Describes default type of plot appropriate to represent the distribution of this variable.

pdfPlotX

A list of number that specifies the x-coordinate of corresponding points of the probability density function.

pdfPlotY

A list of number that specifies the y-coordinate of corresponding points of the probability density function.

cdfPlotType

Describes default type of plot appropriate to represent the cumulative distribution of variable.

cdfPlotX

A list of number that specifies the x-coordinate of corresponding points of the cumulative distribution function.

cdfPlotY

A list of number that specifies the x-coordinate of corresponding points of the cumulative distribution function.

Variable Display Section

note

editable

List of all the variable features which are editable. e.g description, numchar, etc.

viewable

It is a boolean property set for this variable to decide whether this attribute will be showed in the processed data.

omit

A list of all the features which are to be omitted for the particular variable.

images

A list contains custom/scripted images of the variable data.