The TwoRaven Metadata Service (TRMS) provides the summary statistics that power the TwoRaven interface via using the preprocessed tabular data (Currently, it supports CSV, TAB, XLS and XLSX files). This documentation describes the JSON specification language and how to use TRMS in your application.
Dataverse | Size | # Rows | # Columns | Time | First 100 Rows* | Result |
---|---|---|---|---|---|---|
ajps_replication_raw_wide.tab | 86.5 kB | 360 | 66 | 0.23 s | .tab | .json |
sectoral_value_added.csv | 380.4 kB | 900 | 26 | 0.25 s | .csv | .json |
cces2008_primaryvote.tab | 1.0 MB | 3200 | 7 | 0.19 s | .tab | .json |
data.tab | 972 MB | 43522 | 1021 | 96.89 s | .tab | .json |
Our service will return a JSON string that contains four main blocks: Self Section, Dataset-level information, Variable Section and Variable Display section. Below is an example of the output.
{
"self":{
"description": "",
"other attributes": ""
},
"dataset": {
"description": "",
"other attributes": ""
},
"variables":{
"var_1":{
"variableName": "",
"other attributes": ""
},
"var_2":{
"variableName": "",
"other attributes": ""
},
"var_n":{
"variableName": "",
"other attributes": ""
}
},
"variableDisplay":{
"var_1":{
"editable":"",
"other attributes":""
},
"var_2":{
"editable":"",
"other attributes":""
},
"var_n":{
"editable":"",
"other attributes":""
}
}
}
note
- This section contains the information about the process task.
A brief description that contains the link to the service which generate this output.
A timestamp shows when the task is created. (YYYY-MM-DD HH:MM:SS)
An automatic generated ID for current task – Currently is ‘None’.
A number describes the version of current task.
Contains the description of the schema. You need to specify the value of ‘SCHEMA_INFO_DICT’ if this block is required. We don’t provide this information by default.
Attribute | Description |
---|---|
Name | Name of the followed schema |
Version | Version of corresponding schema it applied |
SchemaUrl | Source url of the schema |
SchemaDoc | Link to the corresponding schema |
note
- This section contains the important parameters of preprocess file at dataset level.
A brief description of input dataset. (Currently is null)
Unknown definition. (Currently is null)
The structure of the input dataset.
Number of observations in the dataset.
number of variables in the dataset.
Contains some extra information about the raw file. You need to specify the value of ‘data_source_info’ when a process runner is created, if this block is required. This information is not provided by default.
Attribute | Description |
---|---|
Name | Name of input file |
Type | File type (.csv, .xlxs etc) |
Format | Format of the input file |
fileSize | Size of the file in bytes |
Unknown definition. default is null.
A message shows the error happened during dataset-level analysis. This entity may not exist if there is no error occured.
note
Except for invalid, for all numeric calculations, missing values are ignored.
For non-numeric values, summary statistics such as median and mean are set to “NA”. For example:
{ "median":"NA" }
Name of the variable (column).
Brief explanation of the variable.
The type of this variable (column).
The type of this variable (from statistic perspective), below is the table of possible values.
Name | Definition |
---|---|
Nominal | Just names, IDs |
Ordinal | Have/Represent rank order |
Interval | Has a fixed size of interval between data points |
Ratio | Has a true zero point (e.g. mass, length) |
Percent | Namely, [0.0, 1.0] or [0, 100]% |
A boolean flag indicates whether this variable is a binary variable or not.
Indicate whether the variable is either continuous or discrete, if it’s a numeric variable.
Whether variable appears to represent points in time
When temporal is True, the format string if possible
Example: the date value of “2002-03-11” has a format string of “%Y-%m-%d”
See check_time for details
Whether this variable appears to represent locations
When geographic is True, one of: ‘US state’, ‘country’, ‘country subdivision’
See check_location for details
Counts the number of invalid observations, including missing values, nulls, NA’s and any observation with a value enumerated in invalidSpecialCodes.
Counts the number of valid observations
Count of unique values, including invalid observations.
note - This attribute may have incorrect value, fix is needed.
A central value in the distribution such that there are as many values equal or above, as there are equal or below this value. It will be ‘NA’ if the data is not numerical.
Average of all numeric values, which are not contained in invalidSpecialCodes. It will be ‘NA’ if the data is not numerical.
Largest numeric value observed in dataset, that is not contained in invalidSpecialCodes. It will be ‘NA’ if the data is not numerical.
Least numeric value observed in dataset, that is not contained in invalidSpecialCodes. It will be ‘NA’ if the data is not numerical.
Value that occurs most frequently. Multiple values in the case of ties.
Number of times value of mode is observed in variable.
Value that occurs least frequently. Multiple values in the case of ties.
Number of times value of fewestValues is observed in variable.
The value equidistant from the reported min and max values.
Number of observations with value equal to midpoint.
Standard deviation of the values, measuring the spread between values, specifically using population formula.
Measure of heterogeneity of a categorical variable which gives the probability that any two randomly sampled observations have the same value.
warning
- Following attributes may be moved to Variable Display Section in the future.
Contains the y-value of the plot, available while the plot_type is PLOT_BAR
Describes default type of plot appropriate to represent the distribution of this variable.
A list of number that specifies the x-coordinate of corresponding points of the probability density function.
A list of number that specifies the y-coordinate of corresponding points of the probability density function.
Describes default type of plot appropriate to represent the cumulative distribution of variable.
A list of number that specifies the x-coordinate of corresponding points of the cumulative distribution function.
A list of number that specifies the x-coordinate of corresponding points of the cumulative distribution function.
note
- This section contains additional parameters to control the behavior of final plot.
List of all the variable features which are editable. e.g description
, numchar
, etc.
It is a boolean property set for this variable to decide whether this attribute will be showed in the processed data.
A list of all the features which are to be omitted for the particular variable.
A list contains custom/scripted images of the variable data.