Data preparation guide

This guide provides detailed instructions for preparing your geospatial data for use with GEM, including code examples, validation techniques, and best practices.

Data format requirements

GEM requires input data in Apache Parquet format with a specific schema.

Required schema

Field	Type	Description	Example
`id`	integer or string	Unique identifier for each road segment	`5707295`
`is_navigable`	boolean	Whether the road is navigable by vehicles	`true`
`geometry`	string	Road geometry in WKT LineString format	`"LINESTRING (145.18 -37.87, 145.18 -37.87)"`

Geometry format

The geometry field must contain valid Well-Known Text (WKT) LineString geometries:

LINESTRING (longitude1 latitude1, longitude2 latitude2, ...)

Examples:

LINESTRING (145.18156 -37.87340, 145.18092 -37.87356)
LINESTRING (4.8952 52.3702, 4.8960 52.3710, 4.8975 52.3725)

Creating Parquet files

Using Python (pandas + pyarrow)

1import pandas as pd
2import pyarrow as pa
3import pyarrow.parquet as pq
4
5# Create sample data
6data = {
7    'id': [1, 2, 3, 4, 5],
8    'is_navigable': [True, True, False, True, True],
9    'geometry': [
10        'LINESTRING (4.8952 52.3702, 4.8960 52.3710)',
11        'LINESTRING (4.8960 52.3710, 4.8975 52.3725)',
12        'LINESTRING (4.8975 52.3725, 4.8990 52.3740)',
13        'LINESTRING (4.8990 52.3740, 4.9005 52.3755)',
14        'LINESTRING (4.9005 52.3755, 4.9020 52.3770)'
15    ]
16}
17
18# Create DataFrame
19df = pd.DataFrame(data)
20
21# Define schema with correct types
22schema = pa.schema([
23    ('id', pa.int64()),
24    ('is_navigable', pa.bool_()),
25    ('geometry', pa.string())
26])
27
28# Convert to PyArrow Table with schema
29table = pa.Table.from_pandas(df, schema=schema)
30
31# Write to Parquet
32pq.write_table(table, 'my_road_data.parquet')
33
34print(f"Created Parquet file with {len(df)} records")

Using Python (GeoPandas)

If your data is already in a geospatial format (Shapefile, GeoJSON, etc.):

1import geopandas as gpd
2import pandas as pd
3
4# Read source data
5gdf = gpd.read_file('roads.shp')
6
7# Prepare for GEM
8gem_data = pd.DataFrame({
9    'id': range(1, len(gdf) + 1),  # Generate unique IDs
10    'is_navigable': gdf['navigable'].fillna(True),  # Default to True
11    'geometry': gdf.geometry.apply(lambda g: g.wkt)  # Convert to WKT
12})
13
14# Filter to LineStrings only
15gem_data = gem_data[gem_data['geometry'].str.startswith('LINESTRING')]
16
17# Save as Parquet
18gem_data.to_parquet('gem_input.parquet', index=False)
19
20print(f"Exported {len(gem_data)} road segments")

Using PySpark

For large datasets:

1from pyspark.sql import SparkSession
2from pyspark.sql.types import StructType, StructField, LongType, BooleanType, StringType
3
4# Initialize Spark
5spark = SparkSession.builder.appName("GEM Data Prep").getOrCreate()
6
7# Define schema
8schema = StructType([
9    StructField("id", LongType(), False),
10    StructField("is_navigable", BooleanType(), False),
11    StructField("geometry", StringType(), False)
12])
13
14# Read your source data
15source_df = spark.read.format("your_format").load("your_data")
16
17# Transform to GEM schema
18gem_df = source_df.select(
19    source_df["road_id"].alias("id"),
20    source_df["navigable"].alias("is_navigable"),
21    source_df["wkt_geometry"].alias("geometry")
22)
23
24# Write as Parquet
25gem_df.write.parquet("gem_input.parquet")

Data validation

Always validate your data before uploading to GEM.

Python validation script

1import pandas as pd
2import re
3
4def validate_gem_data(filepath):
  """Validate a Parquet file for GEM compatibility."""
  
  print(f"Validating: {filepath}")
  errors = []
  warnings = []
  
  # Read the file
  try:
      df = pd.read_parquet(filepath)
  except Exception as e:
      return [f"Cannot read file: {e}"], []
  
  print(f"Total records: {len(df)}")
  
  # Check required columns
  required_cols = ['id', 'is_navigable', 'geometry']
  missing_cols = [c for c in required_cols if c not in df.columns]
  if missing_cols:
      errors.append(f"Missing required columns: {missing_cols}")
      return errors, warnings
  
  # Check for null values
  for col in required_cols:
      null_count = df[col].isnull().sum()
      if null_count > 0:
          errors.append(f"Column '{col}' has {null_count} null values")
  
  # Check ID uniqueness
  duplicate_ids = df['id'].duplicated().sum()
  if duplicate_ids > 0:
      errors.append(f"Found {duplicate_ids} duplicate IDs")
  
  # Check data types
  if not pd.api.types.is_integer_dtype(df['id']):
      errors.append(f"Column 'id' should be integer, got {df['id'].dtype}")
  
  if not pd.api.types.is_bool_dtype(df['is_navigable']):
      errors.append(f"Column 'is_navigable' should be boolean, got {df['is_navigable'].dtype}")
  
  # Validate geometries
  linestring_pattern = r'^LINESTRING\s*\([^)]+\)$'
  invalid_geom = 0
  for idx, geom in df['geometry'].items():
      if not isinstance(geom, str):
          invalid_geom += 1
      elif not re.match(linestring_pattern, geom.strip(), re.IGNORECASE):
          invalid_geom += 1
  
  if invalid_geom > 0:
      errors.append(f"Found {invalid_geom} invalid geometries (must be WKT LINESTRING)")
  
  # Check for empty geometries
  empty_geom = df['geometry'].str.contains(r'LINESTRING\s*\(\s*\)', case=False, regex=True).sum()
  if empty_geom > 0:
      warnings.append(f"Found {empty_geom} empty geometries")
  
  # Summary
  print(f"\nValidation Results:")
  print(f"  Errors: {len(errors)}")
  print(f"  Warnings: {len(warnings)}")
  
  if errors:
      print("\nErrors:")
      for e in errors:
          print(f"  ❌ {e}")
  
  if warnings:
      print("\nWarnings:")
      for w in warnings:
          print(f"  ⚠️ {w}")
  
  if not errors:
      print("\n✅ File is valid for GEM!")
  
  return errors, warnings
80
81# Usage
82errors, warnings = validate_gem_data('my_road_data.parquet')

Common data quality issues

Issue 1: Invalid geometry format

Problem: Geometries not in WKT LineString format.

Solution:

1from shapely import wkt
2from shapely.geometry import LineString
3
4def fix_geometry(geom):
5    """Convert various geometry formats to WKT LineString."""
6    try:
7        # If it's already a valid WKT string
8        parsed = wkt.loads(geom)
9        if isinstance(parsed, LineString):
10            return geom
11        else:
12            return None  # Not a LineString
13    except:
14        return None
15
16df['geometry'] = df['geometry'].apply(fix_geometry)
17df = df.dropna(subset=['geometry'])

Issue 2: Duplicate IDs

Problem: Multiple records share the same ID.

Solution:

1# Option 1: Keep first occurrence
2df = df.drop_duplicates(subset=['id'], keep='first')
3
4# Option 2: Regenerate IDs
5df['id'] = range(1, len(df) + 1)

Issue 3: Mixed geometry types

Problem: Dataset contains Points, Polygons, etc. alongside LineStrings.

Solution:

# Filter to LineStrings only
df = df[df['geometry'].str.upper().str.startswith('LINESTRING')]

Issue 4: Coordinate system issues

Problem: Coordinates in wrong order or projection.

Solution:

1import geopandas as gpd
2from shapely import wkt
3
4# Read and reproject
5gdf = gpd.read_file('roads.shp')
6gdf = gdf.to_crs('EPSG:4326')  # Convert to WGS84
7
8# Extract WKT
9df['geometry'] = gdf.geometry.apply(lambda g: g.wkt)

Best practices

Before uploading

Start small: Test with a subset (1,000-10,000 records) before processing full dataset
Validate thoroughly: Run validation script on every file
Check file size: Large files may take longer to upload; plan accordingly
Use descriptive filenames: city_roads_2024_v1.parquet not data.parquet

Data quality tips

Clean geometries: Remove self-intersections and invalid geometries
Ensure connectivity: Connected road networks match better than isolated segments
Include all segments: Don't filter out small roads—they help with context
Accurate navigability: Set is_navigable correctly for better matching

File naming conventions

Recommended naming pattern:

{region}_{data_type}_{date}_{version}.parquet

Examples:

netherlands_roads_20240115_v1.parquet
california_highways_20240120_v2.parquet
tokyo_streets_20240118_final.parquet

Sample data

Here's a minimal sample file you can use for testing:

1import pandas as pd
2
3# Sample Amsterdam road segments
4sample_data = {
5    'id': [1, 2, 3, 4, 5],
6    'is_navigable': [True, True, True, True, False],
7    'geometry': [
8        'LINESTRING (4.8952 52.3702, 4.8960 52.3710)',
9        'LINESTRING (4.8960 52.3710, 4.8975 52.3725)',
10        'LINESTRING (4.8975 52.3725, 4.8990 52.3740)',
11        'LINESTRING (4.8990 52.3740, 4.9005 52.3755)',
12        'LINESTRING (4.9005 52.3755, 4.9020 52.3770)'
13    ]
14}
15
16df = pd.DataFrame(sample_data)
17df.to_parquet('sample_gem_input.parquet', index=False)
18print("Sample file created: sample_gem_input.parquet")

Output data schema

Field	Type	Description
`id`	string or integer	Your original road segment ID
`gers`	string	Matched GERS ID (UUID format)
`confidence`	integer	Match confidence score (0-100)
`lr_id`	string	Linear reference: coordinates and GERS ID
`lr_gers`	string	Linear reference: distance range and original ID

Example:

{"id":"abc","gers":"550e8400-e29b-41d4-a716-446655440000","confidence":99,"lr_id":"52.0197-76.36744#550e8400-e29b-41d4-a716-446655440000","lr_gers":"0.0-100.0#abc"}

Next steps

Once your data is prepared and validated:

UI Workflow Guide - Upload through the web interface
API Workflow Guide - Upload and manage data through the API
Quick Reference - Command cheat sheet