Data preparation guide

Data preparation guide

This guide provides detailed instructions for preparing your geospatial data for use with GEM, including code examples, validation techniques, and best practices.

Data format requirements

GEM requires input data in Apache Parquet format with a specific schema.

Required schema

FieldTypeDescriptionExample
idintegerUnique identifier for each road segment5707295
is_navigablebooleanWhether the road is navigable by vehiclestrue
geometrystringRoad geometry in WKT LineString format"LINESTRING (145.18 -37.87, 145.18 -37.87)"

Geometry format

The geometry field must contain valid Well-Known Text (WKT) LineString geometries:

LINESTRING (longitude1 latitude1, longitude2 latitude2, ...)

Valid examples:

LINESTRING (145.18156 -37.87340, 145.18092 -37.87356)
LINESTRING (4.8952 52.3702, 4.8960 52.3710, 4.8975 52.3725)

Invalid examples:

1POINT (145.18156 -37.87340) # Wrong geometry type
2LINESTRING (145.18156) # Single point, not a line
3LINESTRING ((145.18 -37.87)) # Extra parentheses

Creating Parquet files

Using Python (pandas + pyarrow)

1import pandas as pd
2import pyarrow as pa
3import pyarrow.parquet as pq
4
5# Create sample data
6data = {
7 'id': [1, 2, 3, 4, 5],
8 'is_navigable': [True, True, False, True, True],
9 'geometry': [
10 'LINESTRING (4.8952 52.3702, 4.8960 52.3710)',
11 'LINESTRING (4.8960 52.3710, 4.8975 52.3725)',
12 'LINESTRING (4.8975 52.3725, 4.8990 52.3740)',
13 'LINESTRING (4.8990 52.3740, 4.9005 52.3755)',
14 'LINESTRING (4.9005 52.3755, 4.9020 52.3770)'
15 ]
16}
17
18# Create DataFrame
19df = pd.DataFrame(data)
20
21# Define schema with correct types
22schema = pa.schema([
23 ('id', pa.int64()),
24 ('is_navigable', pa.bool_()),
25 ('geometry', pa.string())
26])
27
28# Convert to PyArrow Table with schema
29table = pa.Table.from_pandas(df, schema=schema)
30
31# Write to Parquet
32pq.write_table(table, 'my_road_data.parquet')
33
34print(f"Created Parquet file with {len(df)} records")

Using Python (GeoPandas)

If your data is already in a geospatial format (Shapefile, GeoJSON, etc.):

1import geopandas as gpd
2import pandas as pd
3
4# Read source data
5gdf = gpd.read_file('roads.shp')
6
7# Prepare for GEM
8gem_data = pd.DataFrame({
9 'id': range(1, len(gdf) + 1), # Generate unique IDs
10 'is_navigable': gdf['navigable'].fillna(True), # Default to True
11 'geometry': gdf.geometry.apply(lambda g: g.wkt) # Convert to WKT
12})
13
14# Filter to LineStrings only
15gem_data = gem_data[gem_data['geometry'].str.startswith('LINESTRING')]
16
17# Save as Parquet
18gem_data.to_parquet('gem_input.parquet', index=False)
19
20print(f"Exported {len(gem_data)} road segments")

Using PySpark

For large datasets:

1from pyspark.sql import SparkSession
2from pyspark.sql.types import StructType, StructField, LongType, BooleanType, StringType
3
4# Initialize Spark
5spark = SparkSession.builder.appName("GEM Data Prep").getOrCreate()
6
7# Define schema
8schema = StructType([
9 StructField("id", LongType(), False),
10 StructField("is_navigable", BooleanType(), False),
11 StructField("geometry", StringType(), False)
12])
13
14# Read your source data
15source_df = spark.read.format("your_format").load("your_data")
16
17# Transform to GEM schema
18gem_df = source_df.select(
19 source_df["road_id"].alias("id"),
20 source_df["navigable"].alias("is_navigable"),
21 source_df["wkt_geometry"].alias("geometry")
22)
23
24# Write as Parquet
25gem_df.write.parquet("gem_input.parquet")

Data validation

Always validate your data before uploading to GEM.

Python validation script

1import pandas as pd
2import re
3
4def validate_gem_data(filepath):
5 """Validate a Parquet file for GEM compatibility."""
6
7 print(f"Validating: {filepath}")
8 errors = []
9 warnings = []
10
11 # Read the file
12 try:
13 df = pd.read_parquet(filepath)
14 except Exception as e:
15 return [f"Cannot read file: {e}"], []
16
17 print(f"Total records: {len(df)}")
18
19 # Check required columns
20 required_cols = ['id', 'is_navigable', 'geometry']
21 missing_cols = [c for c in required_cols if c not in df.columns]
22 if missing_cols:
23 errors.append(f"Missing required columns: {missing_cols}")
24 return errors, warnings
25
26 # Check for null values
27 for col in required_cols:
28 null_count = df[col].isnull().sum()
29 if null_count > 0:
30 errors.append(f"Column '{col}' has {null_count} null values")
31
32 # Check ID uniqueness
33 duplicate_ids = df['id'].duplicated().sum()
34 if duplicate_ids > 0:
35 errors.append(f"Found {duplicate_ids} duplicate IDs")
36
37 # Check data types
38 if not pd.api.types.is_integer_dtype(df['id']):
39 errors.append(f"Column 'id' should be integer, got {df['id'].dtype}")
40
41 if not pd.api.types.is_bool_dtype(df['is_navigable']):
42 errors.append(f"Column 'is_navigable' should be boolean, got {df['is_navigable'].dtype}")
43
44 # Validate geometries
45 linestring_pattern = r'^LINESTRING\s*\([^)]+\)$'
46 invalid_geom = 0
47 for idx, geom in df['geometry'].items():
48 if not isinstance(geom, str):
49 invalid_geom += 1
50 elif not re.match(linestring_pattern, geom.strip(), re.IGNORECASE):
51 invalid_geom += 1
52
53 if invalid_geom > 0:
54 errors.append(f"Found {invalid_geom} invalid geometries (must be WKT LINESTRING)")
55
56 # Check for empty geometries
57 empty_geom = df['geometry'].str.contains(r'LINESTRING\s*\(\s*\)', case=False, regex=True).sum()
58 if empty_geom > 0:
59 warnings.append(f"Found {empty_geom} empty geometries")
60
61 # Summary
62 print(f"\nValidation Results:")
63 print(f" Errors: {len(errors)}")
64 print(f" Warnings: {len(warnings)}")
65
66 if errors:
67 print("\nErrors:")
68 for e in errors:
69 print(f" ❌ {e}")
70
71 if warnings:
72 print("\nWarnings:")
73 for w in warnings:
74 print(f" ⚠️ {w}")
75
76 if not errors:
77 print("\n✅ File is valid for GEM!")
78
79 return errors, warnings
80
81# Usage
82errors, warnings = validate_gem_data('my_road_data.parquet')

Quick validation with pandas

1import pandas as pd
2
3df = pd.read_parquet('my_data.parquet')
4
5# Quick checks
6print("Schema:")
7print(df.dtypes)
8print(f"\nTotal records: {len(df)}")
9print(f"Null values:\n{df.isnull().sum()}")
10print(f"Duplicate IDs: {df['id'].duplicated().sum()}")
11print(f"\nSample records:")
12print(df.head())

Common data quality issues

Issue 1: Invalid geometry format

Problem: Geometries not in WKT LineString format.

Solution:

1from shapely import wkt
2from shapely.geometry import LineString
3
4def fix_geometry(geom):
5 """Convert various geometry formats to WKT LineString."""
6 try:
7 # If it's already a valid WKT string
8 parsed = wkt.loads(geom)
9 if isinstance(parsed, LineString):
10 return geom
11 else:
12 return None # Not a LineString
13 except:
14 return None
15
16df['geometry'] = df['geometry'].apply(fix_geometry)
17df = df.dropna(subset=['geometry'])

Issue 2: Duplicate IDs

Problem: Multiple records share the same ID.

Solution:

1# Option 1: Keep first occurrence
2df = df.drop_duplicates(subset=['id'], keep='first')
3
4# Option 2: Regenerate IDs
5df['id'] = range(1, len(df) + 1)

Issue 3: Mixed geometry types

Problem: Dataset contains Points, Polygons, etc. alongside LineStrings.

Solution:

# Filter to LineStrings only
df = df[df['geometry'].str.upper().str.startswith('LINESTRING')]

Issue 4: Coordinate system issues

Problem: Coordinates in wrong order or projection.

Solution:

1import geopandas as gpd
2from shapely import wkt
3
4# Read and reproject
5gdf = gpd.read_file('roads.shp')
6gdf = gdf.to_crs('EPSG:4326') # Convert to WGS84
7
8# Extract WKT
9df['geometry'] = gdf.geometry.apply(lambda g: g.wkt)

Best practices

Before uploading

  1. Start small: Test with a subset (1,000-10,000 records) before processing full dataset
  2. Validate thoroughly: Run validation script on every file
  3. Check file size: Large files may take longer to upload; plan accordingly
  4. Use descriptive filenames: city_roads_2024_v1.parquet not data.parquet

Data quality tips

  1. Clean geometries: Remove self-intersections and invalid geometries
  2. Ensure connectivity: Connected road networks match better than isolated segments
  3. Include all segments: Don't filter out small roads—they help with context
  4. Accurate navigability: Set is_navigable correctly for better matching

File naming conventions

Recommended naming pattern:

{region}_{data_type}_{date}_{version}.parquet

Examples:

  • netherlands_roads_20240115_v1.parquet
  • california_highways_20240120_v2.parquet
  • tokyo_streets_20240118_final.parquet

Sample data

Here's a minimal sample file you can use for testing:

1import pandas as pd
2
3# Sample Amsterdam road segments
4sample_data = {
5 'id': [1, 2, 3, 4, 5],
6 'is_navigable': [True, True, True, True, False],
7 'geometry': [
8 'LINESTRING (4.8952 52.3702, 4.8960 52.3710)',
9 'LINESTRING (4.8960 52.3710, 4.8975 52.3725)',
10 'LINESTRING (4.8975 52.3725, 4.8990 52.3740)',
11 'LINESTRING (4.8990 52.3740, 4.9005 52.3755)',
12 'LINESTRING (4.9005 52.3755, 4.9020 52.3770)'
13 ]
14}
15
16df = pd.DataFrame(sample_data)
17df.to_parquet('sample_gem_input.parquet', index=False)
18print("Sample file created: sample_gem_input.parquet")

Next steps

Once your data is prepared and validated: