Data preparation guide
Data preparation guide
This guide provides detailed instructions for preparing your geospatial data for use with GEM, including code examples, validation techniques, and best practices.
Data format requirements
GEM requires input data in Apache Parquet format with a specific schema.
Required schema
| Field | Type | Description | Example |
|---|---|---|---|
id | integer | Unique identifier for each road segment | 5707295 |
is_navigable | boolean | Whether the road is navigable by vehicles | true |
geometry | string | Road geometry in WKT LineString format | "LINESTRING (145.18 -37.87, 145.18 -37.87)" |
Geometry format
The geometry field must contain valid Well-Known Text (WKT) LineString geometries:
LINESTRING (longitude1 latitude1, longitude2 latitude2, ...)
Valid examples:
LINESTRING (145.18156 -37.87340, 145.18092 -37.87356)LINESTRING (4.8952 52.3702, 4.8960 52.3710, 4.8975 52.3725)
Invalid examples:
1POINT (145.18156 -37.87340) # Wrong geometry type2LINESTRING (145.18156) # Single point, not a line3LINESTRING ((145.18 -37.87)) # Extra parentheses
Creating Parquet files
Using Python (pandas + pyarrow)
1import pandas as pd2import pyarrow as pa3import pyarrow.parquet as pq45# Create sample data6data = {7 'id': [1, 2, 3, 4, 5],8 'is_navigable': [True, True, False, True, True],9 'geometry': [10 'LINESTRING (4.8952 52.3702, 4.8960 52.3710)',11 'LINESTRING (4.8960 52.3710, 4.8975 52.3725)',12 'LINESTRING (4.8975 52.3725, 4.8990 52.3740)',13 'LINESTRING (4.8990 52.3740, 4.9005 52.3755)',14 'LINESTRING (4.9005 52.3755, 4.9020 52.3770)'15 ]16}1718# Create DataFrame19df = pd.DataFrame(data)2021# Define schema with correct types22schema = pa.schema([23 ('id', pa.int64()),24 ('is_navigable', pa.bool_()),25 ('geometry', pa.string())26])2728# Convert to PyArrow Table with schema29table = pa.Table.from_pandas(df, schema=schema)3031# Write to Parquet32pq.write_table(table, 'my_road_data.parquet')3334print(f"Created Parquet file with {len(df)} records")
Using Python (GeoPandas)
If your data is already in a geospatial format (Shapefile, GeoJSON, etc.):
1import geopandas as gpd2import pandas as pd34# Read source data5gdf = gpd.read_file('roads.shp')67# Prepare for GEM8gem_data = pd.DataFrame({9 'id': range(1, len(gdf) + 1), # Generate unique IDs10 'is_navigable': gdf['navigable'].fillna(True), # Default to True11 'geometry': gdf.geometry.apply(lambda g: g.wkt) # Convert to WKT12})1314# Filter to LineStrings only15gem_data = gem_data[gem_data['geometry'].str.startswith('LINESTRING')]1617# Save as Parquet18gem_data.to_parquet('gem_input.parquet', index=False)1920print(f"Exported {len(gem_data)} road segments")
Using PySpark
For large datasets:
1from pyspark.sql import SparkSession2from pyspark.sql.types import StructType, StructField, LongType, BooleanType, StringType34# Initialize Spark5spark = SparkSession.builder.appName("GEM Data Prep").getOrCreate()67# Define schema8schema = StructType([9 StructField("id", LongType(), False),10 StructField("is_navigable", BooleanType(), False),11 StructField("geometry", StringType(), False)12])1314# Read your source data15source_df = spark.read.format("your_format").load("your_data")1617# Transform to GEM schema18gem_df = source_df.select(19 source_df["road_id"].alias("id"),20 source_df["navigable"].alias("is_navigable"),21 source_df["wkt_geometry"].alias("geometry")22)2324# Write as Parquet25gem_df.write.parquet("gem_input.parquet")
Data validation
Always validate your data before uploading to GEM.
Python validation script
1import pandas as pd2import re34def validate_gem_data(filepath):5 """Validate a Parquet file for GEM compatibility."""67 print(f"Validating: {filepath}")8 errors = []9 warnings = []1011 # Read the file12 try:13 df = pd.read_parquet(filepath)14 except Exception as e:15 return [f"Cannot read file: {e}"], []1617 print(f"Total records: {len(df)}")1819 # Check required columns20 required_cols = ['id', 'is_navigable', 'geometry']21 missing_cols = [c for c in required_cols if c not in df.columns]22 if missing_cols:23 errors.append(f"Missing required columns: {missing_cols}")24 return errors, warnings2526 # Check for null values27 for col in required_cols:28 null_count = df[col].isnull().sum()29 if null_count > 0:30 errors.append(f"Column '{col}' has {null_count} null values")3132 # Check ID uniqueness33 duplicate_ids = df['id'].duplicated().sum()34 if duplicate_ids > 0:35 errors.append(f"Found {duplicate_ids} duplicate IDs")3637 # Check data types38 if not pd.api.types.is_integer_dtype(df['id']):39 errors.append(f"Column 'id' should be integer, got {df['id'].dtype}")4041 if not pd.api.types.is_bool_dtype(df['is_navigable']):42 errors.append(f"Column 'is_navigable' should be boolean, got {df['is_navigable'].dtype}")4344 # Validate geometries45 linestring_pattern = r'^LINESTRING\s*\([^)]+\)$'46 invalid_geom = 047 for idx, geom in df['geometry'].items():48 if not isinstance(geom, str):49 invalid_geom += 150 elif not re.match(linestring_pattern, geom.strip(), re.IGNORECASE):51 invalid_geom += 15253 if invalid_geom > 0:54 errors.append(f"Found {invalid_geom} invalid geometries (must be WKT LINESTRING)")5556 # Check for empty geometries57 empty_geom = df['geometry'].str.contains(r'LINESTRING\s*\(\s*\)', case=False, regex=True).sum()58 if empty_geom > 0:59 warnings.append(f"Found {empty_geom} empty geometries")6061 # Summary62 print(f"\nValidation Results:")63 print(f" Errors: {len(errors)}")64 print(f" Warnings: {len(warnings)}")6566 if errors:67 print("\nErrors:")68 for e in errors:69 print(f" ❌ {e}")7071 if warnings:72 print("\nWarnings:")73 for w in warnings:74 print(f" ⚠️ {w}")7576 if not errors:77 print("\n✅ File is valid for GEM!")7879 return errors, warnings8081# Usage82errors, warnings = validate_gem_data('my_road_data.parquet')
Quick validation with pandas
1import pandas as pd23df = pd.read_parquet('my_data.parquet')45# Quick checks6print("Schema:")7print(df.dtypes)8print(f"\nTotal records: {len(df)}")9print(f"Null values:\n{df.isnull().sum()}")10print(f"Duplicate IDs: {df['id'].duplicated().sum()}")11print(f"\nSample records:")12print(df.head())
Common data quality issues
Issue 1: Invalid geometry format
Problem: Geometries not in WKT LineString format.
Solution:
1from shapely import wkt2from shapely.geometry import LineString34def fix_geometry(geom):5 """Convert various geometry formats to WKT LineString."""6 try:7 # If it's already a valid WKT string8 parsed = wkt.loads(geom)9 if isinstance(parsed, LineString):10 return geom11 else:12 return None # Not a LineString13 except:14 return None1516df['geometry'] = df['geometry'].apply(fix_geometry)17df = df.dropna(subset=['geometry'])
Issue 2: Duplicate IDs
Problem: Multiple records share the same ID.
Solution:
1# Option 1: Keep first occurrence2df = df.drop_duplicates(subset=['id'], keep='first')34# Option 2: Regenerate IDs5df['id'] = range(1, len(df) + 1)
Issue 3: Mixed geometry types
Problem: Dataset contains Points, Polygons, etc. alongside LineStrings.
Solution:
# Filter to LineStrings onlydf = df[df['geometry'].str.upper().str.startswith('LINESTRING')]
Issue 4: Coordinate system issues
Problem: Coordinates in wrong order or projection.
Solution:
1import geopandas as gpd2from shapely import wkt34# Read and reproject5gdf = gpd.read_file('roads.shp')6gdf = gdf.to_crs('EPSG:4326') # Convert to WGS8478# Extract WKT9df['geometry'] = gdf.geometry.apply(lambda g: g.wkt)
Best practices
Before uploading
- Start small: Test with a subset (1,000-10,000 records) before processing full dataset
- Validate thoroughly: Run validation script on every file
- Check file size: Large files may take longer to upload; plan accordingly
- Use descriptive filenames:
city_roads_2024_v1.parquetnotdata.parquet
Data quality tips
- Clean geometries: Remove self-intersections and invalid geometries
- Ensure connectivity: Connected road networks match better than isolated segments
- Include all segments: Don't filter out small roads—they help with context
- Accurate navigability: Set
is_navigablecorrectly for better matching
File naming conventions
Recommended naming pattern:
{region}_{data_type}_{date}_{version}.parquet
Examples:
netherlands_roads_20240115_v1.parquetcalifornia_highways_20240120_v2.parquettokyo_streets_20240118_final.parquet
Sample data
Here's a minimal sample file you can use for testing:
1import pandas as pd23# Sample Amsterdam road segments4sample_data = {5 'id': [1, 2, 3, 4, 5],6 'is_navigable': [True, True, True, True, False],7 'geometry': [8 'LINESTRING (4.8952 52.3702, 4.8960 52.3710)',9 'LINESTRING (4.8960 52.3710, 4.8975 52.3725)',10 'LINESTRING (4.8975 52.3725, 4.8990 52.3740)',11 'LINESTRING (4.8990 52.3740, 4.9005 52.3755)',12 'LINESTRING (4.9005 52.3755, 4.9020 52.3770)'13 ]14}1516df = pd.DataFrame(sample_data)17df.to_parquet('sample_gem_input.parquet', index=False)18print("Sample file created: sample_gem_input.parquet")
Next steps
Once your data is prepared and validated:
- UI Workflow Guide - Upload via the dashboard
- API Documentation - Upload programmatically
- Quick Reference - Command cheat sheet