Merging Data#

Summary

This section explains three ways to merge geospatial data:

  • Merging similar datasets using the concat() function

  • Merging datasets by common attributes using the merge() function

  • Merging datasets based on spatial relationships using the sjoin() function

Merging similar datasets using concat()#

If the attributes of the input datasets are identical, they can be merged vertically using the concat() function in pandas.

Import and load required libraries:

from pathlib import Path
import geopandas as gp
import pandas as pd

Define input and output paths:

INPUT = Path.cwd().parents[0] / "00_data"
OUTPUT = Path.cwd().parents[0] / "out"

Load input datasets:

gdb_path = INPUT / "Biotopwerte_Dresden_2018.gdb"
input_data = gp.read_file(gdb_path, layer="Biotopwerte_Dresden_2018")[:1]
second_data = gp.read_file(gdb_path, layer="Biotopwerte_Dresden_2018")[1:2]
/opt/conda/envs/worker_env/lib/python3.13/site-packages/pyogrio/raw.py:198: RuntimeWarning: organizePolygons() received a polygon with more than 100 parts. The processing may be really slow.  You can skip the processing by setting METHOD=SKIP, or only make it analyze counter-clock wise parts by setting METHOD=ONLY_CCW if you can assume that the outline of holes is counter-clock wise defined
  return ogr_read(

The concat() function is used to concatenate two datasets. The ignore_index parameter determines whether a new index is created or if the original indexes are preserved.

  • ignore_index=True → Creates a new index for the features in the merged dataset.

  • ignore_index=False → Preserves original indexes (which may lead to duplicate indexes).

merged_dataset = pd.concat(
    [input_data, second_data], 
    ignore_index=True)

Check the results:

merged_dataset
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry
0 122 5.271487 210.523801 3371.947771 MULTIPOLYGON (((415775.635 5650481.473, 415776...
1 122 5.271487 31.935928 50.075513 MULTIPOLYGON (((417850.525 5650376.33, 417846....

Merging based on common attributes using merge()#

If two datasets share a common attribute, they can be merged into one dataset horizontally. In this case, the output will include only rows with matching values in the common attribute.

For example, if both datasets contain an ID column, only the rows with the same ID will be merged horizontally, combining all attributes for that ID.

How the how parameter works in the merge() function

The how parameter controls how datasets are merged:

  • inner: The result includes only matching rows. Common attribute must be the same in both datasets from both datasets.

  • outer: The result all rows from both datasets. Missing values are filled with NaN.

  • left: The result includes all rows from the first dataset and matching data from the second. Non-matching rows get NaN.

  • right: The result includes all rows from the second dataset and matching data from the first. Non-matching rows get NaN.

For more information, check the the GeoPandas documentation.

Examples how different how options affect merging with tables

The following examples show how two datasets are merged, defined by the how parameter.

Dataset 1:

ID

Type

1

A

3

C

6

D

Dataset 2:

ID

Time

1

15 min

3

16 min

4

17 min

  • inner: Keeps only the rows with matching ID values in both datasets.

ID

Type

Time

1

A

15 min

3

C

16 min

  • outer: Includes all rows from both datasets; missing values are filled with NaN.

ID

Type

Time

1

A

15 min

3

C

16 min

6

D

NaN

4

NaN

17 min

  • left: Keeps all rows from Dataset 1 and adds matching values from Dataset 2; unmatched values get NaN.

ID

Type

Time

1

A

15 min

3

C

16 min

6

D

NaN

  • right: Keeps all rows from Dataset 2 and adds matching values from Dataset 1; unmatched values get NaN.

ID

Type

Time

1

A

15 min

3

C

16 min

4

NaN

17 min

Code example: Using the merge() function in Python

  • If the common attribute (column) has the same name in both datasets:

Merge based on the column CLC_st1 which is common in both datasets:

  • on: Defines the similarity attribute.

  • how: Controls how datasets are merged.

merged_dataset = input_data.merge(
    second_data, 
    on='CLC_st1',     
    how='inner')       

Then check the result with the print function:

merged_dataset.T
0
CLC_st1 122
Biotpkt2018_x 5.271487
Shape_Length_x 210.523801
Shape_Area_x 3371.947771
geometry_x MULTIPOLYGON (((415775.6353000002 5650481.4728...
Biotpkt2018_y 5.271487
Shape_Length_y 31.935928
Shape_Area_y 50.075513
geometry_y MULTIPOLYGON (((417850.5251000002 5650376.3303...

Handling duplicate column names

If both datasets have columns with the same name (other than the common attribute), GeoPandas adds suffixes _x and _y by default.

Example: If both datasets contain a column named area, the merged dataset will have area_x (from the first dataset) and area_y (from the second dataset).

Customizing suffixes#

To rename these suffixes within the merge() function, use the suffixes parameter:

In the example, FIRST added to the first dataset and SECOND added to the second dataset.

merged_dataset = input_data.merge(
    second_data,
    on='CLC_st1',
    how='inner',
    suffixes=('_FIRST', '_SECOND')) # rename suffixes

Check the result with the print function:

merged_dataset.T
0
CLC_st1 122
Biotpkt2018_FIRST 5.271487
Shape_Length_FIRST 210.523801
Shape_Area_FIRST 3371.947771
geometry_FIRST MULTIPOLYGON (((415775.6353000002 5650481.4728...
Biotpkt2018_SECOND 5.271487
Shape_Length_SECOND 31.935928
Shape_Area_SECOND 50.075513
geometry_SECOND MULTIPOLYGON (((417850.5251000002 5650376.3303...

The following output includes only the CLC_st1 column, which is the common column in both datasets and geometry column of the second dataset.

merged_dataset = input_data[['CLC_st1']].merge(
    second_data[['CLC_st1','geometry']],
    on='CLC_st1',
    how='inner')
merged_dataset # Check the result
CLC_st1 geometry
0 122 MULTIPOLYGON (((417850.525 5650376.33, 417846....
  • If the common attribute (column) has different names in both datasets:

Load the datasets:

input_data = gp.read_file(INPUT / "Biotopwerte_Dresden_2018.gdb")
second_data = gp.read_file(INPUT / "clc_legend.csv")

Check the columns:

The output shows that both datasets include the CLC code but with different names:

  • CLC_st1 in the first dataset

  • CLC_CODE in the second dataset

input_data.columns
Index(['CLC_st1', 'Biotpkt2018', 'Shape_Length', 'Shape_Area', 'geometry'], dtype='object')
second_data.columns
Index(['GRID_CODE', 'CLC_CODE', 'LABEL1', 'LABEL2', 'LABEL3', 'RGB'], dtype='object')

Merge using the merge() function:

In the example below, the CLC codes in the input dataset are labaled using the LABEL3 column in the second dataset.

merged_dataset = input_data.merge(
    second_data[["CLC_CODE", "LABEL3"]],
    left_on ='CLC_st1',
    right_on ='CLC_CODE',
    how ='inner')

merged_dataset # Check the result
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry CLC_CODE LABEL3
0 122 5.271487 210.523801 3371.947771 MULTIPOLYGON (((415775.635 5650481.473, 415776... 122 Road and rail networks and associated land
1 122 5.271487 31.935928 50.075513 MULTIPOLYGON (((417850.525 5650376.33, 417846.... 122 Road and rail networks and associated land
2 122 5.271487 810.640513 1543.310127 MULTIPOLYGON (((417886.917 5650544.364, 417909... 122 Road and rail networks and associated land
3 122 5.271487 24.509066 36.443441 MULTIPOLYGON (((423453.146 5650332.06, 423453.... 122 Road and rail networks and associated land
4 122 5.271487 29.937138 40.494155 MULTIPOLYGON (((417331.434 5650889.039, 417330... 122 Road and rail networks and associated land
... ... ... ... ... ... ... ...
33918 124 8.000000 9.072443 4.947409 MULTIPOLYGON (((414814.645 5666810.533, 414814... 124 Airports
33919 124 8.000000 1369.670301 63201.087919 MULTIPOLYGON (((414791.962 5666543.765, 414803... 124 Airports
33920 124 8.000000 395.094767 708.068118 MULTIPOLYGON (((415006.509 5666816.796, 415004... 124 Airports
33921 231 10.981298 110.373766 99.282910 MULTIPOLYGON (((417478.532 5665012.465, 417477... 231 Pastures
33922 231 10.981298 1401.832280 38939.551849 MULTIPOLYGON (((417482.897 5665014.048, 417475... 231 Pastures

33923 rows × 7 columns

Merging datasets based on spatial relationships using sjoin()#

Another way to merge datasets is by using spatial relationships instead of common attributes.

The sjoin() function can be used to join two datasets based on their spatial relationships.

Key parameters:

  • predicate - Defines the type of spatial relationship.

  • how - Specifies how the datasets are combined.

Available operations with the predicate parameter

The following list explains available operations:

  • contains: Object A completely encloses object B (no boundary touch).

  • covers: Object A completely contains object B (boundaries may touch).

  • within: Object A is completely inside object B (no boundary touch).

  • covered_by: Object A is completely within object B (boundaries may touch).

  • touches: Objects A and B meet only at boundaries.

  • overlaps: Objects A and B share an area.

  • crosses: Objects A and B intersect at discrete points.

  • intersects: Object A and B touch, cross or share an area.

../_images/13.png

Fig. 15 Available operations with the predicate parameter#

Check the ArcGIS documentation to learn more about spatial relationships.

Available operations with the how parameter

The how parameter determines how the two datasets are merged:

  • left: keeps all records from the left dataset (first dataset) and adds matching records from the right dataset. The index comes from the left dataset.

  • right: Keeps all records from the right dataset (second dataset) and adds matching records from the left dataset. The index comes from the right dataset.

  • inner: Keeps only matching records from both datasets (where spatial conditions are met). The index comes from the right dataset.

Check GeoPandas documentation to learn more about sjoin() function.

Code example: Using the sjoin() function in Python

Import and load required libraries:

from pathlib import Path
import geopandas as gp
import pandas as pd

Load the input and join datasets:

input_data = gp.read_file(gdb_path, layer="Biotopwerte_Dresden_2018")
# District data from the Dresden portal Already mentioned in the Clipping Data chapter
import requests
geojson_url = "https://kommisdd.dresden.de/net4/public/ogcapi/collections/L137/items"
response = requests.get(geojson_url)
if response.status_code == 200:
    join_data = gp.read_file(geojson_url) 
    join_data= join_data[join_data['id'] == '99']

else:
    print("Error:", response.text)

Set the same coordinate system:

join_data =join_data.to_crs(input_data.crs)

Perform a spatial join with sjoin() function and default suffixes:

The output includes the land cover and biodiversity values for the id 99 which is related to the the Altfranken/Gompitz area in Dresden.

join_result = input_data.sjoin(
    join_data, 
    predicate='intersects', 
    how='inner')

Check the columns in the output:

list(join_result.columns)
['CLC_st1',
 'Biotpkt2018',
 'Shape_Length',
 'Shape_Area',
 'geometry',
 'index_right',
 'id',
 'bez',
 'bez_lang',
 'flaeche_km2',
 'sst',
 'sst_klar',
 'historie',
 'aend']

Display the first 3 rows:

join_result.head(3)
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry index_right id bez bez_lang flaeche_km2 sst sst_klar historie aend
204 122 5.271487 487783.286284 2.869516e+06 MULTIPOLYGON (((401752.251 5661561.055, 401808... 40 99 Altfranken/Gompitz Altfranken/Gompitz mit Ockerwitz, Pennrich, Ro... 13.0048 None None akt 20.03.2025 00:00:00
261 122 5.271487 1765.048621 6.984237e+03 MULTIPOLYGON (((406601.829 5652484.777, 406604... 40 99 Altfranken/Gompitz Altfranken/Gompitz mit Ockerwitz, Pennrich, Ro... 13.0048 None None akt 20.03.2025 00:00:00
270 122 5.271487 720.099258 2.558749e+03 MULTIPOLYGON (((402096.709 5654422.947, 402077... 40 99 Altfranken/Gompitz Altfranken/Gompitz mit Ockerwitz, Pennrich, Ro... 13.0048 None None akt 20.03.2025 00:00:00

Perform a spatial join with sjoin() function and customizing suffixes:

join_result = input_data.sjoin(
    join_data, 
    predicate='intersects', 
    how='inner', 
    rsuffix='_border') # Defines a suffix for the right dataset
list(join_result.columns) # Checks the columns in the output
['CLC_st1',
 'Biotpkt2018',
 'Shape_Length',
 'Shape_Area',
 'geometry',
 'index__border',
 'id',
 'bez',
 'bez_lang',
 'flaeche_km2',
 'sst',
 'sst_klar',
 'historie',
 'aend']
  • Selecting specific columns

Instead of including all attributes from both datasets, you can specify which columns should be included in the spatial join by using double brackets [[ ]].

Since the spatial join involves geometry attributes, the geometry attribute must be defined among selected columns in both datasets.

join_result = input_data[['CLC_st1','geometry']] \
    .sjoin(
        join_data[['id','geometry']],
        predicate='intersects', how='right')

list(join_result.columns)
['index_left', 'CLC_st1', 'id', 'geometry']

Using the gp.sjoin() function

Another way of spatial joining is to use the gp.sjoin() function from GeoPandas. The key parameters are exactly the same as in the sjoin() function, the only difference is that both datasets are passed as arguments inside the function, which is how functions generally work in Python.

joined_result= gp.sjoin(
    input_data, # First dataset / left dataset
    join_data, # Second dataset / right dataset
    how='right',
    predicate='intersects', 
    lsuffix='_left')
list(joined_result.columns)
['index__left',
 'CLC_st1',
 'Biotpkt2018',
 'Shape_Length',
 'Shape_Area',
 'id',
 'bez',
 'bez_lang',
 'flaeche_km2',
 'sst',
 'sst_klar',
 'historie',
 'aend',
 'geometry']
  • Selecting specific columns

joined_result= gp.sjoin(
    input_data[['CLC_st1','geometry']],
    join_data[['id','geometry']],
    how='right', 
    predicate='intersects', 
    lsuffix='_left')
list(joined_result.columns)
['index__left', 'CLC_st1', 'id', 'geometry']