Clipping Data#

Clipping is a process used to extract specific parts of data based on a predefined dataset that represents the area of interest. The geopandas and matplotlib packages are used for this process.

import geopandas as gp
import matplotlib.pyplot as plt
from pathlib import Path

INPUT = Path.cwd().parents[0] / "00_data"
OUTPUT = Path.cwd().parents[0] / "out"
OUTPUT.mkdir(exist_ok=True)

gdb_path = INPUT / "Biotopwerte_Dresden_2018.gdb"

Clipping Based on Zone (Dataset)#

The areas of interest in this chapter are districts in Dresden.

For retrieving the data, from the Geojson section of the portal, based on the interested format, the URL is chosen.

geojson_url = "https://kommisdd.dresden.de/net4/public/ogcapi/collections/L137/items"

Then the requests library is imported and a request is sent to the portal to extract the data.

import requests

response = requests.get(geojson_url)

Then if the request is successful (code 200), it will load and plot the data.

if response.status_code == 200:
    gdf = gp.read_file(geojson_url)
    print(gdf.head())
    ax=gdf.plot()
    ax.set_axis_off()
else:
    print("Error:", response.text)
Hide code cell output
   id                        bez  \
0  02        Pirnaische Vorstadt   
1  53               Striesen-Süd   
2  97  Gorbitz-Nord/Neu-Omsewitz   
3  22                    Mickten   
4  95                Gorbitz-Süd   

                                         bez_lang  flaeche_km2   sst sst_klar  \
0                             Pirnaische Vorstadt       0.9281  None     None   
1             Striesen-Süd mit Johannstadt-Südost       1.3276  None     None   
2                       Gorbitz-Nord/Neu-Omsewitz       0.8557  None     None   
3  Mickten mit Trachau-Süd, Übigau und Kaditz-Süd       4.2435  None     None   
4                                     Gorbitz-Süd       1.2604  None     None   

  historie                 aend  \
0      akt  20.03.2025 00:00:00   
1      akt  20.03.2025 00:00:00   
2      akt  20.03.2025 00:00:00   
3      akt  20.03.2025 00:00:00   
4      akt  20.03.2025 00:00:00   

                                            geometry  
0  POLYGON ((13.75017 51.04619, 13.75036 51.04612...  
1  POLYGON ((13.77558 51.04538, 13.77544 51.04543...  
2  POLYGON ((13.6723 51.05024, 13.67231 51.0503, ...  
3  POLYGON ((13.70028 51.08324, 13.70015 51.08322...  
4  POLYGON ((13.66572 51.04003, 13.66609 51.04002...  
../_images/4e7f7b66e1057d9a4b23dad437251f402f32d2e460e1e6af38c458fd2b377ec7.png

To add the labels, first, the position of the labels should be defined.

The centroid of the polygons would be the position of the labels and is generated using the geometry.centroid method and stored in a new column.

gdf["centroid"] = gdf.geometry.centroid 
/tmp/ipykernel_2419/1565410799.py:1: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  gdf["centroid"] = gdf.geometry.centroid

For adding the labels the text function is used.

The coordinates of the centroid are the position of the labels and the label text is defined with str. In the following example, the IDs of the districts are used.

fig, ax = plt.subplots(figsize=(10, 10))
gdf.plot(ax=ax, color='white', edgecolor='navy')
ax.set_axis_off()

for idx, row in gdf.iterrows():
    plt.text(row.centroid.x, row.centroid.y, str(row['id']),
             fontsize=8, 
             color='red')

plt.title("Dresden Districts with ID Labels")
plt.show()
../_images/5b5004adf0c77ef19cc547f47700b10b3a745b3a1f6305cdea3a0a5779e47b01.png

Now using the labels the interested area is selected.

clipping_dataset= gdf[gdf['id'].isin(['90','99'])]
ax=clipping_dataset.plot()
ax.set_axis_off()
../_images/f705df3d240a140f5cf9af9ce0fdb06dda19158f8f90b5d5a018663c922dc4ce.png

The path to the original dataset and the clipping dataset, based on which the original dataset should be clipped, will be defined.

original_dataset = gp.read_file(gdb_path, layer="Biotopwerte_Dresden_2018")
/opt/conda/envs/worker_env/lib/python3.13/site-packages/pyogrio/raw.py:198: RuntimeWarning: organizePolygons() received a polygon with more than 100 parts. The processing may be really slow.  You can skip the processing by setting METHOD=SKIP, or only make it analyze counter-clock wise parts by setting METHOD=ONLY_CCW if you can assume that the outline of holes is counter-clock wise defined
  return ogr_read(

The coordinate system of the datasets is checked. If the coordinate systems do not match, the clipping process will not proceed.

original_dataset.crs
<Projected CRS: EPSG:25833>
Name: ETRS89 / UTM zone 33N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Europe between 12°E and 18°E: Austria; Croatia; Denmark - offshore and offshore; Germany - onshore and offshore; Italy - onshore and offshore; Norway including Svalbard - onshore and offshore.
- bounds: (12.0, 34.79, 18.01, 84.01)
Coordinate Operation:
- name: UTM zone 33N
- method: Transverse Mercator
Datum: European Terrestrial Reference System 1989 ensemble
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
clipping_dataset.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

If the coordinate systems are not the same, they are transformed to match.

clipping_dataset = clipping_dataset.to_crs('EPSG:25833')

The dataset is then clipped by defining the original dataset and the clipping dataset using the clip function in the geopandas package.

clipped_dataset = gp.clip(original_dataset, clipping_dataset)

And the result of the clip operation will be printed out.

clipped_dataset.head(3)
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry
261 122 5.271487 1765.048621 6984.236836 POLYGON ((404092.96 5654379.53, 404098.204 565...
5928 322 18.055116 204.848619 582.329680 POLYGON ((404867.05 5654106.356, 404862.483 56...
13027 142 7.000000 481.434416 11886.564349 POLYGON ((405339.13 5654136.481, 405172.058 56...
ax = clipped_dataset.plot(color="green", linewidth=0.5, alpha=0.5, edgecolor="black")
ax.set_title("Clipped dataset")
ax.set_axis_off()
../_images/f3d9a34602661ff58c8db5baf432b0941d0c8060aa8938b2a789112d1b0a13d6.png

The result can be saved in various formats. All you need to do is define the output path and export based on the format you are interested in.

  • Shapefile Format

output_shp = OUTPUT / "clipped.shp"
clipped_dataset.to_file(output_shp)
/tmp/ipykernel_2419/1343647297.py:2: UserWarning: Column names longer than 10 characters will be truncated when saved to ESRI Shapefile.
  clipped_dataset.to_file(output_shp)
/opt/conda/envs/worker_env/lib/python3.13/site-packages/pyogrio/raw.py:723: RuntimeWarning: Normalized/laundered field name: 'Biotpkt2018' to 'Biotpkt201'
  ogr_write(
/opt/conda/envs/worker_env/lib/python3.13/site-packages/pyogrio/raw.py:723: RuntimeWarning: Normalized/laundered field name: 'Shape_Length' to 'Shape_Leng'
  ogr_write(
  • GeoPackage Format

output_gdb = OUTPUT / "clipped.gpkg"
clipped_dataset.to_file(
    output_gdb,
    layer="clipped_dataset",
    driver="GPKG")
  • CSV Format or WKT Format

Have a look at the geometry column, which commonly exists for GeoDataFrames

clipped_dataset.geometry
Hide code cell output
261      POLYGON ((404092.96 5654379.53, 404098.204 565...
5928     POLYGON ((404867.05 5654106.356, 404862.483 56...
13027    POLYGON ((405339.13 5654136.481, 405172.058 56...
15142    POLYGON ((405337.399 5654208.987, 405336.192 5...
3572     POLYGON ((405336.192 5654214.027, 405335.975 5...
                               ...                        
24741    POLYGON ((403517.379 5661823.326, 403517.089 5...
11147    POLYGON ((403506.01 5661746.878, 403512.79 566...
33830    MULTIPOLYGON (((403834.523 5661937.043, 403787...
33829    POLYGON ((403103.13 5662019.63, 403184.07 5662...
33831    POLYGON ((403506.092 5662087.066, 403484.425 5...
Name: geometry, Length: 2923, dtype: geometry

First, ensure that the geometry column contains valid geometries. We need to check if all values in clipped_dataset["geometry"] are actual shapely.geometry objects.

Check if there are any missing geometries (.isna()).

print(clipped_dataset["geometry"].isna().sum())  # Count missing values
0

Check the types of geometries.

print(clipped_dataset.geom_type.value_counts())
Polygon         2833
MultiPolygon      90
Name: count, dtype: int64

Explicitly store WKT geometries in a new column. This uses the apply function with a lambda expression to convert each geometry in the column to a string format.

from shapely.geometry import Polygon, MultiPolygon

clipped_dataset["geometry_wkt"] = clipped_dataset["geometry"].apply(
    lambda geom: geom.wkt if isinstance(geom, (Polygon, MultiPolygon)) else None
)

The output path is defined, and the data is saved as a CSV file. If the index column, which represents unique values for rows, is not needed, the index parameter can be set to False.

Parameters for CSV file

To learn how to customize your CSV data, visit the pandas DataFrame to CSV page.

csv_path = OUTPUT / "clipped_dataset.csv"
clipped_dataset.to_csv(csv_path, index=True)

Clipping Based on Bounding Box#

To show a specific area of the dataset, the data can be clipped to a defined bounding box and then visualized.

For this reason, the required part of the dataset is selected and the bounding box is extracted for this rows.

Bounding Box

The bounding box extraction explained more in Creating Maps Chapter.

required_part= original_dataset[80:120].total_bounds

Then the dataset clipped for plotting:

gdf_clipped = gp.clip(original_dataset, required_part)
fig, ax = plt.subplots(
    nrows=1, ncols=1, figsize=(12, 18), dpi=300)

ax = gdf_clipped.plot(
    ax=ax,
    column='Biotpkt2018', 
    cmap='viridis_r',
    legend=True,
    legend_kwds={
        "label": "Biodiversity value", "orientation": "horizontal",
        "shrink": 0.3, "pad": 0.01})

ax.set_title('Biodiversity Values in Dresden Center',
             backgroundcolor='#70A401',
             color='white')
ax.set_axis_off()
../_images/d9b5a54be62b258b482f2917b1947d0d78dcdc8eefed894c8a495ff7b1378fcc.png