Selecting and Filtering Data#

Summary

When working with large datasets, processing and visualizing all the data in Jupyter can be time-consuming or may fail entirely. To overcome this, it’s practical to limit the data by selecting only the portions relevant to the analysis.

Load the data

import geopandas as gp
import pandas as pd
from pathlib import Path
INPUT = Path.cwd().parents[0] / "00_data"
gdb_path = INPUT / "Biotopwerte_Dresden_2018.gdb"
gdf = gp.read_file(gdb_path, layer="Biotopwerte_Dresden_2018")
/opt/conda/envs/worker_env/lib/python3.13/site-packages/pyogrio/raw.py:198: RuntimeWarning: organizePolygons() received a polygon with more than 100 parts. The processing may be really slow.  You can skip the processing by setting METHOD=SKIP, or only make it analyze counter-clock wise parts by setting METHOD=ONLY_CCW if you can assume that the outline of holes is counter-clock wise defined
  return ogr_read(

Understanding the data#

Before selecting or filtering, it’s crucial to understand the structure and attributes of the loaded GeoDataFrame.

Examining attributes (columns)

To see the names of the columns (attributes) in the GeoDataFrame:

list(gdf.columns)
['CLC_st1', 'Biotpkt2018', 'Shape_Length', 'Shape_Area', 'geometry']

Understanding attribute data types

Knowing the data type of each attribute is essential for correct selection and filtering. Use the .dtypes method:

gdf.dtypes
CLC_st1           object
Biotpkt2018      float64
Shape_Length     float64
Shape_Area       float64
geometry        geometry
dtype: object
../_images/22-2.png

Fig. 12 Explainations of the attributes in the dataset#

Attribute formatting basics

When working with attributes:

  • Use quotes (' ') for selecting text or numbers stored as strings (object dtype).

  • For numeric values (integers or floats), use them directly without quotes.

Selecting unique values#

Identifying unique values within a column is often a first step in understanding the categories or ranges present in your data.

Identifying unique values

The .unique() function returns an array of all distinct values in a specified column:

unique_groups = gdf['CLC_st1'].unique()
print("Unique land use codes:")
print(unique_groups)
Unique land use codes:
['122' '124' '511' '332' '322' '112' '111' '121' '123' '142' '132' '211'
 '133' '231' '221' '222' '141' '324' '311' '312' '313' '321' '131' '333'
 '411' '512']

Explaination: The dataset Biotopwerte_Dresden_2018 uses the land use codes from CORINE Land Cover (CLC), which are listed in the CLC_st1 column.

../_images/19.png

Fig. 13 Corine Land Cover (CLC) classes and their corresponding land use codes (CLC_Code). For more details, see the description of the CLC nomenclature#

Counting unique values

The .nunique() function returns the number of unique values in a column:

unique_numbers = gdf['CLC_st1'].nunique()
print(f"Number of unique land use codes: {unique_numbers}")

# Including None values in the count
unique_numbers_with_na = gdf['CLC_st1'].nunique(dropna=False)
print(f"Number of unique land use codes (including NaN): {unique_numbers_with_na}")
Number of unique land use codes: 26
Number of unique land use codes (including NaN): 26

Counting value frequencies

To see how many times each unique value appears in a column, use the .groupby() and .size() methods or the .value_counts() method.

Using .groupby() and .size():

groupcounts = gdf.groupby('CLC_st1').size()
print("\nFrequency of land use codes (using groupby):")
print(pd.DataFrame(groupcounts).T)
Frequency of land use codes (using groupby):
CLC_st1   111   112   121   122  123  124  131  132  133   141  ...   312  \
0        1115  7187  2732  5454   12  133   16   47    4  1627  ...  1546   

CLC_st1   313  321  322  324  332  333  411   511  512  
0        1594    8  667  525   36   10    3  2563  234  

[1 rows x 26 columns]

Using value_counts:

groupcounts = gdf["CLC_st1"].value_counts()
print("\nFrequency of land use codes (using value_counts):")
pd.DataFrame(groupcounts).T
Frequency of land use codes (using value_counts):
CLC_st1 112 122 231 121 511 142 311 141 313 312 ... 222 132 332 131 221 123 333 321 133 411
count 7187 5454 3293 2732 2563 2413 1787 1627 1594 1546 ... 56 47 36 16 14 12 10 8 4 3

1 rows × 26 columns

Accessing counts for specific values

You can access the frequency of a particular value directly, similar to accessing a dictionary element:

# Retrieves the count for a specific value (string format)
sv_count = groupcounts['142']
print(f"\nNumber of features with code '142': {sv_count}")
Number of features with code '142': 2413
# To group by a float-formatted column
group_2 = gdf.groupby('Biotpkt2018').size()
print("\nFrequency of biodiversity indicator values (head):")
group_2[1.000000]
Frequency of biodiversity indicator values (head):
np.int64(52)

Note: Float values must match precisely, including decimal points.

Selecting data by attribute values#

You can filter your GeoDataFrame to select only the rows that meet specific conditions based on their attribute values.

Selecting rows with a single specific value

To select rows where a column has a particular value, use boolean indexing.

Selecting by string value:

filtered_data = gdf[gdf['CLC_st1'] == '133']
print("\nFirst 5 features with CLC_st1 == '133':")
filtered_data.head(5)
Hide code cell output
First 5 features with CLC_st1 == '133':
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry
11831 133 6.035751 17.828953 0.432276 MULTIPOLYGON (((411237.019 5656164.663, 411228...
24343 133 6.035751 287.703701 4278.338534 MULTIPOLYGON (((411656.913 5656302.92, 411664....
24344 133 6.035751 183.544225 2066.507754 MULTIPOLYGON (((411567.287 5656337.438, 411562...
24347 133 6.035751 213.015432 2684.440542 MULTIPOLYGON (((411587.489 5655959.618, 411588...

Selecting by numeric value:

filtered_data = gdf[gdf['Biotpkt2018'] == 1.000000]
print("\nFirst 5 features with Biotpkt2018 == 1.0:")
filtered_data.head(5)
Hide code cell output
First 5 features with Biotpkt2018 == 1.0:
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry
9492 111 1.0 411.855290 2480.458787 MULTIPOLYGON (((406985.474 5655597.694, 406984...
9504 111 1.0 89.433492 466.944726 MULTIPOLYGON (((416782.224 5652196.728, 416782...
9514 111 1.0 172.927119 1621.632535 MULTIPOLYGON (((419300.05 5657313.72, 419299.9...
9515 111 1.0 320.578260 3691.824728 MULTIPOLYGON (((408302.234 5655785.262, 408296...
9519 111 1.0 121.899927 920.751286 MULTIPOLYGON (((414687.929 5655817.21, 414656....

Selecting rows with multiple specific values

To select rows where a column’s value is one of several specific values, use the .isin() method:

filtered_data = gdf[gdf['CLC_st1'].isin(['133', '321', '411'])]
print("\nFirst 5 features with CLC_st1 in ['133', '321', '411']:")
filtered_data.head(5)
Hide code cell output
First 5 features with CLC_st1 in ['133', '321', '411']:
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry
11831 133 6.035751 17.828953 0.432276 MULTIPOLYGON (((411237.019 5656164.663, 411228...
23859 321 20.164493 618.702965 18317.038350 MULTIPOLYGON (((415627.211 5652243.292, 415579...
23860 321 20.164493 1099.972947 10894.472486 MULTIPOLYGON (((418611.74 5653542.485, 418618....
23861 321 20.164493 1023.775893 22918.685826 MULTIPOLYGON (((418506.559 5653953.337, 418516...
23862 321 20.164493 1126.678849 17980.357862 MULTIPOLYGON (((418145.759 5654361.049, 418151...

Examining dataset dimensions#

The .shape attribute returns a tuple representing the number of rows and columns in your GeoDataFrame:

print(f"\nShape of filtered_data_isin: {filtered_data.shape}")
Shape of filtered_data_isin: (15, 5)

Adding new attributes (columns)#

You can add new columns to your GeoDataFrame with calculated values or constant values.

In the following example, the output of the previous step is used as the dataset, and a new column named test is added.

First, we make sure that our original dataframe is not modified by explicipty creating a copy of the dataframe.

# Ensure we are working on a copy of the data, not a view
filtered_data = filtered_data.copy()

Since the dataset contains 15 rows, 15 values are assigned using the loc method. (The values could be from different formats: string, null, negative, float, integer, …)

filtered_data.loc[:, 'test']=[
    'Hello', None, -7, 4.5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

filtered_data.head()
Hide code cell output
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry test
11831 133 6.035751 17.828953 0.432276 MULTIPOLYGON (((411237.019 5656164.663, 411228... Hello
23859 321 20.164493 618.702965 18317.038350 MULTIPOLYGON (((415627.211 5652243.292, 415579... None
23860 321 20.164493 1099.972947 10894.472486 MULTIPOLYGON (((418611.74 5653542.485, 418618.... -7
23861 321 20.164493 1023.775893 22918.685826 MULTIPOLYGON (((418506.559 5653953.337, 418516... 4.5
23862 321 20.164493 1126.678849 17980.357862 MULTIPOLYGON (((418145.759 5654361.049, 418151... 5

Adding new attributes with missing values

You can initially add a column with None values and then update specific rows:

filtered_data.loc[:,'test'] = None

To update specific rows by their index:

print(filtered_data.index)
Index([11831, 23859, 23860, 23861, 23862, 23863, 23864, 24343, 24344, 24347,
       24444, 24445, 32673, 32674, 33881],
      dtype='int64')

In the following example, index 11831, 23859, and 23862 updated with their new values.

filtered_data.loc[[11831, 23859, 23862], 'test'] = \
                               [100, '200.5', 300]
filtered_data.head()
Hide code cell output
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry test
11831 133 6.035751 17.828953 0.432276 MULTIPOLYGON (((411237.019 5656164.663, 411228... 100
23859 321 20.164493 618.702965 18317.038350 MULTIPOLYGON (((415627.211 5652243.292, 415579... 200.5
23860 321 20.164493 1099.972947 10894.472486 MULTIPOLYGON (((418611.74 5653542.485, 418618.... None
23861 321 20.164493 1023.775893 22918.685826 MULTIPOLYGON (((418506.559 5653953.337, 418516... None
23862 321 20.164493 1126.678849 17980.357862 MULTIPOLYGON (((418145.759 5654361.049, 418151... 300

Adding new attributes from another dataset

Adding new attributes from another dataset is typically done through merging operations, which will be explained in the Merging Data section.

Adding new features (rows)#

You can add new rows to your GeoDataFrame using loc and providing values for all columns.

Here the index of the row is defined and for all the columns (CLC_st1, Biotpkt2018, …) the values should be assigned.

Example below creates a row with the index of sample and as the dataset includes 6 columns, 6 values defined for them.

filtered_data.loc['sample'] = ['Hello', 800, -7, 4.5, None, 9]
filtered_data.head(16)
Hide code cell output
/tmp/ipykernel_2215/1904651694.py:1: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  filtered_data.loc['sample'] = ['Hello', 800, -7, 4.5, None, 9]
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry test
11831 133 6.035751 17.828953 0.432276 MULTIPOLYGON (((411237.019 5656164.663, 411228... 100
23859 321 20.164493 618.702965 18317.038350 MULTIPOLYGON (((415627.211 5652243.292, 415579... 200.5
23860 321 20.164493 1099.972947 10894.472486 MULTIPOLYGON (((418611.74 5653542.485, 418618.... None
23861 321 20.164493 1023.775893 22918.685826 MULTIPOLYGON (((418506.559 5653953.337, 418516... None
23862 321 20.164493 1126.678849 17980.357862 MULTIPOLYGON (((418145.759 5654361.049, 418151... 300
23863 321 20.164493 1251.411517 28498.211647 MULTIPOLYGON (((401726.203 5658676.434, 401724... None
23864 321 20.164493 1055.657824 15517.253917 MULTIPOLYGON (((403428.145 5659215.144, 403416... None
24343 133 6.035751 287.703701 4278.338534 MULTIPOLYGON (((411656.913 5656302.92, 411664.... None
24344 133 6.035751 183.544225 2066.507754 MULTIPOLYGON (((411567.287 5656337.438, 411562... None
24347 133 6.035751 213.015432 2684.440542 MULTIPOLYGON (((411587.489 5655959.618, 411588... None
24444 411 17.079406 389.878192 3208.873688 MULTIPOLYGON (((413977.74 5649257.41, 413976.8... None
24445 411 17.079406 786.310517 17091.353830 MULTIPOLYGON (((401901.386 5658342.569, 401914... None
32673 321 20.164493 755.991947 15304.813349 MULTIPOLYGON (((414609.258 5667182.283, 414600... None
32674 321 20.164493 806.236339 24159.996929 MULTIPOLYGON (((414627.453 5667492.653, 414633... None
33881 411 17.079406 555.046133 10593.540640 MULTIPOLYGON (((411687.325 5665387.9, 411686.9... None
sample Hello 800.000000 -7.000000 4.500000 None 9

Adding new features from another dataset

Adding new features from another dataset is typically done through concatenation operations, which will be explained in the Merging Data section.

Coping data#

You can create copies of columns or rows.

Copying a column

filtered_data.loc[:,'test_2'] = filtered_data.loc[:,'test']
print("\nGeoDataFrame with a copied 'test_copied' column:")
filtered_data.head(17)
Hide code cell output
GeoDataFrame with a copied 'test_copied' column:
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry test test_2
11831 133 6.035751 17.828953 0.432276 MULTIPOLYGON (((411237.019 5656164.663, 411228... 100 100
23859 321 20.164493 618.702965 18317.038350 MULTIPOLYGON (((415627.211 5652243.292, 415579... 200.5 200.5
23860 321 20.164493 1099.972947 10894.472486 MULTIPOLYGON (((418611.74 5653542.485, 418618.... None None
23861 321 20.164493 1023.775893 22918.685826 MULTIPOLYGON (((418506.559 5653953.337, 418516... None None
23862 321 20.164493 1126.678849 17980.357862 MULTIPOLYGON (((418145.759 5654361.049, 418151... 300 300
23863 321 20.164493 1251.411517 28498.211647 MULTIPOLYGON (((401726.203 5658676.434, 401724... None None
23864 321 20.164493 1055.657824 15517.253917 MULTIPOLYGON (((403428.145 5659215.144, 403416... None None
24343 133 6.035751 287.703701 4278.338534 MULTIPOLYGON (((411656.913 5656302.92, 411664.... None None
24344 133 6.035751 183.544225 2066.507754 MULTIPOLYGON (((411567.287 5656337.438, 411562... None None
24347 133 6.035751 213.015432 2684.440542 MULTIPOLYGON (((411587.489 5655959.618, 411588... None None
24444 411 17.079406 389.878192 3208.873688 MULTIPOLYGON (((413977.74 5649257.41, 413976.8... None None
24445 411 17.079406 786.310517 17091.353830 MULTIPOLYGON (((401901.386 5658342.569, 401914... None None
32673 321 20.164493 755.991947 15304.813349 MULTIPOLYGON (((414609.258 5667182.283, 414600... None None
32674 321 20.164493 806.236339 24159.996929 MULTIPOLYGON (((414627.453 5667492.653, 414633... None None
33881 411 17.079406 555.046133 10593.540640 MULTIPOLYGON (((411687.325 5665387.9, 411686.9... None None
sample Hello 800.000000 -7.000000 4.500000 None 9 9

Copying a row

filtered_data.loc['sample_2'] = filtered_data.loc['sample']
print("\nGeoDataFrame with a copied 'sample_2' row:")
filtered_data.head(18)
Hide code cell output
GeoDataFrame with a copied 'sample_2' row:
/tmp/ipykernel_2215/922061175.py:1: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  filtered_data.loc['sample_2'] = filtered_data.loc['sample']
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry test test_2
11831 133 6.035751 17.828953 0.432276 MULTIPOLYGON (((411237.019 5656164.663, 411228... 100 100
23859 321 20.164493 618.702965 18317.038350 MULTIPOLYGON (((415627.211 5652243.292, 415579... 200.5 200.5
23860 321 20.164493 1099.972947 10894.472486 MULTIPOLYGON (((418611.74 5653542.485, 418618.... None None
23861 321 20.164493 1023.775893 22918.685826 MULTIPOLYGON (((418506.559 5653953.337, 418516... None None
23862 321 20.164493 1126.678849 17980.357862 MULTIPOLYGON (((418145.759 5654361.049, 418151... 300 300
23863 321 20.164493 1251.411517 28498.211647 MULTIPOLYGON (((401726.203 5658676.434, 401724... None None
23864 321 20.164493 1055.657824 15517.253917 MULTIPOLYGON (((403428.145 5659215.144, 403416... None None
24343 133 6.035751 287.703701 4278.338534 MULTIPOLYGON (((411656.913 5656302.92, 411664.... None None
24344 133 6.035751 183.544225 2066.507754 MULTIPOLYGON (((411567.287 5656337.438, 411562... None None
24347 133 6.035751 213.015432 2684.440542 MULTIPOLYGON (((411587.489 5655959.618, 411588... None None
24444 411 17.079406 389.878192 3208.873688 MULTIPOLYGON (((413977.74 5649257.41, 413976.8... None None
24445 411 17.079406 786.310517 17091.353830 MULTIPOLYGON (((401901.386 5658342.569, 401914... None None
32673 321 20.164493 755.991947 15304.813349 MULTIPOLYGON (((414609.258 5667182.283, 414600... None None
32674 321 20.164493 806.236339 24159.996929 MULTIPOLYGON (((414627.453 5667492.653, 414633... None None
33881 411 17.079406 555.046133 10593.540640 MULTIPOLYGON (((411687.325 5665387.9, 411686.9... None None
sample Hello 800.000000 -7.000000 4.500000 None 9 9
sample_2 Hello 800.000000 -7.000000 4.500000 None 9 9

Finding and handling null values#

Missing or null values are common in datasets. It’s important to identify and handle them appropriately.

Identifying columns with null values

The .isnull() method returns a boolean DataFrame indicating the presence of null values. Combining it with .any() checks if any null values exist in each column:

print("\nColumns with null values:")
filtered_data.isnull().any()
Columns with null values:
CLC_st1         False
Biotpkt2018     False
Shape_Length    False
Shape_Area      False
geometry         True
test             True
test_2           True
dtype: bool

Identifying indices of null values in a column

print("\nIndices with null values in 'test_2' column:")
filtered_data[filtered_data['test'].isnull()].index
Indices with null values in 'test_2' column:
Index([23860, 23861, 23863, 23864, 24343, 24344, 24347, 24444, 24445, 32673,
       32674, 33881],
      dtype='object')

Selecting rows with any null values

To select all rows that contain at least one null value:

print("\nRows with at least one null value:")
filtered_data[filtered_data.isnull().any(axis=1)]
Hide code cell output
Rows with at least one null value:
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry test test_2
23860 321 20.164493 1099.972947 10894.472486 MULTIPOLYGON (((418611.74 5653542.485, 418618.... None None
23861 321 20.164493 1023.775893 22918.685826 MULTIPOLYGON (((418506.559 5653953.337, 418516... None None
23863 321 20.164493 1251.411517 28498.211647 MULTIPOLYGON (((401726.203 5658676.434, 401724... None None
23864 321 20.164493 1055.657824 15517.253917 MULTIPOLYGON (((403428.145 5659215.144, 403416... None None
24343 133 6.035751 287.703701 4278.338534 MULTIPOLYGON (((411656.913 5656302.92, 411664.... None None
24344 133 6.035751 183.544225 2066.507754 MULTIPOLYGON (((411567.287 5656337.438, 411562... None None
24347 133 6.035751 213.015432 2684.440542 MULTIPOLYGON (((411587.489 5655959.618, 411588... None None
24444 411 17.079406 389.878192 3208.873688 MULTIPOLYGON (((413977.74 5649257.41, 413976.8... None None
24445 411 17.079406 786.310517 17091.353830 MULTIPOLYGON (((401901.386 5658342.569, 401914... None None
32673 321 20.164493 755.991947 15304.813349 MULTIPOLYGON (((414609.258 5667182.283, 414600... None None
32674 321 20.164493 806.236339 24159.996929 MULTIPOLYGON (((414627.453 5667492.653, 414633... None None
33881 411 17.079406 555.046133 10593.540640 MULTIPOLYGON (((411687.325 5665387.9, 411686.9... None None
sample Hello 800.000000 -7.000000 4.500000 None 9 9
sample_2 Hello 800.000000 -7.000000 4.500000 None 9 9

Removing null values

The .dropna() method removes rows or columns containing null values.

Removing rows with any null value:

print("\nGeoDataFrame after dropping rows with any null value:")
filtered_data.dropna()
Hide code cell output
GeoDataFrame after dropping rows with any null value:
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry test test_2
11831 133 6.035751 17.828953 0.432276 MULTIPOLYGON (((411237.019 5656164.663, 411228... 100 100
23859 321 20.164493 618.702965 18317.038350 MULTIPOLYGON (((415627.211 5652243.292, 415579... 200.5 200.5
23862 321 20.164493 1126.678849 17980.357862 MULTIPOLYGON (((418145.759 5654361.049, 418151... 300 300

Removing rows with null values in specific columns

print("\nGeoDataFrame after dropping rows with null in 'test' column:")
filtered_data.dropna(subset=['test'])
Hide code cell output
GeoDataFrame after dropping rows with null in 'test' column:
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry test test_2
11831 133 6.035751 17.828953 0.432276 MULTIPOLYGON (((411237.019 5656164.663, 411228... 100 100
23859 321 20.164493 618.702965 18317.038350 MULTIPOLYGON (((415627.211 5652243.292, 415579... 200.5 200.5
23862 321 20.164493 1126.678849 17980.357862 MULTIPOLYGON (((418145.759 5654361.049, 418151... 300 300
sample Hello 800.000000 -7.000000 4.500000 None 9 9
sample_2 Hello 800.000000 -7.000000 4.500000 None 9 9

Removing columns with any null value:

print("\nGeoDataFrame after dropping columns with any null value:")
filtered_data.dropna(axis=1)
Hide code cell output
GeoDataFrame after dropping columns with any null value:
CLC_st1 Biotpkt2018 Shape_Length Shape_Area
11831 133 6.035751 17.828953 0.432276
23859 321 20.164493 618.702965 18317.038350
23860 321 20.164493 1099.972947 10894.472486
23861 321 20.164493 1023.775893 22918.685826
23862 321 20.164493 1126.678849 17980.357862
23863 321 20.164493 1251.411517 28498.211647
23864 321 20.164493 1055.657824 15517.253917
24343 133 6.035751 287.703701 4278.338534
24344 133 6.035751 183.544225 2066.507754
24347 133 6.035751 213.015432 2684.440542
24444 411 17.079406 389.878192 3208.873688
24445 411 17.079406 786.310517 17091.353830
32673 321 20.164493 755.991947 15304.813349
32674 321 20.164493 806.236339 24159.996929
33881 411 17.079406 555.046133 10593.540640
sample Hello 800.000000 -7.000000 4.500000
sample_2 Hello 800.000000 -7.000000 4.500000

Removing data#

You can remove unwanted columns or rows from your GeoDataFrame.

Removing attributes (columns)

The .drop() method removes specified columns (axis=1):

print("\nGeoDataFrame after dropping columns 'test_2' and 'CLC_st1':")
filtered_data.drop(['test_2','CLC_st1'], axis=1)
GeoDataFrame after dropping columns 'test_2' and 'CLC_st1':
Biotpkt2018 Shape_Length Shape_Area geometry test
11831 6.035751 17.828953 0.432276 MULTIPOLYGON (((411237.019 5656164.663, 411228... 100
23859 20.164493 618.702965 18317.038350 MULTIPOLYGON (((415627.211 5652243.292, 415579... 200.5
23860 20.164493 1099.972947 10894.472486 MULTIPOLYGON (((418611.74 5653542.485, 418618.... None
23861 20.164493 1023.775893 22918.685826 MULTIPOLYGON (((418506.559 5653953.337, 418516... None
23862 20.164493 1126.678849 17980.357862 MULTIPOLYGON (((418145.759 5654361.049, 418151... 300
23863 20.164493 1251.411517 28498.211647 MULTIPOLYGON (((401726.203 5658676.434, 401724... None
23864 20.164493 1055.657824 15517.253917 MULTIPOLYGON (((403428.145 5659215.144, 403416... None
24343 6.035751 287.703701 4278.338534 MULTIPOLYGON (((411656.913 5656302.92, 411664.... None
24344 6.035751 183.544225 2066.507754 MULTIPOLYGON (((411567.287 5656337.438, 411562... None
24347 6.035751 213.015432 2684.440542 MULTIPOLYGON (((411587.489 5655959.618, 411588... None
24444 17.079406 389.878192 3208.873688 MULTIPOLYGON (((413977.74 5649257.41, 413976.8... None
24445 17.079406 786.310517 17091.353830 MULTIPOLYGON (((401901.386 5658342.569, 401914... None
32673 20.164493 755.991947 15304.813349 MULTIPOLYGON (((414609.258 5667182.283, 414600... None
32674 20.164493 806.236339 24159.996929 MULTIPOLYGON (((414627.453 5667492.653, 414633... None
33881 17.079406 555.046133 10593.540640 MULTIPOLYGON (((411687.325 5665387.9, 411686.9... None
sample 800.000000 -7.000000 4.500000 None 9
sample_2 800.000000 -7.000000 4.500000 None 9

Removing features (rows)#

The .drop() method also removes specified rows by their index:

print("\nGeoDataFrame after dropping the first row and 'sample' row:")
filtered_data.drop([11831,'sample'])
Hide code cell output
GeoDataFrame after dropping the first row and 'sample' row:
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry test test_2
23859 321 20.164493 618.702965 18317.038350 MULTIPOLYGON (((415627.211 5652243.292, 415579... 200.5 200.5
23860 321 20.164493 1099.972947 10894.472486 MULTIPOLYGON (((418611.74 5653542.485, 418618.... None None
23861 321 20.164493 1023.775893 22918.685826 MULTIPOLYGON (((418506.559 5653953.337, 418516... None None
23862 321 20.164493 1126.678849 17980.357862 MULTIPOLYGON (((418145.759 5654361.049, 418151... 300 300
23863 321 20.164493 1251.411517 28498.211647 MULTIPOLYGON (((401726.203 5658676.434, 401724... None None
23864 321 20.164493 1055.657824 15517.253917 MULTIPOLYGON (((403428.145 5659215.144, 403416... None None
24343 133 6.035751 287.703701 4278.338534 MULTIPOLYGON (((411656.913 5656302.92, 411664.... None None
24344 133 6.035751 183.544225 2066.507754 MULTIPOLYGON (((411567.287 5656337.438, 411562... None None
24347 133 6.035751 213.015432 2684.440542 MULTIPOLYGON (((411587.489 5655959.618, 411588... None None
24444 411 17.079406 389.878192 3208.873688 MULTIPOLYGON (((413977.74 5649257.41, 413976.8... None None
24445 411 17.079406 786.310517 17091.353830 MULTIPOLYGON (((401901.386 5658342.569, 401914... None None
32673 321 20.164493 755.991947 15304.813349 MULTIPOLYGON (((414609.258 5667182.283, 414600... None None
32674 321 20.164493 806.236339 24159.996929 MULTIPOLYGON (((414627.453 5667492.653, 414633... None None
33881 411 17.079406 555.046133 10593.540640 MULTIPOLYGON (((411687.325 5665387.9, 411686.9... None None
sample_2 Hello 800.000000 -7.000000 4.500000 None 9 9

Filtering data based on ranges#

For numeric attributes, you can filter data based on specific ranges.

To find the minimum and maximum values of the features, the following code is used:

  • min: minimum value in the specified attribute

  • max: maximum value in the specified attribute

min_value = gdf['Shape_Area'].min()
max_value = gdf['Shape_Area'].max()
print(f"minimum: {min_value:0.8f}") # 0.8f round the number to 8 decimals
print(f"maximum: {max_value:0.2f}") # 0.2f round the number to 2 decimals
minimum: 0.00000562
maximum: 3249895.12

For example, it can be decided to work only with the features having areas less than 1000.

# Filter features with Shape_Area less than 1000
filter_db = gdf[gdf['Shape_Area'] < 1000]
print("\nFirst 5 features with Shape_Area less than 1000:")
filter_db.head()
Hide code cell output
First 5 features with Shape_Area less than 1000:
CLC_st1 Biotpkt2018 Shape_Length Shape_Area geometry
1 122 5.271487 31.935928 50.075513 MULTIPOLYGON (((417850.525 5650376.33, 417846....
3 122 5.271487 24.509066 36.443441 MULTIPOLYGON (((423453.146 5650332.06, 423453....
4 122 5.271487 29.937138 40.494155 MULTIPOLYGON (((417331.434 5650889.039, 417330...
6 122 5.271487 390.293053 764.349833 MULTIPOLYGON (((407613.154 5651943.2, 407618.7...
7 122 5.271487 34.931619 45.486260 MULTIPOLYGON (((416066.508 5651463.834, 416066...

The visualization of this filtered data will be covered in the Creating Maps section.