Selecting and Filtering Data

Selecting and Filtering Data#

Summary

When working with large datasets, processing and visualizing all the data in Jupyter can be time-consuming or may fail entirely. To overcome this, it’s practical to limit the data by selecting only the portions relevant to the analysis.

Load the data

import geopandas as gp
import pandas as pd
from pathlib import Path
INPUT = Path.cwd().parents[0] / "00_data"
gdb_path = INPUT / "Biotopwerte_Dresden_2018.gdb"
gdf = gp.read_file(gdb_path, layer="Biotopwerte_Dresden_2018")

/opt/conda/envs/worker_env/lib/python3.13/site-packages/pyogrio/raw.py:198: RuntimeWarning: organizePolygons() received a polygon with more than 100 parts. The processing may be really slow.  You can skip the processing by setting METHOD=SKIP, or only make it analyze counter-clock wise parts by setting METHOD=ONLY_CCW if you can assume that the outline of holes is counter-clock wise defined
  return ogr_read(

Understanding the data#

Before selecting or filtering, it’s crucial to understand the structure and attributes of the loaded GeoDataFrame.

Examining attributes (columns)

To see the names of the columns (attributes) in the GeoDataFrame:

list(gdf.columns)

['CLC_st1', 'Biotpkt2018', 'Shape_Length', 'Shape_Area', 'geometry']

Understanding attribute data types

Knowing the data type of each attribute is essential for correct selection and filtering. Use the .dtypes method:

gdf.dtypes

CLC_st1           object
Biotpkt2018      float64
Shape_Length     float64
Shape_Area       float64
geometry        geometry
dtype: object

../_images/22-2.png — Fig. 12 Explainations of the attributes in the dataset#

Attribute formatting basics

When working with attributes:

Use quotes (' ') for selecting text or numbers stored as strings (object dtype).
For numeric values (integers or floats), use them directly without quotes.

Understanding the dataset “Biotopwerte_Dresden_2018”

Germany has a wide variety of habitats, also known as biotopes. These biotopes can be of natural origin, such as those found in marine and coastal areas, inland waters, terrestrial and semi-terrestrial zones, and mountainous regions such as the Alps. However, there are also near-natural biotopes that are the result of technical or infrastructural conditions, such as small green spaces and open areas in urban environments.

Due to their specific characteristics, these biotopes can be classified into different biotope types according to uniform and distinguishable criteria.

The biodiversity indicator assesses the quality and diversity of biotope types as habitats for different species. This assessment is carried out using biotope value points ranging from 0-27, in accordance with the Federal Compensation Ordinance. These points are combined with spatial land cover data and intermittently available specialised data on ecosystem condition. This combination of data allows a comparative assessment of Germany’s ecosystem inventory in terms of both area and quality.

By applying biotope value points, a monetary valuation can also be carried out. This valuation is based on the average costs associated with the creation, development and maintenance of higher quality biotopes.

For more information, see the description on the IOER FDZ website.

Recommended citation

Syrbe, Ralf-Uwe, 2025, “Biotopwerte Dresden 2018”, https://doi.org/10.82617/ioer-fdz/HPXYJD, ioerDATA, V1

Selecting unique values#

Identifying unique values within a column is often a first step in understanding the categories or ranges present in your data.

Identifying unique values

The .unique() function returns an array of all distinct values in a specified column:

unique_groups = gdf['CLC_st1'].unique()
print("Unique land use codes:")
print(unique_groups)

Unique land use codes:
['122' '124' '511' '332' '322' '112' '111' '121' '123' '142' '132' '211'
 '133' '231' '221' '222' '141' '324' '311' '312' '313' '321' '131' '333'
 '411' '512']

Explaination: The dataset Biotopwerte_Dresden_2018 uses the land use codes from CORINE Land Cover (CLC), which are listed in the CLC_st1 column.

../_images/19.png — Fig. 13 Corine Land Cover (CLC) classes and their corresponding land use codes (CLC_Code). For more details, see the description of the CLC nomenclature#

Counting unique values

The .nunique() function returns the number of unique values in a column:

unique_numbers = gdf['CLC_st1'].nunique()
print(f"Number of unique land use codes: {unique_numbers}")

# Including None values in the count
unique_numbers_with_na = gdf['CLC_st1'].nunique(dropna=False)
print(f"Number of unique land use codes (including NaN): {unique_numbers_with_na}")

Number of unique land use codes: 26
Number of unique land use codes (including NaN): 26

Counting value frequencies

To see how many times each unique value appears in a column, use the .groupby() and .size() methods or the .value_counts() method.

Using .groupby() and .size():

groupcounts = gdf.groupby('CLC_st1').size()
print("\nFrequency of land use codes (using groupby):")
print(pd.DataFrame(groupcounts).T)

Frequency of land use codes (using groupby):
CLC_st1   111   112   121   122  123  124  131  132  133   141  ...   312  \
0        1115  7187  2732  5454   12  133   16   47    4  1627  ...  1546   

CLC_st1   313  321  322  324  332  333  411   511  512  
0        1594    8  667  525   36   10    3  2563  234  

[1 rows x 26 columns]

Using value_counts:

groupcounts = gdf["CLC_st1"].value_counts()
print("\nFrequency of land use codes (using value_counts):")
pd.DataFrame(groupcounts).T

Frequency of land use codes (using value_counts):

CLC_st1	112	122	231	121	511	142	311	141	313	312	...	222	132	332	131	221	123	333	321	133	411
count	7187	5454	3293	2732	2563	2413	1787	1627	1594	1546	...	56	47	36	16	14	12	10	8	4	3

1 rows × 26 columns

Accessing counts for specific values

You can access the frequency of a particular value directly, similar to accessing a dictionary element:

# Retrieves the count for a specific value (string format)
sv_count = groupcounts['142']
print(f"\nNumber of features with code '142': {sv_count}")

Number of features with code '142': 2413

# To group by a float-formatted column
group_2 = gdf.groupby('Biotpkt2018').size()
print("\nFrequency of biodiversity indicator values (head):")
group_2[1.000000]

Frequency of biodiversity indicator values (head):

np.int64(52)

Note: Float values must match precisely, including decimal points.

Selecting data by attribute values#

You can filter your GeoDataFrame to select only the rows that meet specific conditions based on their attribute values.

Selecting rows with a single specific value

To select rows where a column has a particular value, use boolean indexing.

Selecting by string value:

filtered_data = gdf[gdf['CLC_st1'] == '133']
print("\nFirst 5 features with CLC_st1 == '133':")
filtered_data.head(5)

Show code cell output Hide code cell output

First 5 features with CLC_st1 == '133':

	CLC_st1	Biotpkt2018	Shape_Length	Shape_Area	geometry
11831	133	6.035751	17.828953	0.432276	MULTIPOLYGON (((411237.019 5656164.663, 411228...
24343	133	6.035751	287.703701	4278.338534	MULTIPOLYGON (((411656.913 5656302.92, 411664....
24344	133	6.035751	183.544225	2066.507754	MULTIPOLYGON (((411567.287 5656337.438, 411562...
24347	133	6.035751	213.015432	2684.440542	MULTIPOLYGON (((411587.489 5655959.618, 411588...

Selecting by numeric value:

filtered_data = gdf[gdf['Biotpkt2018'] == 1.000000]
print("\nFirst 5 features with Biotpkt2018 == 1.0:")
filtered_data.head(5)

Show code cell output Hide code cell output

First 5 features with Biotpkt2018 == 1.0:

	CLC_st1	Biotpkt2018	Shape_Length	Shape_Area	geometry
9492	111	1.0	411.855290	2480.458787	MULTIPOLYGON (((406985.474 5655597.694, 406984...
9504	111	1.0	89.433492	466.944726	MULTIPOLYGON (((416782.224 5652196.728, 416782...
9514	111	1.0	172.927119	1621.632535	MULTIPOLYGON (((419300.05 5657313.72, 419299.9...
9515	111	1.0	320.578260	3691.824728	MULTIPOLYGON (((408302.234 5655785.262, 408296...
9519	111	1.0	121.899927	920.751286	MULTIPOLYGON (((414687.929 5655817.21, 414656....

Selecting rows with multiple specific values

To select rows where a column’s value is one of several specific values, use the .isin() method:

filtered_data = gdf[gdf['CLC_st1'].isin(['133', '321', '411'])]
print("\nFirst 5 features with CLC_st1 in ['133', '321', '411']:")
filtered_data.head(5)

Show code cell output Hide code cell output

First 5 features with CLC_st1 in ['133', '321', '411']:

	CLC_st1	Biotpkt2018	Shape_Length	Shape_Area	geometry
11831	133	6.035751	17.828953	0.432276	MULTIPOLYGON (((411237.019 5656164.663, 411228...
23859	321	20.164493	618.702965	18317.038350	MULTIPOLYGON (((415627.211 5652243.292, 415579...
23860	321	20.164493	1099.972947	10894.472486	MULTIPOLYGON (((418611.74 5653542.485, 418618....
23861	321	20.164493	1023.775893	22918.685826	MULTIPOLYGON (((418506.559 5653953.337, 418516...
23862	321	20.164493	1126.678849	17980.357862	MULTIPOLYGON (((418145.759 5654361.049, 418151...

Examining dataset dimensions#

The .shape attribute returns a tuple representing the number of rows and columns in your GeoDataFrame:

print(f"\nShape of filtered_data_isin: {filtered_data.shape}")

Shape of filtered_data_isin: (15, 5)

Adding new attributes (columns)#

You can add new columns to your GeoDataFrame with calculated values or constant values.

In the following example, the output of the previous step is used as the dataset, and a new column named test is added.

First, we make sure that our original dataframe is not modified by explicipty creating a copy of the dataframe.

# Ensure we are working on a copy of the data, not a view
filtered_data = filtered_data.copy()

Working on a view or copy of dataframe?

In pandas, when working with a slice of a DataFrame, you may be working on a view rather than a copy of the data. A view is simply a reference to the original DataFrame, meaning changes made to the view can affect the original data. On the other hand, a copy is a distinct, independent object, so modifications will not impact the original DataFrame. This distinction is important when assigning values to a slice because pandas will often raise a SettingWithCopyWarning to indicate that the operation might be applied to a view, leading to potential unintended side effects.

Since the dataset contains 15 rows, 15 values are assigned using the loc method. (The values could be from different formats: string, null, negative, float, integer, …)

filtered_data.loc[:, 'test']=[
    'Hello', None, -7, 4.5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

filtered_data.head()

Show code cell output Hide code cell output

	CLC_st1	Biotpkt2018	Shape_Length	Shape_Area	geometry	test
11831	133	6.035751	17.828953	0.432276	MULTIPOLYGON (((411237.019 5656164.663, 411228...	Hello
23859	321	20.164493	618.702965	18317.038350	MULTIPOLYGON (((415627.211 5652243.292, 415579...	None
23860	321	20.164493	1099.972947	10894.472486	MULTIPOLYGON (((418611.74 5653542.485, 418618....	-7
23861	321	20.164493	1023.775893	22918.685826	MULTIPOLYGON (((418506.559 5653953.337, 418516...	4.5
23862	321	20.164493	1126.678849	17980.357862	MULTIPOLYGON (((418145.759 5654361.049, 418151...	5

[ : , ‘column_name’ ]

Using : sign before comma , means selecting all the rows, and the information after comma , specifies the columns of interest.

Adding new attributes with missing values

You can initially add a column with None values and then update specific rows:

filtered_data.loc[:,'test'] = None

To update specific rows by their index:

print(filtered_data.index)

Index([11831, 23859, 23860, 23861, 23862, 23863, 23864, 24343, 24344, 24347,
       24444, 24445, 32673, 32674, 33881],
      dtype='int64')

In the following example, index 11831, 23859, and 23862 updated with their new values.

filtered_data.loc[[11831, 23859, 23862], 'test'] = \
                               [100, '200.5', 300]

filtered_data.head()

Show code cell output Hide code cell output

	CLC_st1	Biotpkt2018	Shape_Length	Shape_Area	geometry	test
11831	133	6.035751	17.828953	0.432276	MULTIPOLYGON (((411237.019 5656164.663, 411228...	100
23859	321	20.164493	618.702965	18317.038350	MULTIPOLYGON (((415627.211 5652243.292, 415579...	200.5
23860	321	20.164493	1099.972947	10894.472486	MULTIPOLYGON (((418611.74 5653542.485, 418618....	None
23861	321	20.164493	1023.775893	22918.685826	MULTIPOLYGON (((418506.559 5653953.337, 418516...	None
23862	321	20.164493	1126.678849	17980.357862	MULTIPOLYGON (((418145.759 5654361.049, 418151...	300

Adding new attributes from another dataset

Adding new attributes from another dataset is typically done through merging operations, which will be explained in the Merging Data section.

Adding new features (rows)#

You can add new rows to your GeoDataFrame using loc and providing values for all columns.

Here the index of the row is defined and for all the columns (CLC_st1, Biotpkt2018, …) the values should be assigned.

Example below creates a row with the index of sample and as the dataset includes 6 columns, 6 values defined for them.

filtered_data.loc['sample'] = ['Hello', 800, -7, 4.5, None, 9]
filtered_data.head(16)

Show code cell output Hide code cell output

/tmp/ipykernel_2215/1904651694.py:1: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  filtered_data.loc['sample'] = ['Hello', 800, -7, 4.5, None, 9]

	CLC_st1	Biotpkt2018	Shape_Length	Shape_Area	geometry	test
11831	133	6.035751	17.828953	0.432276	MULTIPOLYGON (((411237.019 5656164.663, 411228...	100
23859	321	20.164493	618.702965	18317.038350	MULTIPOLYGON (((415627.211 5652243.292, 415579...	200.5
23860	321	20.164493	1099.972947	10894.472486	MULTIPOLYGON (((418611.74 5653542.485, 418618....	None
23861	321	20.164493	1023.775893	22918.685826	MULTIPOLYGON (((418506.559 5653953.337, 418516...	None
23862	321	20.164493	1126.678849	17980.357862	MULTIPOLYGON (((418145.759 5654361.049, 418151...	300
23863	321	20.164493	1251.411517	28498.211647	MULTIPOLYGON (((401726.203 5658676.434, 401724...	None
23864	321	20.164493	1055.657824	15517.253917	MULTIPOLYGON (((403428.145 5659215.144, 403416...	None
24343	133	6.035751	287.703701	4278.338534	MULTIPOLYGON (((411656.913 5656302.92, 411664....	None
24344	133	6.035751	183.544225	2066.507754	MULTIPOLYGON (((411567.287 5656337.438, 411562...	None
24347	133	6.035751	213.015432	2684.440542	MULTIPOLYGON (((411587.489 5655959.618, 411588...	None
24444	411	17.079406	389.878192	3208.873688	MULTIPOLYGON (((413977.74 5649257.41, 413976.8...	None
24445	411	17.079406	786.310517	17091.353830	MULTIPOLYGON (((401901.386 5658342.569, 401914...	None
32673	321	20.164493	755.991947	15304.813349	MULTIPOLYGON (((414609.258 5667182.283, 414600...	None
32674	321	20.164493	806.236339	24159.996929	MULTIPOLYGON (((414627.453 5667492.653, 414633...	None
33881	411	17.079406	555.046133	10593.540640	MULTIPOLYGON (((411687.325 5665387.9, 411686.9...	None
sample	Hello	800.000000	-7.000000	4.500000	None	9

Adding new features from another dataset

Adding new features from another dataset is typically done through concatenation operations, which will be explained in the Merging Data section.

Coping data#

You can create copies of columns or rows.

Copying a column

filtered_data.loc[:,'test_2'] = filtered_data.loc[:,'test']
print("\nGeoDataFrame with a copied 'test_copied' column:")
filtered_data.head(17)

Show code cell output Hide code cell output

GeoDataFrame with a copied 'test_copied' column:

	CLC_st1	Biotpkt2018	Shape_Length	Shape_Area	geometry	test	test_2
11831	133	6.035751	17.828953	0.432276	MULTIPOLYGON (((411237.019 5656164.663, 411228...	100	100
23859	321	20.164493	618.702965	18317.038350	MULTIPOLYGON (((415627.211 5652243.292, 415579...	200.5	200.5
23860	321	20.164493	1099.972947	10894.472486	MULTIPOLYGON (((418611.74 5653542.485, 418618....	None	None
23861	321	20.164493	1023.775893	22918.685826	MULTIPOLYGON (((418506.559 5653953.337, 418516...	None	None
23862	321	20.164493	1126.678849	17980.357862	MULTIPOLYGON (((418145.759 5654361.049, 418151...	300	300
23863	321	20.164493	1251.411517	28498.211647	MULTIPOLYGON (((401726.203 5658676.434, 401724...	None	None
23864	321	20.164493	1055.657824	15517.253917	MULTIPOLYGON (((403428.145 5659215.144, 403416...	None	None
24343	133	6.035751	287.703701	4278.338534	MULTIPOLYGON (((411656.913 5656302.92, 411664....	None	None
24344	133	6.035751	183.544225	2066.507754	MULTIPOLYGON (((411567.287 5656337.438, 411562...	None	None
24347	133	6.035751	213.015432	2684.440542	MULTIPOLYGON (((411587.489 5655959.618, 411588...	None	None
24444	411	17.079406	389.878192	3208.873688	MULTIPOLYGON (((413977.74 5649257.41, 413976.8...	None	None
24445	411	17.079406	786.310517	17091.353830	MULTIPOLYGON (((401901.386 5658342.569, 401914...	None	None
32673	321	20.164493	755.991947	15304.813349	MULTIPOLYGON (((414609.258 5667182.283, 414600...	None	None
32674	321	20.164493	806.236339	24159.996929	MULTIPOLYGON (((414627.453 5667492.653, 414633...	None	None
33881	411	17.079406	555.046133	10593.540640	MULTIPOLYGON (((411687.325 5665387.9, 411686.9...	None	None
sample	Hello	800.000000	-7.000000	4.500000	None	9	9

Copying a row

filtered_data.loc['sample_2'] = filtered_data.loc['sample']
print("\nGeoDataFrame with a copied 'sample_2' row:")
filtered_data.head(18)

Show code cell output Hide code cell output

GeoDataFrame with a copied 'sample_2' row:

/tmp/ipykernel_2215/922061175.py:1: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  filtered_data.loc['sample_2'] = filtered_data.loc['sample']

	CLC_st1	Biotpkt2018	Shape_Length	Shape_Area	geometry	test	test_2
11831	133	6.035751	17.828953	0.432276	MULTIPOLYGON (((411237.019 5656164.663, 411228...	100	100
23859	321	20.164493	618.702965	18317.038350	MULTIPOLYGON (((415627.211 5652243.292, 415579...	200.5	200.5
23860	321	20.164493	1099.972947	10894.472486	MULTIPOLYGON (((418611.74 5653542.485, 418618....	None	None
23861	321	20.164493	1023.775893	22918.685826	MULTIPOLYGON (((418506.559 5653953.337, 418516...	None	None
23862	321	20.164493	1126.678849	17980.357862	MULTIPOLYGON (((418145.759 5654361.049, 418151...	300	300
23863	321	20.164493	1251.411517	28498.211647	MULTIPOLYGON (((401726.203 5658676.434, 401724...	None	None
23864	321	20.164493	1055.657824	15517.253917	MULTIPOLYGON (((403428.145 5659215.144, 403416...	None	None
24343	133	6.035751	287.703701	4278.338534	MULTIPOLYGON (((411656.913 5656302.92, 411664....	None	None
24344	133	6.035751	183.544225	2066.507754	MULTIPOLYGON (((411567.287 5656337.438, 411562...	None	None
24347	133	6.035751	213.015432	2684.440542	MULTIPOLYGON (((411587.489 5655959.618, 411588...	None	None
24444	411	17.079406	389.878192	3208.873688	MULTIPOLYGON (((413977.74 5649257.41, 413976.8...	None	None
24445	411	17.079406	786.310517	17091.353830	MULTIPOLYGON (((401901.386 5658342.569, 401914...	None	None
32673	321	20.164493	755.991947	15304.813349	MULTIPOLYGON (((414609.258 5667182.283, 414600...	None	None
32674	321	20.164493	806.236339	24159.996929	MULTIPOLYGON (((414627.453 5667492.653, 414633...	None	None
33881	411	17.079406	555.046133	10593.540640	MULTIPOLYGON (((411687.325 5665387.9, 411686.9...	None	None
sample	Hello	800.000000	-7.000000	4.500000	None	9	9
sample_2	Hello	800.000000	-7.000000	4.500000	None	9	9

Finding and handling null values#

Missing or null values are common in datasets. It’s important to identify and handle them appropriately.

Identifying columns with null values

The .isnull() method returns a boolean DataFrame indicating the presence of null values. Combining it with .any() checks if any null values exist in each column:

print("\nColumns with null values:")
filtered_data.isnull().any()

Columns with null values:

CLC_st1         False
Biotpkt2018     False
Shape_Length    False
Shape_Area      False
geometry         True
test             True
test_2           True
dtype: bool

Identifying indices of null values in a column

print("\nIndices with null values in 'test_2' column:")
filtered_data[filtered_data['test'].isnull()].index

Indices with null values in 'test_2' column:

Index([23860, 23861, 23863, 23864, 24343, 24344, 24347, 24444, 24445, 32673,
       32674, 33881],
      dtype='object')

Selecting rows with any null values

To select all rows that contain at least one null value:

print("\nRows with at least one null value:")
filtered_data[filtered_data.isnull().any(axis=1)]

Show code cell output Hide code cell output

Rows with at least one null value:

	CLC_st1	Biotpkt2018	Shape_Length	Shape_Area	geometry	test	test_2
23860	321	20.164493	1099.972947	10894.472486	MULTIPOLYGON (((418611.74 5653542.485, 418618....	None	None
23861	321	20.164493	1023.775893	22918.685826	MULTIPOLYGON (((418506.559 5653953.337, 418516...	None	None
23863	321	20.164493	1251.411517	28498.211647	MULTIPOLYGON (((401726.203 5658676.434, 401724...	None	None
23864	321	20.164493	1055.657824	15517.253917	MULTIPOLYGON (((403428.145 5659215.144, 403416...	None	None
24343	133	6.035751	287.703701	4278.338534	MULTIPOLYGON (((411656.913 5656302.92, 411664....	None	None
24344	133	6.035751	183.544225	2066.507754	MULTIPOLYGON (((411567.287 5656337.438, 411562...	None	None
24347	133	6.035751	213.015432	2684.440542	MULTIPOLYGON (((411587.489 5655959.618, 411588...	None	None
24444	411	17.079406	389.878192	3208.873688	MULTIPOLYGON (((413977.74 5649257.41, 413976.8...	None	None
24445	411	17.079406	786.310517	17091.353830	MULTIPOLYGON (((401901.386 5658342.569, 401914...	None	None
32673	321	20.164493	755.991947	15304.813349	MULTIPOLYGON (((414609.258 5667182.283, 414600...	None	None
32674	321	20.164493	806.236339	24159.996929	MULTIPOLYGON (((414627.453 5667492.653, 414633...	None	None
33881	411	17.079406	555.046133	10593.540640	MULTIPOLYGON (((411687.325 5665387.9, 411686.9...	None	None
sample	Hello	800.000000	-7.000000	4.500000	None	9	9
sample_2	Hello	800.000000	-7.000000	4.500000	None	9	9

Removing null values

The .dropna() method removes rows or columns containing null values.

Removing rows with any null value:

print("\nGeoDataFrame after dropping rows with any null value:")
filtered_data.dropna()

Show code cell output Hide code cell output

GeoDataFrame after dropping rows with any null value:

	CLC_st1	Biotpkt2018	Shape_Length	Shape_Area	geometry	test	test_2
11831	133	6.035751	17.828953	0.432276	MULTIPOLYGON (((411237.019 5656164.663, 411228...	100	100
23859	321	20.164493	618.702965	18317.038350	MULTIPOLYGON (((415627.211 5652243.292, 415579...	200.5	200.5
23862	321	20.164493	1126.678849	17980.357862	MULTIPOLYGON (((418145.759 5654361.049, 418151...	300	300

Removing rows with null values in specific columns

print("\nGeoDataFrame after dropping rows with null in 'test' column:")
filtered_data.dropna(subset=['test'])

Show code cell output Hide code cell output

GeoDataFrame after dropping rows with null in 'test' column:

	CLC_st1	Biotpkt2018	Shape_Length	Shape_Area	geometry	test	test_2
11831	133	6.035751	17.828953	0.432276	MULTIPOLYGON (((411237.019 5656164.663, 411228...	100	100
23859	321	20.164493	618.702965	18317.038350	MULTIPOLYGON (((415627.211 5652243.292, 415579...	200.5	200.5
23862	321	20.164493	1126.678849	17980.357862	MULTIPOLYGON (((418145.759 5654361.049, 418151...	300	300
sample	Hello	800.000000	-7.000000	4.500000	None	9	9
sample_2	Hello	800.000000	-7.000000	4.500000	None	9	9

Removing columns with any null value:

print("\nGeoDataFrame after dropping columns with any null value:")
filtered_data.dropna(axis=1)

Show code cell output Hide code cell output

GeoDataFrame after dropping columns with any null value:

	CLC_st1	Biotpkt2018	Shape_Length	Shape_Area
11831	133	6.035751	17.828953	0.432276
23859	321	20.164493	618.702965	18317.038350
23860	321	20.164493	1099.972947	10894.472486
23861	321	20.164493	1023.775893	22918.685826
23862	321	20.164493	1126.678849	17980.357862
23863	321	20.164493	1251.411517	28498.211647
23864	321	20.164493	1055.657824	15517.253917
24343	133	6.035751	287.703701	4278.338534
24344	133	6.035751	183.544225	2066.507754
24347	133	6.035751	213.015432	2684.440542
24444	411	17.079406	389.878192	3208.873688
24445	411	17.079406	786.310517	17091.353830
32673	321	20.164493	755.991947	15304.813349
32674	321	20.164493	806.236339	24159.996929
33881	411	17.079406	555.046133	10593.540640
sample	Hello	800.000000	-7.000000	4.500000
sample_2	Hello	800.000000	-7.000000	4.500000

Removing data#

You can remove unwanted columns or rows from your GeoDataFrame.

Removing attributes (columns)

The .drop() method removes specified columns (axis=1):

print("\nGeoDataFrame after dropping columns 'test_2' and 'CLC_st1':")
filtered_data.drop(['test_2','CLC_st1'], axis=1)

GeoDataFrame after dropping columns 'test_2' and 'CLC_st1':

	Biotpkt2018	Shape_Length	Shape_Area	geometry	test
11831	6.035751	17.828953	0.432276	MULTIPOLYGON (((411237.019 5656164.663, 411228...	100
23859	20.164493	618.702965	18317.038350	MULTIPOLYGON (((415627.211 5652243.292, 415579...	200.5
23860	20.164493	1099.972947	10894.472486	MULTIPOLYGON (((418611.74 5653542.485, 418618....	None
23861	20.164493	1023.775893	22918.685826	MULTIPOLYGON (((418506.559 5653953.337, 418516...	None
23862	20.164493	1126.678849	17980.357862	MULTIPOLYGON (((418145.759 5654361.049, 418151...	300
23863	20.164493	1251.411517	28498.211647	MULTIPOLYGON (((401726.203 5658676.434, 401724...	None
23864	20.164493	1055.657824	15517.253917	MULTIPOLYGON (((403428.145 5659215.144, 403416...	None
24343	6.035751	287.703701	4278.338534	MULTIPOLYGON (((411656.913 5656302.92, 411664....	None
24344	6.035751	183.544225	2066.507754	MULTIPOLYGON (((411567.287 5656337.438, 411562...	None
24347	6.035751	213.015432	2684.440542	MULTIPOLYGON (((411587.489 5655959.618, 411588...	None
24444	17.079406	389.878192	3208.873688	MULTIPOLYGON (((413977.74 5649257.41, 413976.8...	None
24445	17.079406	786.310517	17091.353830	MULTIPOLYGON (((401901.386 5658342.569, 401914...	None
32673	20.164493	755.991947	15304.813349	MULTIPOLYGON (((414609.258 5667182.283, 414600...	None
32674	20.164493	806.236339	24159.996929	MULTIPOLYGON (((414627.453 5667492.653, 414633...	None
33881	17.079406	555.046133	10593.540640	MULTIPOLYGON (((411687.325 5665387.9, 411686.9...	None
sample	800.000000	-7.000000	4.500000	None	9
sample_2	800.000000	-7.000000	4.500000	None	9

Removing features (rows)#

The .drop() method also removes specified rows by their index:

print("\nGeoDataFrame after dropping the first row and 'sample' row:")
filtered_data.drop([11831,'sample'])

Show code cell output Hide code cell output

GeoDataFrame after dropping the first row and 'sample' row:

	CLC_st1	Biotpkt2018	Shape_Length	Shape_Area	geometry	test	test_2
23859	321	20.164493	618.702965	18317.038350	MULTIPOLYGON (((415627.211 5652243.292, 415579...	200.5	200.5
23860	321	20.164493	1099.972947	10894.472486	MULTIPOLYGON (((418611.74 5653542.485, 418618....	None	None
23861	321	20.164493	1023.775893	22918.685826	MULTIPOLYGON (((418506.559 5653953.337, 418516...	None	None
23862	321	20.164493	1126.678849	17980.357862	MULTIPOLYGON (((418145.759 5654361.049, 418151...	300	300
23863	321	20.164493	1251.411517	28498.211647	MULTIPOLYGON (((401726.203 5658676.434, 401724...	None	None
23864	321	20.164493	1055.657824	15517.253917	MULTIPOLYGON (((403428.145 5659215.144, 403416...	None	None
24343	133	6.035751	287.703701	4278.338534	MULTIPOLYGON (((411656.913 5656302.92, 411664....	None	None
24344	133	6.035751	183.544225	2066.507754	MULTIPOLYGON (((411567.287 5656337.438, 411562...	None	None
24347	133	6.035751	213.015432	2684.440542	MULTIPOLYGON (((411587.489 5655959.618, 411588...	None	None
24444	411	17.079406	389.878192	3208.873688	MULTIPOLYGON (((413977.74 5649257.41, 413976.8...	None	None
24445	411	17.079406	786.310517	17091.353830	MULTIPOLYGON (((401901.386 5658342.569, 401914...	None	None
32673	321	20.164493	755.991947	15304.813349	MULTIPOLYGON (((414609.258 5667182.283, 414600...	None	None
32674	321	20.164493	806.236339	24159.996929	MULTIPOLYGON (((414627.453 5667492.653, 414633...	None	None
33881	411	17.079406	555.046133	10593.540640	MULTIPOLYGON (((411687.325 5665387.9, 411686.9...	None	None
sample_2	Hello	800.000000	-7.000000	4.500000	None	9	9

Filtering data based on ranges#

For numeric attributes, you can filter data based on specific ranges.

To find the minimum and maximum values of the features, the following code is used:

min: minimum value in the specified attribute
max: maximum value in the specified attribute

min_value = gdf['Shape_Area'].min()
max_value = gdf['Shape_Area'].max()
print(f"minimum: {min_value:0.8f}") # 0.8f round the number to 8 decimals
print(f"maximum: {max_value:0.2f}") # 0.2f round the number to 2 decimals

minimum: 0.00000562
maximum: 3249895.12

For example, it can be decided to work only with the features having areas less than 1000.

# Filter features with Shape_Area less than 1000
filter_db = gdf[gdf['Shape_Area'] < 1000]
print("\nFirst 5 features with Shape_Area less than 1000:")
filter_db.head()

Show code cell output Hide code cell output

First 5 features with Shape_Area less than 1000:

	CLC_st1	Biotpkt2018	Shape_Length	Shape_Area	geometry
1	122	5.271487	31.935928	50.075513	MULTIPOLYGON (((417850.525 5650376.33, 417846....
3	122	5.271487	24.509066	36.443441	MULTIPOLYGON (((423453.146 5650332.06, 423453....
4	122	5.271487	29.937138	40.494155	MULTIPOLYGON (((417331.434 5650889.039, 417330...
6	122	5.271487	390.293053	764.349833	MULTIPOLYGON (((407613.154 5651943.2, 407618.7...
7	122	5.271487	34.931619	45.486260	MULTIPOLYGON (((416066.508 5651463.834, 416066...

The visualization of this filtered data will be covered in the Creating Maps section.

Selecting and Filtering Data

Contents

Selecting and Filtering Data#

Understanding the data#

Selecting unique values#

Selecting data by attribute values#

Examining dataset dimensions#

Adding new attributes (columns)#

Adding new features (rows)#

Coping data#

Finding and handling null values#

Removing data#

Removing features (rows)#

Filtering data based on ranges#