Exploration of geosocial patterns (Germany)#

Alexander Dunkel, Leibniz Institute of Ecological Urban and Regional Development, Transformative Capacities & Research Data Centre (IÖR-FDZ)

Last updated: Aug-08-2024, Carto-Lab Docker Version 0.18.0

Summary

In this section we intend to share additional notebooks on various topics in spatial visualisation, prepared by colleagues for specific projects or publications.

This sample notebook on explorative cartographic visualisation of geo-social media data (Flickr, Instagram, Twitter, iNaturalist) shows a work in progress. It is shared as-is as an example of how to visualise big data in a privacy-friendly way, following the workflow described in Dunkel et al. (2020)(Dunkel et al. 2020). We do not provide a dependency setup. The data for this notebook has not been shared, so you cannot replicate this notebook yet. However, you can copy and use the methods and procedures to visualise your own data.

This is a project where I explore base geosocial media patterns for Germany

Prepare environment#

To run this notebook, as a starting point, you have two options:

1. Create an environment with the packages and versions shown in the following cell.

As a starting point, you may use the latest conda environment_default.yml from our CartoLab docker container.

2. If docker is available to you, we suggest to use the Carto-Lab Docker Container

Clone the repository and edit your .env value to point to the repsitory, where this notebook can be found, e.g.:

git clone https://gitlab.vgiscience.de/lbsn/tools/jupyterlab.git
cd jupyterlab
cp .env.example .env
nano .env
## Enter:
# JUPYTER_NOTEBOOKS=~/notebooks/geosocial_patterns_de
# TAG=v0.12.3
docker network create lbsn-network
docker-compose pull && docker-compose up -d
import sys
from pathlib import Path

module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules.base import tools

root_packages = [
    'python', 'geopandas', 'pandas', 'matplotlib', 'dask', 'datashader']
tools.package_report(root_packages)
List of package versions used in this notebook
package python dask datashader geopandas matplotlib pandas
version 3.12.4 2024.7.1 0.16.3 1.0.1 3.9.1 2.2.2

Load dependencies:

import os
from pathlib import Path
import geopandas as gp
import pandas as pd
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict, Optional
from IPython.display import clear_output, display, HTML

Activate autoreload of changed python files:

%load_ext autoreload
%autoreload 2

Parameters#

Define initial parameters that affect processing

OUTPUT = Path.cwd().parents[0] / "out"
OUTPUT.mkdir(exist_ok=True)
(Path.cwd().parents[0] / "notebooks").mkdir(exist_ok=True)
(Path.cwd().parents[0] / "py").mkdir(exist_ok=True)
CHUNK_SIZE = 5000000 

Create notebook HTML#

import numpy as np
import pandas as pd
import dask
import dask.dataframe as dd
import dask.diagnostics as diag
import datashader.transfer_functions as tf
import datashader as ds
from datashader.utils import lnglat_to_meters
from IPython.display import clear_output
from pathlib import Path
filename = Path.cwd().parents[0] / 'data' / '2024-07-31_DE_All_exportAllLatLng.csv'
dtypes = {'latitude': float, 'longitude': float}
%%time
parquet_output = OUTPUT / "twitter_proj.snappy.parq"
if not parquet_output.exists():
    iter_csv = pd.read_csv(
        filename, iterator=True,
        dtype=dtypes, encoding='utf-8', chunksize=CHUNK_SIZE)

    cnt = 0

    for ix, chunk in enumerate(iter_csv):
        # read
        append=True
        cnt += CHUNK_SIZE
        clear_output(wait=True)
        print(f"Processed {cnt:,.0f} coordinates..")
        if ix==0:
            if parquet_output.exists():
                break
            append=False
        dd_chunk = dd.from_pandas(chunk, npartitions=1)
        # project
        web_mercator_x, web_mercator_y = lnglat_to_meters(
            chunk['longitude'], chunk['latitude'])
        projected_coordinates = dd.concat(
            [web_mercator_x, web_mercator_y], axis=1)
        transformed = projected_coordinates.rename(
            columns={'longitude':'x', 'latitude': 'y'})
        # store
        dd.to_parquet(transformed, parquet_output, append=append, compression="SNAPPY")
Processed 70,000,000 coordinates..
CPU times: user 20.4 s, sys: 2.96 s, total: 23.4 s
Wall time: 23.4 s
datasize = sum(f.stat().st_size for f in parquet_output.glob('**/*') if f.is_file())/(1024*1024*1024)
print(
    f"Size: {datasize:,.1f} GB")
Size: 0.5 GB
df = dask.dataframe.read_parquet(parquet_output)
if datasize < 8:
    df = df.persist()
df.columns
Index(['x', 'y'], dtype='object')
df.head()
x y
0 1.483521e+06 6.891437e+06
1 7.615500e+05 6.609943e+06
2 1.491849e+06 6.896486e+06
3 9.711956e+05 6.063937e+06
4 7.902371e+05 6.540994e+06
def bounds(x_range, y_range):
    x,y = lnglat_to_meters(x_range, y_range)
    return dict(x_range=x, y_range=y)

Earth       = ((-180.00, 180.00), (-59.00, 74.00))
France      = (( -12.00,  16.00), ( 41.26, 51.27))
Berlin      = (( 12.843018,  14.149704), ( 52.274880, 52.684292))
Dresden     = (( 13.415680,  14.703827), ( 50.740090, 51.194905))
USA         = (( -126,  -64), ( 24.92, 49.35))
Paris       = ((   2.05,   2.65), ( 48.76, 48.97))
DE          = ((   4.605469,   15.372070), ( 46.697243, 55.065885))
cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, **bounds(*Earth))
with diag.ProgressBar(), diag.Profiler() as prof, diag.ResourceProfiler(0.5) as rprof:
    agg = cvs.points(df, x='x', y='y')
[########################################] | 100% Completed | 102.74 ms
def plot(x_range, y_range, plot_width: int = None):
    """Plot df using tf-shade()
    Calculates aspect ratio based on 
    web mercator distance ratio lat/lng
    """
    if plot_width is None:
        plot_width = 1000
    lng_width = x_range[1]-x_range[0]
    lat_height = y_range[1]-y_range[0]
    plot_height = int(((plot_width * lat_height) / lng_width)*1.5)
    cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, **bounds(x_range, y_range))
    with diag.ProgressBar(), diag.Profiler() as prof, diag.ResourceProfiler(0.5) as rprof:
        agg = cvs.points(df, x='x', y='y')
    return tf.shade(agg, cmap=["lightblue","darkblue"])

def save_image(img, output_name, return_img: bool = True, ):
    """Saves image as PNG"""
    ds.utils.export_image(
        img=img, filename= str(OUTPUT / output_name), fmt=".png", background='white')
    if return_img:
        return img
%time save_image(plot(*DE), output_name='DE_map')
[########################################] | 100% Completed | 204.29 ms
CPU times: user 703 ms, sys: 397 ms, total: 1.1 s
Wall time: 740 ms
%time  save_image(plot(*Berlin), output_name='Berlin_map')
[########################################] | 100% Completed | 103.02 ms
CPU times: user 350 ms, sys: 190 ms, total: 541 ms
Wall time: 610 ms
%time  save_image(plot(*Dresden), output_name='Dresden_map')
[########################################] | 100% Completed | 102.91 ms
CPU times: user 301 ms, sys: 220 ms, total: 522 ms
Wall time: 592 ms

References#

[1]

Alexander Dunkel, Marc Löchner, and Dirk Burghardt. Privacy-Aware Visualization of Volunteered Geographic Information (VGI) to Analyze Spatial Activity: A Benchmark Implementation. ISPRS International Journal of Geo-Information, 9(10):607, October 2020. URL: https://www.mdpi.com/2220-9964/9/10/607 (visited on 2023-03-28), doi:10.3390/ijgi9100607.