Skip to navigation (Press Enter)
Skip to search (Press Enter)
Skip to course offerings (Press Enter)
Skip to content (Press Enter)

Contact

GO-SDPF

Online Training

Duration
3 days

Price

on request

Enroll now

Request a date

Classroom Training

Duration
3 days

Price

on request

Enroll now

Request a date

Google

Serverless Data Processing with Dataflow (SDPF) – Outline

Detailed Course Outline

Module 1: Introduction

Introduce the course objectives.
Demonstrate how Apache Beam and Dataflow work together to fulfill your organization’s data processing needs.

Module 2: Beam Portability

Summarize the benefits of the Beam Portability Framework.
Customize the data processing environment of your pipeline using custom containers.
Review use cases for cross-language transformations.
Enable the Portability framework for your Dataflow pipelines.

Module 3: Separating Compute and Storage with Dataflow

Enable Shuffle and Streaming Engine, for batch and streaming pipelines respectively, for maximum performance.
Enable Flexible Resource Scheduling for more cost-efficient performance.

Module 4: IAM, Quotas, and Permissions

Select the right combination of IAM permissions for your Dataflow job.
Determine your capacity needs by inspecting the relevant quotas for your Dataflow jobs.

Module 5: Security

Select your zonal data processing strategy using Dataflow, depending on your data locality needs.
Implement best practices for a secure data processing environment.

Module 6: Beam Concepts Review

Review main Apache Beam concepts (Pipeline, PCollections, PTransforms, Runner, reading/writing, Utility PTransforms, side inputs), bundles and DoFn Lifecycle.

Module 7: Windows, Watermarks, Triggers

Implement logic to handle your late data.
Review different types of triggers.
Review core streaming concepts (unbounded PCollections, windows).

Module 8: Sources and Sinks

Write the I/O of your choice for your Dataflow pipeline.
Tune your source/sink transformation for maximum performance.
Create custom sources and sinks using SDF.

Module 9: Schemas

Introduce schemas, which give developers a way to express structured data in their Beam pipelines.
Use schemas to simplify your Beam code and improve the performance of your pipeline.

Module 10: State and Timers

Identify use cases for state and timer API implementations.
Select the right type of state and timers for your pipeline.

Module 11: Best Practices

Implement best practices for Dataflow pipelines.

Module 12: Dataflow SQL and DataFrames

Develop a Beam pipeline using SQL and DataFrames.

Module 13: Beam Notebooks

Prototype your pipeline in Python using Beam notebooks.
Use Beam magics to control the behavior of source recording in your notebook.
Launch a job to Dataflow from a notebook.

Module 14: Monitoring

Navigate the Dataflow Job Details UI.
Interpret Job Metrics charts to diagnose pipeline regressions.
Set alerts on Dataflow jobs using Cloud Monitoring.

Module 15: Logging and Error Reporting

Use the Dataflow logs and diagnostics widgets to troubleshoot pipeline issues.

Module 16: Troubleshooting and Debug

Use a structured approach to debug your Dataflow pipelines.
Examine common causes for pipeline failures.

Module 17: Performance

Understand performance considerations for pipelines.
Consider how the shape of your data can affect pipeline performance.

Module 18: Testing and CI/CD

Testing approaches for your Dataflow pipeline.
Review frameworks and features available to streamline your CI/CD workflow for Dataflow pipelines.

Module 19: Reliability

Implement reliability best practices for your Dataflow pipelines.

Module 20: Flex Templates

Using flex templates to standardize and reuse Dataflow pipeline code.

Module 21: Summary

Summary.