Iguazio Platform Getting Started Guide and Tutorials

Platform Overview

Iguazio provides a fully integrated and secure data science PaaS which simplify development, accelerate performance,
enable collaboration, and address operational challenges. The platform incorporate the following components:

• Data science workbench (Jupyter with integrated analytics engines & Python packages)
• Real-time dashboards based on Grafana
• Managed data and ML services over scalable Kubernetes cluster
• Real-time serverless functions framework (aka Nuclio).
• Extremely fast and secure data layer supporting SQL, NoSQL, time series , files/objects and streaming
• Integration with 3rd party data sources (S3, HDFS, SQL DBs, Streaming/messaging protocols, etc.)

We use Kubernetes as the baseline cluster manager and deploy various microservices on top to address the different data-science tasks.
most services support scaling out, support GPU acceleration, and have secure and low-latency access to iguazio shared database and file-system
enabling high performance and scalability at maximum resource efficiency.

The platform make extensive use of Nuclio serverless functions to automate various tasks from data collection, ETLs, custom APIs, model serving, and batch jobs
functions describe the code with all the resource definitions and configuration needed to make it run, functions auto-scale and can be versioned
functions can be generated automatically in various ways (UI, Docker, Git, and Jupyter), this is demonstrated in the various tutorials.

For more details:

Data science workflow on iguazio platform

Iguazio enable a complete data science workflow in a single ready to use platform:

Collect, explore and label data coming from various real-time or offline sources
Run ML training and validation at scale over multiple CPUs and GPUs
Deploy models and applications into production with Serverless functions
Log, monitor, and version all your data and services

Iguazio provides all the building blocks for creating data science applications from research to production.

Data collection and ingestion

For details visit the data collection and exploration tutorial

There are many ways to collect or ingest data into the system from various sources:

Real-time streaming from (e.g. using Kafka, Kinesis, Azure Event Hub, Google Pub/sub)

Loading data directly from external database in an event driven or periodic/scheduled way

Loading data from internal or external file/object sources like S3 or Hadoop (using CSV, Parquet, Json formats)

Importing time-series telemetry data using Prometheus compatible scraping API

Pushing/ingesting data directly into the system via AWS like object, streaming and NoSQL RestAPIs

By implementing custom nuclio functions which scrape data from external sources or read form external API sources (e.g. Twitter, Weather services, Stock trading data, etc.)

Data Exploration and Processing

Iguazio provides a wide range of pre-integrated data query and exploration tools. The most common ones are:

Apache Spark SQL, ML, R, Graph (with read-time access to iguazio DB and file-system)

interactive SQL queries (using Apache Presto distributed processing engine over iguazio DB or file/object data sources)

Python Pandas dataframe (or Dask for “distributed Pandas like”)

Frames - Iguazio open source high speed library for data access providing unified interface for NoSQL tables, Time series tables and Streaming data
and native integration with Pandas and NVIDIA RAPIDS.

Built in ML packages: Scikit learn , Pyplot , numpy, Pytorch and Tensorflow.

All the tools are integrated with Jupyter notebook allowing access to same data through multiple tools and APIs, and with minimal configuration overhead.
The Python environment has pre-deployed conda package. Users can install any packages using pip and conda.

Note: to deploy add-on services such as Spark use the platform services tab

Building and training models

Models can be developed and tested in the Jupyter notebooks or using external editors
Once you built a model you can train it inside Jupyter or use scalable cluster resources such as Nuclio functions, Dask, Spark ML, or Kubernetes Jobs
You can see examples of model training in the predictive infrastructure monitoring tutorial (using Scikit Learn) or in the image recognition tutorial (using TensorFlow and Keras)

If you are a beginner this useful guide for machine Learning Algorithms In Layman’s Terms can come in handy

Models deployment to Production

With iguazio platform users can easily deploy their model to production in a reproducible way using the open-source nuclio serverless framework
nuclio takes code (or notebooks) coupled with resource definitions (CPU, memory, GPU, ..), environment variables, package or software dependencies, data links, and trigger information. nuclio automatically builds the code, generate custom container images and wire them up to the relevant compute or data resources
nuclio functions can be triggered by a wide variety of event sources including most commonly used streaming and messaging protocols, HTTP APIs, scheduled (cron) tasks, and batch jobs. read more details about nuclio

Nuclio functions can be created in the platform UI, or using standard code IDEs and be deployed on the cluster
One of the most convenient way is to develop and deploy functions is using Jupyter and the Python tools

Here is a an overview of Nuclio and how to work and deploy your python code from Jupyter to a serverless function
https://github.com/nuclio/nuclio-jupyter/blob/master/README.md#installing
Many of the tutorials demonstrate how functions can be documented and deployed directly from a Notebook, e.g. deploying the network operation model as a function

Note that nuclio functions are not limited to model serving, they can automate data collection, serve custom APIs, build real-time feature vectors, drive triggers, etc.

Visulization, monitoring and logging

Collected data, internal/external telemetry and logs, output data, etc. can be visualized in different ways simultaneously
iguazio platform support multiple standard APIs (SQL, Prometheus, Grafana, Pandas, etc.) which can be used to visualize data
this include plotting or charting data within Jupyter using matplotlib, using external BI tools like Tableau via the SQL/JDBC APIs, or building real-time dashbard in Grafana.

The different tools and services generate telemetry and log data which can be stored in the iguazio time-series database or in external tools such as ElasticSearch
users can easily instrument their code and functions to collect various statistics or logs, the same data is accessible for exploration in real-time

Grafana is natively integrated into the platform, users can create dashboards programmatically using wizard scripts and access all forms of data (tabels, time-series, logs, streams) from the different dashboard widgets.

For information on how to create charts in Grafana using Iguazio :
https://www.iguazio.com/docs/tutorials/latest-release/getting-started/trial-qs/grafana-dashboards/

Support

Our support team will be happy to help with any questions
Feel free to reach out to support@iguazio.com or use the chatbox for direct communication with our experts

Others

Sample datasets http://iguazio-sample-data.s3.amazonaws.com/

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
GettingStarted		GettingStarted
assets/images		assets/images
demos		demos
LICENSE		LICENSE
PlatformComponents.pdf		PlatformComponents.pdf
README.md		README.md
Welcome.ipynb		Welcome.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Iguazio Platform Getting Started Guide and Tutorials

Platform Overview

Data science workflow on iguazio platform

Data collection and ingestion

Data Exploration and Processing

Building and training models

Models deployment to Production

Visulization, monitoring and logging

Support

Others

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Iguazio Platform Getting Started Guide and Tutorials

Platform Overview

Data science workflow on iguazio platform

Data collection and ingestion

Data Exploration and Processing

Building and training models

Models deployment to Production

Visulization, monitoring and logging

Support

Others

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages