Skip to content

iguazioDani/tutorials

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Iguazio Platform Getting Started Guide and Tutorials

Platform Overview

Iguazio provides a fully integrated and secure data science PaaS which simplify development, accelerate performance,
enable collaboration, and address operational challenges. The platform incorporate the following components:

• Data science workbench (Jupyter with integrated analytics engines & Python packages)
• Real-time dashboards based on Grafana
• Managed data and ML services over scalable Kubernetes cluster
• Real-time serverless functions framework (aka Nuclio).
• Extremely fast and secure data layer supporting SQL, NoSQL, time series , files/objects and streaming
• Integration with 3rd party data sources (S3, HDFS, SQL DBs, Streaming/messaging protocols, etc.)


We use Kubernetes as the baseline cluster manager and deploy various microservices on top to address the different data-science tasks.
most services support scaling out, support GPU acceleration, and have secure and low-latency access to iguazio shared database and file-system
enabling high performance and scalability at maximum resource efficiency.

The platform make extensive use of Nuclio serverless functions to automate various tasks from data collection, ETLs, custom APIs, model serving, and batch jobs
functions describe the code with all the resource definitions and configuration needed to make it run, functions auto-scale and can be versioned
functions can be generated automatically in various ways (UI, Docker, Git, and Jupyter), this is demonstrated in the various tutorials.

For more details:

Data science workflow on iguazio platform

Iguazio enable a complete data science workflow in a single ready to use platform:

  • Collect, explore and label data coming from various real-time or offline sources
  • Run ML training and validation at scale over multiple CPUs and GPUs
  • Deploy models and applications into production with Serverless functions
  • Log, monitor, and version all your data and services


Iguazio provides all the building blocks for creating data science applications from research to production.

Data collection and ingestion

For details visit the data collection and exploration tutorial

There are many ways to collect or ingest data into the system from various sources:

  • Real-time streaming from (e.g. using Kafka, Kinesis, Azure Event Hub, Google Pub/sub)
  • Loading data directly from external database in an event driven or periodic/scheduled way
  • Loading data from internal or external file/object sources like S3 or Hadoop (using CSV, Parquet, Json formats)
  • Importing time-series telemetry data using Prometheus compatible scraping API
  • Pushing/ingesting data directly into the system via AWS like object, streaming and NoSQL RestAPIs
  • By implementing custom nuclio functions which scrape data from external sources or read form external API sources (e.g. Twitter, Weather services, Stock trading data, etc.)

Data Exploration and Processing

Iguazio provides a wide range of pre-integrated data query and exploration tools. The most common ones are:

  • Apache Spark SQL, ML, R, Graph (with read-time access to iguazio DB and file-system)
  • interactive SQL queries (using Apache Presto distributed processing engine over iguazio DB or file/object data sources)
  • Python Pandas dataframe (or Dask for “distributed Pandas like”)
  • Frames - Iguazio open source high speed library for data access providing unified interface for NoSQL tables, Time series tables and Streaming data
    and native integration with Pandas and NVIDIA RAPIDS.
  • Built in ML packages: Scikit learn , Pyplot , numpy, Pytorch and Tensorflow.

All the tools are integrated with Jupyter notebook allowing access to same data through multiple tools and APIs, and with minimal configuration overhead.
The Python environment has pre-deployed conda package. Users can install any packages using pip and conda.

Note: to deploy add-on services such as Spark use the platform services tab

Building and training models

Models can be developed and tested in the Jupyter notebooks or using external editors
Once you built a model you can train it inside Jupyter or use scalable cluster resources such as Nuclio functions, Dask, Spark ML, or Kubernetes Jobs
You can see examples of model training in the predictive infrastructure monitoring tutorial (using Scikit Learn) or in the image recognition tutorial (using TensorFlow and Keras)

If you are a beginner this useful guide for machine Learning Algorithms In Layman’s Terms can come in handy

Models deployment to Production

With iguazio platform users can easily deploy their model to production in a reproducible way using the open-source nuclio serverless framework
nuclio takes code (or notebooks) coupled with resource definitions (CPU, memory, GPU, ..), environment variables, package or software dependencies, data links, and trigger information. nuclio automatically builds the code, generate custom container images and wire them up to the relevant compute or data resources
nuclio functions can be triggered by a wide variety of event sources including most commonly used streaming and messaging protocols, HTTP APIs, scheduled (cron) tasks, and batch jobs. read more details about nuclio

Nuclio functions can be created in the platform UI, or using standard code IDEs and be deployed on the cluster
One of the most convenient way is to develop and deploy functions is using Jupyter and the Python tools

Here is a an overview of Nuclio and how to work and deploy your python code from Jupyter to a serverless function
https://github.com/nuclio/nuclio-jupyter/blob/master/README.md#installing
Many of the tutorials demonstrate how functions can be documented and deployed directly from a Notebook, e.g. deploying the network operation model as a function

Note that nuclio functions are not limited to model serving, they can automate data collection, serve custom APIs, build real-time feature vectors, drive triggers, etc.

Visulization, monitoring and logging

Collected data, internal/external telemetry and logs, output data, etc. can be visualized in different ways simultaneously
iguazio platform support multiple standard APIs (SQL, Prometheus, Grafana, Pandas, etc.) which can be used to visualize data
this include plotting or charting data within Jupyter using matplotlib, using external BI tools like Tableau via the SQL/JDBC APIs, or building real-time dashbard in Grafana.

The different tools and services generate telemetry and log data which can be stored in the iguazio time-series database or in external tools such as ElasticSearch
users can easily instrument their code and functions to collect various statistics or logs, the same data is accessible for exploration in real-time

Grafana is natively integrated into the platform, users can create dashboards programmatically using wizard scripts and access all forms of data (tabels, time-series, logs, streams) from the different dashboard widgets.

For information on how to create charts in Grafana using Iguazio :
https://www.iguazio.com/docs/tutorials/latest-release/getting-started/trial-qs/grafana-dashboards/

Support

Our support team will be happy to help with any questions
Feel free to reach out to support@iguazio.com or use the chatbox for direct communication with our experts

Others

Sample datasets http://iguazio-sample-data.s3.amazonaws.com/

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 100.0%