Definition
It’s the scientific method applied to extract insights from data and solve business problems, converting data into value. It combines areas like statistics, computer science and specific business knowledge to analyze and interpret data.
Data Scientists employ diverse tactics to find patterns, make predictions and provide useful information to make decisions.
Use of Machine Learning in Data Science
Machine Learning allows systems to learn and improve autonomously through experience without being explicitly developed.
Types of ML
Supervised Learning
Models trained with labeled data to predict results.
- Lineal regression
- Decision trees
- Neural Networks
Non Supervised Learning
Models that find patterns in non-labeled data.
- Clustering
- Principal Component Analysis
Reinforced Learning
Models that learn to take decisions through trial and error.
- Q-Learning
- Deep Q-Networks
What is Data Science?
Data Science is the intersection between three disciplines:
- Computer Science
- Math
- Business Experience
Different Types of Analysis
Descriptive Analytics
Answers: What is happening? It involves having accurate data collection.
Diagnostic Analytics
Answers: Why did something happens? It involves drilling down to the root cause of a problem.
What is Data?
Data are raw values, without context nor interpretation. By themselves, this data points are meaningless.
On the other hand, Information is the result of processing and organizing said data so they become useful. Eg: calculate average monthly sale, etc.
They’re easy to differentiate: In marketing, a data point could be “500 clicks on an ad campaign”, meanwhile an information is “The latest ad campaign generated a 10% more clicks the the previous one”.
Data Science Process
Obtain: Gather Data from relevant sources Scrub: Clean Data to formats that the machine understands Explore: Find significant patterns and trends using statistical methods Model: Construct models to predict and forecast Interpret: Put the results into good use
Obtaining, scrubbing and exploring data takes 80% of the time.
Types of Data
Structured Data
It’s both tabular and standardized
Unstructured Data
It’s neither tabular nor standardized
Semi-structured Data
It’s not tabular but it is standardized
Data Science Tools
Python
Using python with libraries like pandas and numpy.
pandas is a core library for data manipulation and is part of the Data Science workflow.
Data Sources
See Data Acquisition.
Data Processing
See Data Processing.
Data Exploring
See Data Exploring.
Workflows
See Workflows.
Modeling
See Modeling
Interpretation
See Data Interpretation.