Google Data Analytics
Data, Data, Everywhere
Completed
Ask Questions to Make Data-Driven Decisions
Completed
Prepare Data for Exploration
In Progress
Course 3 - Prepare Data For Exploration
Module 1
Data Type and Structure
A Data Exploration
01 Introduction to data exploration
- Understanding the different types of data and data sctructures.
- What typo of data is right for the question you´re answering
- Practical skills about how to extract, use, organize and protect your data
Data Analysis Process
- Ask
- Prepare
- Analyze
- Share
- Act
Learn
- How data is generated
- Different formats, types ans structures of data
- Analyze data for bias and credibility
- What “clean data” means
- Databases
- Extract your own data using spreadsheets and SQL
- The basics of data organization
- The process of protecting you data
B Collect Data
01 Data collection in our world
How dat is generated
- Interviews
- Observations
- Forms
- Questionnaries
- Surveys
- Cookies
02 Determine what data to collect
Data collection considerations
- How the data will be colelcted
- Choose data sources
- Solving your business problem
- Decide what data to use
- How much data to collect
- Select the right data type
- Determine the time frame
First-party data: data collected by an individual or group using their own resources
Second-party data: data collected by a group directly from its audience and then sold
Third-party data: data collected from outside sources who did not collect it directly
Population: all possible data values in a certain dataset
Sample: a part of a population that is representative of the population
03 Select the right data
Select the right data
Following are some data-collection considerations to keep in mind for your analysis:
How the data will be collected
Decide if you will collect the data using your own resources or receive (and possibly purchase it) from another party. Data that you collect yourself is called first-party data.
Data sources
If you don’t collect the data using your own resources, you might get data from second-party or third-party data providers. Second-party data is collected directly by another group and then sold. Third-party data is sold by a provider that didn’t collect the data themselves. Third-party data might come from a number of different sources.
Solving your business problem
Datasets can show a lot of interesting information. But be sure to choose data that can actually help solve your problem question. For example, if you are analyzing trends over time, make sure you use time series data — in other words, data that includes dates.
How much data to collect
If you are collecting your own data, make reasonable decisions about sample size. A random sample from existing data might be fine for some projects. Other projects might need more strategic data collection to focus on certain criteria. Each project has its own needs.
Time frame
If you are collecting your own data, decide how long you will need to collect it, especially if you are tracking trends over a long period of time. If you need an immediate answer, you might not have time to collect new data. In this case, you would need to use historical data that already exists.
C Differentiate data formats and structures
01 Discover data formats
Quantitative and qualitative data
Discrete Data: data that is counted and has a limited number of values
Continuous data: data is measured and can have almost any numeric value
Nominal data: a type of qualitative data that is categorized without a set order
Ordinal data: a typr of qualitative data with a set order or scale
Internal data: data that lives within a company´s own systems
External data: data that lives and is generated outside of a organization
Structude data: data organized in a certain formats such as rows and columns (spreadsheets and relational databases)
Unstructude data: data that is not organized in any easily identifiable manner (audio and video files)
02 Data formats in practice
Data formats in practice
Data format examples
As with most things, it is easier for definitions to click when you can pair them with examples you might encounter on a daily basis. Review each data format’s definition first and then use the examples to lock in your understanding.
Primary versus secondary data
The following table highlights the differences between primary and secondary data and presents examples of each.
Data format classification | Definition | Examples |
---|---|---|
Primary data | Collected by a researcher from first-hand sources |
|
Secondary data | Gathered by other people or from other research |
|
Internal versus external data
The following table highlights the differences between internal and external data and presents examples of each.
Data format classification | Definition | Examples |
---|---|---|
Internal data | Data that is stored inside a company’s own systems |
|
External data | Data that is stored outside of a company or organization |
|
Continuous versus discrete data
The following table highlights the differences between continuous and discrete data and presents examples of each.
Data format classification | Definition | Examples |
---|---|---|
Continuous data | Data that is measured and can have almost any numeric value |
|
Discrete data | Data that is counted and has a limited number of values |
|
Qualitative versus quantitative data
The following table highlights the differences between qualitative and quantitative data and presents examples of each.
Data format classification | Definition | Examples |
---|---|---|
Qualitative | A subjective and explanatory measure of a quality or characteristic |
|
Quantitative | A specific and objective measure, such as a number, quantity, or range |
|
Nominal versus ordinal data
The following table highlights the differences between nominal and ordinal data and presents examples of each.
Data format classification | Definition | Examples |
---|---|---|
Nominal | A type of qualitative data that is categorized without a set order |
|
Ordinal | A type of qualitative data with a set order or scale |
|
Structured versus unstructured data
The following table highlights the differences between structured and unstructured data and presents examples of each.
Data format classification | Definition | Examples |
---|---|---|
Structured data | Data organized in a certain format, like rows and columns |
|
Unstructured data | Data that cannot be stored as columns and rows in a relational database. |
|
04 Continue exploring structured data
Unstructured data examples
- Audio files
- Video files
- Emails
- Photos
- Social media
Structude data
Data organized in a certain formats such as rows and columns
- Data model: a model that us used for organizing data elements and how they relate to one another
- Data elements: pieces of information, such as people´s name, account numbers, and addresses
Sources od structude data
- Spreadsheets
- Databases that stores datasets
05 The effects of different structures
Data is everywhere and it can be stored in lots of ways. Two general categories of data are:
- Structured data: Organized in a certain format, such as rows and columns
- Define data types
- Most often quantitative data
- Easy to organize
- Easy to search
- Stored in relational databases and data warehouses
- Contained in rows and columns
- Examples: excel, google sheets, SQL, customer data, phone records, transaction history
- Unstructured data: Not organized in any easy-to-identify way
- Varied data types
- Most often qualitative data
- Difficult to search
- Provides more freedom for analysis
- Stores in data lakes, data warehouses, and noSQL databases
- Can´t put in rows and columns
- Examples: text messages, social media comments, phone call transcriptions, various log files, images, audio, video
For example, when you rate your favorite restaurant online, you’re creating structured data. But when you use Google Earth to check out a satellite image of a restaurant location, you’re using unstructured data.
Structured data
As we described earlier, structured data is organized in a certain format. This makes it easier to store and query for business needs. If the data is exported, the structure goes along with the data.
Unstructured data
Unstructured data can’t be organized in any easily identifiable manner. And there is much more unstructured than structured data in the world. Video and audio files, text files, social media content, satellite imagery, presentations, PDF files, open-ended survey responses, and websites all qualify as types of unstructured data.
The fairness issue
The lack of structure makes unstructured data difficult to search, manage, and analyze. But recent advancements in artificial intelligence and machine learning algorithms are beginning to change that. Now, the new challenge facing data scientists is making sure these tools are inclusive and unbiased. Otherwise, certain elements of a dataset will be more heavily weighted and/or represented than others. And as you’re learning, an unfair dataset does not accurately represent the population, causing skewed outcomes, low accuracy levels, and unreliable analysis.
06 Data modeling levels and techniques
This reading introduces you to data modeling and different types of data models. Data models help keep data consistent and enable people to map out how data is organized. A basic understanding makes it easier for analysts and other stakeholders to make sense of their data and use it in the right ways.
Important note: As a junior data analyst, you won’t be asked to design a data model. But you might come across existing data models your organization already has in place.
What is data modeling?
Data modeling is the process of creating diagrams that visually represent how data is organized and structured. These visual representations are called data models. You can think of data modeling as a blueprint of a house. At any point, there might be electricians, carpenters, and plumbers using that blueprint. Each one of these builders has a different relationship to the blueprint, but they all need it to understand the overall structure of the house. Data models are similar; different users might have different data needs, but the data model gives them an understanding of the structure as a whole.
Levels of data modeling
Each level of data modeling has a different level of detail.
Conceptual data modeling gives a high-level view of the data structure, such as how data interacts across an organization. For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn’t contain technical details.
Logical data modeling focuses on the technical details of a database such as relationships, attributes, and entities. For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn’t spell out actual names of database tables. That’s the job of a physical data model.
Physical data modeling depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database.
More information can be found in this comparison of data models here
Data-modeling techniques
There are a lot of approaches when it comes to developing data models, but two common methods are the Entity Relationship Diagram (ERD) and the Unified Modeling Language (UML) diagram. ERDs are a visual way to understand the relationship between entities in the data model. UML diagrams are very detailed diagrams that describe the structure of a system by showing the system’s entities, attributes, operations, and their relationships. As a junior data analyst, you will need to understand that there are different data modeling techniques, but in practice, you will probably be using your organization’s existing technique.
You can read more about ERD, UML, and data dictionaries in this data modeling techniques article here
Data analysis and data modeling
Data modeling can help you explore the high-level details of your data and how it is related across the organization’s information systems. Data modeling sometimes requires data analysis to understand how the data is put together; that way, you know how to map the data. And finally, data models make it easier for everyone in your organization to understand and collaborate with you on your data. This is important for you and everyone on your team!
D Explore data types, fields, and values
01 Know the type of data you're working with
A data type
Is a specific kind of data attribute that tells what kind of value the data is
A data type in a spreadsheet can be one of three things:
- Number
- Text or string
- Boolean
Text data type, or a string data type
Is a sequence of characters and punctuation that contains textual information. Example: treats and people’s names. These can also include numbers, like phone numbers or numbers in street addresses. But these numbers wouldn’t be used for calculations. In this case they’re treated like text, not numbers
A Boolean data type
Is a data type with only two possible values: true or false
Number data type
Use to calculate and create formulas
02 Use Boolean logic
In this reading, you will explore the basics of Boolean logic and learn how to use single and multiple conditions in a Boolean statement. These conditions are created with Boolean operators, including AND, OR, and NOT. These operators are similar to mathematical operators and can be used to create logical statements that filter your results. Data analysts use Boolean statements to do a wide range of data analysis tasks, such as writing queries for searches and checking for conditions when writing programming code
Boolean logic example
Imagine you are shopping for shoes, and are considering certain preferences:
- You will buy the shoes only if they are any combination of pink and grey
- You will buy the shoes if they are entirely pink, entirely grey, or if they are pink and grey
- You will buy the shoes if they are grey, but not if they have any pink
Use Boolean logic in statements
In queries, Boolean logic is represented in a statement written with Boolean operators. An operator is a symbol that names the operation or calculation to be performed. Read on to discover how you can convert your shoe preferences into Boolean statements
The AND operator
Your condition is “If the color of the shoe has any combination of grey and pink, you will buy them.” The Boolean statement would break down the logic of that statement to filter your results by both colors. It would say IF (Color=”Grey”) AND (Color=”Pink”) then buy them
The AND operator lets you stack both of your conditions
Below is a simple truth table that outlines the Boolean logic at work in this statement. In the Color is Grey column, there are two pairs of shoes that meet the color condition. And in the Color is Pink column, there are two pairs that meet that condition. But in the If Grey AND Pink column, only one pair of shoes meets both conditions. So, according to the Boolean logic of the statement, there is only one pair marked true. In other words, there is one pair of shoes that you would buy
The OR operator
The OR operator lets you move forward if either one of your two conditions is met. Your condition is “If the shoes are grey or pink, you will buy them.” The Boolean statement would be IF (Color=”Grey”) OR (Color=”Pink”) then buy them.
Notice that any shoe that meets either the Color is Grey or the Color is Pink condition is marked as true by the Boolean logic. According to the truth table below, there are three pairs of shoes that you can buy
The NOT operator
Finally, the NOT operator lets you filter by subtracting specific conditions from the results. Your condition is “You will buy any grey shoe except for those with any traces of pink in them.” Your Boolean statement would be IF (Color=”Grey”) AND (Color=NOT “Pink”) then buy them
Now, all of the grey shoes that aren’t pink are marked true by the Boolean logic for the NOT Pink condition. The pink shoes are marked false by the Boolean logic for the NOT Pink condition. Only one pair of shoes is excluded
The power of multiple conditions
For data analysts, the real power of Boolean logic comes from being able to combine multiple conditions in a single statement. For example, if you wanted to filter for shoes that were grey or pink, and waterproof, you could construct a Boolean statement such as: “IF ((Color = “Grey”) OR (Color = “Pink”)) AND (Waterproof=”True”)
Notice that you can use parentheses to group your conditions together.
Key takeaways
Operators are symbols that name the operation or calculation to be performed. The operators AND, OR, and NOT can be used to write Boolean statements in programming languages. Whether you are doing a search for new shoes or applying this logic to queries, Boolean logic lets you create multiple conditions to filter your results. Now that you know a little more about Boolean logic, you can start using it!
Resources for more information
Learn about who pioneered Boolean logic in this historical article: Origins of Boolean Algebra in the Logic of Classes
Find more information about using AND, OR, and NOT from these tips for searching with Boolean operators
03 Data table components
Fields, Rows and Columns
04 Step-by-Step: Meet wide and long data
Fields, Rows and Columns
Examine wide data
- Sort
Examine long data
05 Meet wide and long data
Fields, Rows and Columns
Examine wide data
- Sort
Examine long data