Q1: In the data science terminology, how do you call the data that you analyze?
In data science, you analyze datasets. Datasets consists of cases, which are the entities you analyze. Cases are described by their variables, which represent the attributes of the entities. The first important question you need to answer when you start a data science project is what exactly is your case. Is this a person, a family, an order? Then you collect all of the knowledge about each case you can get and store this information in the variables.
For more information, see the article: Introduction to data science, data understanding and preparation
Q2: What are the data structures you use to store the datasets?
In SQL Server, you store the dataset you are analyzing in a table. A table can by physical or a virtual one, which is a view. SQL Server tables follow the relational model, meaning that they represent a set. Set theory is the basis for the relational model. Cases are rows, and variables columns.
R introduces the structure called data frame. A data frame is a matrix. The basis is linear algebra.
Python follows R and also introduces data frames. However, unlike with R, where the data frame structure is a part of the core engine, you need to use an additional library called pandas in Python in order to get data frames available.
For more information, see the article: Introduction to data science, data understanding and preparation
Q3: Is there a difference between a table and a data frame?
Yes, there is an important difference. An SQL Server table represents a set, a data frame a matrix. Because the order of rows and columns in a set is not defined, you cannot refer to the data values by their position. In order to read a value from a table, you need to know the table name, the key of the row, and the name of the column. In an R or Python data frame, you can use numerical positional indexes to read the data values in their position.
For more information, see the article: Introduction to data science, data understanding and preparation
Q4: Do all variables measure data values in the same way?
No, there are two basic classes of variables:
- Discrete variables have a limited pool of possible distinct values. Discrete variables can be categorical, ordinal, or dichotomous.
- Continuous variables have an infinite pool of values. Continuous variables can be open or closed intervals or true numerical variables, or true numeris.
For more information, see the article: Introduction to data science, data understanding and preparation
Q5: How do you get an overview of a distribution of a discrete variable?
You should use frequency tables, or shortly frequencies. There can many different pieces of information in a frequency table. At minimum, there must be values and counts of those values, or an absolute frequency. You can also show the absolute percentage, the cumulative frequency, and the cumulative percent. Graphically, you use bar charts and histograms.
For more information, see the article: Introduction to data science, data understanding and preparation
Q6: Do you treat ordinal discrete variables differently than categorical?
Ordinal variables, or ordinals, have an intrinsic order. You need to define this order correctly to get the appropriate results when you use them for analysis.
Categorical variables are also called nominal, because they provide only the name for each category, and nothing more.
For more information, see the article: Data science, data understanding and preparation – ordinal variables and dummies
Q7: How do you define the order of ordinals?
There are many ways how to define the order of ordinals correctly. In T-SQL, you can use the CASE expression to modify the data values. In R and Python, you have additional possibilities besides modifying the data values. In R, you define all discrete variables as factors, and distinct values are called levels. For ordinals, you also define the order of the levels. In Python, you use the astype(‘category’) method of a data frame to define discrete variables, and then the pandas cat.reorder_categories() function to define the proper order.
For more information, see the article: Data science, data understanding and preparation – ordinal variables and dummies
Q8: How do you convert discrete variables to numerics?
If the values are ordinals, you can use the numeric positional index of the categories as the numerical representation of the variable. For nominals, you cannot use a number for a category, because numbers have an intrinsic order. You create a new indicator variable for every possible value of a nominal, showing whether the value is taken or not. Such indicators are called dummies.
For more information, see the article: Data science, data understanding and preparation – ordinal variables and dummies
Q9: How do you create dummies from nominals?
In SQL Server, you can use the CASE expression again. In addition, since there are only two possibilities for a dummy, 0 or 1, you can also use the T-SQL IIF() function, which is a shortcut for CASE when you have only two possible outcomes.
In R, you can use the dummy() function from the dummies package. In Python, you can use the pandas get_dummies() function.
For more information, see the article: Data science, data understanding and preparation – ordinal variables and dummies
Q10: How do you convert numerical variables to discrete?
Binning, or discretization, is the process of creating discrete variables from numerics. Note that there are many different ways of binning. Nevertheless, since numerics have an order, you typically want to preserve the order. Thus, you create ordinals from numerics.
For more information, see the article: Data science, data understanding and preparation – binning a continuous variable
Q11: What are the different way of discretization?
There are many different ways of doing the binning. Each possibility has its own advantages and disadvantages. The most popular ones are:
- Equal width binning, where the width of each bin (the interval of a continuous variable) is equal
- Equal height binning, where you have equal number of cases in each bin, but the width of the bins varies
- Custom binning, where you define the bins based on the content of the data, or based on the business logic
For more information, see the article: Data science, data understanding and preparation – binning a continuous variable
Q12: What are the advantages of different ways of binning?
The most important advantages of different ways of binning are:
- For equal width binning, you preserve the shape of the distribution
- For equal height binning, you maximally preserve the information in the variable
- For custom binning, you follow the logic of real life
For more information, see the article: Data science, data understanding and preparation – binning a continuous variable
Q13: How do you do the discretization?
In T-SQL, you use the CASE expression for equal width and custom binning. You can use the NTILE() window function for equal height binning.
In Python, you use the pandas cut() function for equal width and custom binning. For equal height binning, you can use the qcut() function.
In R, you can use the cut() function from the base installation for equal width and custom binning. For equal height binning, you can search for a function is some additional package. Alternatively, it is quite simple to write your own function.
For more information, see the article: Data science, data understanding and preparation – binning a continuous variable
Q14: Is there a measure for the amount of information in a variable?
Yes, the measure for the information is the entropy. Entropy is defined in the information theory branch of applied mathematics. Basically, the information is the same thing as the surprise. If you already know a piece of information, you are not surprised when somebody tells it to you or when you read or see it again. Entropy is a quantified measure for the information.
For more information, see the article: Data science, data understanding and preparation – entropy of a discrete variable
Q15: Which variable has a higher entropy, a constant or an equal height binned one?
Entropy is also uncertainty. The more uncertainty, the higher possible surprise. There is no uncertainty in a variable that occupies only one value for all cases, for a constant. Entropy of a constant is zero. Constants are not useful for analyses. On the other hand, variables with equal number of cases in each class have the maximal possible uncertainty for a specific number of distinct classes.
For more information, see the article: Data science, data understanding and preparation – binning a continuous variable
Q16: Which variable has a higher entropy, a numeric or a discrete one?
Numerical variables can have more information than discrete ones, although this depends on the distribution. The maximal possible entropy of a discrete variable is when you have equal height distribution. The maximal possible entropy of a continuous variable for a given variance is when the distribution is normal. With more distinct classes of a discrete variable, the maximal possible entropy raises. Therefore, when you discretize a variable, you are losing some entropy.
For more information, see the article: Data science, data understanding and preparation – binning a continuous variable
Q17: How do you calculate the entropy of a discrete variable?
In T-SQL, you need to do your own calculation, there is no function out of the box. You need the aggregate functions, window aggregate functions, and the LOG() function.
In Python, you can use the stats.entropy() function from the scipy library. However, you need to calculate counts in advance.
In R, you can use the Entropy() function from the DescTools package. This function also expects the counts as the inputs.
For more information, see the article: Data science, data understanding and preparation – binning a continuous variable
Q18: Are R and Python data frame positional indexes the same?
No, there is a slight difference when you refer to the data in R or in python Pandas data frame. Python index is zero-based, R one-based. When you define an interval by an index, in Python the interval does not include the upper bound, while in R does. Therefore, TM[3:6, 1:4] returns the same rows and columns as TM.iloc[2:6, 0:4] in Python
For more information, see the article: Data science, data understanding and preparation – basic work with datasets
Q19: What do you use in T-SQL to read the data?
In T-SQL, you use, or course, the mighty SELECT statement to read the data. The core elements are:
- The FROM clause, where you define the source tables and how to join them
- The SELECT clause, where you define the columns, or the projection, and the computed columns
- The WHERE clause, where you define the filters for the rows
- The ORDER BY clause, where you define the order of the rows returned
For more information, see the article: Data science, data understanding and preparation – basic work with datasets
Q20: What are the basic operations on data frames in R?
In R, here are some basic operations on data frames:
- Retrieve the values by the index positions for rows and columns
- Retrieve the values by row and column names
- Join two data frames with the merge() function
- Bind data frames by columns
- Bind data frames by rows
For more information, see the article: Data science, data understanding and preparation – basic work with datasets
Q21: What are the basic operations on data frames in Python?
In pandas, there are many functions that help you operating on data frames. The most basic manipulations you can do include:
- Creating projections by using a subset of columns
- Retrieve the values by the index positions for rows and columns
- Locate the values by a predicate for the rows and columns
- Join two data frames with the pandas merge() function
- Reorder a data frame with the sort() method
For more information, see the article: Data science, data understanding and preparation – basic work with datasets
Q22: How do you aggregate data in T-SQL?
If you need to aggregate the data in T-SQL, you use the aggregate functions. There are the basic aggregate functions available, including some statistical functions; however, the set of aggregate functions is quite limited.
If you need to perform the aggregations in groups, you use the GROUP BY clause. If you need to filter on the aggregated values, you use the HAVING clause. You can create multiple groupings in a single statement with the GROUPING SETS clause.
For more information, see the article: Data science, data understanding and preparation – grouping and aggregating data
Q23: How do you aggregate data in R?
In R, you can start with the summary() function from the base installation. It gives you a quick overview of a variable distribution with descriptive statistics. It returns multiple measures. There are many individual aggregates and statistical functions already in the base package. Of course, you get countless more with additional packages. You can use the aggregate() function from the base installation to do the aggregations in groups.
For more information, see the article: Data science, data understanding and preparation – grouping and aggregating data
Q24: How do you aggregate data in Python?
In Python, you use the pandas aggregations on the data frames. A pandas data frame has the describe() method, which gives you an overview of a distribution with descriptive statistics similarly to the R summary() function. There are many more pandas data frame methods for calculating individual descriptive statistics values. You can use the groupby() data frame method to calculate the aggregations in groups.
For more information, see the article: Data science, data understanding and preparation – grouping and aggregating data
Table of contents
- Data Science in SQL Server: Unpivoting Data - October 29, 2018
- Data science in SQL Server: Data analysis and transformation – Using SQL pivot and transpose - October 11, 2018
- Data science in SQL Server: Data analysis and transformation – grouping and aggregating data II - September 28, 2018