SQL Guide for Beginner Data Scientists and Developers
Tables of Contents
- What Is SQL?
- What Are Common SQL Commands?
- What Are SQL Data Types?
- SQL Careers
- The Importance of SQL to Data Scientists and Developers
Businesses thrive by making good decisions based on relevant, up-to-date information about market and industry trends, competitors, and partners. One of the main ways companies access business intelligence to help them make those decisions is by extracting data from databases using the Structured Query Language (SQL, pronounced “sequel”).
This SQL guide presents the language as the primary tool for querying databases to generate the reports that business managers rely on to make decisions. SQL dates back to the earliest relational databases developed in the 1970s by Dr. Edgar Codd, Raymond Boyce, Donald Chamberlin, and other researchers at IBM, as Oracle-Patches describes. Relational databases allow data saved in one table to relate to data in any of the other tables that are stored in the database’s thousands or millions of records.
SQL makes it easier and faster than it’s ever been to search for and retrieve data on business-related topics and output the results in the form of a report. Since its inception some 40 years ago, SQL has evolved into diverse forms, some proprietary and some open system. Open system forms can be used with no license fees, although vendors are paid for support, services, and enhancements.
What Is SQL?
SQL is a tool used to query a relational database. It takes advantage of the relational model’s ability to assign attributes or properties to data, such as a number, date, monetary unit, or specific value. The basic SQL format consists of tables, rows, columns, and fields:
- Database tables are made up of rows and columns, similar to a spreadsheet.
- Each row represents a specific record, and each column represents a field in that record.
- Properties can be assigned to the table and to each column in the table to ensure that the data in each field conforms to the database’s requirements.
Why SQL Knowledge Is Important for Data Scientists
Mastery of SQL is an important skill for data scientists and programmers. Analytics Vidhya, a community of analytics and data scientists, lists five benefits of learning SQL for data scientists:
- SQL is easy to learn and use, relying on a simple structure and English words.
- SQL helps data scientists understand their datasets by giving them tools to investigate and visualize the data.
- SQL integrates with Python, R, and other scripting languages, making it easier for data scientists to share data with others and to present it clearly.
- SQL accommodates the massive amounts of data that data scientists need to process. Relational databases are much more powerful than spreadsheets, for example.
- SQL proficiency is ranked higher than other programming languages by most firms hiring data scientists.
What Can SQL Do?
The first step in querying a database is to create Data Definition Language (DDL) statements that the database will process to create objects, Data Manipulation Language (DML) statements that will modify data, and queries to retrieve the data. Oracle explains the stages in SQL query processing:
- Parsing includes a syntax check, semantic check, and shared pool check.
- Optimization generates multiple execution plans.
- Row source generation generates the query plan.
- Execution runs the query.
SQL has remained popular over decades because it meets the needs of businesses for a universal computer language that is powerful, accessible, and suitable for a variety of purposes.
Popular SQL Versions
- Oracle is ranked at the top of DB-Engines’ ranking of popular database management systems (DBMS). It requires a commercial license and is noted for its ability to cluster many transactions for faster processing.
- MySQL is an open-source relational DBMS developed by Oracle that conserves storage space and processing overhead by scaling to accommodate any type or size of database.
- Microsoft SQL Server is a commercial relational DBMS that is suitable for enterprise database needs. It is more expensive than MySQL, but it has many powerful features and is easier to use.
- PostgreSQL is an open-source product that offers features comparable to Oracle and MySQL in terms of power and the ability to accommodate various types and sizes of databases.
Types of SQL Statements
- Data Definition Language (DDL) statements are used to create, change, and drop schema objects, as well as to grant and revoke privileges and roles. They also analyze information on a table, index, or cluster basis.
- Data Control Language (DCL) statements include GRANT to give users access privileges to the database, and REVOKE to withdraw the access privileges.
- Data Manipulation Language (DML) statements alter the data via commands such as INSERT, UPDATE, and DELETE.
SQL Query Process
- SQL query engines are typically made up of a storage engine, which creates, reads, and updates data between disk and memory, and a query processor, which accepts, parses, and executes SQL commands, according to OmniSci.
- Optimization engines are designed to find the most efficient execution plan for a specific query, as Hacker Noon describes. The optimizer estimates the cost of various execution plans and determines the least expensive one.
- Query dispatchers monitor traffic on the database network to direct multiple incoming session requests to a single queue, as Oracle explains.
- Classic query engines are teamed with standard SQL query engines to handle all non-SQL queries, according to SiteNXT.
How Do Data Scientists and Developers Use SQL?
SQL-generated reports are applied to solve many different kinds of business problems. SQL has the power and security features that the financial services sector and other large industries require, yet it is easy enough to learn that small and midsize businesses can benefit from using it.
These are among the uses of SQL by data scientists and developers.
Towards Data Science describes how to use basic SQL queries to pre-process data, which normalizes the data to make it easier and more efficient to query. Data normalization applies rules and criteria to database designs to create various normal forms, as SQLShack explains.
The benefits of using SQL to build and run machine learning models are examined on Towards Data Science. They include strong data analytics tools that are easy to apply. They also include the ability to run algorithms without having to access external data sources, which hinders performance.
Data Modeling and Mining
Many organizations are using SQL for data warehousing, as SQLShack explains. This allows them to use SQL for data mining to enhance the classification, estimation, sequencing, association, and cluster forecasting of data.
Benefits of Knowing SQL
Possessing SQL expertise helps boost the career potential for data analysts and developers. KDnuggets describes the importance of data to the relationship between buyers and sellers:
- Customers share personal information with sellers even when they don’t make a purchase.
- Sellers use that information to target offers to those customers based on what the seller has learned about them.
- Data scientists use SQL to extract as much business intelligence from customer data as possible to maximize its value to sellers.
Data scientists rely on SQL tools to determine the value of the data stored in today’s massive databases and to ensure the data is accessible to business decision-makers in a form they can use to support their product plans and strategies. This makes data scientists with SQL knowledge more valuable to businesses.
Knowing SQL offers many other benefits, including the following examples.
SQL Is a Universal Language
SQL Is Open Source and Easy to Learn and Use
SQL benefits from a large and knowledgeable community of developers who freely share their valuable SQL programming tips and techniques. The query language is less difficult to learn than C++ and other popular languages, and the variety of open-source SQL versions provides data scientists and programmers with options that meet the needs of different business situations.
SQL Accommodates Databases of All Sizes
Spreadsheets aren’t able to support the large number and size of records that are stored in a typical business database. SQL can be used efficiently no matter how small or large the database is, whether it holds 1,000 records or 100 million records. The SQL database model allows queries to process millions of rows of data with no fear of crashing the system.
SQL Skills Are in Great Demand Among Businesses
A study conducted by Indeed found that SQL was the most-requested technical programming skill of U.S. employers in 2019. SQL was listed as a requirement in 21.9% of job postings evaluated, followed by Java (20.8%) and Python (18%).
- SQLZoo offers dozens of free SQL tutorials and how-to guides that are suitable for beginners and veterans alike.
- W3Resource provides an in-depth history of SQL and an extensive tutorial that covers such topics as terminology, constructs, and database and table manipulation.
- Hacker Noon has compiled a list of seven SQL resources designed to introduce beginners to the language’s concepts and application.
What Are Common SQL Commands?
SQL commands are used to communicate with the database to create and modify tables, set user permissions, and perform other operations. There are five basic types of SQL commands, as Hackr.io explains:
- Transaction Control Language (TCL)
- Data Query Language (DQL)
Data Definition Language (DDL)
DDL commands are used to change a table’s structure by creating, deleting, or modifying the data it contains. The commands are saved in the database automatically, or “auto-committed.”
This command is used primarily to add a new attribute to the database structure or to modify an existing attribute. Example uses include dropping an existing column from the table, renaming a column or table, or reducing a column’s size.
This command creates a new table by specifying the table name and column names, sizes, and properties.
This command is used to delete an entire table, including all the data the table contains.
This command specifies a new name for an existing table.
This command deletes all rows from a table and frees up the space that was used by the table.
Data Manipulation Language (DML)
DML commands are used to modify the database. Unlike DDL commands, they do not auto-commit, so it’s possible to undo DML commands.
This command removes one or many rows from the table. Specifying just the table name deletes all of its rows, while adding a condition removes only the rows that meet the condition, such as WHERE Name = “Smith.”
This command is used to insert data in a table row by specifying the table name and the values that will apply to the new data, such as names, ages, and addresses. It can also be used to add data that is extracted from another table.
This command modifies the value of a table column and applies the change to all rows or only those rows that meet a specified condition, such as those that contain a particular state code or ZIP code.
Data Control Language (DCL)
DCL commands regulate access to the data stored in a database. The commands apply various user access privileges to grant or disable access by specific users.
This command provides the user with access privileges and stipulates the actions the person is allowed to take, such as selecting and updating tables, and the ability to grant access privileges to other users.
This command takes away a user’s access privileges. It can be used by anyone who has the privilege to grant access to others even when the grantor is not the person who created the table.
Transaction Control Language (TCL)
TCL commands are used in conjunction with DML commands to manage the database. Because TCL commands are auto-committed in the database, they can’t be used while creating or dropping tables.
This command saves all transactions in the database, ending the current transaction and designating all changes made during the transaction as permanent. It also releases all transaction locks the table acquired.
This command undoes all transactions that haven’t been saved to the database by ending the transaction and wiping out all changes made during the transaction. It also releases transaction locks acquired on the table.
This command rolls back the database to a specified savepoint that was created beforehand. This allows only certains parts of the transaction to be saved. The savepoint must be specified since the last COMMIT or ROLLBACK command.
Data Query Language (DQL)
DQL commands retrieve data from the database that meet the criteria specified in the SELECT command syntax.
This is the lone DQL command and is used for all retrieval operations. The syntax signals where to search for the requested data by specifying the table name. The WHERE statement indicates the attributes or properties that the target data must meet to be retrieved, such as “WHERE age > 65.”
SQL Command Resources
- Oracle provides an extensive directory of information about SQL commands that covers optimizing SQL statements, database tuning, and improving performance by identifying resource-intensive SQL components.
- Snowflake has compiled a SQL command reference that lists all DDL, DML, and SELECT commands alphabetically, including a summary of their operation.
- Bytescout lists the top 60 most important SQL queries in 2020, which include queries for retrieving tables, selecting columns from a table, and using a constraint to output data, among other operations.
What Are SQL Data Types?
A well-structured query begins by choosing the correct SQL data types to use when formulating queries. Data types stipulate the kind of value that can be stored in a table column. They fall into six broad categories:
SQL Numeric Data Types
- bit: 0 or 1
- tinyint: 0 to 255
- smallint: -32,768 to 32,768
- int: -2,147,483,648 to 2,147,483,648
- bigint: -9,223,372,036,854,775,808 to 9,223,372,036,854,775,808
- decimal: -10^38 +1 to 10^38 +1
- numeric: -10^38 +1 to 10^38 +1
- float: -1.79E + 308 to 1.79E + 308
- real: -3.40E + 38 to 3.40E + 38
SQL Date and Time Data Types
- DATE: Date stored in YYYY-MM-DD format
- TIME: Time stored in HH:MI:SS format
- DATETIME: Date and time stored in YYYY-MM-DD HH:MI:SS format
- TIMESTAMP: Stores number of seconds elapsed since the UNIX epoch (‘1970-01-01 00:00:00’ UTC)
- YEAR: Year stored in two-digit in range 70-69 (1970-2069) or four-digit (1901-2155) format
SQL Character and String Data Types
- CHAR: Fixed length up to 8,000 characters
- VARCHAR: Variable length up to 8,000 characters
- VARCHAR(max): Variable length storage that either defines the string size in bytes in a value from 1 through 8,000, or uses the “max” option to set a column constraint size up to 2^31-1 bytes (2 GB)
- TEXT: Variable length storage up to 2GB
SQL Unicode Character and String Data Types
- NCHAR: Fixed length up to 4,000 characters
- NVARCHAR: Variable length up to 4,000 characters
- NVARCHAR(max): Variable length storage that defines the string size in byte-pairs in values from 1 through 4,000; “max” option limits storage to 2^30-1 characters (2 GB)
- NTEXT: Variable length storage up to 1GB
SQL Binary Data Types
- BINARY: Fixed length up to 8,000 characters
- VARBINARY: Variable length up to 8,000 characters
- VARBINARY(max): Variable length can be a value from 1 through 8,000; “max” option limits storage to 2^31-1 bytes (2GB)
- IMAGE: Variable length up to 2GB of binary data
SQL Miscellaneous Data Types
- CLOB: Character large objects up to 2GB; structures data either as a plain set within an XML tag or structured XML
- BLOB: Binary large objects, such as images and multimedia
- XML: Stores XML data either as XML instances in a column or as a variable of XML type
SQL Data Type Resources
- Microsoft’s documentation for SQL Server includes detailed descriptions of dozens of SQL data types for Transact-SQL and other SQL versions.
- SQLShack has compiled a detailed explanation of the XML data type that includes working with XML templates and efficient storage of user information.
- The Oracle Help Center describes the characteristics of and uses for dozens of Oracle Database data types.
SQL expertise broadens the career options available to data scientists and software developers. These are just a sample of the professional positions that benefit from a background in SQL.
SQL Database Administrator (DBA)
As TechTarget explains, database administrators are responsible for maintaining the entire database environment, as well as:
- Ensuring that all applications have continuous, uninterrupted access to the data they need to operate
- Developing and implementing plans for deploying databases, including reliable storage and user access
- Preventing, preparing for, and responding to system outages, downtimes, and slowdowns
- Ensuring that all data resources are backed up regularly and that databases can be restored quickly in the event of a data loss
The U.S. Bureau of Labor Statistics (BLS) notes that SQL is the most common database language, so it is the language most database administrators will need to be proficient in. The BLS forecasts that jobs for database administrators will increase by 9% between 2018 and 2028, which is faster than the projected average growth for all occupations. As of May 2019, the median annual salary for database administrators was $93,750, according to BLS figures.
SQL Database Developer
In addition to developing SQL database applications, database developers also maintain, monitor, troubleshoot, and debug these applications, according to PayScale. Other responsibilities include:
- Working with clients and business analysts to ensure that database applications meet the organization’s requirements
- Analyzing database code and making recommendations for improving database performance
- Documenting all aspects of database applications to support future maintenance and upgrades
The BLS forecasts a 21% increase in the number of jobs for software developers between 2018 and 2028, which is much faster growth than the projected average for all occupations. PayScale notes that the most popular skills for database developers, after database development, are Microsoft SQL Server, SQL, ETL (extract, transform, load), and SQL Server Integration Services (SSIS). The median annual salary for database developers is about $76,000, according to PayScale.
SQL Data Analyst
Data analysts tap databases and other data sources to find information that supports the decisions made by business managers, as PayScale describes. Since the majority of business information is stored in SQL databases, the ability to use SQL is an important skill for data analysts. Data analysts typically:
- Collect customer requirements, identify technical issues, and determine the project’s data analytics needs
- Gather and analyze information relevant to specific business problems and situations
- Write reports and create charts that communicate their interpretations of the data to business managers
- Research new techniques and tools that will enhance data collection, analysis, and reporting
As expected, data analysis is the most popular skill for data analysts, according to PayScale. The next most popular skills are Microsoft Excel, SQL, database management and reporting, and statistical analysis. The median annual salary for data analysts is around $60,000, according to PayScale, which increases to $73,000 in late career.
SQL Data Application Interface Developer
Data application interface developers create and maintain the connections between SQL databases and other applications and systems. Their work links data that resides in relational databases with external business processes via application programming interfaces (APIs).
PayScale explains that data application interface developers typically require experience using many different programming languages in addition to SQL. Data application interface developers typically:
- Work in teams with designers, database administrators, and content creators
- Base their work on documentation, schematics, and templates designed by other team members
- Rely on a variety of programming tools, such as source code debuggers and visual development environments
- Create code that adheres to best practices and is easy for other team members to revise and build upon
The BLS reports that the median annual salary for systems software developers, the broad employment category that includes data application interface developers, was $110,000 as of May 2018. Jobs for systems software developers are forecast to increase by 10% between 2018 and 2028, according to the BLS.
Business Intelligence Administrator
Business intelligence administrators collect and organize the company’s information relating to its information technology infrastructure, systems design, applications, and cloud operations, as PayScale describes:
- They ensure the integrity and security of databases and other software.
- They conduct research to identify areas where business intelligence can be applied to improve business processes and system performance.
- They test new technologies and oversee the migration of data to new and upgraded systems.
- They monitor and maintain databases, including patch applications and design changes.
- They perform data analyses, create data visualizations, and report to business managers on the status of the company’s business intelligence operations.
PayScale reports the median annual salary of business intelligence administrators is around $85,000, while experienced business intelligence administrators earn an average of $98,000 annually. The most popular skills for business intelligence analysts, after business intelligence, are SQL, Tableau software, system administration, and Cognos software.
Cloud Database Expert
The increasing popularity of cloud databases in organizations creates a need for cloud expertise that can be shared throughout the company. Cloud database experts work as part of teams charged with promoting and implementing best practices and frameworks for the use of cloud services and technologies, as tech employment site Dice explains:
- They work in small teams composed of experts from various departments, such as network engineers, database administrators, developers, and business managers.
- They manage the transition to the cloud database model as it impacts all areas of the company.
- They develop and implement pilot projects that are ultimately expanded into production.
- They promote continuous improvement and innovation in the organization’s cloud initiatives.
According to PayScale, the median annual salary for cloud solutions architects is around $125,000, which increases to $146,000 annually late in their career. After cloud computing, the most popular skills for cloud solution architects are Amazon Web Services (AWS), Microsoft Azure, system architecture, and enterprise solutions.
SQL Career Resources
- DZone examines various SQL career paths and describes 14 SQL jobs and opportunities. They include SQL data scientist, big data expert, database migration engineer, and business intelligence solution architect.
- Web development service Squareboat describes the skills, responsibilities, and benefits of pursuing a career as a SQL developer.
- The Balance Careers looks at the SQL skills that are in greatest demand among employers and popular careers that require knowledge of SQL, such as back-end developer and data scientist.
The Importance of SQL to Data Scientists and Developers
There is no better way for data scientists and software/database developers to prepare for their careers than by learning SQL. The database language is not only among the most in-demand skills of employers in need of data professionals, it also serves as the springboard for gaining other valuable tech skills. Many lucrative careers in data science and software development are built on a foundation of SQL knowledge.