Read a Csv Into a Data Frame Python

CSV (comma-separated value) files are a common file format for transferring and storing data. The ability to read, dispense, and write information to and from CSV files using Python is a cardinal skill to master for whatsoever data scientist or business analysis. In this post, we'll go over what CSV files are, how to read CSV files into Pandas DataFrames, and how to write DataFrames back to CSV files mail service assay.

Pandas is the nigh popular data manipulation packet in Python, and DataFrames are the Pandas data blazon for storing tabular 2D data.

  1. Load CSV files to Python Pandas
  2. 1. File Extensions and File Types
  3. 2. Information Representation in CSV files
    • Other Delimiters / Separators – TSV files
    • Delimiters in Text Fields – Quotechar
  4. three. Python – Paths, Folders, Files
    • Finding your Python Path
    • File Loading: Absolute and Relative Paths
  5. 4. Pandas CSV File Loading Errors
  6. Advanced Read CSV Files
    • Specifying Information Types
    • Skipping and Picking Rows and Columns From File
    • Custom Missing Value Symbols
  7.  CSV Format Advantages and Disadvantages
  8. Additional Reading

Load CSV files to Python Pandas

The basic procedure of loading data from a CSV file into a Pandas DataFrame (with all going well) is achieved using the "read_csv" function in Pandas:

# Load the Pandas libraries with alias 'pd'  import pandas equally pd   # Read data from file 'filename.csv'  # (in the same directory that your python process is based) # Control delimiters, rows, column names with read_csv (see later)  data = pd.read_csv("filename.csv")   # Preview the showtime 5 lines of the loaded data  data.head()

While this code seems simple, an understanding of three fundamental concepts is required to fully grasp and debug the operation of the data loading procedure if you encounter issues:

  1. Understanding file extensions and file types – what practise the letters CSV actually mean? What's the divergence between a .csv file and a .txt file?
  2. Understanding how data is represented inside CSV files – if you open up a CSV file, what does the data really wait similar?
  3. Agreement the Python path and how to reference a file – what is the absolute and relative path to the file y'all are loading? What directory are you working in?
  4. CSV data formats and errors – common errors with the function.

Each of these topics is discussed below, and we finish this tutorial past looking at some more avant-garde CSV loading mechanisms and giving some wide advantages and disadvantages of the CSV format.

1. File Extensions and File Types

The starting time stride to working with comma-separated-value (CSV) files is understanding the concept of file types and file extensions.

  1. Information is stored on your computer in individual "files", or containers, each with a different name.
  2. Each file contains data of different types – the internals of a Discussion document is quite different from the internals of an paradigm.
  3. Computers make up one's mind how to read files using the "file extension", that is the code that follows the dot (".") in the filename.
  4. So, a filename is typically in the grade "<random name>.<file extension>". Examples:
    • project1.DOCX – a Microsoft Give-and-take file chosen Project1.
    • shanes_file.TXT – a unproblematic text file called shanes_file
    • IMG_5673.JPG – An image file called IMG_5673.
    • Other well known file types and extensions include: XLSX: Excel, PDF: Portable Document Format, PNG – images, Zero – compressed file format, GIF – animation, MPEG – video, MP3 – music etc. See a complete list of extensions here.
  5. A CSV file is a file with a ".csv" file extension, e.g. "data.csv", "super_information.csv". The "CSV" in this example lets the calculator know that the data contained in the file is in "comma separated value" format, which we'll discuss below.

File extensions are subconscious by default on a lot of operating systems. The get-go step that any self-respecting engineer, software engineer, or data scientist will practise on a new calculator is to ensure that file extensions are shown in their Explorer (Windows) or Finder (Mac) windows.

Binder with file extensions showing. Before working with CSV files, ensure that you tin can meet your file extensions in your operating system. Different file contents are denoted by the file extension, or letters later the dot, of the file name. e.thousand. TXT is text, DOCX is Microsoft Word, PNG are images, CSV is comma-separated value data.

To check if file extensions are showing in your system, create a new text document with Notepad (Windows) or TextEdit (Mac) and save it to a binder of your choice. If you tin can't meet the ".txt" extension in your folder when you lot view it, you volition have to alter your settings.

  • In Microsoft Windows: Open Control Panel > Appearance and Personalization.  Now, click on Folder Options or File Explorer Selection, every bit it is now chosen > View tab. In this tab, under Advance Settings, you will see the pick Hide extensions for known file types. Uncheck this choice and click on Employ and OK.
  • In Mac Bone: Open Finder > In card, click Finder > Preferences, Click Advanced, Select the checkbox for "Evidence all filename extensions".

2. Data Representation in CSV files

A "CSV" file, that is, a file with a "csv" filetype, is a bones text file. Any text editor such as NotePad on windows or TextEdit on Mac, can open a CSV file and prove the contents. Sublime Text is a wonderful and multi-functional text editor selection for any platform.

CSV is a standard for storing tabular data in text format, where commas are used to dissever the different columns, and newlines (railroad vehicle render / press enter) used to separate rows. Typically, the first row in a CSV file contains the names of the columns for the data.

And example tabular array data set and the corresponding CSV-format data is shown in the diagram beneath.

Pandas read csv function read_csv is used to process this comma-separated file into tabular format in the Python DataFrame. Here we look at the innards of a CSV file to examine how columns are specified.
Comma-separated value files, or CSV files, are simple text files where commas and newlines are used to define tabular data in a structured fashion.

Note that nearly any tabular data can be stored in CSV format – the format is pop because of its simplicity and flexibility. You can create a text file in a text editor, relieve information technology with a .csv extension, and open that file in Excel or Google Sheets to see the table form.

Other Delimiters / Separators – TSV files

The comma separation scheme is by far the most popular method of storing tabular data in text files.

Nonetheless, the choice of the ',' comma grapheme to delimiters columns, all the same, is arbitrary, and can be substituted where needed. Popular alternatives include tab ("\t") and semi-colon (";"). Tab-separate files are known as TSV (Tab-Separated Value) files.

When loading information with Pandas, the read_csv function is used for reading any delimited text file, and by changing the delimiter using the sep  parameter.

Delimiters in Text Fields – Quotechar

One complication in creating CSV files is if y'all have commas, semicolons, or tabs really in one of the text fields that you want to shop. In this example, it's important to employ a "quote character" in the CSV file to create these fields.

The quote character can exist specified in Pandas.read_csv using the quotechar statement. By default (as with many systems), it'due south set as the standard quotation marks ("). Whatever commas (or other delimiters as demonstrated beneath) that occur between two quote characters will be ignored as column separators.

In the example shown, a semicolon-delimited file, with quotation marks as a quotechar is loaded into Pandas, and shown in Excel. The apply of the quotechar allows the "NickName" column to incorporate semicolons without being separate into more columns.

" data-medium-file="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-300x215.png" data-large-file="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-1024x734.png" src="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-e1530995690282.png" alt="Demonstration of semicolon separated file data with quote character to prevent unnecessary splits in columns. Read this CSV file with pandas using read_csv with the ";" sep specified." class="wp-image-1103" width="818" height="586" data-old-src="data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%20818%20586'%3E%3C/svg%3E" data-lazy-src="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-e1530995690282.png">
Other than commas in CSV files, Tab-separated and Semicolon-separated data is popular also. Quote characters are used if the data in a cavalcade may contain the separating character. In this instance, the 'NickName' cavalcade contains semicolon characters, and and so this column is "quoted". Specify the separator and quote graphic symbol in pandas.read_csv

3. Python – Paths, Folders, Files

When you specify a filename to Pandas.read_csv, Python will look in your "electric current working directory". Your working directory is typically the directory that you started your Python process or Jupyter notebook from.

When filenotfounderrors occur, it can be due to a misspelled filename or a working directory mistake,
Pandas searches your 'current working directory' for the filename that you lot specify when opening or loading files. The FileNotFoundError can be due to a misspelled filename, or an wrong working directory.

Finding your Python Path

Your Python path can be displayed using the built-in os module. The Os module is for operating organisation dependent functionality into Python programs and scripts.

To find your current working directory, the part required is os.getcwd(). Thebone.listdir() role can exist used to brandish all files in a directory, which is a good bank check to see if the CSV file you are loading is in the directory as expected.

# Discover out your current working directory import bone print(os.getcwd())  # Out: /Users/shane/Documents/blog  # Display all of the files found in your current working directory print(os.listdir(os.getcwd())   # Out: ['test_delimted.ssv', 'CSV Weblog.ipynb', 'test_data.csv']

In the instance higher up, my current working directory is in the '/Users/Shane/Certificate/blog' directory. Whatever files that are places in this directory volition exist immediately available to the Python file open() function or the Pandas read csv office.

Instead of moving the required data files to your working directory, yous tin also change your electric current working directory to the directory where the files reside usingos.chdir().

File Loading: Absolute and Relative Paths

When specifying file names to the read_csv function, you can supply both absolute or relative file paths.

  • A relative pathis the path to the file if you start from your current working directory. In relative paths, typically the file will be in a subdirectory of the working directory and the path will non outset with a bulldoze specifier, e.g. (data/test_file.csv). The characters '..' are used to move to a parent directory in a relative path.
  • An absolute pathis the complete path from the base of your file system to the file that you want to load, e.yard. c:/Documents/Shane/data/test_file.csv. Absolute paths will starting time with a drive specifier (c:/ or d:/ in Windows, or '/' in Mac or Linux)

Information technology's recommended and preferred to use relative paths where possible in applications, because absolute paths are unlikely to piece of work on unlike computers due to different directory structures.

absolute vs relative file paths
Loading the same file with Pandas read_csv using relative and accented paths. Relative paths are directions to the file starting at your electric current working directory, where absolute paths always starting time at the base of operations of your file system.

iv. Pandas CSV File Loading Errors

The most mutual error's you'll get while loading information from CSV files into Pandas will exist:

  1. FileNotFoundError: File b'filename.csv' does not exist
    A File Not Found error is typically an result with path setup, current directory, or file proper noun confusion (file extension can play a part here!)
  2. UnicodeDecodeError: 'utf-viii' codec can't decode byte in position : invalid continuation byte
    A Unicode Decode Fault is typically caused by non specifying the encoding of the file, and happens when you take a file with non-standard characters. For a quick fix, effort opening the file in Sublime Text, and re-saving with encoding 'UTF-viii'.
  3. pandas.parser.CParserError: Error tokenizing data.
    Parse Errors can be caused in unusual circumstances to practise with your data format – try to add the parameter "engine='python'" to the read_csv function call; this changes the data reading function internally to a slower just more stable method.

Advanced Read CSV Files

In that location are some additional flexible parameters in the Pandas read_csv() function that are useful to have in your arsenal of data scientific discipline techniques:

Specifying Data Types

Every bit mentioned before, CSV files exercise not incorporate whatsoever type data for information. Information types are inferred through exam of the pinnacle rows of the file, which can lead to errors. To manually specify the data types for different columns, thedtype parameter can be used with a dictionary of column names and data types to be applied, for example:dtype={"name": str, "age": np.int32}.

Notation that for dates and appointment times, the format, columns, and other behaviour can be adjusted using parse_dates, date_parser, dayfirst, keep_dateparameters.

Skipping and Picking Rows and Columns From File

Thenrows parameter specifies how many rows from the elevation of CSV file to read, which is useful to take a sample of a large file without loading completely. Similarly theskiprowsparameter allows you to specify rows to get out out, either at the start of the file (provide an int), or throughout the file (provide a listing of row indices). Similarly, theusecolsparameter can be used to specify which columns in the information to load.

Custom Missing Value Symbols

When information is exported to CSV from different systems, missing values can exist specified with different tokens. Thena_values parameter allows you to customise the characters that are recognised as missing values. The default values interpreted equally NA/NaN are: '', '#N/A', '#N/A N/A', '#NA', '-ane.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'zero'.

# Advanced CSV loading instance  information = pd.read_csv(     "data/files/complex_data_example.tsv",      # relative python path to subdirectory     sep='\t' 					# Tab-separated value file.     quotechar="'",				# unmarried quote allowed as quote character     dtype={"salary": int}, 		        # Parse the bacon column equally an integer      usecols=['name', 'birth_date', 'bacon'].   # But load the three columns specified.     parse_dates=['birth_date'], 		# Intepret the birth_date cavalcade equally a engagement     skiprows=10, 				# Skip the get-go x rows of the file     na_values=['.', '??'] 			# Take whatsoever '.' or '??' values as NA )

 CSV Format Advantages and Disadvantages

As with all technical decisions, storing your data in CSV format has both advantages and disadvantages. Be enlightened of the potential pitfalls and issues that you will run into equally you load, store, and exchange data in CSV format:

On the plus side:

  • CSV format is universal and the data can be loaded by almost whatsoever software.
  • CSV files are simple to understand and debug with a basic text editor
  • CSV files are quick to create and load into memory before analysis.

All the same, the CSV format has some negative sides:

  • In that location is no information type information stored in the text file, all typing (dates, int vs float, strings) are inferred from the data only.
  • In that location's no formatting or layout information storable – things like fonts, borders, column width settings from Microsoft Excel will be lost.
  • File encodings tin become a problem if at that place are non-ASCII compatible characters in text fields.
  • CSV format is inefficient; numbers are stored as characters rather than binary values, which is wasteful. You will find however that your CSV data compresses well using zippo compression.

As and bated, in an endeavour to counter some of these disadvantages, two prominent data science developers in both the R and Python ecosystems, Wes McKinney and Hadley Wickham, recently introduced the Feather Format, which aims to be a fast, simple, open up, flexible and multi-platform data format that supports multiple data types natively.

Additional Reading

  1. Official Pandas documentation for the read_csv function.
  2. Python iii Notes on file paths, working directories, and using the OS module.
  3. Datacamp Tutorial on loading CSV files, including some additional OS commands.
  4. PythonHow Loading CSV tutorial.

slusstholsolot46.blogspot.com

Source: https://www.shanelynn.ie/python-pandas-read-csv-load-data-from-csv-files/

0 Response to "Read a Csv Into a Data Frame Python"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel