Python | Pandas Basics | Panda DataFrames | Panda Series

What is Pandas in Python?

Pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.

To use pandas, you’ll normally begin with the accompanying line of code.

import pandas as pd

Creating data

There are two core objects in pandas: the DataFrame and the Series.

DataFrame

A DataFrame is a table. It contains an array of individual entries, every one of which has a specific value. Every entry compares to a row (or record) and a column.

For instance, consider the accompanying simple DataFrame:

pd.DataFrame({'Yes': [10, 20], 'No': [30, 40]})

Output

In this example, the “0, No” entry has a value of 30. The “0, Yes” section has a value of 10, etc.

DataFrame sections are not restricted to integers. For example, here’s a DataFrame whose values are strings:

pd.DataFrame({'John': ['I like Python.', 'It is easy.'], 'Joseph': ['Pretty good.', 'Bland.']})

Output

We are utilizing the pd.DataFrame() constructor to generate these DataFrame objects. The syntax for declaring another one is a dictionary whose keys are the names of the columns (John and Joseph in this example), and whose values are a list of entries. This is the standard method of building another DataFrame and the one you are well on the way to encounter.

The dictionary list constructor assigns values to the column labels, however just uses an ascending count from (0, 1, 2, 3, …) for the row labels. Here and there this is OK, yet in many cases, we will need to assign these labels ourselves.

The list of row labels utilized in a DataFrame is known as an Index. We can allot values to it by utilizing an index parameter in our constructor:

pd.DataFrame({'John': ['Its is an interactive language.', 'It is object-oriented.'], 
              'Joseph': ['Pretty easy.', 'It is class based.']},
             index=['Python', 'Java'])

Output

Series

 

A Series, on the other hand, is a sequence of data values. If a DataFrame is a table, a Series is a list. Hence, you can create things with simply a list:

pd.Series([1, 2, 3, 4, 5])

Output

A Series is, fundamentally, a single column of a DataFrame. So you can assign column values to the Series a similar way as before, utilizing an index parameter. Although a Series doesn’t have a column name, it just has one in the general name:

pd.Series([30, 25, 40], index=['2018 Sales', '2019 Sales', '2020 Sales'], name='Watch')

Output

The Series and the DataFrame are intimately related. It’s helpful to think of a DataFrame as actually being just a bunch of Series “glued together”.

Reading data files

Having the option to create a DataFrame or Series by hand is convenient. However, more often than not, we won’t really be making our own data by hand. Rather, we’ll be working with data that already exists.

Data can be stored in any of a number of different forms and formats. By far the most fundamental of these is the humble CSV file. At the point when you open a CSV file you get something that resembles this:

Product A,Product B,Product C,
30,21,9,
35,34,1,
41,11,11

So a CSV file is a table of values separated by commas. Subsequently the name: “Comma-Separated Values”, or CSV.

We should now put aside our toy datasets and see what a real dataset resembles when we read it into a DataFrame. We’ll use the pd.read_csv() function to add the data to a DataFrame. This goes in this manner:

data = pd.read_csv("C:/Users/user/Desktop/transaction_data (2).csv")

We can use the shape attribute to check how large the resulting DataFrame is:

data.shape

 

So our new DataFrame has 100,000 records split across 8 different columns. That’s almost 2 million entries!

We can examine the contents of the resultant DataFrame using the head() command, which grabs the first five rows:

data.head()

The pd.read_csv() function is well-endowed, with more than 30 parameters(optional) you can specify. For instance, you can find in this dataset that the CSV file has a built-in index, which pandas didn’t get on naturally. To make pandas utilize that column for the index (rather than making another one from scratch), we can specify an index_col.

data = pd.read_csv("C:/Users/user/Desktop/transaction_data (2).csv", index_col = 0)
data.head()

Output

Naive accessors

In Python, we can access to the property of an object by accessing to it as an attribute. A book object, for instance, may have a title property, which we can access by calling book.title. Columns in a pandas DataFrame work similarly.

Consequently to get to the ItemDescription of Data we can utilize:

data.ItemDescription

Output

If we have a Python dictionary, we can access its values using the indexing ([]) operator. We can do the same with columns in a DataFrame:

data['ItemDescription']

Output

These are the two different ways of choosing a particular Series out of a DataFrame. Neither of them is pretty much syntactically legitimate than the other, however, the indexing operator [] has the advantage that it can deal with columns names with reserved characters in them.

Indexing in Pandas

The indexing operator and attribute selection are nice because they work simply as they do in the rest of the Python system. As a fledgling, this makes them simple to pick up and utilize. However, pandas have their own accessor operators, loc, and iloc. For further developed operations, these are the ones you should utilize.

Index-based selection

Pandas indexing works in one of two paradigms. The first is the index-based choice: choosing data based on its numerical position in the data. iloc follows this paradigm.

To choose the first row of data in a DataFrame, we may utilize the accompanying:

data.iloc[0]

Output

Both loc and iloc are row-first, column-second. This is something contrary to what we do in native Python, which is column-first, row-second.

This implies it’s barely simpler to retrieve rows, and possibly harder to get retrieve columns. To get a column with iloc, we can do the accompanying:

data.iloc[:,0]

Output

On its own, the : operator, which also comes from native Python, means “everything”. When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the userId column from just the first, second, and third row, we would do:

data.iloc[:3,0]

Output

Or, to select just the second and third entries, we would do:

data.iloc[1:3,0]

Output

It’s also possible to pass a list:

data.iloc[[0,1,2],0]

Output

Finally, it’s worth knowing that negative numbers can be used in selection. This will start counting forwards from the end of the values. So for example here are the last five elements of the dataset.

data.iloc[-5:]

Output

Manipulating the index

The label-based selection gets its power from the labels in the index. Basically, the index we utilize isn’t immutable. We can manipulate the index in any way we see fit.

The set_index() method can be utilized to carry out the job. Here is the thing that happens when we set_index to the item description field.

data.set_index("ItemDescription")

Output

Assigning data

Going the other way, assigning data to a DataFrame is easy. You can assign either a constant value:

data['Country'] = 'UNITED STATES'
data['Country']

Output

Or with an iterable of values:

data['index_backwards'] = range(len(data), 0, -1)
data['index_backwards']

Output