Data Analysis 2: Analyzing Tabular Data with Pandas

Can Gulmez
4 min readDec 29, 2021

--

Data analysis is a process of inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

Hi everyone! If you remember that I have written an post related numerical computing. Right now, in the second post of series, I’m gonna tell about analyzing tabular data with Pandas. Of course that you can find in my github page that you need all about data analysis. So I’m putting my github page here.

Pandas is used analyzing manipulating tabular data. So it is as useful as numpy.

Generally, datasets is in excel or csv format and pandas.read_excel(“file.xslx) and pandas.read_csv(“file.csv”) function is used to read respectively excel and csv datasets performing data analsis or machine learning tasks. I think that you heard excel but maybe you don’t know what is csv format. CSVs as known a comma-separated values file is a delimited text file that use comma to separate values.

If you run an excel, csv or other one and look at type of it that you notice that its type is pandas.core.frame.DataFrame. DataFrame is a special object type and allows so pretty functions. Let’s look at some of those:

  • .describe() function views statistical information about numerical columns.
  • .columns function show us dataset’s columns information.
  • .index function show us index information of dataset.
  • .info() funtion is giving us some basic information about rows, columns etc.
  • .shape() function tells us shape of our datas.
  • .head() function show us first values.
  • .tail() function views last values.
  • .samples() function views random sample values.

These function give so many knowledge us but generally we can want to select specific values or specific a range to analyze datas easier. In here you must use .iloc function. You can find that how is .iloc function used in my github repository.

Sometimes, the dataset have gaps as known NaN. In my opinion, you should delete these. Because if you are working on machine learning or generally AI that you created machine or system is not run and get errors. .dropna() is suitable function. Apart from this if you want to delete anything in datafarme like rows or columns that you should use .drop() function.

Of course, you can apply .sum, .max, .min, .mean like function on your datas like being Numpy.

Apart from that, we can want to sort our values.Pandas provides some function related this. These are .sort_values and .sort_index functions. You can sort yours by using these functions.

Lastly, I want to share four significant function to you.

  • Sometimes, we want to look at datas with different a point of view. In this case, I recommend that you grouping your datas with .groupby function.
  • If you have more than one datasets and you want to merge it. Pandas provides a function related merging datasets. It’s .concat function.
  • Also, you can save that you created new datasets in csv format. In this case, you must use .to_csv function.
  • Pandas is so wide library and thanks to .plot function that you can even visualize your datas.

I’ve tried to mention of pandas and its fuctions. Of course that if you want more knowlegde that you can look at my github repository:

--

--