top of page
Search

MTA SUBWAY TRAIN DELAYS 2020 - 2024 PROJECT

  • okanhasturkk
  • May 11
  • 4 min read

Updated: Jun 24


This is my Tableau project using real data from NY State data.




My motivation for choosing this dataset is clear. First, this is a real-time raw dataset, which I prefer to work with. Second, as in my previous projects, it is dirty data and requires cleaning and standardization. Third and most important for this project is this dataset includes time and allows to build visualization for Tableau, which is the main reason and focus here. My focus here is to analyze MTA Subway Delays during 2024 by every aspect and time intervals as possible.


In compare to the other datasets, this dataset does not require much data cleaning. The cleaning that I made is as follows.


  • Within the LINE column dots were added to the rows that are “S Fkln” and “S Rock”.

  • Within the SUBCATEGORY column spaces were added to the rows “Door-Related” and “Sick/Injured”.


Those 2 cleaning were made in Excel before importing the data to the Tableau. The only cleaning that was done within Tableau is as follows:


Updated the name of the column from 'month' to 'date' to prevent confusion because of the often usage of month as an element of time as dimension.
Updated the name of the column from 'month' to 'date' to prevent confusion because of the often usage of month as an element of time as dimension.

This was all the cleaning and standardization that was required hence the remaining data was extremely clean.


In my EDA (Exploratory Data Analysis), the goal was to the include all the elements of the dataset, which refers to the columns. The only update that I wanted to do but could not realize was converting the weekdays and weekends to days because the data was registered only under Sunday and Monday, that is why they are going to be referenced as they are in this dataset as Weekday and Weekend.


Because I was focused on the year 2024 and this dataset encapsulates a time interval between 2020 and 2024, I had to create a filter which will enable to focus only on the last year, which I generated through a calculated field that can be seen by the Tables.


On the left side of the equation, I am extracting the YEAR from the 'Date' column, and I am going to equal the separated YEAR to MAX value of the 'Date' which is 2024 on this dataset. The use of brackets on the right side of the equation is that it specifies it as a special value, which is 2024 on our case, and then we can set our filter for 2024.
On the left side of the equation, I am extracting the YEAR from the 'Date' column, and I am going to equal the separated YEAR to MAX value of the 'Date' which is 2024 on this dataset. The use of brackets on the right side of the equation is that it specifies it as a special value, which is 2024 on our case, and then we can set our filter for 2024.
The Final step of our filter. I add the calculation to the filter section and choose 'True' for it to filter the data. The condition has been achieved.
The Final step of our filter. I add the calculation to the filter section and choose 'True' for it to filter the data. The condition has been achieved.

After establishing my filter, I go forward with my first visualization, which is total number delays in year 2024 by month and by division. The division column represents numbered subway lines as A and lettered subway names as B. In the dataset, these divisions are represented by A and B. I updated these to numbered and lettered subway lines by editing those aliases under the division tab under Marks for better clarity.


Not only I edited the aliases, I sorted the divisions by color and added the sum of delays as detail and labelled them, so that they can be individually noticed upon browsing on any part of the chart.
Not only I edited the aliases, I sorted the divisions by color and added the sum of delays as detail and labelled them, so that they can be individually noticed upon browsing on any part of the chart.
At the left bottom, the details can be seen clearly because of the labelling and detailing of the dimension. Finally, I added an average line on the top middle for the complete sum of delays regardless of divisions for extra detail which was done under analytics tab.
At the left bottom, the details can be seen clearly because of the labelling and detailing of the dimension. Finally, I added an average line on the top middle for the complete sum of delays regardless of divisions for extra detail which was done under analytics tab.
The second chart is similar to the first chart, the main difference is that I preferred an area chart here because I wanted to put emphasis on the overwhelming density of delays weekdays to weekends, which can be displayed by an area chart very clearly. Similar to previous chart, I applied the 2024 filter, added day type as label and changed the aliases of 1 and 2 to weekday and weekend and displaying them by colors as well.
The second chart is similar to the first chart, the main difference is that I preferred an area chart here because I wanted to put emphasis on the overwhelming density of delays weekdays to weekends, which can be displayed by an area chart very clearly. Similar to previous chart, I applied the 2024 filter, added day type as label and changed the aliases of 1 and 2 to weekday and weekend and displaying them by colors as well.
I preferred a text table here because I wanted to show the number of delays by lines on 2024 numbers only but I wanted to draw attention by colors. The dark blue represents total delays over 2260, which is center, where as light blue represents the opposite. With the help of grand total column, any classification can be done upon clicking on it.
I preferred a text table here because I wanted to show the number of delays by lines on 2024 numbers only but I wanted to draw attention by colors. The dark blue represents total delays over 2260, which is center, where as light blue represents the opposite. With the help of grand total column, any classification can be done upon clicking on it.
For clearer display, I minimized the stepped color count to 2.
For clearer display, I minimized the stepped color count to 2.
An extremely useful benefit of Grand Total column. with a click, the most delays by lines can be seen not only by numbers but also with colors as well.
An extremely useful benefit of Grand Total column. with a click, the most delays by lines can be seen not only by numbers but also with colors as well.

On my last visualization, I chose packed bubbles because there was an opportunity create a hierarchy between categories and subcategory, which is the perfect condition for packed bubbles. I bound subcategories to the category and updated percentages as label instead of numbers to display the reasons of the delay by categories. Let's have a look.


Delays by Category
Delays by Category
Delays by Subcategory
Delays by Subcategory
How I built the hierarchy.
How I built the hierarchy.
For both categories, I added a table calculation to the chart, just to display the percentages.
For both categories, I added a table calculation to the chart, just to display the percentages.

At the final stage of my representation, I connected all my charts to each other from the Dashboard segment under Actions, so that whenever an entry has been clicked, it can be displayed on all charts, such as:



Connecting every entry with each other under Actions.
Connecting every entry with each other under Actions.
Upon Clicking the Service Management under the Total Delays by Categories chart, it can be seen on which months, which days at which lane and at what percentage to the whole it takes place, thanks to connecting all entries within each other.
Upon Clicking the Service Management under the Total Delays by Categories chart, it can be seen on which months, which days at which lane and at what percentage to the whole it takes place, thanks to connecting all entries within each other.

 
 
 

Comments


bottom of page