Photo by Clint Patterson on Unsplash

Introduction

It is common that a team of data scientists and data engineers work on the same program simultaneously or that different versions of a program are deployed in different environments.

So it is useful to have some tools that permit you to track changes in the source code, in configuration files or in time documentation. The tools that permit us to do that are versions control systems.

The most famous and frequently used version control system is Git.

The main aspect of Git is that every time you make changes, Git takes a picture of what all files look like at the moment and stores a reference to that snapshot.

A Version Control System might be centralized and distributed in centralized. That means all members of the team are connected to the main server and all of them collaborate through that server.

Anyway, if the central server is down you can’t collaborate in distributed version control systems.

With Git, you have all repositories history on your local computer which gives you instant access for all changes. That means better performance.

Git is very easy to install. To install it you have to go here and follow the instructions.

In this article you will see the following arguments:

How to configure Git
Git repo init & first Git commands
Git remote repository & SSH Connection
Branch Management & Pull Request
Merge Conflicts & Git History Management
Git Stash, Git Reflog & Git Clone

Part I: How to configure Git

After you have installed Git, the first step is to open Git Bash and tell it who you are. But before to start to configure Git, you have to know that Git permits us to set configuration on three different levels:

SYSTEM → applied to all users on your computer and all their repositories
GLOBAL → applied to all repositories on the computer for the current user
LOCAL → only current repository

Now, we open git and say who we are with the following command:

$ git config --global user.name “Moryba Kouate”$ git config --global user.email “my.email@example.com”

To set up notepad as a default text editor in Git, you can easily digit:

$ git config --global core.editor “notepad”

Part II: Git repo init & first Git commands

In this part, you will see how to create a local repository and make your first commit. There are two ways to create a repository.

The first one is to create it locally and the second option is to clone an existing repository. But for now, we will see only how to create a repository locally.

Imagine you want to track changes in some files. The first step is to initialize a repository.

$ git init

You will see now between parentheses the word “master”. Master is the main branch. Remember that each branch contains a history of snapshots.

The area where there are files that we want to commit is called the staging area. In this area, we can add one file or multiple files. Below the commands to add some files or all files.

To add some files:

$ git add test.py README.md

To add all files:

$ git add .

If you want to add all files with for example the extension py, you can use the following command.

$ git add *.py

In order to check what is the status of your repository on which branch you are and if there are states changes to be committed, we have to type the git status command. I personally prefer the short status as below.

$ git status -s

To clear the Git Bash console you can press simultaneously Ctrl + L.

If I want to not track anymore the changes of a specific file without removing it from my local repository, I can use the following command:

$ git rm --cached .name_of_file

The above command removes the file from the staging area but lets the file in the local area. Anyway, if you want to remove from the staging area all files from a specific directory, you have to use the following command.

$ git rm -r cached myfolder/

To create a new file in the repository, you can use the command:

$ touch .gitignore

You can edit the file from the Git Bash console with the following command:

$ nano .gitignore

Then, for example, you can write in the file .gitignore the option to ignore all the .txt extension in the bin folder.

*.class
bin/

You can give a name to your changes with the commit command. After that Git will create a snapshot of your changes.

$ git commit -m "changes done"

To check the history of all the changes and commit done, you can use the command:

$ git log

Imagine that we forgot to modify a file and we need to change the name of the commit. In this situation, we have to use this command:

$ git commit --amend

Then you will see open the editor and you can modify and save it.

Part III: Git remote repository & SSH Connection

When we work as a team of data professionals, we need a common repository hosted on the internet and available 24/7. GitHub is one of the most famous remote repositories.

Another two interesting repositories are Bitbucket supported by Atlassian and GitLab. For this part, I suppose you have already a GitHub account. If you don’t have a GitHub profile go here for registration.

For example, if I want to connect to one of my GitHub repositories I can easily digit git remote and add the path of my repository.

Below a screenshot of my GitHub repository and where to see in order to copy the right path.

$ git remote add origin https://github.com/your_name/your_app.git

So you can push your local files with these commands:

$ git push origin -u master

As a member of a team of data professionals, you will use protocol many times in order to access other computers.

With SSH you can easily log into a remote computer and perform any commands on that computer or any changes you wish. The goal for SSH is to make the connection between computer much secure.

In fact, to create a connection between two computers is necessary a public key in the remote computer and a private key in the local computer.

So the first thing to do is to generate SSH keys. The easiest way to generate a key is by opening Git Gui and go to help>Show SSH Key and then click on generate the key.

After you have generated the key, you can go on your GitHub settings and select the option SSH and GPG keys.

There you can create a new SSH where you will paste what you have obtained before thanks to the Git Gui. Now we can paste the SSH path our remote repository with this command:

$ git remote set-url origin git@github.com:name/application.git

Part IV: Branch Management & Pull Request

In this part, we will see how to create new branches and we also see some command like git log for the specific branch, git checkout git switch and git branch. We will see how to remove branches locally and in the remote repository.

Now, imagine that as a data engineer you have the task to implement a program. The first step to check the local branches that I have with the following command.

$ git branch

Instead, to see all local and remote branches, I will use this command:

$ git branch -a

To create a branch, I can use:

$ git branch BA0003

Anyway, there is a command to create and change branches simultaneously. I will use the below command where checkout command is for changing branches, while option -b is to indicate that I want to change and create a new branch.

$ git checkout -b BA0003

Now in the parentheses, you will not see anymore master, but (BA0003). Remember that you can change the path with the command cd.

$ cd base

To have a list in your directory, we can use:

$ ls -l

Before to see the log, it is important to transfer files in the staging area with the command git add. Then we can create a commit with the command that we have seen in the previous part. Now to see the log of a specific branch we can easily type:

$ git log BA0003

It is possible to navigate to the specific commit by typing the first words and numbers that you see in the commit and switch to a specific commit. So if I have for example the following commit fee9e3b6dfaf02ef5f0b12c939cc9082487f2578, I will type:

$ git checkout fee9

Navigate to commit is useful because sometimes we can need to investigate about a defect. So we want to know how a code worked before and execute this code.

To delete a branch you have to use this command:

$ git branch -d BA0003

If you want to switch to the master branch you can easily type:

$ git switch master

Now it is time to see what is a pull request. Pull request is a method of submitting contributions to a project.

In other words, you can pull changes from one branch to update another branch and by submitting a pull request you ask other data professionals to review your changes and pull them into another branch.

After you push the modification that you have done in your remote repository (GitHub) as I showed you before. You can go to the section pull request and create a new pull request as you can see below.

Then you can see the changes of the reviewer. Obviously, you have to add your collaborators on GitHub for the repository that you want to share. Below a screenshot of the page where you can invite your collaborator on GitHub.

Another useful command is git fetch that helps us to fetch all changes from the remote repository and put it in our local repository.

But if I want to merge the remote changes into my local branch, I need to type:

$ git merge

Part V: Merge Conflicts & Git History Management

What is the merge conflicts? Merge conflicts is a specific event that takes place when Git is unable to resolve differences in code between two commits automatically.

In fact, it is possible that multiple developers try to edit the same content and Git doesn’t understand what version of code is the latest one the proper one.

To solve this kind of problem you can open the file on your Git Bash and change it manually in order to overcome the conflict.

Some important Git commands for history management are git rebase and git reset. Git Rebase permits us to replace a branch which we want. For more details and a practical example of Git rebase I suggest you visit this link.

For what concerns Git Reset, you need to is it when you want to undo changes. For example, you create a snapshot that you don’t need anymore.

As you see in the previous parts when you want to pass from the working tree to the staging area, you need to use Git Add. Then to set the changes in your history you need Git Commit. So if you want to follow the reverse process and go from History to Staging Area you will need to type:

$ git reset --soft

But if you want to go before Staging Area, so to see the situation in your working tree you need to use:

$ git reset --mixed

So if you have added files in the staging area, now if you type git status you will notice that your files added in the staging area are now in the working directory.

Then if you need to go at the beginning of all your actions and make your working directory clean you will type:

$ git reset --hard

Part VI: Git Stash, Git Reflog & Git Clone

Imagine you need to immediately start a new task and let the previous one that you didn’t finish yet. So the solution is to save your current task in order to come back later. In this case, the command is:

$ git stash

Then you can use Git Switch- in order to switch your current task and go to your new task.

So stash works as temporary storage for your changes and to list all the saved tasks you can use:

$ git stash list

With the below command you can pop up the last changes. This concept reminds to the idea of LIFO in computer programming. If you want to know more about the idea of LIFO you can read my article about Data Structures & Algorithms in Python.

To clear a specific stash you can easily take the reference that you find after you have applied git stash list. The command will be:

$ git stash drop stash@{0}

To clear all stashes you can simply write this command:

$ git stash clear

It is possible to lose some commits and you want to restore some of them. To solve this problem we can use git reflog.

This command gives us information about each operation that was done here about each commit, each research on the region master branch. To see all the details a good practice is to use:

$ git log -g

So after you identify the lost changes you can simply use the below command to restore the commit.

$ git branch name_of_branch e57d

You have noticed that I give a new name and put the commit code after the name. Remember that the information in reflog is stored for 90 days.

Anyway if you want to see the reflog since 1 hour you can simply digit:

$ git reflog --since="1-hour"

One last useful git command is Git Clone. With this command you can clone a remote repository. So you need to go to you GitHub repository and select the HTTPS or the SSH. After that, open your Git Bash and digit:

$ git clone git@github.com:name/application.git

Great! You have cloned your remote repository.

Conclusion

Now you have a good understanding of how Git works and why it is so important for a Data Professional to know it. However, I think the most effective way to absorb these concepts deeply is to work on a project with some of your friends and try to develop a complex program.

In this way, you will face many problems that will push you to use Git commands in the most effective way and improve your understanding of the real utility of Git in a process of development.