Q. What is a data engineer?
A.
Data engineers create processing to treat data.
- Data engineers create processes to treat data.
They make some data pipelines for extracting data, Transformation, and Loading into database which is called ETL.
- They make some data pipelines to extract data, transform and load into databases. This is called ETL (extract, transform, load)
There could be some issues on the pipeline as bad consistency, late processing, or broken pipelines and so on.
- There could be some issues in the data pipeline such as bad consistency, late processing or broken pipelines and so on.
Data engineers will treat these problems to maintain the pipelines.
- As such, data engineers will treat these problems to maintain the pipelines.
Also They accept some ask from Data Scientiests or Analystics to add more pipeline or support more business data.
- Also, they accept requests from data scientists or analysts to add more pipelines or support more business data.
Data Engineers' job makes a result as formatted and cleaned data so that Data scientiest and Analysists can extract values from it.
- Furthermore, their job in turn formats and cleans the data helping Data analysts and scientists to extract values from it.
Q. Explain indexing.
A.
It's similar to a table of books.
- Indexing can be thought of as a table of books.
Many databases store lots of data and support query which search some specific data for users.
- Many databases store lots of data and support queries which search some specific data for users.
If data is too much, won't be easy to run query.
- If the data quantity is too large, it will not be easy to run queries.
That's why Indexing is needed.
- That's why indexing is needed to make it more efficient.
Indexing saves hash values of data(as primary key in SQL) in advance. The hash values are in a B+tree structure
- It saves the hash values of data in advance and these values are in a B+ tree structure.
and that's why databses can run finding query fast.
- As a result, that's why finding queries can be executed fast.
But indexing occupies some spaces of disks and makes DML queries complicated.
- On the other hand, indexing occupies some disk space and makes DML queries complicated.
Q. What is the Replication factor in Hadoop?
A.
It's the number of duplication blocks in HDFS.
- It's the number of duplication blocks in HDFS.
Some data storing into HDFS will be devided into blocks of a certain size which is 128mb by default.
- Some data stored in HDFS will be divided into blocks of a certain size which is 128mb by default.
The blocks are duplicated the replication factor times and spread to other machines.
- The blocks are duplicated by the replication factor times and spread to other machines.
Replication factor is 3 by default but if it increases, data processing speed would be faster by data locality
- The replication factor is 3 by default, but if it increases, the data processing speed will be faster by data locality.
and it would be difficult to loss data, but could occupy more disk spaces.
- In addition it would be difficult to lose data, but could occupy more disk space.
'English' 카테고리의 다른 글
Study English 23.08.03 (0) | 2023.08.05 |
---|---|
Study English 23.08.02 (0) | 2023.08.03 |
Study English 23.07.31 (0) | 2023.08.02 |
Study English 23.07.30 (0) | 2023.08.01 |
Study English 23.07.29 (0) | 2023.08.01 |