Group B

Q11: Design a distributed application using MapReduce which processes a log file of a system.

Log Processing using MapReduce

Solution and implementation for Q11 from Data Science Laboratory (ds).

11_setup.md Download
# Setup Guide — MapReduce Log Processor

---

## 1. Install Java

```bash
sudo apt update
sudo apt install openjdk-8-jdk -y
java -version
```

---

## 2. Download & Extract Hadoop

```bash
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xvzf hadoop-3.3.6.tar.gz
mv hadoop-3.3.6 hadoop
```

---

## 3. Set Environment Variables

```bash
nano ~/.bashrc
```

Add at the bottom:

```bash
export HADOOP_HOME=~/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
```

```bash
source ~/.bashrc
```

---

## 4. Configure Hadoop

**core-site.xml** — `nano ~/hadoop/etc/hadoop/core-site.xml`

```xml
<configuration>
  <property>
    <n>fs.defaultFS</n>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>
```

**hdfs-site.xml** — `nano ~/hadoop/etc/hadoop/hdfs-site.xml`

```xml
<configuration>
  <property>
    <n>dfs.replication</n>
    <value>1</value>
  </property>
</configuration>
```

---

## 5. Fix JAVA_HOME in Hadoop Config

```bash
readlink -f $(which java)
# Output example: /usr/lib/jvm/java-8-openjdk-amd64/bin/java
```

```bash
nano ~/hadoop/etc/hadoop/hadoop-env.sh
```

Find and update:

```bash
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
```

---

## 6. Format Namenode

```bash
hdfs namenode -format
```

---

## 7. Set Up SSH (Passwordless)

```bash
sudo apt install openssh-server -y
sudo service ssh start
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
ssh localhost
```

---

## 8. Start Hadoop

```bash
start-dfs.sh
start-yarn.sh
jps
```

---

## 9. Install Python

```bash
sudo apt install python3 -y
sudo apt install python3-pip -y
sudo apt install python-is-python3
python --version
```

---

## 10. Verify Hadoop is Ready

```bash
hadoop version
hdfs dfs -ls /
```
11_run.md Download
# Run Guide — MapReduce Log Processor

---

## 1. Create mapper.py

```bash
nano mapper.py
```

Paste this code:

```python
#!/usr/bin/env python3
import sys

for line in sys.stdin:
    line = line.strip()
    parts = line.split()

    if len(parts) >= 3:
        ip = parts[0]
        log_type = parts[1]

        print(f"{log_type}\t1")
        print(f"{ip}\t1")
```

Save and close:
- `Ctrl + X` → Exit
- `Y` → Confirm save
- `Enter` → Keep filename

---

## 2. Create reducer.py

```bash
nano reducer.py
```

Paste this code:

```python
#!/usr/bin/env python3
import sys

current_key = None
current_count = 0

for line in sys.stdin:
    key, value = line.strip().split("\t")
    value = int(value)

    if current_key == key:
        current_count += value
    else:
        if current_key:
            print(f"{current_key}\t{current_count}")
        current_key = key
        current_count = value

if current_key:
    print(f"{current_key}\t{current_count}")
```

Save and close:
- `Ctrl + X` → Exit
- `Y` → Confirm save
- `Enter` → Keep filename

---

## 3. Create log.txt

```bash
nano log.txt
```

Paste this data:

```
192.168.1.1 INFO User login
192.168.1.2 ERROR Disk failure
192.168.1.1 WARNING Low memory
192.168.1.3 INFO File uploaded
192.168.1.2 ERROR Network issue
```

Save and close:
- `Ctrl + X` → Exit
- `Y` → Confirm save
- `Enter` → Keep filename

---

## 4. Give Execute Permission

```bash
chmod +x mapper.py reducer.py
```

---

## 5. Test Locally First

```bash
cat log.txt | ./mapper.py | sort | ./reducer.py
```

Expected output:

```
ERROR       2
INFO        2
WARNING     1
192.168.1.1 2
192.168.1.2 2
192.168.1.3 1
```

---

## 6. Upload Log File to HDFS

```bash
hdfs dfs -mkdir /input
hdfs dfs -put log.txt /input
hdfs dfs -ls /input
```

---

## 7. Run MapReduce on Hadoop

```bash
hdfs dfs -rm -r /output
hadoop jar ~/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar \
  -input /input/log.txt \
  -output /output \
  -mapper mapper.py \
  -reducer reducer.py
```

> If JAR not found, locate it first:
> ```bash
> find ~/hadoop -name "*streaming*.jar"
> ```

---

## 8. View Output

```bash
hdfs dfs -cat /output/part-00000
```

Expected output:

```
192.168.1.1    2
192.168.1.2    2
192.168.1.3    1
ERROR          2
INFO           2
WARNING        1
```

---

## Troubleshooting

**Output folder already exists:**
```bash
hdfs dfs -rm -r /output
```

**log.txt not in HDFS:**
```bash
hdfs dfs -mkdir /input
hdfs dfs -put log.txt /input
```

Other Questions in Data Science Laboratory

See All Available Questions
Download